Cluster computing

Sunday, January 5, 2020

Apache Flink Applications can use ProcessFunction with streams. It is similar to FlatMap Function but handles all three : events, state and timers. The function applies to each and every event in the input stream. It also gives access to the Flink keyed state via the runtime Context. The timer allows changes in event time and processing time to be handled by the application. Both the timestamps and timerservice are available via the context object. This is only possible with process function on a keyed stream.

The timer service is called for future event processing via registered callbacks. The onTimer method is invoked for this particular processing. Inside this method, states are scoped to the key for which the timer was created.

Joins are possible on low-level operations of two inputs. The ProcessFunction or the KeyedCoProcessFunction is a function that is bound to two different inputs and it gets individual calls to processElement1 and processElement2 for records for respective inputs. One of the inputs is processed first. Its state is updated. Then the other input is processed. The earlier state is probed and the result is emitted. Since events may correspond to different times including delays, a watermark may be used as a reference and when it has elapsed, the operation may be performed.

The timers are fault-tolerant and checkpointed This helps with recovery if the task gets paused. There does not need to be an ordered set of events with the logic above.

The Flink coalescing may maintain only one timer per key and timestamp. The timer resolution is in milliseconds but the runtime may not permit it at this granularity although the system clock can. Besides, the timer may work upto +/- one-time unit equal to resolution from the time specified. Timers can be coalesced between watermarks. However, there can be only one timer per key per resolution time-unit.

We now review asynchronous I/O for external data access in Flink. A MapFunction is synchronous. This might take arbitrary time for a single statement in an Flink Stream application. Asynchronous access helps enable high stream throughput.

The number of request responses increased dramatically with high stream throughput which incurs a high resource cost and bookkeeping cost. Each asynchronous request returns a future and the results of the asynchronous operation are passed as the result of the future.

Asynchronous api implementations extend an AsyncFunction and specify a callback with asyncInvoke override. The I/O operation is then applied as a transformation with the help of
AsyncDataStream.unorderedWait( stream, new AsyncImplementation(), 1000, TimeUnit.MILLISECONDS, 100);

Asynchronous requests come with timeouts. The results can be ordered or unordered.

Cluster computing

Sunday, January 5, 2020

No comments:

Post a Comment