Saturday, November 30, 2019

We were discussing Flink Apis and connector:

A typical query looks like this:



env.readTextFile(ratingsCsvPath)



.flatMap(new ExtractRating())



            .groupBy(0)



            .sum(1)



            .print();



There are connectors for different data sources which can be passed in via addSource



In order to add parallelization, that governs the tasks, replicas and runtime, we can set it globally with env.setParallelism(dop) ;



Otherwise we can set it at the DataSet API



There is even a similar method to set the maximum parallelism. This would prevent the parallelism from going beyond the limit



The newer DataStream API allows methods such as: assignTimestampsAndWatermarks because it supports the watermark and allows the ordering of the events in the stream. A sink can be added to write the events to store.



The Flink connector library for Pravega adds a data source and sink for use with the Flink Dara API.



The Flink Pravega Writer and Flink Pravega Reader work with generic events as long as the events implement serializability and de-serializability  respectively.



The FlinkPravegaWriter extends RichSinkFunction and implements ListCheckpointed and CheckpointListener interfaces.



The FlinkPravegaReader extends RichParallelSourceFunction implements ResultTypeQueryable, StoppableFunction and ExternallyInducedSource interfaces.



Notice the extension of  RichParallelSourceFunction from Flink is meant for high scalability from reader groups. This is required in use case such as reading from socket where the data may arrive at a fast rate

Flink has a construct for WindowJoin where it can relate two entities within a window. The two sides toa join may have different rates of arrival so the output may not always be the same, but the output of the join will follow a format.
The word count is inherently summation form. This means it can be parallelized by splitting data. If the data arrives on a socket, it just needs to do a map and reduce. For example, the SockerWindowWordCount example in the flink source code repository makes use of a flatMap function that translates a string to a dictionary of word and frequency.  The reduce function in this case, merely combines the count from the flatMap for the same word.
We can also create a trigger since the examples are data driven. This trigger calculates a delta or difference between the datapoint that triggered the last and the current datapoint. It the delta is higher than a threshold, the trigger fires. A delta function and a threshold is required for the trigger to work.
The delta function comes in useful for cumulative values because the path of change does not matter except the initial and final values. For example, the cumulative value is always monotonic for positive values. It can climb at different rates but the final value will be at the same or higher level than the initial value. We can take the delta function in this case, to suitably determine a threshold. The word counts are all positive values Therefore a trigger is suitable for a parallelized Flink word count example when we are interested in queries where thresholds are exceeded first.

No comments:

Post a Comment