Wednesday, December 18, 2019

Flink provides two sets of APIs to work with Data. One is for unbounded data such as streams and provided by DataStream APIs. Another is for the bounded data and provided by DataSet API.
Since Java programmers like collections, the latter provides methods to work with collections as source and sink. 
For example, we have methods on StreamExecutionEnvironment such as 
DataSet<T> result = env.fromCollection(data) which returns a DataSet.
And we have methods to specify a collection data sink as follows:
List<T> dataList = new ArrayList<T>();
result.output(new CollectionDataFormat(dataList));
This collection data sink works well as a debugging tool. 
Both the dataType and iterators must be serializable for the read from source and write to sink.
The DataStream API only has support for source from collections. If we need to write a stream to a sink we can access the console or file-system because that will most likely be for debugging. There is no direct conversion from stream to collection.
DataStream<T>  stream = env.fromCollection(data);
Iterator<T> it = DataStreamUtils.collect(stream); // expensive
There is however another method called addSink which can allow custom sinks.
For most purposes of a collection, a DataStream provides an iterate() method that returns an Iterative Stream. 
For example, 
IterativeStream<T> = stream.iterate()
DataStream<T> newStream = iteration.map(new MyMapFunction()); where the map is invoked inside the loop.
Finally, iteration.closeWith(…) to define the tail of the iteration.

We reviewed Flink’s DataStream and DataSet Api for its distinction of bounded and unbounded data.
Let us now take a look at the parallelizatio possibilities with Flink.
The Flink data flow model consists of the streams, transformation and sinks. An example of stream generation is the addSource method on the StreamExecutionEnvironment. An example for transformation is the map operator on the DataStream. An example of using a sink is the addSink method on the DataStream.
Each of these stages can partition the data to take advantage of parallelization. The addSource method may take a partition to begin the data flow on a smaller subset of the data. The transformation opeator may further partition the stream.  Some operators can even multiplex the partitions with keyBy operator and the sink may group the events together.

No comments:

Post a Comment