Cluster computing

Thursday, November 28, 2019

We were discussing Flink Apis and connector:
A typical query looks like this:

env.readTextFile(ratingsCsvPath)

.flatMap(new ExtractRating())

.groupBy(0)

.sum(1)

.print();

There are connectors for different data sources which can be passed in via addSource

In order to add parallelization, that governs the tasks, replicas and runtime, we can set it globally with env.setParallelism(dop) ;

Otherwise we can set it at the DataSet API

There is even a similar method to set the maximum parallelism. This would prevent the parallelism from going beyond the limit

The newer DataStream API allows methods such as: assignTimestampsAndWatermarks because it supports the watermark and allows the ordering of the events in the stream. A sink can be added to write the events to store.

The Flink connector library for Pravega adds a data source and sink for use with the Flink Dara API.

The Flink Pravega Writer and Flink Pravega Reader work with generic events as long as the events implement serializability and de-serializability respectively.

The FlinkPravegaWriter extends RichSinkFunction and implements ListCheckpointed and CheckpointListener interfaces.

The FlinkPravegaReader extends RichParallelSourceFunction implements ResultTypeQueryable, StoppableFunction and ExternallyInducedSource interfaces.

Notice the extension of RichParallelSourceFunction from Flink is meant for high scalability from reader groups. This is required in use case such as reading from socket where the data may arrive at a fast rate

Cluster computing

Thursday, November 28, 2019

No comments:

Post a Comment