Wednesday, November 27, 2019

The solutions that are built on top of SQL queries are not necessarily looking to write SQL. They want programmability and are just as content to write code using Table API as they are with SQL queries. Consequently, business intelligence and CRM solutions are writing their code with standard query operators over their data. These well-defined operators are applicable to most collections of data and the notion is shared with other programming lConsistency also places hard requirements on the choice of the processing. Strict consistency has traditionally been associated with relational databases. Eventual consistency has been made possible in distributed storage with the help of messaging algorithms such as Paxos. Big Data is usually associated with eventual consistency.anguages such as. Net.
Partitioning is probably the reason why tools like Spark, Flink becomes popular for BigData:
Apache Flink provides a SQL abstraction over its Table API. It also provides syntax fo make use of Partitioning and increase parallelism.
The Apache Flink provides the ability to read from various data sources such as env.readFromFile, env.readFromCollection and readFromInput while allowing read and write from streams via connector.
A typical query looks like this:
env.readTextFile(ratingsCsvPath)
.flatMap(new ExtractRating())
            .groupBy(0)
            .sum(1)
            .print();
There are connectors for different data sources which can be passed in via addSource
In order to add parallelization, that governs the tasks, replicas and runtime, we can set it globally with env.setParallelism(dop) ;
Otherwise we can set it at the DataSet API
There is even a similar method to set the maximum parallelism. This would prevent the parallelism from going beyond the limit
The newer DataStream API allows methods such as: assignTimestampsAndWatermarks because it supports the watermark and allows the ordering of the events in the stream. A sink can be added to write the events to store.
The Flink connector library for Pravega adds a data source and sink for use with the Flink Dara API. 
The Flink Pravega Writer and Flink Pravega Reader work with generic events as long as the events implement serializability and de-serializability  respectively.
The FlinkPravegaWriter extends RichSinkFunction and implements ListCheckpointed and CheckpointListener interfaces.
The FlinkPravegaReader extends RichParallelSourceFunction implements ResultTypeQueryable, StoppableFunction and ExternallyInducedSource interfaces. 
Notice the extension of  RichParallelSourceFunction from Flink is meant for high scalability from reader groups. This is required in use case such as reading from socket where the data may arrive at a fast rate 

No comments:

Post a Comment