Sunday, November 17, 2019

A comparision of Flink SQL execution and Facebook’s Presto:
The Flink Application provides the ability to write SQL query expressions. This abstraction works closely with the Table API and SQL queries can be executed over tables. The Table API is a language centered around tables and follows a relational model. Tables have a schema attached and the API provides the relational operators of selection, projection and join. Programs written with Table API go through an optimizer that applies optimization rules before execution.
Flink Applications generally do not need to use the above abstraction of Table APIs and SQL layers. Instead they work directly on the Core APIs of DataStream (unbounded) and DataSet (bounded data set) APIs. These APIs provide the ability to perform stateful stream processing.
For example,
DataStream<String> lines = env.addSource( new FlinkKafkaConsumer<>(…)); // source
DataStream<Event> events = lines.map((line)->parse(line)); // transformation
DataStream<Statistics> stats = events.keyBy(“id”).timeWindow(Time.seconds(10)).apply(new MyWindowAggregationFunction());
stats.addsink(new BucketingSinkPath));
Presto from Facebook is  a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time. It does not partition based on MapReduce and executes the query with a custom SQL execution engine written in Java. It has a pipelined data model that can run multiple stages at once while pipelining the data between stages as it become available. This reduces end to end time while maximizing parallelization via stages on large data sets. A co-ordinator taking the incoming the query from the user draws up the plan and the assignment of resources. Facebook’s Presto can run on large data sets of social media such as in the order of Petabytes. It can also run over HDFS for interactive graphics.
There is also a difference in the queries when we match a single key or many keys. For example, when we use == operator versus IN operator in the query statement, the size of the list of key-values to be iterated does not reduce. It's only the efficiency of matching one tuple with the set of keys in the predicate that improves when we us an IN operator because we don’t have to traverse the entire list multiple times. Instead each entry is matched against the set of keys in the predicate specified by the IN operator. The use of a join on the other hand reduces the size of the range significantly and gives the query execution a chance to optimize the plan.
Just like the standard query operators of .Net the FLink SQL layer is merely a convenience over the table APIs. On the other hand, Presto offers to run over any kind of data source not just Table APIs.

No comments:

Post a Comment