Cluster computing

Friday, November 22, 2019

A comparision of Flink SQL execution and Facebook’s Presto continued:
The Flink Application provides the ability to write SQL query expressions. This abstraction works closely with the Table API and SQL queries can be executed over tables. The Table API is a language centered around tables and follows a relational model. Tables have a schema attached and the API provides the relational operators of selection, projection and join. Programs written with Table API go through an optimizer that applies optimization rules before execution.
Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time.
Consider the time and space dimension queries that are generally required in dashboards, charts and graphs. These queries need to search over data that has been accumulated which might be quite large often exceeding terabytes. As the data grows, the metadata becomes all the more important and their organization can now be tailored to the queries instead of relying on the organization of the data. If there is need to separate online adhoc queries on current metadata from more analytical and intensive background queries, then we can choose to have different organizations of the information in each category so that they serve their queries better.
SQL queries have dominated almost all databases and warehouses queries no matter how high the software stack has been built over these products. On the other hand, simple summation form logic on big data need to be written with Map-Reduce methods. Although query languages such as U-SQL are trying to bridge the gap and adapters are written to translate SQL queries over other forms of data stores, they are not native to the unstructured stores and they often come up with their own query language.
The language for the query has traditionally been SQL. Tools like LogParser allow SQL queries to be executed over enumerable. SQL has been supporting user defined operators for a while now. These user defined operators help with additional computations that are not present as built-ins. In the case of relational data, these generally have been user defined functions or user defined aggregates. With the enumerable data set, the SQL is somewhat limited for LogParser. Any implementation of a query execution layer over the key value collection could choose to allow or disallow user defined operators. These enable computation on say user defined data types that are not restricted by the system defined types. Such types have been useful with say spatial co-ordinates or geographical data for easier abstraction and simpler expression of computational logic. For example, vector addition can be done with user defined data types and user defined operators.
The solutions that are built on top of SQL queries are not necessarily looking to write SQL. They want programmability and are just as content to write code using Table API as they are with SQL queries. Consequently, business intelligence and CRM solutions are writing their code with standard query operators over their data. These well-defined operators are applicable to most collections of data and the notion is shared with other programming languages such as. Net.

Cluster computing

Friday, November 22, 2019

No comments:

Post a Comment