Cluster computing: Support for small to large footprint introspection database and query

Saturday, August 1, 2020

Support for small to large footprint introspection database and query

One of the restrictions that comes with packaged queries exported via REST APIs is their ability to scale since they consume significant resources on the backend and continue to run for a long time. These restrictions cannot be relaxed without some reduction on their resource usage. The API must provide a way for consumers to launch several queries with trackers and they should be completed reliably even if they are done one by one. This is facilitated with the help of a reference to the query and a progress indicator. The reference is merely an opaque identifier that only the system issues and uses to look up the status. The indicator could be another api that takes the reference and returns the status. It is relatively easy for the system to separate read-only status information from read-write operations so the number of times the status indicator is called has no degradation on the rest of the system. There is a clean separation of the status information part of the system which is usually periodically collected or pushed from the rest of the system. The separation of read-write from read-only also helps with their treatment differently. For example, it is possible to replace the technology for the read-only separately from the technology for read-write. Even the technology for read-only can be swapped from one to another for improvements on this side.

The design of all REST APIs generally follows a convention. This practice gives well recognized uri qualifier patterns, query parameters and methods. Exceptions, errors and logging are typically done with the best usage of http protocol. The 1.1 version of that protocol is revised so it helps with the

The SQL query can also be translated from REST APIs with the help of engines like Presto or LogParser.

Tools like LogParser allow sql queries to be executed over enumerable. SQL has been supporting user defined operators for a while now. Aggregation operations have had the benefit that they can support both batch and streaming mode operations. These operations therefore can operate on large datasets because they view only a portion at a time. The batch operation can be parallelized to a number of processors. This has generally been the case with Big Data and cluster mode operations. Both batch and stream processing can operate on unstructured data.

Presto from Facebook is a distributed SQL query engine can operate on streams from various data source supporting adhoc queries in near real-time.

Just like the standard query operators of .NET the FLink SQL layer is merely a convenience over the table APIs. On the other hand, Presto offers to run over any kind of data source not just Table APIs.

Although Apache Spark query code and Apache Flink query code look very much similar, the former uses stream processing as a special case of batch processing while the latter does just the reverse.

Also Apache Flink provides a SQL abstraction over its Table API

While Apache Spark provided ". map(...)" and ". reduce(...)" programmability syntax to support batch-oriented processing, Apache Flink provides Table APIs with ".groupby(...)" and ". order(...)" syntax. It provides SQL abstraction and supports steam processing as the norm.

The APIs vary when the engines change even though they form an interface such that any virtualized query can be executed and the APIs form their own query processing syntax. They vary because the requests and responses vary and their size and content vary. Some APIs are better suited for stream processing and others for batch processing.

Cluster computing

Saturday, August 1, 2020

Support for small to large footprint introspection database and query

No comments:

Post a Comment