Cluster computing: Query execution continued...

The separation of data records to process and optimization of the operations on those records to improve the efficiency can be leveraged to reduce the time for all the records by parallelizing the queries on separate workers. Since the set of records are not shared and only the results are shared, the workers merely have to share the reports of the processing to a central accumulator and that reduces the sequential time by a factor of the number of workers.

Query language, tools and products come with nuances, built-ins and even features that can further help to analyze, optimize and rewrite queries so that they perform better than others. Some form of query execution statistics is also made available via the store or from profiling tools. There is also a way to improve the efficiency of the queries by breaking up its organizational structure and introducing pipelining so the results of one stage as passed to another can be studied.

Pipelined execution involves the following stages:

1) acquisition

2) extraction

3) integration

4) analysis

5) interpretation

The challenges facing pipelined execution involve:

1) scale: A pipeline must hold against a data tsunami. In addition, data flow may fluctuate, and the pipeline must hold against the ebb and the flow. Data may be measured in rate, duration and size and the pipeline may need to become elastic. Column stores and time series database have typically helped with this.

2) Heterogeneity: Data types – their format and semantics have evolved to a wide degree. The pipeline must at least support primitives such as variants, objects and arrays. Support for different data stores has become more important as services have become microservices and independent of each other in their implementations including data storage. Data and their services have a large legacy which also needs to be accommodated in the pipeline. ETL operations and structured storage are still the norm for many industries.

3) Extension: ETL may require flexibility in extending logic and in their usages on clusters versus servers. Microservices are much easier to be written. They became popular with Big Data storage. Together they have bound compute and storage into their own verticals with data stores expanding in number and variety. Queries written in one service now need to be written in another service while the pipeline may or may not support data virtualization.

4) Timeliness: Both synchronous and asynchronous processing need to be facilitated so that some data transfers can be run online while others may be relegated to the background. Publisher-subscriber message queues may be used in this regard. Services and brokers do not scale as opposed to cluster- based message queues. It might take nearly a year to fetch the data into the analytics system and only a month for the analysis. While the benefit for the user may be immense, their patience for the overall time elapsed may be thin.

5) Privacy: User’s location, personally identifiable information and location-based services are required to be redacted. This involves not only parsing for such data but also doing it over and over starting from the admission control on the boundaries of integration

Cluster computing

Thursday, April 14, 2022

Query execution continued...

No comments:

Post a Comment