Cluster computing: Query execution – what the documentation does not cover.

Introduction: Writing queries has had an upward propagation from technology to business applications. It has started pervading even the management of business as they lean into the monitoring and analytical capabilities of queries. Yet, the best practices for writing queries available online do not prepare the user for the often-encountered errors. This article covers only a handful of these advanced techniques.

Description: One of the most useful features of writing a query is that it can span diverse data sources. For example, it can refer to a table in one location and use the entries of that table with another from a different database using a join. Functionally, these are easy to write so long as there is at least one column that can be used to make the join. But even a simple query can take a long time to execute and one of the primary reasons for the resulting timeouts is that the data exceeds the number of records that can be processed in a reasonable amount of time. The cardinality of the data in either location matters so it is best to leverage the one that is smaller than the other. Since the data size does not reflect in a query and the author is even unaware at the time of writing, it takes some repetitions to get to this root cause and figure out the next steps. One of the mitigations for this kind of error is seen with the use of distinct operator that reduces the data size significantly by eliminating duplicates. When the records are not distinct, projecting only some of the columns as required for the join and the results of the query enables the row set to shrink because the projection makes different records look the same.

Another approach to make records look the same is the use of computed columns that puts different records into the same bucket so that those the actions that were to be taken on the records instead can now be taken on the buckets. There are several such operators to put records into buckets also called hashes. The use of proper hashing functions can result in buckets with more or less equal numbers of records. This results in a uniform distribution of workload which we will bring up again for parallelizing their execution.

Records also can get partitioned without the use of hashing operator because they are stored separately even though they belong to the same logical table. This materialized view is significantly more useful for querying because it alleviates the contention on going to the original source. The cost of translation is borne once so that the queries do not need to recompute the data each and every time by going to the source and then working on it. The use of materialized views also reduces the cost significantly when it is stored local to the store where the queries are executed rather than fetching the original data from remote every time.

The side-effect of processing a large number of records without the author knowing about it and resulting in errors is that the systems that participate in the query execution can easily get inundated with a large number of calls so that those systems begin to apply a rate limit that they have been configured with. When there are multiple systems that participate in multiple joins in a single query, the overall execution is halted when any one of the system begins to error out. This means that the author has no way of predicting what remedial actions to take on the query until actually the errors are known or the data cardinality is found beforehand.

Similarly, the problem of unknown data size is compounded when the queries are nested one within the other. This is rarely visible when the results of the nested queries are merely passed around to higher levels. The same caution taken to reduce data size and optimize operations on the reduced size must be reapplied at each level of the querying. This will benefit the overall execution in the same manner as a bubble propagates up the depth.

Lastly, the separation of data records to process and optimization of the operations on those records to improve the efficiency can be leveraged to reduce the time for all the records by parallelizing the queries on separate workers. Since the set of records are not shared and only the results are shared, the workers merely have to share the reports of the processing to a central accumulator and that reduces the sequential time by a factor of the number of workers.

Reference: Sample queries

Cluster computing

Wednesday, April 13, 2022

Query execution – what the documentation does not cover.

No comments:

Post a Comment