Introduction: Writing queries has had an upward propagation from technology to business applications. It has started pervading even the management of business as they lean into the monitoring and analytical capabilities of queries. Yet, the best practices for writing queries available online do not prepare the user for the often-encountered errors. This article covers only a handful of these advanced techniques.
Description: One of the most useful features of
writing a query is that it can span diverse data sources. For example, it can refer to a table in one
location and use the entries of that table with another from a different
database using a join. Functionally, these are easy to write so long as there
is at least one column that can be used to make the join. But even a simple
query can take a long time to execute and one of the primary reasons for the
resulting timeouts is that the data exceeds the number of records that can be
processed in a reasonable amount of time. The cardinality of the data in either
location matters so it is best to leverage the one that is smaller than the
other. Since the data size does not reflect in a query and the author is even
unaware at the time of writing, it takes some repetitions to get to this root
cause and figure out the next steps. One of the mitigations for this kind of
error is seen with the use of distinct operator that reduces the data size
significantly by eliminating duplicates.
When the records are not distinct, projecting only some of the columns
as required for the join and the results of the query enables the row set to
shrink because the projection makes different records look the same.
Another approach to make records look the same is the use of
computed columns that puts different records into the same bucket so that those
the actions that were to be taken on the records instead can now be taken on
the buckets. There are several such operators to put records into buckets also
called hashes. The use of proper hashing functions can result in buckets with
more or less equal numbers of records. This results in a uniform distribution
of workload which we will bring up again for parallelizing their execution.
Records also can get partitioned without the use of hashing
operator because they are stored separately even though they belong to the same
logical table. This materialized view is significantly more useful for querying
because it alleviates the contention on going to the original source. The cost
of translation is borne once so that the queries do not need to recompute the
data each and every time by going to the source and then working on it. The use
of materialized views also reduces the cost significantly when it is stored
local to the store where the queries are executed rather than fetching the
original data from remote every time.
The side-effect of processing a large number of records
without the author knowing about it and resulting in errors is that the systems
that participate in the query execution can easily get inundated with a large
number of calls so that those systems begin to apply a rate limit that they
have been configured with. When there are multiple systems that participate in
multiple joins in a single query, the overall execution is halted when any one
of the system begins to error out. This means that the author has no way of
predicting what remedial actions to take on the query until actually the errors
are known or the data cardinality is found beforehand.
Similarly, the problem of unknown data size is compounded
when the queries are nested one within the other. This is rarely visible when
the results of the nested queries are merely passed around to higher levels.
The same caution taken to reduce data size and optimize operations on the
reduced size must be reapplied at each level of the querying. This will benefit
the overall execution in the same manner as a bubble propagates up the depth.
Lastly, the separation of data records to process and
optimization of the operations on those records to improve the efficiency can
be leveraged to reduce the time for all the records by parallelizing the
queries on separate workers. Since the set of records are not shared and only
the results are shared, the workers merely have to share the reports of the
processing to a central accumulator and that reduces the sequential time by a
factor of the number of workers.
Reference: Sample queries
No comments:
Post a Comment