Improving Queries Part 2
This document is a continuation of an earlier one here,
which discusses the nuances of writing queries that are often encountered after
they are written. In this article we explore a few more resolutions around data
syntax, semantics and constraints.
A query often includes a join and while the columns between the
left and right tables can be resolved by name, the results of the join might
change considerably if the values do not match exactly. The join operator
specifies an equality operator that is case sensitive. The values in the
columns used to join may values differing from each other in leading and
trailing whitespaces or other non-alphanumeric characters. While a cursory
check of a few hundred records is possible, that technique cannot be applied to
many records. It is rather difficult to complete list the anomalies in the data
pertaining to the columns used for joining. Instead, it is easier to
canonicalize the values to a form where they can be compared. For example,
converting all values to lower case as a new computed column in both the
participating tables, eliminates mismatches based on case-sensitivity.
Similarly, removing all trailing values or extracting alphanumeric characters
from the values enables inclusion of more records that would otherwise be
ignored from the usual join. Such logic is easily verified to be sufficient by
counting the number of records on both the left and the right tables.
Another concern for writing queries is about the meaning of
columns. It might not be surprising for query writers to find that many columns
refer to the same notion of an attribute for an entity. A simple name column
might be available as a friendly name, a business name or even a legal name.
The choice of the columns not only derives from the business requirements but
also from the operational need to use the one that has the best values for
enabling a join – both now and later. This means that query writers now must
explore some data for this purpose. It is not always easy or possible to
inspect the data because they might be protected such as in production systems
or for compliance such as with privacy or security when storing say credit card
data. Samples can be drawn from equivalent systems that are temporally and
spatially spread out to best represent the actual data. Otherwise, it is
usually possible to generate data using random characters.
Sometimes, the best efforts to come up with a logic for
joining is defeated on the grounds of the complexity, scale or its arduous
nature. The values may also not be unique between rows. In such cases, a
composite column comprising of one or more columns as a computed column can be
used to make a join. This additional logic solves the limitations around data
from a single column and enables a robust joining of two tables. When a
sequence of tables must be joined one with the other, we can repeat this
evaluation for every join between two tables at a time.
Lastly, as joins increase, the number of records in the
result set might decrease as fewer and fewer records match all the joining
criteria. The decrease in the number of results at each join must be studied to
ensure that the number of the records in the result set is correct.
No comments:
Post a Comment