Cluster computing

Saturday, April 16, 2022

Improving Queries Part 2

This document is a continuation of an earlier one here, which discusses the nuances of writing queries that are often encountered after they are written. In this article we explore a few more resolutions around data syntax, semantics and constraints.

A query often includes a join and while the columns between the left and right tables can be resolved by name, the results of the join might change considerably if the values do not match exactly. The join operator specifies an equality operator that is case sensitive. The values in the columns used to join may values differing from each other in leading and trailing whitespaces or other non-alphanumeric characters. While a cursory check of a few hundred records is possible, that technique cannot be applied to many records. It is rather difficult to complete list the anomalies in the data pertaining to the columns used for joining. Instead, it is easier to canonicalize the values to a form where they can be compared. For example, converting all values to lower case as a new computed column in both the participating tables, eliminates mismatches based on case-sensitivity. Similarly, removing all trailing values or extracting alphanumeric characters from the values enables inclusion of more records that would otherwise be ignored from the usual join. Such logic is easily verified to be sufficient by counting the number of records on both the left and the right tables.

Another concern for writing queries is about the meaning of columns. It might not be surprising for query writers to find that many columns refer to the same notion of an attribute for an entity. A simple name column might be available as a friendly name, a business name or even a legal name. The choice of the columns not only derives from the business requirements but also from the operational need to use the one that has the best values for enabling a join – both now and later. This means that query writers now must explore some data for this purpose. It is not always easy or possible to inspect the data because they might be protected such as in production systems or for compliance such as with privacy or security when storing say credit card data. Samples can be drawn from equivalent systems that are temporally and spatially spread out to best represent the actual data. Otherwise, it is usually possible to generate data using random characters.

Sometimes, the best efforts to come up with a logic for joining is defeated on the grounds of the complexity, scale or its arduous nature. The values may also not be unique between rows. In such cases, a composite column comprising of one or more columns as a computed column can be used to make a join. This additional logic solves the limitations around data from a single column and enables a robust joining of two tables. When a sequence of tables must be joined one with the other, we can repeat this evaluation for every join between two tables at a time.

Lastly, as joins increase, the number of records in the result set might decrease as fewer and fewer records match all the joining criteria. The decrease in the number of results at each join must be studied to ensure that the number of the records in the result set is correct.

Cluster computing

Saturday, April 16, 2022

No comments:

Post a Comment