Cluster computing: Improving queries

This document is a continuation of the article introduced here to improve Kusto queries. In this section, the best practice for writing this query is discussed.

Kusto queries are highly optimized to use time filters. When the criteria for filtering the results are specified in a query, the use of time filters first is a good choice.

String operations are often tripped by case-sensitivity and leading or trailing whitespaces. If full tokens must be found, the use of ‘has’ operator is preferred over ‘contains’ operator since it does not look for substrings.

The case-sensitive operators are preferred for exact matches such as in a join. The use of == operator, in operator, contains_cs is preferable over their approximation counterparts. The use of has and has_cs is still preferred over contains and contains_cs. When case-insensitive operations are involved, it is best to use col =~ “lowercasestring” rather than using the tolower() function. If the data is already in either lower or upper case, then it can be matched exactly with a corresponding constant.

When searching text, it is best to limit it to a column. An asterisk wild card character does a full text search across all columns and is more expensive than a column search.

The column level scoping also applies to extracting fields from dynamic objects across many rows. It might be prudent to materialize the column at ingestion time if most of the queries extract fields from dynamic objects so that the column extraction is paid only once. On the contrary, if the key/values to be looked up are rare in dynamic objects, then it is beneficial to first query if the dynamic column “has” the rare value before proceeding to find the exact match via dynamic-column.somekey == “rare value”. This way most of the records are filtered out before the json parsing occurs on the rest.

Variables are declared with the help of ‘let’ statement. This is good when a value is used more than once. Another option is to use the materialize() function.

Conversions are often necessary, but they must be restricted to the data that is being converted and not include extraneous data so that the size of the data for processing can be reduced. The query can be reshaped to reduce the amount of data fed into the conversion. Avoiding the conversion of large amounts of data is also a performance gain.

When writing new queries, some trial and error is required. It is best not to generate all the results during these intermediary studies. Instead, it is best to work with a small number of results by way of limit.

Filtering on a calculated column is not preferred. The use of a table column for filtering is recommended.

When the summarize operator is used and the group by keys comes with high cardinality, the use of the hint.strategy = shuffle is recommended.

Cluster computing

Tuesday, April 19, 2022

Improving queries – part 3

No comments:

Post a Comment