This document is a continuation of the article introduced here to
improve Kusto queries. In this section,
the best practice for writing this query is discussed.
Kusto queries are highly optimized to use time filters. When
the criteria for filtering the results are specified in a query, the use of
time filters first is a good choice.
String operations are often tripped by case-sensitivity and
leading or trailing whitespaces. If full tokens must be found, the use of ‘has’
operator is preferred over ‘contains’ operator since it does not look for
substrings.
The case-sensitive operators are preferred for exact matches
such as in a join. The use of == operator, in operator, contains_cs is
preferable over their approximation counterparts. The use of has and has_cs is
still preferred over contains and contains_cs. When case-insensitive operations
are involved, it is best to use col =~ “lowercasestring” rather than using the
tolower() function. If the data is already in either lower or upper case, then
it can be matched exactly with a corresponding constant.
When searching text, it is best to limit it to a column. An
asterisk wild card character does a full text search across all columns and is
more expensive than a column search.
The column level scoping also applies to extracting fields
from dynamic objects across many rows. It might be prudent to materialize the
column at ingestion time if most of the queries extract fields from dynamic
objects so that the column extraction is paid only once. On the contrary, if
the key/values to be looked up are rare in dynamic objects, then it is
beneficial to first query if the dynamic column “has” the rare value before
proceeding to find the exact match via dynamic-column.somekey == “rare value”.
This way most of the records are filtered out before the json parsing occurs on
the rest.
Variables are declared with the help of ‘let’ statement.
This is good when a value is used more than once. Another option is to use the
materialize() function.
Conversions are often necessary, but they must be restricted
to the data that is being converted and not include extraneous data so that the
size of the data for processing can be reduced. The query can be reshaped to
reduce the amount of data fed into the conversion. Avoiding the conversion of
large amounts of data is also a performance gain.
When writing new queries, some trial and error is required.
It is best not to generate all the results during these intermediary studies.
Instead, it is best to work with a small number of results by way of limit.
Filtering on a calculated column is not preferred. The use
of a table column for filtering is recommended.
When the summarize operator is used and the group by keys
comes with high cardinality, the use of the hint.strategy = shuffle is
recommended.
No comments:
Post a Comment