Improved queries part 4
This is a continuation of the best practice in writing Kusto
queries as introduced here.
Conversions are often necessary, but they must be
restricted to the data that is being converted and not include extraneous data
so that the size of the data for processing can be reduced. The query can be
reshaped to reduce the amount of data fed into the conversion. Avoiding the
conversion of large amounts of data is also a performance gain.
When writing new queries, some trial and error is
required. It is best not to generate all the results during these intermediary
studies. Instead, it is best to work with a small number of results by way of
limit.
Filtering on a calculated column is not
preferred. The use of a table column for filtering is recommended.
When the summarize operator is used and the group by keys comes with
high cardinality, the use of the hint.strategy = shuffle is recommended.
When a join is performed, it is better to take fewer rows to
be the left-hand side in the query. When a single column is used for filtering,
the use of ‘in’ operator is preferred over left semi join for filtering by a
single column. When the join is spread out across clusters, the query should be
run where most of the data is located.
A hint of ‘hint.strategy=broadcast’ can also be used when
the left side is small and the right side is large.
The values on the columns with strings sharing the same
pattern can be extracted using the parse operator instead of using the
extract() statement multiple times.
Both the materialize() function and the materialized_view()
function can be judiciously used to reduce the data set for processing and
still retain the semantics of the query. The use of filters and projection of
required columns helps reduce this data set. The materialized views store the
commonly used aggregations and help to query the materialized data set.
When specifying a database object, care must be taken to
point to the specific table or data source. For example, an unqualified name
might resolve to a function in the local database instead of a table in the
remote database. The best way to overcome this resolution error is to fully
qualify the name of the target. For example,
cluster(‘clusterName’).database(‘databaseName’)
Whenever queries are run across clusters, it is important to
remember that the cluster that performs the initial query interpretation must
have the schema of the entities referenced on the remote clusters. Sending the
command can be an expensive network operation and the remote schema entities
are cached. A cross-cluster query can refresh the schema with “.clear cache
remote-schema cluster(‘clusterName’).database(‘databaseName’)”
Similarly query results cache can also be cleared using the
“.clear database cache query_results” command.
Reference: Sample queries
No comments:
Post a Comment