Cluster computing

Improved queries part 4

This is a continuation of the best practice in writing Kusto queries as introduced here.

Conversions are often necessary, but they must be restricted to the data that is being converted and not include extraneous data so that the size of the data for processing can be reduced. The query can be reshaped to reduce the amount of data fed into the conversion. Avoiding the conversion of large amounts of data is also a performance gain.

When writing new queries, some trial and error is required. It is best not to generate all the results during these intermediary studies. Instead, it is best to work with a small number of results by way of limit.

Filtering on a calculated column is not preferred. The use of a table column for filtering is recommended.

When the summarize operator is used and the group by keys comes with high cardinality, the use of the hint.strategy = shuffle is recommended.

When a join is performed, it is better to take fewer rows to be the left-hand side in the query. When a single column is used for filtering, the use of ‘in’ operator is preferred over left semi join for filtering by a single column. When the join is spread out across clusters, the query should be run where most of the data is located.

A hint of ‘hint.strategy=broadcast’ can also be used when the left side is small and the right side is large.

The values on the columns with strings sharing the same pattern can be extracted using the parse operator instead of using the extract() statement multiple times.

Both the materialize() function and the materialized_view() function can be judiciously used to reduce the data set for processing and still retain the semantics of the query. The use of filters and projection of required columns helps reduce this data set. The materialized views store the commonly used aggregations and help to query the materialized data set.

When specifying a database object, care must be taken to point to the specific table or data source. For example, an unqualified name might resolve to a function in the local database instead of a table in the remote database. The best way to overcome this resolution error is to fully qualify the name of the target. For example, cluster(‘clusterName’).database(‘databaseName’)

Whenever queries are run across clusters, it is important to remember that the cluster that performs the initial query interpretation must have the schema of the entities referenced on the remote clusters. Sending the command can be an expensive network operation and the remote schema entities are cached. A cross-cluster query can refresh the schema with “.clear cache remote-schema cluster(‘clusterName’).database(‘databaseName’)”

Similarly query results cache can also be cleared using the “.clear database cache query_results” command.

Reference: Sample queries

Cluster computing

Wednesday, April 20, 2022

No comments:

Post a Comment