Cluster computing: Improving queries part 6

Friday, April 22, 2022

Improving queries part 6

This is a continuation of the best practice in writing Kusto queries as introduced here. 

and continued with the log queries for Analytics such as with Azure Monitor Logs. While the part 5 recognized optimizations and best practices, this discusses parsing and other routines along with performance indicators.

Parsing text data in Azure Monitor Logs is required when there are many pieces of information embedded in a single property. When the data is parsed into multiple properties, a single log entry can participate better in search and aggregation as they target the individual properties.

There are several methods and best practices in parsing text data. These methods vary when the parsing is done at ingestion time or when collected at query time:

Parsing data at collection time allows us to configure Custom Fields that create new properties in the table. Queries do not have to include any parsing logic and simply use these properties as any other columns in the table.

Advantages to this method are that it is easier to query the collected data since we don’t need to include parse commands in the query and it is better for query performance since the query does not need to do it.

The disadvantages are that it must be defined in advance, it can’t change the parsing logic and it will only apply to new data. There are fewer parsing options than available in queries, increases latency time for collecting data and errors can be difficult to handle.

Parsing at query time prevents the modification to the original table.

The advantages to this method are that it applies to any data, including data that has already been collected but it makes the queries more complex.

Query performance indicators are available for each query that is executed. These include:

Total CPU which reflects the overall compute is used to process the query even across all compute nodes It represents time used for computing, parsing, and data fetching. All of these actions can take time during query execution but some can be amortized for later or done upfront once.

Data used for processed query which reflects the overall data that was accessed to process the query. It can change with the size of the target table, time span used, filters applied, and the number of columns referenced. Reducing the data aggressively at every stage of the processing can tremendously improve performance benefits.

Time span of the processed query which reflects the gap between the recent and the earliest data that was accessed to process the query. Time range can also be specified for the query and it is usually better to specify the time span as the first predicate.

Age of processed data which reflects the gap between now and the oldest data that was accessed to process the query. It highly influences the efficiency of data fetching and should be specified as the first predicate whenever possible.

Number of workspaces which reflects a count of workspaces read because some not have been intended but were still part of the query execution.

Number of regions: which reflects the number of regions accessed during the query processing resulting from the workspaces. These multi-region queries are much less efficient and performance indicators present partial coverage.

Parallelism: The degree of parallelism indicates how much work can happen in parallel. This is appealing to queries that have high CPU consumption and it can vary with functions and operators.

One of the examples we can use with this query is the following:

Heartbeat

| where timestamp > ago(1h)

| where EventData !has "%SYSTEM32"

| where EventData has “success”

Reference: Sample queries 

Cluster computing

Friday, April 22, 2022

Improving queries part 6

No comments:

Post a Comment