Improving queries part 6
This is a continuation of the best practice in
writing Kusto queries as introduced here.
and continued with the log queries for
Analytics such as with Azure Monitor Logs. While the part 5
recognized optimizations and best practices, this discusses parsing and other
routines along with performance indicators.
Parsing
text data in Azure Monitor Logs is required when there are many pieces of
information embedded in a single property. When the data is parsed into
multiple properties, a single log entry can participate better in search and
aggregation as they target the individual properties.
There are
several methods and best practices in parsing text data. These methods vary
when the parsing is done at ingestion time or when collected at query time:
Parsing
data at collection time allows us to configure Custom Fields that create new
properties in the table. Queries do not have to include any parsing logic and
simply use these properties as any other columns in the table.
Advantages
to this method are that it is easier to query the collected data since we don’t
need to include parse commands in the query and it is better for query
performance since the query does not need to do it.
The
disadvantages are that it must be defined in advance, it can’t change the
parsing logic and it will only apply to new data. There are fewer parsing
options than available in queries, increases latency time for collecting data
and errors can be difficult to handle.
Parsing at
query time prevents the modification to the original table.
The
advantages to this method are that it applies to any data, including data that
has already been collected but it makes the queries more complex.
Query performance indicators are available for each query that is
executed. These include:
Total CPU which reflects the overall compute is used to process the
query even across all compute nodes It represents time used for computing,
parsing, and data fetching. All of these actions can take time during query
execution but some can be amortized for later or done upfront once.
Data used for processed query which reflects the overall data that
was accessed to process the query. It can change with the size of the target
table, time span used, filters applied, and the number of columns referenced.
Reducing the data aggressively at every stage of the processing can
tremendously improve performance benefits.
Time span of the processed query which reflects the gap between the recent
and the earliest data that was accessed to process the query. Time range can
also be specified for the query and it is usually better to specify the time
span as the first predicate.
Age of processed data which reflects the gap between now and the
oldest data that was accessed to process the query. It highly influences the
efficiency of data fetching and should be specified as the first predicate
whenever possible.
Number of workspaces which reflects a count of workspaces read
because some not have been intended but were still part of the query execution.
Number of regions: which reflects the number of regions accessed
during the query processing resulting from the workspaces. These multi-region
queries are much less efficient and performance indicators present partial
coverage.
Parallelism: The degree of parallelism indicates how much work can
happen in parallel. This is appealing to queries that have high CPU consumption
and it can vary with functions and operators.
One of the examples we can use with this query is the following:
Heartbeat
| where timestamp > ago(1h)
| where EventData !has "%SYSTEM32"
| where EventData has “success”
Reference: Sample queries
No comments:
Post a Comment