Sunday, April 24, 2022

Improving queries part 8

This is a continuation of the best practice in writing Kusto queries as introduced here.  

and continued with the log queries for Analytics such as with Azure Monitor Logs. While parts 5, 6, and 7 recognized optimizations, best practices, and the management aspects of Kusto queries, this discusses expressions that come useful in log queries that span across workspaces and apps. Such queries give a system-wide view of the data.

When the data is stored in multiple workspaces and apps, the log queries can either specify the workspace and app details or they can use resource-context queries and the query in the context of a specific resource. It is important to remember that the limits of a hundred resources and workspaces apply to a single query. Workspaces can be referred to by their resource name, qualified name, workspace ID, and Azure Resource ID.

The app() expression is used to refer to a Workspace-based application insights resource and is used to retrieve data from a specific application in the same resource group, another resource group, or another subscription. An App can show all the requests it has received as well as heartbeats.

The workspace() expression is used to refer to a workspace in the same resource group, another resource group, or another subscription. Read access to the workspace is required.

The resource() expression is used in the Azure Monitor query() scoped to a resource to retrieve data from other resources. The URL path for the resource from the Azure Portal can be used to refer to the resource and the same holds for the resource group. Read access to the resource is required and multiple resources can be queried.

The Application Insights and Log Analytics workspace schema differences show between their properties. UserId, AppId, and name will be found in both.

Availability count, type, duration, message, run location, id, name, and timestamp in Log Analytics have a corresponding itemCount, duration, message, location, id, name, and timestamp in Application insights, resource properties. Browser, city, client, and country pertaining to the client are prefixed by the ‘client_’ in the Application Insights resource properties. Custom events count, dimensions and name also have an equivalent property without referring to by the ‘custom’ prefix. The device model and device type are available as client_model and client_type. Exception count, HandledAt, Message, and Type have equivalent in the latter. operation and OperationName are similarly named. Pageview count, duration, and name have their pageview prefix dropped. Similarly, Request count, duration, id, name, and success have their prefix dropped. The role, RoleInstance, and SessionID are also similar. SessionId, source system, telemetry type, URL, and user accountID also have underscored equivalents.

Azure Monitor supports cross-service queries between Azure Data Explorer, Application Insights, and Log Analytics. This enables a Kusto cluster and database to be available to Log Analytics/Application Insights tools and refers to it in a cross-service query. The .show functions, .show function {functionName} and .show database {DatabaseName} schema as JSON commands are all supported with cross-service queries. This is a really powerful feature because it brings external data sources directly into the tool.

Reference: Sample queries   

Saturday, April 23, 2022

Improving queries part 7

This is a continuation of the best practice in writing Kusto queries as introduced here. 

and continued with the log queries for Analytics such as with Azure Monitor Logs. While the part 5 and 6 recognized optimizations and best practices, this discusses the management aspects of Kusto queries.

Kusto databases and clusters pose similar challenges to administration and management as traditional databases. This section enumerates some of the commands available to Database Admins and Monitors that explore usage, track operations and investigate ingestion failures.

The .show queries returns a list of queries that have reached a final state. This is available to users who are not administrators as well, but they can only see the queries that they have launched.

The .show running queries command returns a list of currently executing queries. This can show queries invoked by all users.

The .cancel query command starts a best effort attempt to cancel a specific running query. A reason phrase can be optionally specified.

The .show commands lists the admin commands that have reached a final state. These commands are available to query for 30 days. The listing includes cpu usage as well as resource utilization for all completed commands.

The .show commands-and-queries combines the information for both commands and queries.

The .show journal displays the history of metadata operations. The metadata operations can result from a control command that a user executed, or the system invoked. These have consistent and well formatted data with very few variations between versions. Metadata changes are stored specific to databases on the cluster.

The .show database <databaseName> journal command returns journal for the specific database.

The .show operations displays administrative operations both running and completed, since Admin node was last elected.

The .show operations command returns a table with all administrative operations which were executed in the last two weeks. It displays both running and completed operations. It may be run in two modes. The first mode is the log mode in which all entries in the logs that the user has access to are returned. The second mode is the Latest update mode the last update for each operation ID provided by the user is returned. The first mode is used when the operation ID is not available, and the second mode is used when the operation ID is available.

The .show failed ingestion failures returns a result set that shows failure specific to the data ingestion control commands and encountered by the Kusto data engine.

Ingestion metrics and diagnostic logs included errors that are encountered outside of the database engine.

Frequently some of these queries are executed together for specific diagnostics and troubleshooting cases. They have not been compiled into a published dscript-like library because they change hands or are lost in silos. A sample collection of these libraries from forensics could pave the way for onboarding any new team.

Reference: Sample queries 

Friday, April 22, 2022

Improving queries part 6

This is a continuation of the best practice in writing Kusto queries as introduced here. 

and continued with the log queries for Analytics such as with Azure Monitor Logs. While the part 5 recognized optimizations and best practices, this discusses parsing and other routines along with performance indicators.

Parsing text data in Azure Monitor Logs is required when there are many pieces of information embedded in a single property. When the data is parsed into multiple properties, a single log entry can participate better in search and aggregation as they target the individual properties.

There are several methods and best practices in parsing text data. These methods vary when the parsing is done at ingestion time or when collected at query time:

Parsing data at collection time allows us to configure Custom Fields that create new properties in the table. Queries do not have to include any parsing logic and simply use these properties as any other columns in the table.

Advantages to this method are that it is easier to query the collected data since we don’t need to include parse commands in the query and it is better for query performance since the query does not need to do it.

The disadvantages are that it must be defined in advance, it can’t change the parsing logic and it will only apply to new data. There are fewer parsing options than available in queries, increases latency time for collecting data and errors can be difficult to handle.

Parsing at query time prevents the modification to the original table.

The advantages to this method are that it applies to any data, including data that has already been collected but it makes the queries more complex.

Query performance indicators are available for each query that is executed. These include:

Total CPU which reflects the overall compute is used to process the query even across all compute nodes It represents time used for computing, parsing, and data fetching. All of these actions can take time during query execution but some can be amortized for later or done upfront once.

Data used for processed query which reflects the overall data that was accessed to process the query. It can change with the size of the target table, time span used, filters applied, and the number of columns referenced. Reducing the data aggressively at every stage of the processing can tremendously improve performance benefits.

Time span of the processed query which reflects the gap between the recent and the earliest data that was accessed to process the query. Time range can also be specified for the query and it is usually better to specify the time span as the first predicate.

Age of processed data which reflects the gap between now and the oldest data that was accessed to process the query. It highly influences the efficiency of data fetching and should be specified as the first predicate whenever possible.

Number of workspaces which reflects a count of workspaces read because some not have been intended but were still part of the query execution.

Number of regions: which reflects the number of regions accessed during the query processing resulting from the workspaces. These multi-region queries are much less efficient and performance indicators present partial coverage.

Parallelism: The degree of parallelism indicates how much work can happen in parallel. This is appealing to queries that have high CPU consumption and it can vary with functions and operators.

One of the examples we can use with this query is the following:

Heartbeat

| where timestamp > ago(1h)

| where EventData !has "%SYSTEM32"

| where EventData has “success”

Reference: Sample queries 

Thursday, April 21, 2022

Improving queries part 5

This is a continuation of the best practice in writing Kusto queries as introduced here.

This section focuses on the log queries for Analytics such as with Azure Monitor Logs. Specifically, it refers to the Azure Monitor Logs which are queries written using the same Kusto Query language.

Statements not supported in the Azure Monitor include Alias and Query parameters which are used to send Kusto queries. These were used to prevent SQL Injection attacks but in the context of logs, these are irrelevant. On the other hand, we have new operators for Azure Monitor Logs that include app(), resource() and workspace()

App() retrieves data from specific application insights app and is in the same resource group, another resource group or another subscription. This can help query data across multiple applications in an Application Insights query.

The resource() operator is used to retrieve data from a specific resource as identified by its resource id. Read access is required to the resource and the query will run scoped to this resource.

The Workspace() operator is used to retrieve the data from a specific workspace in the same resource group, another resource group, or another subscription. This can also be used across multiple workspaces in a log query.

Log queries deal a lot with literals. Some of these may have special characters including quotes. These must be escaped.

Some columns are available as standards and these can be leveraged in the queries. Columns such as TenantId, TimeGenerated, _TimeReceived, Type, _ItemId, _ResourceId, _SubscriptionId, _IsBillable, _BilledSize come useful to filter the results, transform into expressions and to summarize the results with the corresponding KQL operator.

Audit queries are not usually included in the logs. These must be enabled from the portal.

Some optimization might be required for log and audit queries due to the size of the data.

Optimized queries will run faster, reduce overall duration of the query execution and will have a smaller chance of rejection. The queries that are recurrent or occurring-in-burst usages such as dashboards, alerts, Logic Apps, and Power BI are most commonly the candidates for optimization.

The query performance monitor can give significant information on Total CPU, Data used for processed query, Time span of the processed query, Age of the processed data, Number of workspaces, Number of regions, and Parallelism.

The filtering of records prior to the usage of compute intensive operators or functions is preferred.

The use of evaluated where clause as opposed to the use of the actual data is discouraged.

Aggregation commands and dimensions in summarize and join must be effective for use because some of them like dcount can use HyperLogLog algorithm. The Join and the summarize commands may be compute intensive for large data. For example, Heartbeat | where Computer startsWith “Production” | summarize count() by ComputerIP is more efficient than specifying extended columns or using contains operator.

Full XML and Json parsing can also be avoided to reduce the CPU usage for the operators.

This holds true for all data in fact since the size of data passed from pipe operator can be quite large.

The unnecessary use of search and union operators must be avoided for similar reasons and filtering can be applied early in the queries to reduce the data size.

Reference: Sample queries

Wednesday, April 20, 2022

Improved queries part 4

This is a continuation of the best practice in writing Kusto queries as introduced here.

Conversions are often necessary, but they must be restricted to the data that is being converted and not include extraneous data so that the size of the data for processing can be reduced. The query can be reshaped to reduce the amount of data fed into the conversion. Avoiding the conversion of large amounts of data is also a performance gain.

When writing new queries, some trial and error is required. It is best not to generate all the results during these intermediary studies. Instead, it is best to work with a small number of results by way of limit.

Filtering on a calculated column is not preferred. The use of a table column for filtering is recommended.

When the summarize operator is used and the group by keys comes with high cardinality, the use of the hint.strategy = shuffle is recommended.

When a join is performed, it is better to take fewer rows to be the left-hand side in the query. When a single column is used for filtering, the use of ‘in’ operator is preferred over left semi join for filtering by a single column. When the join is spread out across clusters, the query should be run where most of the data is located.

A hint of ‘hint.strategy=broadcast’ can also be used when the left side is small and the right side is large.

The values on the columns with strings sharing the same pattern can be extracted using the parse operator instead of using the extract() statement multiple times.

Both the materialize() function and the materialized_view() function can be judiciously used to reduce the data set for processing and still retain the semantics of the query. The use of filters and projection of required columns helps reduce this data set. The materialized views store the commonly used aggregations and help to query the materialized data set.

When specifying a database object, care must be taken to point to the specific table or data source. For example, an unqualified name might resolve to a function in the local database instead of a table in the remote database. The best way to overcome this resolution error is to fully qualify the name of the target. For example, cluster(‘clusterName’).database(‘databaseName’)

Whenever queries are run across clusters, it is important to remember that the cluster that performs the initial query interpretation must have the schema of the entities referenced on the remote clusters. Sending the command can be an expensive network operation and the remote schema entities are cached. A cross-cluster query can refresh the schema with “.clear cache remote-schema cluster(‘clusterName’).database(‘databaseName’)”

Similarly query results cache can also be cleared using the “.clear database cache query_results” command.

Reference: Sample queries

Tuesday, April 19, 2022

Improving queries – part 3

This document is a continuation of the article introduced here to improve Kusto queries. In this section, the best practice for writing this query is discussed.

Kusto queries are highly optimized to use time filters. When the criteria for filtering the results are specified in a query, the use of time filters first is a good choice.

String operations are often tripped by case-sensitivity and leading or trailing whitespaces. If full tokens must be found, the use of ‘has’ operator is preferred over ‘contains’ operator since it does not look for substrings.

The case-sensitive operators are preferred for exact matches such as in a join. The use of == operator, in operator, contains_cs is preferable over their approximation counterparts. The use of has and has_cs is still preferred over contains and contains_cs. When case-insensitive operations are involved, it is best to use col =~ “lowercasestring” rather than using the tolower() function. If the data is already in either lower or upper case, then it can be matched exactly with a corresponding constant.

When searching text, it is best to limit it to a column. An asterisk wild card character does a full text search across all columns and is more expensive than a column search.

The column level scoping also applies to extracting fields from dynamic objects across many rows. It might be prudent to materialize the column at ingestion time if most of the queries extract fields from dynamic objects so that the column extraction is paid only once. On the contrary, if the key/values to be looked up are rare in dynamic objects, then it is beneficial to first query if the dynamic column “has” the rare value before proceeding to find the exact match via dynamic-column.somekey == “rare value”. This way most of the records are filtered out before the json parsing occurs on the rest.

Variables are declared with the help of ‘let’ statement. This is good when a value is used more than once. Another option is to use the materialize() function.

Filtering on a calculated column is not preferred. The use of a table column for filtering is recommended.

When the summarize operator is used and the group by keys comes with high cardinality, the use of the hint.strategy = shuffle is recommended.

Monday, April 18, 2022

Kusto data access continued

Consumers:

Kusto data access is an option to access the data directly from a database. It is immensely helpful when we want to browse through the data or explore it. APIs encapsulate logic with access to the data and provide validation, error translation, and response formatting along with diagnosability, telemetry and troubleshooting help. These are owned by the service team that also owns the data.

Kusto queries can also be quite elaborate or custom-defined to suit specific needs that are faster and lightweight compared to the staged management, release pipelines, and scheduled delivery of features introduced into the APIs. The ability to include such queries in background automation is sometimes useful because they don't have interactivity with the customer and those automations can be kicked off periodically or on-demand. 

Kusto data access can be programmatic involving the use of a query provider and a HTTP client. No code or low code scenarios prefer this approach.

There are two ways of consuming data from Kusto. These are:

1. Polling

1. On-demand fetch

We compare them as follows:

When the changes are infrequent, but the scenario requires immediate responsiveness, on-demand direct access can be especially efficient.  The fetch gets just the data that has changed and as soon as the change happens that enables downstream systems to be reactive and changes to propagate immediately. The downstream systems are also relieved from implementing anything more than a message handler. There is also the benefit that comes from the scope of the change that is passed down to the subscribers, so they get additional information on just what data was needed.  If the changes become frequent, the number of messages is large leading up to performance bottlenecks with variations in the queue size and delays in the messages.  Instead, when the changes span a lot of data, it is best to get those changes in bulk. A polling mechanism can get changes in batches over time and then process through all those changes. It can even find only the updates that were made from the time of the previous polling.  This enables incremental updates at the destination. Since a polling mechanism is a loop that perpetually finds changes, if any, and applies them to the destination, it can work in the background even as a single worker if it has a destination store. A polling mechanism is a read-only operation and therefore it does not need to fetch the data from the store where the configuration is being actively updated.