Saturday, April 23, 2022

Improving queries part 7

This is a continuation of the best practice in writing Kusto queries as introduced here. 

and continued with the log queries for Analytics such as with Azure Monitor Logs. While the part 5 and 6 recognized optimizations and best practices, this discusses the management aspects of Kusto queries.

Kusto databases and clusters pose similar challenges to administration and management as traditional databases. This section enumerates some of the commands available to Database Admins and Monitors that explore usage, track operations and investigate ingestion failures.

The .show queries returns a list of queries that have reached a final state. This is available to users who are not administrators as well, but they can only see the queries that they have launched.

The .show running queries command returns a list of currently executing queries. This can show queries invoked by all users.

The .cancel query command starts a best effort attempt to cancel a specific running query. A reason phrase can be optionally specified.

The .show commands lists the admin commands that have reached a final state. These commands are available to query for 30 days. The listing includes cpu usage as well as resource utilization for all completed commands.

The .show commands-and-queries combines the information for both commands and queries.

The .show journal displays the history of metadata operations. The metadata operations can result from a control command that a user executed, or the system invoked. These have consistent and well formatted data with very few variations between versions. Metadata changes are stored specific to databases on the cluster.

The .show database <databaseName> journal command returns journal for the specific database.

The .show operations displays administrative operations both running and completed, since Admin node was last elected.

The .show operations command returns a table with all administrative operations which were executed in the last two weeks. It displays both running and completed operations. It may be run in two modes. The first mode is the log mode in which all entries in the logs that the user has access to are returned. The second mode is the Latest update mode the last update for each operation ID provided by the user is returned. The first mode is used when the operation ID is not available, and the second mode is used when the operation ID is available.

The .show failed ingestion failures returns a result set that shows failure specific to the data ingestion control commands and encountered by the Kusto data engine.

Ingestion metrics and diagnostic logs included errors that are encountered outside of the database engine.

Frequently some of these queries are executed together for specific diagnostics and troubleshooting cases. They have not been compiled into a published dscript-like library because they change hands or are lost in silos. A sample collection of these libraries from forensics could pave the way for onboarding any new team.

Reference: Sample queries 

Friday, April 22, 2022

Improving queries part 6

This is a continuation of the best practice in writing Kusto queries as introduced here. 

and continued with the log queries for Analytics such as with Azure Monitor Logs. While the part 5 recognized optimizations and best practices, this discusses parsing and other routines along with performance indicators.

Parsing text data in Azure Monitor Logs is required when there are many pieces of information embedded in a single property. When the data is parsed into multiple properties, a single log entry can participate better in search and aggregation as they target the individual properties.

There are several methods and best practices in parsing text data. These methods vary when the parsing is done at ingestion time or when collected at query time:

Parsing data at collection time allows us to configure Custom Fields that create new properties in the table. Queries do not have to include any parsing logic and simply use these properties as any other columns in the table.

Advantages to this method are that it is easier to query the collected data since we don’t need to include parse commands in the query and it is better for query performance since the query does not need to do it.

The disadvantages are that it must be defined in advance, it can’t change the parsing logic and it will only apply to new data. There are fewer parsing options than available in queries, increases latency time for collecting data and errors can be difficult to handle.

Parsing at query time prevents the modification to the original table.

The advantages to this method are that it applies to any data, including data that has already been collected but it makes the queries more complex.

Query performance indicators are available for each query that is executed. These include:

Total CPU which reflects the overall compute is used to process the query even across all compute nodes It represents time used for computing, parsing, and data fetching. All of these actions can take time during query execution but some can be amortized for later or done upfront once.

Data used for processed query which reflects the overall data that was accessed to process the query. It can change with the size of the target table, time span used, filters applied, and the number of columns referenced. Reducing the data aggressively at every stage of the processing can tremendously improve performance benefits.

Time span of the processed query which reflects the gap between the recent and the earliest data that was accessed to process the query. Time range can also be specified for the query and it is usually better to specify the time span as the first predicate.

Age of processed data which reflects the gap between now and the oldest data that was accessed to process the query. It highly influences the efficiency of data fetching and should be specified as the first predicate whenever possible.

Number of workspaces which reflects a count of workspaces read because some not have been intended but were still part of the query execution.

Number of regions: which reflects the number of regions accessed during the query processing resulting from the workspaces. These multi-region queries are much less efficient and performance indicators present partial coverage.

Parallelism: The degree of parallelism indicates how much work can happen in parallel. This is appealing to queries that have high CPU consumption and it can vary with functions and operators.

One of the examples we can use with this query is the following:

Heartbeat

| where timestamp > ago(1h)

| where EventData !has "%SYSTEM32"

| where EventData has “success”

Reference: Sample queries 

Thursday, April 21, 2022

Improving queries part 5

This is a continuation of the best practice in writing Kusto queries as introduced here.

This section focuses on the log queries for Analytics such as with Azure Monitor Logs. Specifically, it refers to the Azure Monitor Logs which are queries written using the same Kusto Query language.

Statements not supported in the Azure Monitor include Alias and Query parameters which are used to send Kusto queries. These were used to prevent SQL Injection attacks but in the context of logs, these are irrelevant. On the other hand, we have new operators for Azure Monitor Logs that include app(), resource() and workspace()

App() retrieves data from specific application insights app and is in the same resource group, another resource group or another subscription. This can help query data across multiple applications in an Application Insights query.

The resource() operator is used to retrieve data from a specific resource as identified by its resource id. Read access is required to the resource and the query will run scoped to this resource.

The Workspace() operator is used to retrieve the data from a specific workspace in the same resource group, another resource group, or another subscription. This can also be used across multiple workspaces in a log query.

Log queries deal a lot with literals. Some of these may have special characters including quotes. These must be escaped.

Some columns are available as standards and these can be leveraged in the queries. Columns such as TenantId, TimeGenerated, _TimeReceived, Type, _ItemId, _ResourceId, _SubscriptionId, _IsBillable, _BilledSize come useful to filter the results, transform into expressions and to summarize the results with the corresponding KQL operator.

Audit queries are not usually included in the logs. These must be enabled from the portal.

Some optimization might be required for log and audit queries due to the size of the data.

Optimized queries will run faster, reduce overall duration of the query execution and will have a smaller chance of rejection. The queries that are recurrent or occurring-in-burst usages such as dashboards, alerts, Logic Apps, and Power BI are most commonly the candidates for optimization.

The query performance monitor can give significant information on Total CPU, Data used for processed query, Time span of the processed query, Age of the processed data, Number of workspaces, Number of regions, and Parallelism.

The filtering of records prior to the usage of compute intensive operators or functions is preferred.

The use of evaluated where clause as opposed to the use of the actual data is discouraged.

Aggregation commands and dimensions in summarize and join must be effective for use because some of them like dcount can use HyperLogLog algorithm. The Join and the summarize commands may be compute intensive for large data. For example, Heartbeat | where Computer startsWith “Production” | summarize count() by ComputerIP is more efficient than specifying extended columns or using contains operator.

Full XML and Json parsing can also be avoided to reduce the CPU usage for the operators.

This holds true for all data in fact since the size of data passed from pipe operator can be quite large.

The unnecessary use of search and union operators must be avoided for similar reasons and filtering can be applied early in the queries to reduce the data size.

Reference: Sample queries

Wednesday, April 20, 2022

Improved queries part 4

This is a continuation of the best practice in writing Kusto queries as introduced here.

Conversions are often necessary, but they must be restricted to the data that is being converted and not include extraneous data so that the size of the data for processing can be reduced. The query can be reshaped to reduce the amount of data fed into the conversion. Avoiding the conversion of large amounts of data is also a performance gain.

When writing new queries, some trial and error is required. It is best not to generate all the results during these intermediary studies. Instead, it is best to work with a small number of results by way of limit.

Filtering on a calculated column is not preferred. The use of a table column for filtering is recommended.

When the summarize operator is used and the group by keys comes with high cardinality, the use of the hint.strategy = shuffle is recommended.

When a join is performed, it is better to take fewer rows to be the left-hand side in the query. When a single column is used for filtering, the use of ‘in’ operator is preferred over left semi join for filtering by a single column. When the join is spread out across clusters, the query should be run where most of the data is located.

A hint of ‘hint.strategy=broadcast’ can also be used when the left side is small and the right side is large.

The values on the columns with strings sharing the same pattern can be extracted using the parse operator instead of using the extract() statement multiple times.

Both the materialize() function and the materialized_view() function can be judiciously used to reduce the data set for processing and still retain the semantics of the query. The use of filters and projection of required columns helps reduce this data set. The materialized views store the commonly used aggregations and help to query the materialized data set.

When specifying a database object, care must be taken to point to the specific table or data source. For example, an unqualified name might resolve to a function in the local database instead of a table in the remote database. The best way to overcome this resolution error is to fully qualify the name of the target. For example, cluster(‘clusterName’).database(‘databaseName’)

Whenever queries are run across clusters, it is important to remember that the cluster that performs the initial query interpretation must have the schema of the entities referenced on the remote clusters. Sending the command can be an expensive network operation and the remote schema entities are cached. A cross-cluster query can refresh the schema with “.clear cache remote-schema cluster(‘clusterName’).database(‘databaseName’)”

Similarly query results cache can also be cleared using the “.clear database cache query_results” command.

Reference: Sample queries

Tuesday, April 19, 2022

Improving queries – part 3

This document is a continuation of the article introduced here to improve Kusto queries. In this section, the best practice for writing this query is discussed.

Kusto queries are highly optimized to use time filters. When the criteria for filtering the results are specified in a query, the use of time filters first is a good choice.

String operations are often tripped by case-sensitivity and leading or trailing whitespaces. If full tokens must be found, the use of ‘has’ operator is preferred over ‘contains’ operator since it does not look for substrings.

The case-sensitive operators are preferred for exact matches such as in a join. The use of == operator, in operator, contains_cs is preferable over their approximation counterparts. The use of has and has_cs is still preferred over contains and contains_cs. When case-insensitive operations are involved, it is best to use col =~ “lowercasestring” rather than using the tolower() function. If the data is already in either lower or upper case, then it can be matched exactly with a corresponding constant.

When searching text, it is best to limit it to a column. An asterisk wild card character does a full text search across all columns and is more expensive than a column search.

The column level scoping also applies to extracting fields from dynamic objects across many rows. It might be prudent to materialize the column at ingestion time if most of the queries extract fields from dynamic objects so that the column extraction is paid only once. On the contrary, if the key/values to be looked up are rare in dynamic objects, then it is beneficial to first query if the dynamic column “has” the rare value before proceeding to find the exact match via dynamic-column.somekey == “rare value”. This way most of the records are filtered out before the json parsing occurs on the rest.

Variables are declared with the help of ‘let’ statement. This is good when a value is used more than once. Another option is to use the materialize() function.

Filtering on a calculated column is not preferred. The use of a table column for filtering is recommended.

When the summarize operator is used and the group by keys comes with high cardinality, the use of the hint.strategy = shuffle is recommended.

Monday, April 18, 2022

Kusto data access continued

Consumers:

Kusto data access is an option to access the data directly from a database. It is immensely helpful when we want to browse through the data or explore it. APIs encapsulate logic with access to the data and provide validation, error translation, and response formatting along with diagnosability, telemetry and troubleshooting help. These are owned by the service team that also owns the data.

Kusto queries can also be quite elaborate or custom-defined to suit specific needs that are faster and lightweight compared to the staged management, release pipelines, and scheduled delivery of features introduced into the APIs. The ability to include such queries in background automation is sometimes useful because they don't have interactivity with the customer and those automations can be kicked off periodically or on-demand. 

Kusto data access can be programmatic involving the use of a query provider and a HTTP client. No code or low code scenarios prefer this approach.

There are two ways of consuming data from Kusto. These are:

1. Polling

1. On-demand fetch

We compare them as follows:

When the changes are infrequent, but the scenario requires immediate responsiveness, on-demand direct access can be especially efficient.  The fetch gets just the data that has changed and as soon as the change happens that enables downstream systems to be reactive and changes to propagate immediately. The downstream systems are also relieved from implementing anything more than a message handler. There is also the benefit that comes from the scope of the change that is passed down to the subscribers, so they get additional information on just what data was needed.  If the changes become frequent, the number of messages is large leading up to performance bottlenecks with variations in the queue size and delays in the messages.  Instead, when the changes span a lot of data, it is best to get those changes in bulk. A polling mechanism can get changes in batches over time and then process through all those changes. It can even find only the updates that were made from the time of the previous polling.  This enables incremental updates at the destination. Since a polling mechanism is a loop that perpetually finds changes, if any, and applies them to the destination, it can work in the background even as a single worker if it has a destination store. A polling mechanism is a read-only operation and therefore it does not need to fetch the data from the store where the configuration is being actively updated.

Sunday, April 17, 2022

Kusto data access

Purpose:

Many automations using a Workflow Management System framework must reconcile with disparate sources of information. Some of them reside in the configuration store, some in Workflow Management System and others in external systems. Azure Cloud services and teams leveraging a Workflow Management System find themselves having to write a query to get this information. While their purpose changes on a case-by-case basis, the mechanism to do so has evolved to a point where it is becoming a standard as seen from the above ARM service portfolio as well as the requirements from foundational services. This document is a definitive guide towards using the standard way to retrieve this information.

The benefit of standardization and this documentation is that it has streamlined and smoothened the process that would inevitably be encountered by teams attempting to write their own solutions. There are also significant cost savings when following this guide and not resorting to the source depot tools and offline mechanisms. It also improves visibility of the retrieval process as well as the accuracy of the data with little or no maintenance.

Why Kusto?

Workflow Management System workflow authors are rightfully concerned about the support for their workflows in different clouds such as public, sovereign and air gapped clouds. The goal for these clouds has been to build and prepare all the required packages and workflows to run on the low side and then replicate to the high side. In this regard, Kusto comes with two primary benefits:

1) Separation of Read-only from read-write access for all workflows and activities and universally accessible without specific onboarding access. Data virtualization as if with public access has never been this easy to do as it is with Kusto. Different systems can access each other’s cluster and database using the same connection with little or no performance overhead.

2) Kusto has incredible support for language, execution, and results from almost all angles so that writers can not only write their query once but also safeguard the query from changes due to maintenance, updates, data variations and software revisions. This eliminates a lot of costs from conventional mechanisms seen using files, repositories, toolsets, and others.

These reasons are sufficient to move the read-only traffic away from Cosmos DB to Kusto.

Why Workflow Management System?

Azure Buildout workflows must be run in Workflow Management System and while new deployment technologies may content to remain above ARM, Workflow Management System must support foundational services deployed to dedicated clusters as well as deployments in isolated clouds and everything in between. Many workflows and activities produce and consume information that must remain shared. Earlier, build-drop locations and Blob storage proved great repositories but writing queries requires a repository such as Kusto to enable these workflows and activities within Workflow Management System.

Will configuration store also show up in Kusto?

If there is existing automation relying directly on configuration store repository, it is not recommended that we make changes to those automations immediately. The Kusto data does have the same information as the depot including Cloud Information, geographical information, Region information, AZ Information, DC Information, Edge Zone, and their corresponding staging information. New automations and activities can leverage Kusto data access. It is likely that the new workflows and activities will replace the old ones rather than require upgrading and paying the technical debt on the old ones.

How far behind is the data till it is refreshed?

Different systems have different refresh intervals and even the refresh of the tables in the same database can vary. Usually this is a matter of minutes but for large tables it can take up to an hour to refresh. Most of these will be listed in the documentation from the corresponding systems under Kusto access section but it is safe to assume that the data will lag the transactional process of create, update and delete of the data in their corresponding stores.

I just got a new task involving Kusto, how do I get started?

Determine whether a producer or a consumer is required for Kusto data access. The steps to follow in the guide differ according to a producer or a consumer. Please refer to the post: http://ravinote.blogspot.com/2022/04/kusto-data-producer.html?m=1 for getting started with producers.