Thursday, April 21, 2022

Improving queries part 5

This is a continuation of the best practice in writing Kusto queries as introduced here.

This section focuses on the log queries for Analytics such as with Azure Monitor Logs. Specifically, it refers to the Azure Monitor Logs which are queries written using the same Kusto Query language.

Statements not supported in the Azure Monitor include Alias and Query parameters which are used to send Kusto queries. These were used to prevent SQL Injection attacks but in the context of logs, these are irrelevant. On the other hand, we have new operators for Azure Monitor Logs that include app(), resource() and workspace()

App() retrieves data from specific application insights app and is in the same resource group, another resource group or another subscription. This can help query data across multiple applications in an Application Insights query.

The resource() operator is used to retrieve data from a specific resource as identified by its resource id. Read access is required to the resource and the query will run scoped to this resource.

The Workspace() operator is used to retrieve the data from a specific workspace in the same resource group, another resource group, or another subscription. This can also be used across multiple workspaces in a log query.

Log queries deal a lot with literals. Some of these may have special characters including quotes. These must be escaped.

Some columns are available as standards and these can be leveraged in the queries. Columns such as TenantId, TimeGenerated, _TimeReceived, Type, _ItemId, _ResourceId, _SubscriptionId, _IsBillable, _BilledSize come useful to filter the results, transform into expressions and to summarize the results with the corresponding KQL operator.

Audit queries are not usually included in the logs. These must be enabled from the portal.

Some optimization might be required for log and audit queries due to the size of the data.

Optimized queries will run faster, reduce overall duration of the query execution and will have a smaller chance of rejection. The queries that are recurrent or occurring-in-burst usages such as dashboards, alerts, Logic Apps, and Power BI are most commonly the candidates for optimization.

The query performance monitor can give significant information on Total CPU, Data used for processed query, Time span of the processed query, Age of the processed data, Number of workspaces, Number of regions, and Parallelism.

The filtering of records prior to the usage of compute intensive operators or functions is preferred.

The use of evaluated where clause as opposed to the use of the actual data is discouraged.

Aggregation commands and dimensions in summarize and join must be effective for use because some of them like dcount can use HyperLogLog algorithm. The Join and the summarize commands may be compute intensive for large data. For example, Heartbeat | where Computer startsWith “Production” | summarize count() by ComputerIP is more efficient than specifying extended columns or using contains operator.

Full XML and Json parsing can also be avoided to reduce the CPU usage for the operators.

This holds true for all data in fact since the size of data passed from pipe operator can be quite large.

The unnecessary use of search and union operators must be avoided for similar reasons and filtering can be applied early in the queries to reduce the data size.

Reference: Sample queries

Wednesday, April 20, 2022

Improved queries part 4

This is a continuation of the best practice in writing Kusto queries as introduced here.

Conversions are often necessary, but they must be restricted to the data that is being converted and not include extraneous data so that the size of the data for processing can be reduced. The query can be reshaped to reduce the amount of data fed into the conversion. Avoiding the conversion of large amounts of data is also a performance gain.

When writing new queries, some trial and error is required. It is best not to generate all the results during these intermediary studies. Instead, it is best to work with a small number of results by way of limit.

Filtering on a calculated column is not preferred. The use of a table column for filtering is recommended.

When the summarize operator is used and the group by keys comes with high cardinality, the use of the hint.strategy = shuffle is recommended.

When a join is performed, it is better to take fewer rows to be the left-hand side in the query. When a single column is used for filtering, the use of ‘in’ operator is preferred over left semi join for filtering by a single column. When the join is spread out across clusters, the query should be run where most of the data is located.

A hint of ‘hint.strategy=broadcast’ can also be used when the left side is small and the right side is large.

The values on the columns with strings sharing the same pattern can be extracted using the parse operator instead of using the extract() statement multiple times.

Both the materialize() function and the materialized_view() function can be judiciously used to reduce the data set for processing and still retain the semantics of the query. The use of filters and projection of required columns helps reduce this data set. The materialized views store the commonly used aggregations and help to query the materialized data set.

When specifying a database object, care must be taken to point to the specific table or data source. For example, an unqualified name might resolve to a function in the local database instead of a table in the remote database. The best way to overcome this resolution error is to fully qualify the name of the target. For example, cluster(‘clusterName’).database(‘databaseName’)

Whenever queries are run across clusters, it is important to remember that the cluster that performs the initial query interpretation must have the schema of the entities referenced on the remote clusters. Sending the command can be an expensive network operation and the remote schema entities are cached. A cross-cluster query can refresh the schema with “.clear cache remote-schema cluster(‘clusterName’).database(‘databaseName’)”

Similarly query results cache can also be cleared using the “.clear database cache query_results” command.

Reference: Sample queries

Tuesday, April 19, 2022

Improving queries – part 3

This document is a continuation of the article introduced here to improve Kusto queries. In this section, the best practice for writing this query is discussed.

Kusto queries are highly optimized to use time filters. When the criteria for filtering the results are specified in a query, the use of time filters first is a good choice.

String operations are often tripped by case-sensitivity and leading or trailing whitespaces. If full tokens must be found, the use of ‘has’ operator is preferred over ‘contains’ operator since it does not look for substrings.

The case-sensitive operators are preferred for exact matches such as in a join. The use of == operator, in operator, contains_cs is preferable over their approximation counterparts. The use of has and has_cs is still preferred over contains and contains_cs. When case-insensitive operations are involved, it is best to use col =~ “lowercasestring” rather than using the tolower() function. If the data is already in either lower or upper case, then it can be matched exactly with a corresponding constant.

When searching text, it is best to limit it to a column. An asterisk wild card character does a full text search across all columns and is more expensive than a column search.

The column level scoping also applies to extracting fields from dynamic objects across many rows. It might be prudent to materialize the column at ingestion time if most of the queries extract fields from dynamic objects so that the column extraction is paid only once. On the contrary, if the key/values to be looked up are rare in dynamic objects, then it is beneficial to first query if the dynamic column “has” the rare value before proceeding to find the exact match via dynamic-column.somekey == “rare value”. This way most of the records are filtered out before the json parsing occurs on the rest.

Variables are declared with the help of ‘let’ statement. This is good when a value is used more than once. Another option is to use the materialize() function.

Filtering on a calculated column is not preferred. The use of a table column for filtering is recommended.

When the summarize operator is used and the group by keys comes with high cardinality, the use of the hint.strategy = shuffle is recommended.

Monday, April 18, 2022

Kusto data access continued

Consumers:

Kusto data access is an option to access the data directly from a database. It is immensely helpful when we want to browse through the data or explore it. APIs encapsulate logic with access to the data and provide validation, error translation, and response formatting along with diagnosability, telemetry and troubleshooting help. These are owned by the service team that also owns the data.

Kusto queries can also be quite elaborate or custom-defined to suit specific needs that are faster and lightweight compared to the staged management, release pipelines, and scheduled delivery of features introduced into the APIs. The ability to include such queries in background automation is sometimes useful because they don't have interactivity with the customer and those automations can be kicked off periodically or on-demand. 

Kusto data access can be programmatic involving the use of a query provider and a HTTP client. No code or low code scenarios prefer this approach.

There are two ways of consuming data from Kusto. These are:

1. Polling

1. On-demand fetch

We compare them as follows:

When the changes are infrequent, but the scenario requires immediate responsiveness, on-demand direct access can be especially efficient.  The fetch gets just the data that has changed and as soon as the change happens that enables downstream systems to be reactive and changes to propagate immediately. The downstream systems are also relieved from implementing anything more than a message handler. There is also the benefit that comes from the scope of the change that is passed down to the subscribers, so they get additional information on just what data was needed.  If the changes become frequent, the number of messages is large leading up to performance bottlenecks with variations in the queue size and delays in the messages.  Instead, when the changes span a lot of data, it is best to get those changes in bulk. A polling mechanism can get changes in batches over time and then process through all those changes. It can even find only the updates that were made from the time of the previous polling.  This enables incremental updates at the destination. Since a polling mechanism is a loop that perpetually finds changes, if any, and applies them to the destination, it can work in the background even as a single worker if it has a destination store. A polling mechanism is a read-only operation and therefore it does not need to fetch the data from the store where the configuration is being actively updated.

Sunday, April 17, 2022

Kusto data access

Purpose:

Many automations using a Workflow Management System framework must reconcile with disparate sources of information. Some of them reside in the configuration store, some in Workflow Management System and others in external systems. Azure Cloud services and teams leveraging a Workflow Management System find themselves having to write a query to get this information. While their purpose changes on a case-by-case basis, the mechanism to do so has evolved to a point where it is becoming a standard as seen from the above ARM service portfolio as well as the requirements from foundational services. This document is a definitive guide towards using the standard way to retrieve this information.

The benefit of standardization and this documentation is that it has streamlined and smoothened the process that would inevitably be encountered by teams attempting to write their own solutions. There are also significant cost savings when following this guide and not resorting to the source depot tools and offline mechanisms. It also improves visibility of the retrieval process as well as the accuracy of the data with little or no maintenance.

Why Kusto?

Workflow Management System workflow authors are rightfully concerned about the support for their workflows in different clouds such as public, sovereign and air gapped clouds. The goal for these clouds has been to build and prepare all the required packages and workflows to run on the low side and then replicate to the high side. In this regard, Kusto comes with two primary benefits:

1) Separation of Read-only from read-write access for all workflows and activities and universally accessible without specific onboarding access. Data virtualization as if with public access has never been this easy to do as it is with Kusto. Different systems can access each other’s cluster and database using the same connection with little or no performance overhead.

2) Kusto has incredible support for language, execution, and results from almost all angles so that writers can not only write their query once but also safeguard the query from changes due to maintenance, updates, data variations and software revisions. This eliminates a lot of costs from conventional mechanisms seen using files, repositories, toolsets, and others.

These reasons are sufficient to move the read-only traffic away from Cosmos DB to Kusto.

Why Workflow Management System?

Azure Buildout workflows must be run in Workflow Management System and while new deployment technologies may content to remain above ARM, Workflow Management System must support foundational services deployed to dedicated clusters as well as deployments in isolated clouds and everything in between. Many workflows and activities produce and consume information that must remain shared. Earlier, build-drop locations and Blob storage proved great repositories but writing queries requires a repository such as Kusto to enable these workflows and activities within Workflow Management System.

Will configuration store also show up in Kusto?

If there is existing automation relying directly on configuration store repository, it is not recommended that we make changes to those automations immediately. The Kusto data does have the same information as the depot including Cloud Information, geographical information, Region information, AZ Information, DC Information, Edge Zone, and their corresponding staging information. New automations and activities can leverage Kusto data access. It is likely that the new workflows and activities will replace the old ones rather than require upgrading and paying the technical debt on the old ones.

How far behind is the data till it is refreshed?

Different systems have different refresh intervals and even the refresh of the tables in the same database can vary. Usually this is a matter of minutes but for large tables it can take up to an hour to refresh. Most of these will be listed in the documentation from the corresponding systems under Kusto access section but it is safe to assume that the data will lag the transactional process of create, update and delete of the data in their corresponding stores.

I just got a new task involving Kusto, how do I get started?

Determine whether a producer or a consumer is required for Kusto data access. The steps to follow in the guide differ according to a producer or a consumer. Please refer to the post: http://ravinote.blogspot.com/2022/04/kusto-data-producer.html?m=1 for getting started with producers.

Saturday, April 16, 2022

Improving Queries Part 2

This document is a continuation of an earlier one here, which discusses the nuances of writing queries that are often encountered after they are written. In this article we explore a few more resolutions around data syntax, semantics and constraints.

A query often includes a join and while the columns between the left and right tables can be resolved by name, the results of the join might change considerably if the values do not match exactly. The join operator specifies an equality operator that is case sensitive. The values in the columns used to join may values differing from each other in leading and trailing whitespaces or other non-alphanumeric characters. While a cursory check of a few hundred records is possible, that technique cannot be applied to many records. It is rather difficult to complete list the anomalies in the data pertaining to the columns used for joining. Instead, it is easier to canonicalize the values to a form where they can be compared. For example, converting all values to lower case as a new computed column in both the participating tables, eliminates mismatches based on case-sensitivity. Similarly, removing all trailing values or extracting alphanumeric characters from the values enables inclusion of more records that would otherwise be ignored from the usual join. Such logic is easily verified to be sufficient by counting the number of records on both the left and the right tables.

Another concern for writing queries is about the meaning of columns. It might not be surprising for query writers to find that many columns refer to the same notion of an attribute for an entity. A simple name column might be available as a friendly name, a business name or even a legal name. The choice of the columns not only derives from the business requirements but also from the operational need to use the one that has the best values for enabling a join – both now and later. This means that query writers now must explore some data for this purpose. It is not always easy or possible to inspect the data because they might be protected such as in production systems or for compliance such as with privacy or security when storing say credit card data. Samples can be drawn from equivalent systems that are temporally and spatially spread out to best represent the actual data. Otherwise, it is usually possible to generate data using random characters.

Sometimes, the best efforts to come up with a logic for joining is defeated on the grounds of the complexity, scale or its arduous nature. The values may also not be unique between rows. In such cases, a composite column comprising of one or more columns as a computed column can be used to make a join. This additional logic solves the limitations around data from a single column and enables a robust joining of two tables. When a sequence of tables must be joined one with the other, we can repeat this evaluation for every join between two tables at a time.

Lastly, as joins increase, the number of records in the result set might decrease as fewer and fewer records match all the joining criteria. The decrease in the number of results at each join must be studied to ensure that the number of the records in the result set is correct.

Friday, April 15, 2022

Kusto data producer

Azure Kusto data access differs by producer and consumer. This article explains how to ingest data into Kusto with a producer and how to retrieve data with a consumer.

Producers:

There are a few products to publish data to Kusto. These include:

· Metadata Mover: Cosmos DB changes result in a change feed that can be used to ingest Kusto data tables.

· Azure Data Factory: This is a service designed to bridge disparate data sources. It is quick and easy to use a preconfigured data pipeline, connects to both SQL DB and KQL (Kusto Query Language) cluster, and allows the creation of scheduled triggers. The pipeline will not run an activity individually. ADF (Azure Data Factory) requires MSI (Managed Service Identity) and does not have support for MI (Managed Identities)

· Studio Jobs: replicates source to destination fully every time including new columns

These involve two different approaches primarily:

· Change tracking. The source must support change tracking and publishing via a change feed. It should be detailed about every delta in terms of the scope, type and description of the change and must come with versioning so that each change can be referred to by its version. Then these changes can be applied to Kusto

· E2E Workflow: This is a two-stage publishing.

o The first stage does an initial load from the source to the destination again using some watermark such as a version or a timestamp for the data to be transferred to the destination.

o The second stage involves periodic incremental loading

Some progress indicator is needed for incremental updates. Prefer overwriting the destination to reading and merging changes, if any.

There can also be a hybrid man-in-the-middle implementation that acts as consumer for source and a producer for destination, but it would be implementing both the producer and consumer in code versus leveraging the capabilities of these technologies in an ADO pipeline.

ADO is widely recognized as the best way to create a pipeline in Azure and your task to add data to Kusto requires you to connect your data source to Kusto.

The Azure DevOps project represents a fundamental container where data is stored when added to Azure DevOps. Since it is a repository for packages and a place for users to plan, track progress, and collaborate on building workflows, it must scale with the organization. When a project is created, a team is created by the same name. For an enterprise, it is better to use collection-project-team structure which provides teams with a high level of autonomy and supports administrative tasks to occur at the appropriate level.