Cluster computing

Sunday, April 17, 2022

Many automations using a Workflow Management System framework must reconcile with disparate sources of information. Some of them reside in the configuration store, some in Workflow Management System and others in external systems. Azure Cloud services and teams leveraging a Workflow Management System find themselves having to write a query to get this information. While their purpose changes on a case-by-case basis, the mechanism to do so has evolved to a point where it is becoming a standard as seen from the above ARM service portfolio as well as the requirements from foundational services. This document is a definitive guide towards using the standard way to retrieve this information.

The benefit of standardization and this documentation is that it has streamlined and smoothened the process that would inevitably be encountered by teams attempting to write their own solutions. There are also significant cost savings when following this guide and not resorting to the source depot tools and offline mechanisms. It also improves visibility of the retrieval process as well as the accuracy of the data with little or no maintenance.

Why Kusto?

Workflow Management System workflow authors are rightfully concerned about the support for their workflows in different clouds such as public, sovereign and air gapped clouds. The goal for these clouds has been to build and prepare all the required packages and workflows to run on the low side and then replicate to the high side. In this regard, Kusto comes with two primary benefits:

1) Separation of Read-only from read-write access for all workflows and activities and universally accessible without specific onboarding access. Data virtualization as if with public access has never been this easy to do as it is with Kusto. Different systems can access each other’s cluster and database using the same connection with little or no performance overhead.

2) Kusto has incredible support for language, execution, and results from almost all angles so that writers can not only write their query once but also safeguard the query from changes due to maintenance, updates, data variations and software revisions. This eliminates a lot of costs from conventional mechanisms seen using files, repositories, toolsets, and others.

These reasons are sufficient to move the read-only traffic away from Cosmos DB to Kusto.

Why Workflow Management System?

Azure Buildout workflows must be run in Workflow Management System and while new deployment technologies may content to remain above ARM, Workflow Management System must support foundational services deployed to dedicated clusters as well as deployments in isolated clouds and everything in between. Many workflows and activities produce and consume information that must remain shared. Earlier, build-drop locations and Blob storage proved great repositories but writing queries requires a repository such as Kusto to enable these workflows and activities within Workflow Management System.

Will configuration store also show up in Kusto?

If there is existing automation relying directly on configuration store repository, it is not recommended that we make changes to those automations immediately. The Kusto data does have the same information as the depot including Cloud Information, geographical information, Region information, AZ Information, DC Information, Edge Zone, and their corresponding staging information. New automations and activities can leverage Kusto data access. It is likely that the new workflows and activities will replace the old ones rather than require upgrading and paying the technical debt on the old ones.

How far behind is the data till it is refreshed?

Different systems have different refresh intervals and even the refresh of the tables in the same database can vary. Usually this is a matter of minutes but for large tables it can take up to an hour to refresh. Most of these will be listed in the documentation from the corresponding systems under Kusto access section but it is safe to assume that the data will lag the transactional process of create, update and delete of the data in their corresponding stores.

I just got a new task involving Kusto, how do I get started?

Determine whether a producer or a consumer is required for Kusto data access. The steps to follow in the guide differ according to a producer or a consumer. Please refer to the post: http://ravinote.blogspot.com/2022/04/kusto-data-producer.html?m=1 for getting started with producers.

Saturday, April 16, 2022

Improving Queries Part 2

This document is a continuation of an earlier one here, which discusses the nuances of writing queries that are often encountered after they are written. In this article we explore a few more resolutions around data syntax, semantics and constraints.

A query often includes a join and while the columns between the left and right tables can be resolved by name, the results of the join might change considerably if the values do not match exactly. The join operator specifies an equality operator that is case sensitive. The values in the columns used to join may values differing from each other in leading and trailing whitespaces or other non-alphanumeric characters. While a cursory check of a few hundred records is possible, that technique cannot be applied to many records. It is rather difficult to complete list the anomalies in the data pertaining to the columns used for joining. Instead, it is easier to canonicalize the values to a form where they can be compared. For example, converting all values to lower case as a new computed column in both the participating tables, eliminates mismatches based on case-sensitivity. Similarly, removing all trailing values or extracting alphanumeric characters from the values enables inclusion of more records that would otherwise be ignored from the usual join. Such logic is easily verified to be sufficient by counting the number of records on both the left and the right tables.

Another concern for writing queries is about the meaning of columns. It might not be surprising for query writers to find that many columns refer to the same notion of an attribute for an entity. A simple name column might be available as a friendly name, a business name or even a legal name. The choice of the columns not only derives from the business requirements but also from the operational need to use the one that has the best values for enabling a join – both now and later. This means that query writers now must explore some data for this purpose. It is not always easy or possible to inspect the data because they might be protected such as in production systems or for compliance such as with privacy or security when storing say credit card data. Samples can be drawn from equivalent systems that are temporally and spatially spread out to best represent the actual data. Otherwise, it is usually possible to generate data using random characters.

Sometimes, the best efforts to come up with a logic for joining is defeated on the grounds of the complexity, scale or its arduous nature. The values may also not be unique between rows. In such cases, a composite column comprising of one or more columns as a computed column can be used to make a join. This additional logic solves the limitations around data from a single column and enables a robust joining of two tables. When a sequence of tables must be joined one with the other, we can repeat this evaluation for every join between two tables at a time.

Lastly, as joins increase, the number of records in the result set might decrease as fewer and fewer records match all the joining criteria. The decrease in the number of results at each join must be studied to ensure that the number of the records in the result set is correct.

Friday, April 15, 2022

Kusto data producer

Azure Kusto data access differs by producer and consumer. This article explains how to ingest data into Kusto with a producer and how to retrieve data with a consumer.

Producers:

There are a few products to publish data to Kusto. These include:

· Metadata Mover: Cosmos DB changes result in a change feed that can be used to ingest Kusto data tables.

· Azure Data Factory: This is a service designed to bridge disparate data sources. It is quick and easy to use a preconfigured data pipeline, connects to both SQL DB and KQL (Kusto Query Language) cluster, and allows the creation of scheduled triggers. The pipeline will not run an activity individually. ADF (Azure Data Factory) requires MSI (Managed Service Identity) and does not have support for MI (Managed Identities)

· Studio Jobs: replicates source to destination fully every time including new columns

These involve two different approaches primarily:

· Change tracking. The source must support change tracking and publishing via a change feed. It should be detailed about every delta in terms of the scope, type and description of the change and must come with versioning so that each change can be referred to by its version. Then these changes can be applied to Kusto

· E2E Workflow: This is a two-stage publishing.

o The first stage does an initial load from the source to the destination again using some watermark such as a version or a timestamp for the data to be transferred to the destination.

o The second stage involves periodic incremental loading

Some progress indicator is needed for incremental updates. Prefer overwriting the destination to reading and merging changes, if any.

There can also be a hybrid man-in-the-middle implementation that acts as consumer for source and a producer for destination, but it would be implementing both the producer and consumer in code versus leveraging the capabilities of these technologies in an ADO pipeline.

ADO is widely recognized as the best way to create a pipeline in Azure and your task to add data to Kusto requires you to connect your data source to Kusto.

The Azure DevOps project represents a fundamental container where data is stored when added to Azure DevOps. Since it is a repository for packages and a place for users to plan, track progress, and collaborate on building workflows, it must scale with the organization. When a project is created, a team is created by the same name. For an enterprise, it is better to use collection-project-team structure which provides teams with a high level of autonomy and supports administrative tasks to occur at the appropriate level.  

Thursday, April 14, 2022

Query execution continued...

The separation of data records to process and optimization of the operations on those records to improve the efficiency can be leveraged to reduce the time for all the records by parallelizing the queries on separate workers. Since the set of records are not shared and only the results are shared, the workers merely have to share the reports of the processing to a central accumulator and that reduces the sequential time by a factor of the number of workers.

Query language, tools and products come with nuances, built-ins and even features that can further help to analyze, optimize and rewrite queries so that they perform better than others. Some form of query execution statistics is also made available via the store or from profiling tools. There is also a way to improve the efficiency of the queries by breaking up its organizational structure and introducing pipelining so the results of one stage as passed to another can be studied.

Pipelined execution involves the following stages:

1) acquisition

2) extraction

3) integration

4) analysis

5) interpretation

The challenges facing pipelined execution involve:

1) scale: A pipeline must hold against a data tsunami. In addition, data flow may fluctuate, and the pipeline must hold against the ebb and the flow. Data may be measured in rate, duration and size and the pipeline may need to become elastic. Column stores and time series database have typically helped with this.

2) Heterogeneity: Data types – their format and semantics have evolved to a wide degree. The pipeline must at least support primitives such as variants, objects and arrays. Support for different data stores has become more important as services have become microservices and independent of each other in their implementations including data storage. Data and their services have a large legacy which also needs to be accommodated in the pipeline. ETL operations and structured storage are still the norm for many industries.

3) Extension: ETL may require flexibility in extending logic and in their usages on clusters versus servers. Microservices are much easier to be written. They became popular with Big Data storage. Together they have bound compute and storage into their own verticals with data stores expanding in number and variety. Queries written in one service now need to be written in another service while the pipeline may or may not support data virtualization.

4) Timeliness: Both synchronous and asynchronous processing need to be facilitated so that some data transfers can be run online while others may be relegated to the background. Publisher-subscriber message queues may be used in this regard. Services and brokers do not scale as opposed to cluster- based message queues. It might take nearly a year to fetch the data into the analytics system and only a month for the analysis. While the benefit for the user may be immense, their patience for the overall time elapsed may be thin.

5) Privacy: User’s location, personally identifiable information and location-based services are required to be redacted. This involves not only parsing for such data but also doing it over and over starting from the admission control on the boundaries of integration

Wednesday, April 13, 2022

Query execution – what the documentation does not cover.

Introduction: Writing queries has had an upward propagation from technology to business applications. It has started pervading even the management of business as they lean into the monitoring and analytical capabilities of queries. Yet, the best practices for writing queries available online do not prepare the user for the often-encountered errors. This article covers only a handful of these advanced techniques.

Description: One of the most useful features of writing a query is that it can span diverse data sources. For example, it can refer to a table in one location and use the entries of that table with another from a different database using a join. Functionally, these are easy to write so long as there is at least one column that can be used to make the join. But even a simple query can take a long time to execute and one of the primary reasons for the resulting timeouts is that the data exceeds the number of records that can be processed in a reasonable amount of time. The cardinality of the data in either location matters so it is best to leverage the one that is smaller than the other. Since the data size does not reflect in a query and the author is even unaware at the time of writing, it takes some repetitions to get to this root cause and figure out the next steps. One of the mitigations for this kind of error is seen with the use of distinct operator that reduces the data size significantly by eliminating duplicates. When the records are not distinct, projecting only some of the columns as required for the join and the results of the query enables the row set to shrink because the projection makes different records look the same.

Another approach to make records look the same is the use of computed columns that puts different records into the same bucket so that those the actions that were to be taken on the records instead can now be taken on the buckets. There are several such operators to put records into buckets also called hashes. The use of proper hashing functions can result in buckets with more or less equal numbers of records. This results in a uniform distribution of workload which we will bring up again for parallelizing their execution.

Records also can get partitioned without the use of hashing operator because they are stored separately even though they belong to the same logical table. This materialized view is significantly more useful for querying because it alleviates the contention on going to the original source. The cost of translation is borne once so that the queries do not need to recompute the data each and every time by going to the source and then working on it. The use of materialized views also reduces the cost significantly when it is stored local to the store where the queries are executed rather than fetching the original data from remote every time.

The side-effect of processing a large number of records without the author knowing about it and resulting in errors is that the systems that participate in the query execution can easily get inundated with a large number of calls so that those systems begin to apply a rate limit that they have been configured with. When there are multiple systems that participate in multiple joins in a single query, the overall execution is halted when any one of the system begins to error out. This means that the author has no way of predicting what remedial actions to take on the query until actually the errors are known or the data cardinality is found beforehand.

Similarly, the problem of unknown data size is compounded when the queries are nested one within the other. This is rarely visible when the results of the nested queries are merely passed around to higher levels. The same caution taken to reduce data size and optimize operations on the reduced size must be reapplied at each level of the querying. This will benefit the overall execution in the same manner as a bubble propagates up the depth.

Lastly, the separation of data records to process and optimization of the operations on those records to improve the efficiency can be leveraged to reduce the time for all the records by parallelizing the queries on separate workers. Since the set of records are not shared and only the results are shared, the workers merely have to share the reports of the processing to a central accumulator and that reduces the sequential time by a factor of the number of workers.

Reference: Sample queries

Tuesday, April 12, 2022

This article is a continuation of the series of articles with the most recent discussion on managed Service Fabric cluster as referred here.

Azure offers a control plane for all resources that can be deployed to the cloud and services take advantage of them both for themselves and their customers. While Azure Functions allow extensions via new resources, Azure Resource provider and ARM APIs provide extensions via existing resources. This eliminates the need to have new processes introduced around new resources and is a significant win for reusability and user convenience. New and existing resources are not the only way to write extensions, there are other options such as writing it in the Azure Store or via other control planes such as container orchestration frameworks and third-party platforms. This article focuses on deploying Service Fabric via ARM templates.

The service fabric managed cluster itself is represented by an ARM template which is a JSON notation and it defines the infrastructure and configuration for the cluster. The template uses declarative syntax so there is no need to write commands to create the deployment. The template takes parameters such as shown here which is then applied to the resource and the cluster is created.

{

"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",

"contentVersion": "1.0.0.0",

"parameters": {

"clusterName": {

"value": "GEN-UNIQUE"

"clusterSku": {

"value": "Basic"

"adminUserName": {

"value": "GEN-UNIQUE"

"adminPassword": {

"value": "GEN-PASSWORD"

"clientCertificateThumbprint": {

"value": "GEN-SF-CERT-THUMBPRINT"

"nodeTypeName": {

"value": "NT1"

"vmImagePublisher": {

"value": "MicrosoftWindowsServer"

"vmImageOffer": {

"value": "WindowsServer"

"vmImageSku": {

"value": "2019-Datacenter"

"vmImageVersion": {

"value": "latest"

"vmSize": {

"value": "Standard_D2s_v3"

"vmInstanceCount": {

"value": 5

"dataDiskSizeGB": {

"value": 128

"managedDataDiskType": {

"value": "StandardSSD_LRS"

}

A vmInstanceCount of 5 is sufficient for a quorum of 3 in an ensemble of 5.

Encryption options can be selected for the VM.

The multiplePlacementGroups can be used in the nodeType definition to specify a large VMSS. Each nodeType is backed by a VMSS.

Managed identity can be configured which is specified by a property vmManagedIdentity that has been added to node type definitions and contains a list of identities that may be used.

Specifying the managed disk implies the disk type and size do not necessarily need to be specified and can be used for all storage types.

Deployment using the template can be kicked off directly from the CLI, PowerShell, portal and SDK. All of these provide programmability options. Resources can be cleaned up by deleting the resource group. The following ARM template is used to create a basic managed service fabric cluster of type NT-1

Monday, April 11, 2022

This article is a continuation of the part 1 article on Configuration Automations, part 2 that describes the syntax of the configuration templates, and part 3 describes the process of finding and loading configurations. This article describes some of the common routines in configuration automation about existing products

Take Ubuntu for example, and you will find the following list of software: Apache, MySQL, ZooKeeper, Storm, Kafka, Spark, Cassandra, Nginx, Marathon and Docker usually. And take windows and we have the following list of software: .NET, Visual Studio, Team Foundation Server, Octopus etc. While Docker has made applications portable, we want to view the solution for continuous deployment that involves configuration management.

CloudFoundry is a tool for automated software deployment and hosting with the following benefits decreases the time to production, iterates faster develop->build->test->deploy transitions, increases productivity of developers, improves the quality of the products, improves the efficiency of IT operations and increases the utilization of hardware. Octopus is a tool that does massive deployments over several different virtual machines with the same configuration and software of choice. It can maintain simultaneous dev, test and production environments and tear them down at will. Octopus can prepare a VM with any kind of installer. This is an advantage over CloudFoundry because different VMs or a pool of VMs can be chosen for configuration management.

Any configuration service or solution for configuration management must contend with two primary responsibilities. First, it must allow transactional read-write of configuration key values including bulk mode insertion, update and delete. Second it must allow publisher subscriber changes associated with various scopes of configurations.

The first is easily achieved when the store is externalized such that read-write paths are fast and any analysis can be done in the form of read-only paths that are separate from the transactional nature.

The second is easily performed when there is a queue available for allowing subscriber notifications via fan-out. The configuration store does not have any exposure to the subscribers directly other than the queue. This enables the store to be the source of truth for the configurations. The queue can have internal and external subscribers and the update to the state is bidirectional. When the subscribers request, they can get their notifications. This perfects the write update because the data does not need to be sent out. If the queue sends messages back to the subscribers, it is a fan-out process. The subscribers can choose to check-in at selective times and the server can be selective about which subscribers to update. Both methods work well in certain situations. The fan-out happens in both writing as well as loading. It can be made selective as well. The fan-out can be limited during both pull and push. Disabling the writes to all subscribers can significantly reduce the cost. Other subscribers can load these updates only when reading. It is also helpful to keep track of which subscribers are active over a period so that only those subscribers get preference.

Both first and second can be written in any programming language or technology stack but implementing it in C# as a cloud service works very well from elasticity, performance and scalability requirements.