Cluster computing

Thursday, April 14, 2022

Query execution continued...

The separation of data records to process and optimization of the operations on those records to improve the efficiency can be leveraged to reduce the time for all the records by parallelizing the queries on separate workers. Since the set of records are not shared and only the results are shared, the workers merely have to share the reports of the processing to a central accumulator and that reduces the sequential time by a factor of the number of workers.

Query language, tools and products come with nuances, built-ins and even features that can further help to analyze, optimize and rewrite queries so that they perform better than others. Some form of query execution statistics is also made available via the store or from profiling tools. There is also a way to improve the efficiency of the queries by breaking up its organizational structure and introducing pipelining so the results of one stage as passed to another can be studied.

Pipelined execution involves the following stages:

1) acquisition

2) extraction

3) integration

4) analysis

5) interpretation

The challenges facing pipelined execution involve:

1) scale: A pipeline must hold against a data tsunami. In addition, data flow may fluctuate, and the pipeline must hold against the ebb and the flow. Data may be measured in rate, duration and size and the pipeline may need to become elastic. Column stores and time series database have typically helped with this.

2) Heterogeneity: Data types – their format and semantics have evolved to a wide degree. The pipeline must at least support primitives such as variants, objects and arrays. Support for different data stores has become more important as services have become microservices and independent of each other in their implementations including data storage. Data and their services have a large legacy which also needs to be accommodated in the pipeline. ETL operations and structured storage are still the norm for many industries.

3) Extension: ETL may require flexibility in extending logic and in their usages on clusters versus servers. Microservices are much easier to be written. They became popular with Big Data storage. Together they have bound compute and storage into their own verticals with data stores expanding in number and variety. Queries written in one service now need to be written in another service while the pipeline may or may not support data virtualization.

4) Timeliness: Both synchronous and asynchronous processing need to be facilitated so that some data transfers can be run online while others may be relegated to the background. Publisher-subscriber message queues may be used in this regard. Services and brokers do not scale as opposed to cluster- based message queues. It might take nearly a year to fetch the data into the analytics system and only a month for the analysis. While the benefit for the user may be immense, their patience for the overall time elapsed may be thin.

5) Privacy: User’s location, personally identifiable information and location-based services are required to be redacted. This involves not only parsing for such data but also doing it over and over starting from the admission control on the boundaries of integration

Wednesday, April 13, 2022

Query execution – what the documentation does not cover.

Introduction: Writing queries has had an upward propagation from technology to business applications. It has started pervading even the management of business as they lean into the monitoring and analytical capabilities of queries. Yet, the best practices for writing queries available online do not prepare the user for the often-encountered errors. This article covers only a handful of these advanced techniques.

Description: One of the most useful features of writing a query is that it can span diverse data sources. For example, it can refer to a table in one location and use the entries of that table with another from a different database using a join. Functionally, these are easy to write so long as there is at least one column that can be used to make the join. But even a simple query can take a long time to execute and one of the primary reasons for the resulting timeouts is that the data exceeds the number of records that can be processed in a reasonable amount of time. The cardinality of the data in either location matters so it is best to leverage the one that is smaller than the other. Since the data size does not reflect in a query and the author is even unaware at the time of writing, it takes some repetitions to get to this root cause and figure out the next steps. One of the mitigations for this kind of error is seen with the use of distinct operator that reduces the data size significantly by eliminating duplicates. When the records are not distinct, projecting only some of the columns as required for the join and the results of the query enables the row set to shrink because the projection makes different records look the same.

Another approach to make records look the same is the use of computed columns that puts different records into the same bucket so that those the actions that were to be taken on the records instead can now be taken on the buckets. There are several such operators to put records into buckets also called hashes. The use of proper hashing functions can result in buckets with more or less equal numbers of records. This results in a uniform distribution of workload which we will bring up again for parallelizing their execution.

Records also can get partitioned without the use of hashing operator because they are stored separately even though they belong to the same logical table. This materialized view is significantly more useful for querying because it alleviates the contention on going to the original source. The cost of translation is borne once so that the queries do not need to recompute the data each and every time by going to the source and then working on it. The use of materialized views also reduces the cost significantly when it is stored local to the store where the queries are executed rather than fetching the original data from remote every time.

The side-effect of processing a large number of records without the author knowing about it and resulting in errors is that the systems that participate in the query execution can easily get inundated with a large number of calls so that those systems begin to apply a rate limit that they have been configured with. When there are multiple systems that participate in multiple joins in a single query, the overall execution is halted when any one of the system begins to error out. This means that the author has no way of predicting what remedial actions to take on the query until actually the errors are known or the data cardinality is found beforehand.

Similarly, the problem of unknown data size is compounded when the queries are nested one within the other. This is rarely visible when the results of the nested queries are merely passed around to higher levels. The same caution taken to reduce data size and optimize operations on the reduced size must be reapplied at each level of the querying. This will benefit the overall execution in the same manner as a bubble propagates up the depth.

Lastly, the separation of data records to process and optimization of the operations on those records to improve the efficiency can be leveraged to reduce the time for all the records by parallelizing the queries on separate workers. Since the set of records are not shared and only the results are shared, the workers merely have to share the reports of the processing to a central accumulator and that reduces the sequential time by a factor of the number of workers.

Reference: Sample queries

Tuesday, April 12, 2022

This article is a continuation of the series of articles with the most recent discussion on managed Service Fabric cluster as referred here.

Azure offers a control plane for all resources that can be deployed to the cloud and services take advantage of them both for themselves and their customers. While Azure Functions allow extensions via new resources, Azure Resource provider and ARM APIs provide extensions via existing resources. This eliminates the need to have new processes introduced around new resources and is a significant win for reusability and user convenience. New and existing resources are not the only way to write extensions, there are other options such as writing it in the Azure Store or via other control planes such as container orchestration frameworks and third-party platforms. This article focuses on deploying Service Fabric via ARM templates.

The service fabric managed cluster itself is represented by an ARM template which is a JSON notation and it defines the infrastructure and configuration for the cluster. The template uses declarative syntax so there is no need to write commands to create the deployment. The template takes parameters such as shown here which is then applied to the resource and the cluster is created.

{

"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",

"contentVersion": "1.0.0.0",

"parameters": {

"clusterName": {

"value": "GEN-UNIQUE"

"clusterSku": {

"value": "Basic"

"adminUserName": {

"value": "GEN-UNIQUE"

"adminPassword": {

"value": "GEN-PASSWORD"

"clientCertificateThumbprint": {

"value": "GEN-SF-CERT-THUMBPRINT"

"nodeTypeName": {

"value": "NT1"

"vmImagePublisher": {

"value": "MicrosoftWindowsServer"

"vmImageOffer": {

"value": "WindowsServer"

"vmImageSku": {

"value": "2019-Datacenter"

"vmImageVersion": {

"value": "latest"

"vmSize": {

"value": "Standard_D2s_v3"

"vmInstanceCount": {

"value": 5

"dataDiskSizeGB": {

"value": 128

"managedDataDiskType": {

"value": "StandardSSD_LRS"

}

A vmInstanceCount of 5 is sufficient for a quorum of 3 in an ensemble of 5.

Encryption options can be selected for the VM.

The multiplePlacementGroups can be used in the nodeType definition to specify a large VMSS. Each nodeType is backed by a VMSS.

Managed identity can be configured which is specified by a property vmManagedIdentity that has been added to node type definitions and contains a list of identities that may be used.

Specifying the managed disk implies the disk type and size do not necessarily need to be specified and can be used for all storage types.

Deployment using the template can be kicked off directly from the CLI, PowerShell, portal and SDK. All of these provide programmability options. Resources can be cleaned up by deleting the resource group. The following ARM template is used to create a basic managed service fabric cluster of type NT-1

Monday, April 11, 2022

This article is a continuation of the part 1 article on Configuration Automations, part 2 that describes the syntax of the configuration templates, and part 3 describes the process of finding and loading configurations. This article describes some of the common routines in configuration automation about existing products

Take Ubuntu for example, and you will find the following list of software: Apache, MySQL, ZooKeeper, Storm, Kafka, Spark, Cassandra, Nginx, Marathon and Docker usually. And take windows and we have the following list of software: .NET, Visual Studio, Team Foundation Server, Octopus etc. While Docker has made applications portable, we want to view the solution for continuous deployment that involves configuration management.

CloudFoundry is a tool for automated software deployment and hosting with the following benefits decreases the time to production, iterates faster develop->build->test->deploy transitions, increases productivity of developers, improves the quality of the products, improves the efficiency of IT operations and increases the utilization of hardware. Octopus is a tool that does massive deployments over several different virtual machines with the same configuration and software of choice. It can maintain simultaneous dev, test and production environments and tear them down at will. Octopus can prepare a VM with any kind of installer. This is an advantage over CloudFoundry because different VMs or a pool of VMs can be chosen for configuration management.

Any configuration service or solution for configuration management must contend with two primary responsibilities. First, it must allow transactional read-write of configuration key values including bulk mode insertion, update and delete. Second it must allow publisher subscriber changes associated with various scopes of configurations.

The first is easily achieved when the store is externalized such that read-write paths are fast and any analysis can be done in the form of read-only paths that are separate from the transactional nature.

The second is easily performed when there is a queue available for allowing subscriber notifications via fan-out. The configuration store does not have any exposure to the subscribers directly other than the queue. This enables the store to be the source of truth for the configurations. The queue can have internal and external subscribers and the update to the state is bidirectional. When the subscribers request, they can get their notifications. This perfects the write update because the data does not need to be sent out. If the queue sends messages back to the subscribers, it is a fan-out process. The subscribers can choose to check-in at selective times and the server can be selective about which subscribers to update. Both methods work well in certain situations. The fan-out happens in both writing as well as loading. It can be made selective as well. The fan-out can be limited during both pull and push. Disabling the writes to all subscribers can significantly reduce the cost. Other subscribers can load these updates only when reading. It is also helpful to keep track of which subscribers are active over a period so that only those subscribers get preference.

Both first and second can be written in any programming language or technology stack but implementing it in C# as a cloud service works very well from elasticity, performance and scalability requirements.

Sunday, April 10, 2022

Configuration Automation part 3.

This article is a continuation of the part 1 article on Configuration Automations and part 2 that describes the syntax of the configuration templates. It describes the process of finding and loading configurations.

Most application configurations are composable. This means that they can be declared in separate files and can be referenced one from the other. Configuration key values must be unique so that their values are deterministic during the execution. Organization of files for application configuration is subject to DevOps considerations and best practices such as for minimizing the impact on changes to code or environment.

Application configuration files are local to the application. The configuration is loaded from the local first before resolving from remote. Remote configuration sources can include configuration repositories such as with the operating system, remote store or others. They can also include services that are injected as dependencies into the application so that they can be read from a dedicated store that is not on the same host as the application.

The process of resolving the configuration keys is dependent on the system but the visibility into its final value at the time of execution can be found from logs. Most applications will log their configurations at least at the time of usage for diagnostics. This helps with determining whether they were intended to be set but it does not provide a way to change them. One way to change configuration keys is to store them in a repository outside the application so that they can be fetched dynamically by a component service of the application that was registered to be used at application initialization time.

The use of dynamic retrieval of configuration key values is a concept that allows the alteration of application behavior without having to restart it. This is helpful to change the behavior on subsequent requests to the application after initialization, but it does not give any indication on the trigger that caused the configuration to change. Audit is one way in which these changes can be effectively found but they happen post-mortem.

Consequently, configuration services often use a publisher-subscription method for listening to changes so that appropriate handlers can take actions. This eliminates code or expensive and inefficient polling services. Events are pushed through an Event Grid to subscribers. Common App configuration event scenarios include refreshing application configuration, triggering deployment or any configuration-oriented workflow. There is a difference between polling mechanisms and event subscriptions where each has its advantage and can be seen with the help of the size of the changes made to the configuration store. When the changes are infrequent, but the scenario requires immediate responsiveness, event-based architecture can be especially efficient. The publisher and subscribers get just the data that has changed and as soon as the change happens which enables downstream systems to be reactive and multiple subscribers to receive the notifications at once. They are also relieved from implementing anything more than a message handler. There is also the benefit that comes from the scope of the change that is passed down to the subscribers, so they get additional information on just what configuration has changed. If the changes become frequent, the number of notifications is large leading up to performance bottleneck with variations in the queue size and delays in the messages. Instead, when the changes span a lot of keys, it is best to get those changes in bulk. A polling mechanism can get changes in batches over time and then process through all those changes. It can even find only the updates that were made from the time of the previous polling. This enables incremental updates at the destination. Since a polling mechanism is a loop that perpetually finds changes, if any, and applies them to the destination, it can work in the background even as a single worker. A polling mechanism is a read-only operation and therefore it does not need to fetch the data from the store where the configuration is being actively updated. It can even fetch the data from a mirror of the configuration store. Separation of the read-write store from a read-only store helps improve the throughput for the clients that update the configuration store. Read-only access is only for querying purposes and with a store that is dedicated to this purpose, the configuration store can deploy a suitable technology to host the read-only store that can assist with queries. It is recommended that both the source and the destination of the configuration store changes be made better suited to their purpose

Saturday, April 9, 2022

Configuration Automations Part 2.

This article is a continuation of the previous article on Configuration Automations and describes the syntax of the configuration templates.

The templates can declare variables and logic that expand to actual configurations when substituted. As common to declarative syntax across declarative frameworks such as for user interface resources, infrastructure resources, and container orchestration frameworks, most begin and end with a specific delimiter.

The W3 organization specifies a template language for xml and xquery that involves a QName. Template arguments can be described in a key-value pairs where the key is a proper key name. These must be uniquely identified for their usage.

Templates can also import modules and open data files. The import modules take a namespace to avoid conflict from other modules. The data files can be opened as a string or an xml node using a built-in intrinsic. This can be used to assign the content to a variable. With the help of this variable, it is possible to browse an element by walking the hierarchy checking for each level if it exists and then finding the value corresponding to the name. A text file can be opened as a sequence of strings as well.

XPath is the language for navigating the xml hierarchy to identify a node. It uses curly braces as delimiters and those delimiters so those cannot be used inside the XPath embedded in a template. XQuery is the language for querying the nodes with the criteria specified. XQuery embedded in an attribute value of the template will call the builtin implicitly and this will return the node itself if it were an attribute that was queried or the concatenated text of the node and its descendants or if the input contains multi-nodes, their return value will be joined and delimited by white space. If the XQuery is embedded in an element text, it will keep its original result and in the case of attribute node will return the surrounding elements’ attribute while in the case of element node will become the surrounding elements’ child. As an example, a

<root>

</root>

where xml content can be used with a template specifying an attribute as:

and this will result in

“A 1 B 2 C 3”

And an element as:

And this will result in

Templates can also include references to other entities in element texts and attribute values and there’s some difference between this format and X-Query.

One of the challenges with template syntax that is often encountered is the use of quotes for enclosing literals. Single quotes and double quotes cannot be used together in a string literal but concat operator can be used to build a string from parts that are enclosed in both.

There are extensions available that can take a template in one form and translate it into another. For example, a JSONiq extension is designed to write JSON templates naturally.

Friday, April 8, 2022

Configuration templates:

Configuration for software applications is essential to control their behavior. It includes secrets, endpoints, connection strings, certificate references and many other settings. When the same software is deployed to different environments, the values change. Multiple configurations specific to environments begin to form and teams start leaning towards a configuration store. As with all stores, a suitable service can automate the computations associated with the store and make the results available over the web, so a configuration service forms.

The variations between environments in configurations become easier to manage only when the repetitions are moved to a single template. This form of configuration splits the multiple copies of configurations into a template and values set of files. There can be many values file, but their size will be small, and they will be easier to read. The idea of templates to generate configmaps for deployment is rampant throughout the industry on both windows and Linux environments and hosting infrastructures that leverage this operating system.

The process of adding a new configuration to a deployed application sometimes involves several manual steps. After the configuration value is set, it must be deployed for consumption in every cluster, environment, and machine function. This involves running a command using a client for every cluster, environment, and machine function where the target application is deployed. The client requires a configuration template and configuration value file as inputs.

As mentioned, the configuration value file needs to be unique for every cluster, environment, and machine function combination. One way to automate it is to generate the configuration values file so that the human involvement to edit it is removed and there is little chance for typo. Tools are often used for this purpose.

While this solves a set of problems, it does not eliminate the need to run the tool many times for creating configuration values in each distinct deployment. Consequently, some form of batching also makes its way into automations. The batch command merely repeats the invocations once for each deployment so that the human involvement to run them again and again is avoided. The output of the batch command is also easy to read with the set pattern of invocations but with a different parameter for the generated configuration values file.

This covers the deployment related automations for configuration but it does not solve the syntax and semantics associated with the templates. While Xpath for xml and similar for Json are popular ways to store data and browse them, the syntax can be used just like a programming language with repetitions using loops and counters. Some templates provide a veritable library of builtin intrinsics and functions to use and thus make them more succinct.