Cluster computing: June 2022

Thursday, June 30, 2022

This is a continuation of series of articles on hosting solutions and services on Azure public cloud with the most recent discussion on Multitenancy here This article discusses the architectural approaches for messaging in multitenant solutions.  

Messaging services unlike storage services are more homogeneous in their functionality. All messaging systems have similar functionalities, transport protocols and usage scenarios. Most of the modern messaging systems involve asynchronous communications. Multitenant solutions introduce sharing which brings a higher density of tenants to infrastructure and reduce the operational cost and management. Unlike compute or storage, isolation model can be as granular as the messages and events. Using the published information and how data is consumed and processed by the applications, we can distinguish between different kinds. Services that deliver an event can include Azure Event Grid and Event Hubs and systems that send a message can include Service Bus

When these messaging resources are shared, isolation model, impact to scaling performance, state management and security of the messaging resources become complex. These key decisions for planning a multitenant messaging solution are discussed below. 

Scaling resources helps meet the changing demand from the growing number of tenants and the increase in the amount of traffic. We might need to increase the capacity of the resources to maintain an acceptable performance rate. For example, if a single messaging topic/queue is provisioned for all the tenants and the traffic exceeds the specific number of messaging operations per second, the Azure messaging will reject the application’s requests and all the tenants will be impacted. Scaling depends on Number of producers and consumers, Payload size, Partition count, Egress request rate and Usage of Event Hubs Capture, Schema Registry, and other advanced features. When additional messaging is provisioned or rate limit is adjusted, the multitenant solution can perform retries to overcome the transient failures from requests. When the number of active users reduces or there is a decrease in the traffic, the messaging resources could be released to reduce costs. When the resources are dedicated to a tenant, they can be independently scaled to meet the tenants’ demands. This is the simplest solution, but it requires a minimum number of resources per tenant. A shared scaling of resources in the platform implies all the tenants will be affected. They will also suffer when the scale is insufficient to handle their overall load. If a single tenant uses a disproportionate amount of the resources available in the system, it leads to a well-known problem called the noisy neighbor antipattern. When the resource usage increases above the total capacity from the peak load of the tenants involved, failures occur which are not specific to a tenant and impact the performance of those tenants. The total capacity can also be exceeded when the individual usages are small, but the number of tenants increase dramatically. Performance problems often remain undetected until an application is under load.  A load testing preview can help analyze the behavior of the application under stress. Scaling horizontally or vertically helps correct the correlated application behavior. 

Data isolation depends on the scope of isolation. When ServiceBus is used for instance, separate topics or queues are deployed for each tenant and subscriptions can be shared between tenants. Another option is to use some level of sharing for queues and topics and create more instances when the utility has exceeded tolerable limits. Finally, messaging resources can be provisioned within a single subscription or separated into one per tenant.

Varying levels and scope of sharing of messaging resources demands simplicity from the architecture of the multitenant application to store and access data with little expertise. A particular concern for multitenant solution is the level of customization to be supported.

Patterns such as the use of the deployment stamp pattern, the messaging resource consolidation pattern and the dedicated messaging resources per tenant pattern help to optimize the operational cost and management with little or no impact to the usages. 

Wednesday, June 29, 2022

Storage services involve a variety of storage resources such as commodity disks, local storage, remote network shares, blobs, tables, queues, database resources, and specialized resources like cold tier and archival. Multitenant solutions introduce sharing which brings a higher density of tenants to infrastructure and reduce the operational cost and management. Unlike compute, data can leak, egress or remain vulnerable in both transit and rest, therefore isolation model is even more important.

When these storage resources are shared, isolation model, impact to scaling performance, state management and security of the storage resources become complex. These key decisions for planning a multitenant storage solution are discussed below.

Scaling of resources helps meet the changing demand from the growing number of tenants and the increase in the amount of traffic. We might need to increase the capacity of the resources to maintain an acceptable performance rate. For example, if a single storage account is provisioned for all the tenants and the traffic exceeds the specific number of storage operations per second, the Azure storage will reject the application’s requests and all the tenants will be impacted. When additional storage is provisioned or rate limit is adjusted, the multitenant solution can perform retries to overcome the transient failures from requests. When the number of active users reduces or there is a decrease in the traffic, the storage resources could be released to reduce costs. When the resources are dedicated to a tenant, they can be independently scaled to meet the tenants’ demands. This is the simplest solution but it requires a minimum number of resources per tenant. A shared scaling of resources in the platform implies all the tenants will be affected. They will also suffer when the scale is insufficient to handle their overall load. If a single tenant uses a disproportionate amount of the resources available in the system, it leads to a well-known problem called the noisy neighbor antipattern. When the resource usage increases above the total capacity from the peak load of the tenants involved, failures occur which are not specific to a tenant and impact the performance of those tenants. The total capacity can also be exceeded when the individual usages are small but the number of tenants increase dramatically. Performance problems often remain undetected until an application is under load. A load testing preview can help analyze the behavior of the application under stress. Scaling horizontally or vertically helps correct the correlated application behavior.

Data isolation depends on the data storage provider. When the CosmosDB is used for instance, separate containers are deployed for each tenant and databases and accounts can be shared between tenants. When Azure Storage is used, either the container or the account could be separated per tenant. When a shared storage management system such as a relational store is used, separate tables or even separate databases can be used for each tenant. Finally, storage resources can be provisioned within a single subscription or separated into one per tenant.

Varying levels and scope of sharing of storage resources demands simplicity from the architecture of the multitenant application to store and access data with little expertise. A particular concern for multitenant solution is the level of customization to be supported.

Patterns such as the use of the deployment stamp pattern, the storage resource consolidation pattern and the dedicated storage resources per tenant pattern help to optimize the operational cost and management with little or no impact to the usages.

Tuesday, June 28, 2022

Compute services involve a variety of compute resources such as commodity virtual machines, containers, queue processors, PaaS resources, and specialized resources like GPUs and high-performance compute. Multitenant solutions introduce sharing which brings a higher density of tenants to infrastructure and reduce the operational cost and management.

When these compute resources are shared, isolation model, impact to scaling performance, state management and security of the compute resources must be considered. These key decisions for planning a multitenant compute solution are discussed below.

Scaling of resources helps meet the changing demand from the growing number of tenants and the increase in the amount of traffic. We might need to increase the capacity of the resources to maintain an acceptable performance rate. When the number of active users reduces or there is a decrease in the traffic, the compute resources could be released to reduce costs. When the resources are dedicated to a tenant, they can be independently scaled to meet the tenants’ demands. This is the simplest solution but it requires a minimum number of resources per tenant. A shared scaling of resources in the platform implies all the tenants will be affected. They will also suffer when the scale is insufficient to handle their overall load. If a single tenant uses a disproportionate amount of the resources available in the system, it leads to a well-known problem called the noisy neighbor antipattern. When the resource usage increases above the total capacity from the peak load of the tenants involved, failures occur which are not specific to a tenant and impact the performance of those tenants. The total capacity can also be exceeded when the individual usages are small but the number of tenants increase dramatically. Performance problems often remain undetected until an application is under load. A load testing preview can help analyze the behavior of the application under stress. Scaling horizontally or vertically helps correct the correlated application behavior.

The triggers that cause the components to scale must be carefully planned. When the scaling depends on the number of tenants, it is best to batch the next scaling after a batch size of tenants have been adequately served with the available resources. Many compute services provide autoscaling and require us to provide minimum and maximum levels of scale. Azure app services, Azure app functions, Azure container apps, Azure Kubernetes service, and Service for virtual machines can automatically increase or decrease the number of instances that run in the application.

State based considerations come when the data must be persisted between requests. From a scalability perspective, stateless components are often easy to scale out in terms of workers, instances or nodes and they can provide a warm start to immediately process requests. Tenants can also be moved between resources if it can be permitted. Stateful resources depend on the persisted state. They are generally avoided to be saved in the compute layer and are instead stored in storage services. Transient state can be stored in caches or local temporary files.

The patterns described above such as the use of the deployment stamp pattern, the compute resource consolidation pattern and the dedicated compute resources per tenant pattern help to optimize the operational cost and management with little or no impact to the usages.

Monday, June 27, 2022

Resource organization helps a multi-tenant solution with tenant-isolation and scale. There are specific tradeoffs to consider with multi-tenant isolation and scale-out across multiple resources. Azure’s resource limits and quotas and scaling the solution beyond these limits will be discussed.

When there is an automated deployment process and there is a need to scale across resources, the way to deploy and assign tenants must be decided. As we are approaching the number of tenants that can be assigned to a specific resource, we must detect the threshold. When we plan to deploy new resources, it must be decided whether they will be ready just in time or ready ahead of time.

When assumptions are made in code and configuration, they can limit the ability to scale. There might be a need to scale out to multiple storage accounts, but the application tier might be assuming a single storage account for all tenants.

Azure resources are deployed and managed through a hierarchy. Most resources are deployed into resource groups which are contained in subscriptions. This hierarchy pertains to a tenant. When we deploy the resources, we have the option to isolate them at different levels. Different models can be used in different components of the same solution.

Resources that are shared across multiple instances can still achieve isolation on a single instance for all the workloads from the tenants. When we run a single instance of resource, the service limits, subscription limits and the quota applies. When these limits are encountered, the shared resources must be scaled out.

Isolations within a shared resource requires the application code to be fully aware of multitenancy, and to restrict the data for a specific tenant. An alternative to this is to separate resources in resource groups. These help to manage the lifecycles of resources. Those resources that are in a resource group can be deleted all at once by deleting the resource group A naming convention, strategy and resource tags or a tenant catalog database is required in this case. Resource groups can also be separated into subscriptions. This enables us to easily configure policies and access control by putting resource groups into a shared subscription. There is a limit to the maximum number of resource groups that can be put in a subscription so they must spill over to a new subscription upon exceeding. Separate subscriptions help achieve complete isolation of tenant specific resources and they can be programmatically created. Azure reservations can also be used across subscriptions. The only difficulty is to request quota increases when there are a large number of subscriptions. The Quota API comes helpful to some resource types and quota increases must be requested by initiating a support case. Tenant specific subscriptions can be put into a management group hierarchy so that it enables easy management of access control rules and policies.

Reference: Multitenancy: https://1drv.ms/w/s!Ashlm-Nw-wnWhLMfc6pdJbQZ6XiPWA?e=fBoKcN    

Sunday, June 26, 2022

Exception StackTrace associations for root cause analysis   

Problem statement: Given a method to collect root causes from many data points in errors in logs, can there be a determination of associations between root causes? 

Solution: There are two stages to solving this problem:  

Stage 1 – discover root cause and create a summary to capture it  

Stage 2 – use an association data mining algorithm on root causes.

Stage 1:  

The first stage involves a data pipeline that converts log entries to exception stacktraces and hashes them into buckets. Sample included.  When the exception stack traces are collected from a batch of log entries, we can transform them into a vector representation and using the notable stack frames as features. Then we can generate a hidden weighted matrix for the neural network  

We use that hidden layer to determine the salience using the gradient descent method.       

All values are within [0,1] co-occurrence probability range.      

The solution to the quadratic form representing the embeddings is found by arriving at the minima represented by Ax = b using the conjugate gradient method.    

We are given input matrix A, b, a starting value x, several iterations i-max, and an error tolerance epsilon < 1    

This method proceeds this way:     

set I to 0     

set residual to b - Ax     

set search-direction to residual.    

And delta-new to the dot-product of residual-transposed.residual.    

Initialize delta-0 to delta-new    

while I < I-max and delta > epsilon^2 delta-0 do:     

    q = dot-product(A, search-direction)    

    alpha = delta-new / (search-direction-transposed. q)     

    x = x + alpha.search-direction    

    If I is divisible by 50     

        r = b - Ax     

    else     

        r = r - alpha.q     

    delta-old = delta-new    

    delta-new = dot-product(residual-transposed,residual)    

     Beta = delta-new/delta-old    

     Search-direction = residual + Beta. Search-direction    

     I = I + 1     

Root cause capture – Exception stack traces that are captured from various sources and appear in the logs can be stack hashed. The root cause can be described by a specific stacktrace, its associated point of time, the duration over which it appears, and the time of fix introduced, if known.   

Stage 2:

Association data mining determines whether two root causes occur together. The computation involves two computed columns namely Support and Probability. Support defines the percentage of cases in which a rule must exist before it is considered valid. We define that a rule must be found in at least 1 percent of cases.

Probability defines how likely an association must be before it is considered valid. We will consider any association with a probability of at least 10 percent.

Bayesian conditional probability and confidence can also be used. Associations have association rules formed with a pair of antecedent and consequent item-sets, so named, because we want to find the value of taking one item with another. Let I be a set of items, T be a set of transactions. Then an association A is defined as a subset of I that occurs together in T. Support (S1) is a fraction of T containing S1. Let S1 and S2 be subsets of I, then the association rule to associate S1 to S2 has support(S1->S2) defined as Support(S1 union S2) and a confidence (S1->S2) = Support(S1 union S2)/ Support(S1). A third metric Lift is determined as Confidence(S1->S2)/Support(S2) and is preferred because a popular S1 gives high confidence for any S2 and lift corrects that by having a value greater than 1.0 when S2 is also significant.

Certain databases allow the creation of association models that can be persisted and evaluated against each incoming request. Usually, a training/testing data split of 70/30% is used in this regard.
Sample: https://jsfiddle.net/g2snw4da/