Cluster computing

Thursday, November 18, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the extraneous fetching antipattern. This one focuses on the Chatty I/O antipattern.

When I/O requests are frequent and numerous, they can have a significant impact on performance and responsiveness. Network calls and other I/O operations are much slower compared to compute tasks. Each I/O request has a significant overhead as it travels up and down the networking stack on local and remote and includes the round trip time, and the cumulative effect of numerous I/O operations can slow down the system. There are some common causes of chatty I/O which include:

Reading and writing individual records to a database as distinct requests – When records are often fetched one at a time, then a series of queries are run one after the other to get the information. It is exacerbated when the Object-Relational Mapping hides the behavior underneath the business logic and each entity is retrieved over several queries. The same might happen on write for an entity.

Implementing a single logical operation as a series of HTTP requests. This occurs when objects residing on a remote server are represented as proxy in the memory of the local system. The code appears as if an object is modified locally when in fact every modification is coming with at least the cost of the RTT. When there are many networks round trips, the cost is cumulative and even prohibitive. It is easily observable when a proxy object has many properties, and each property get / set requires a relay to the remote object. In such case, there is also the requirement to perform validation after every access.

Reading and writing to a file on disk – File I/O also hides the distributed nature of interconnected file systems. Every byte written to a file on a mount must be relayed to the original on the remote server. When the writes are several, the cost accumulates quickly. It is even more noticeable when the writes are only a few bytes and frequent.

There are several ways to fix the problem. They are about detection and remedy. When the number of I/O requests are many, they can be batched into coarse requests. The database can be read with one query substituting many queries. It also provides an opportunity for the database to execute it better and faster. Web APIs can be designed with the REST best practices. Instead of separate GET method for different properties there can be single GET method for the resource representing the object. Even if the response body is large, it will likely be a single request. File I/O can be improved with buffering and using cache. Files may need not be opened or closed repeatedly. This helps to reduce fragmentation of the file on disk.

When more information is retrieved via fewer I/O calls, there is a risk of falling into the extraneous fetching antipattern. The right tradeoff depends on the usages. It is also important to read only as much as necessary to avoid both the size and the frequency of calls. Sometimes, data can also be partitioned into two chunks, frequently accessed data that accounts for most requests and less frequently accessed data that is used rarely. When data is written, resources need not be locked at too large a scope or for longer duration.

Wednesday, November 17, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the cloud readiness antipatterns. This one talks about design principles and advanced operations.

A management baseline provides a minimum level of business commitment for all supported workloads. It includes a standard business commitment to minimize business interruptions and accelerate recovery if service is interrupted. Usually it includes inventory and visibility, operational compliance, and protection and recovery – all of which provide streamlined operational management. It does not apply to mission critical workloads, but it covers 80% of the less critical workloads.

There are a few ways to go beyond the management baseline which includes enhanced baseline, platform specialization, and workload specialization.

The enhanced management baseline uses cloud-native tools to improve uptime and decrease recovery times. It significantly reduces cost and implementation time.

The management specialization are aspects of workload and platform operations which require changes to design and architecture principles, and these could take time and result in increased operating expenses. The enhanced management baseline applies broadly to many workloads while this one applies specifically to certain cases. There are two areas of specialization: 1) the platform specialization and 2) workload specializations. The former resolves key pain points in the platform and distributes the investments across multiple workloads and the latter involves ongoing operations of a specific mission-critical workload.

In addition to these management baselines, there are a few steps that apply to each specialization process. These include improved system design, automated remediation, scaled solution, and continuous improvement. Improved system design is the most effective approach among these, and it applies universally to most operations of any platform. It increases stability and decreases impact from changes in business operations. Both the Cloud Adoption Framework and the Azure Well-architected framework provide guiding tenets for improving the quality of a platform or a specific workload with the five pillars of architecture excellence which include cost optimization, operational excellence, performance efficiency, reliability, and security.

Business interruptions cause technical debt and if it cannot be automatically resolved, automated remediation is an alternative. Use of Azure automation and Azure Monitor can detect trends and provide automated remediation which is the most common approach. Similarly, a service catalog can list applications that can be deployed for internal consumption. A platform can then maximize adoption and minimize maintenance overhead with the use of the service catalog.

Tuesday, November 16, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the cloud readiness antipatterns. This article focuses on the no-caching antipattern.

A no-caching antipattern occurs when a cloud application handles many concurrent requests, and they fetch the same data. Since there is contention for the data access, it can reduce performance and scalability. When the data is not cached, it leads to many manifestations of areas for improvement.

First, the fetching of data can traverse several layers and go deep into the stack taking significant resource consumption and increasing costs in terms of I/O overhead and latency. It repeatedly constructs the same objects or data structures.

Second, it makes excessive calls to a remote service that has a service quota and throttles clients past a certain limit.

Both these can lead to degradation in response times, increased contention, and poor scalability.

The examples of no-caching antipattern are easy to spot. Entity framework calls that are repeatedly called for the same read-only data fits this antipattern. The use of a cache might have simply been overlooked but usually the case is that the cache could not be included in the design because of some unknowns. The benefits and drawbacks of using a cache is not clear then. There might be a concern about the accuracy and the freshness of the cached data.

Other times, the cache was left out because the application was migrated from on-premises where network latency and response times were controlled. The system might have been running on expensive high-performance hardware unlike the commodity cloud virtual machine scale sets.

Rarely, it might even be the case where the caching was simply left out of the architecture design and for operations to include via standalone independent products which was not clearly communicated. Other times, the introduction of a cache might increase latency, maintenance and ownership and decrease overall availability. It might also interfere with existing caching strategies and expiration policies of the underlying systems. Some might prefer to not add an external cache to a database and only as a sidecar for the web services. It’s true that databases can cache even materialized views for a connection, but the addition of a cache lookup could be cheap in all cases where the compute in the deeper systems could be costly and can be avoided.

There are two strategies to fix the problem. The first one includes the on-demand network or cache-aside strategy. When the application tries to read the data from the cache, and if it isn’t there, it retrieves and puts it in the cache. When the application writes the change directly to the data source, it removes the old value from the source but refilled the next time it is required.

Another strategy might be to always keep static resources in the cache with no expiration date. This is equivalent to CDN usage although CDNs are for distribution. Applications that cache dynamic data should be designed to support eventual consistency.

No matter how the cache is implemented, it must support fallback to the deep data access when the data is not available in the cache. This Circuit-breaker pattern merely avoids overwhelming the data source.

Monday, November 15, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the cloud readiness antipatterns. This one talks about design principles and advanced operations.

There are a few ways to go beyond the management baseline which includes enhanced baseline, platform specialization, and workload specialization.

The enhanced management baseline uses cloud-native tools to improve uptime and decrease recovery times. It significantly reduces cost and implementation time.

Sunday, November 14, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the cloud readiness antipatterns. This article describes ways to manage the antipatterns.

Antipatterns are experienced when planning a cloud adoption. They can be avoided using tools and automations.

One of the antipatterns is about tooling itself. Modern IT tools support several automations which are helpful towards relieving employees of their tedious tasks. But the most important part of the tooling is their business outcome. Focusing on the tooling but not the business outcome is one of the difficult tasks and a common antipattern. One way to overcome this involves measuring the usefulness or impact of the tool. A new or modernized tool chain does not automatically provide faster delivery or better business outcomes.

Platforms are yet another case where they don’t always improve performance. A platform brings lots of desirable advantages. These include conformance, consistency, maintenance, simplicity, automation and hiding of differences between those that it manages. A CI/CD pipeline can serve as a platform for standardized processes and governance that brings tremendous value to different business units and allows them to deploy features faster. But while platforms improve the speed of certain processes, the overall execution time may still be hampered by approvals or release criteria. The platform cannot guarantee that it will work any better or faster that it was under the circumstances. This antipattern can also be avoided by measuring the usefulness or impact of the platform.

One of the ways of measuring usefulness or impact is defining SMART objectives which requires specific, measurable, achievable, reasonable and timebound goals. With the goals written this way, the commitments are clear, the progress indicated, and the deliverables held accountable. The caveat here is that improper metrics should not justify business impact or usefulness. Faster deployment alone is not an indicator of success, but it is critical to the overall impact.

Development team empowerment is a specific goal that tremendously improves business outcomes. It is also well-studied and structured for organizations to follow. Revenue growth, operating margin and higher innovations can all be improved with developer velocity.

Saturday, November 13, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

· Resources can be locked to prevent unexpected changes. A subscription, resource group or resource can be locked to prevent other users from accidentally deleting or modifying critical resources. The lock overrides any permissions the users may have. The lock level can be set to CannotDelete or ReadOnly with the ReadOnly being more restrictive. Lock inheritance can be applied at a parent scope, all resources within that scope can then inherit the same lock. Some considerations still apply after locking. For example, a CannotDelete lock on a storage account does not prevent data within that account to be deleted. A read only lock on an application gateway prevents you from getting the backend health of the application gateway because it uses POST. Only Owner and User Access Administrator role members are granted access to Microsoft.Authorization/locks/* actions.

· Blob rehydration to the archive tier can be for either hot or cool tier. There are two options for rehydrating a blob that is stored in the archive tier. A) One can copy an archived blob to an online tier using the reference of the blob or its URL. B) Or one can change the blob access tier to an online tier. It can rehydrate the archived blob to hot or cool by changing its tier. Rehydrating might take several hours but several of them can be done concurrently. Rehydration priority might also be set.

· Virtual Network peering allows us to connect virtual networks in the same region or across regions as in the case of Global VNet Peering through the Azure Backbone network. When the peering is setup, traffic to the remote virtual network, traffic forwarded from the remote virtual network, virtual network gateway or Route server and traffic to the virtual network can be allowed by default.

· Transaction processing in Azure is not on by default. A transactions locks and logs records so that others cannot use it, but it can be bound to partitions, enabled as distributed transactions and with two phase commit protocol. Transaction processing requires two communication steps for a resource manager and a response from the transaction coordinator which are costly for a datacenter in Azure. It does not scale as the number resource to calls expands as 2 resources – 4 network calls, 4 resources – 16 calls, 100 resource – 400 calls. Besides, the datacenter contains thousands of machines, failures are expected, and the system must deal with network partitions. Waiting for response from all resource managers has costly communication overhead.

· Diagnostic settings to send platform logs and metrics to different destinations can be authored. Logs include Azure Activity logs and resource logs. Platform metrics are collected by default and stored in the Azure monitor metrics database. Each Azure resource requires its own diagnostic settings, and a single setting can define no more than one of each of the destinations. The available categories will vary for different resource types. The destinations for the logs could include the Log Analytics workspace, Event Hubs and Azure Storage. Metrics are sent automatically to the Azure Monitor Metrics. Optionally, settings can be used to send metrics to Azure monitor logs for analysis with other monitoring data using restricted queries. Multi-dimensional metrics (MDM) are not supported. They must be flattened

Friday, November 12, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on Azure public cloud.

There are several references to best practices throughout the series of articles posted on the documentation for the Azure Public Cloud. This article focuses on the antipatterns to avoid, specifically the cloud readiness antipatterns.

Antipatterns are experienced when planning a cloud adoption. Misaligned operating models can lead to increased time to market, misunderstanding and increased workload on IT departments. Companies choose the wrong operating model when they assume Platform-as-a-service decreases costs without their involvement. Sometimes change of direction in business can lead to radical changes in architecture requiring replacement projects which can become complex and cost intensive.

A model articulates types of accountabilities, landing zones and focus and the company chooses a model based on strategic priorities and scope of its portfolio. When we assign too much responsibility to a small team, it may result in slow adoption journey. Such a team is burdened to approve measures only after fully understanding the impact on the business, operations and security and it could be worse if these aren’t the teams’ main area of expertise.

Cloud readiness antipatterns are those that are experienced during the readiness phase of cloud adoption.

Assuming released services are ready for production is the first cloud readiness antipattern we discuss.

Services age over time. Not all services are mature. Preview services cannot keep up with a Service-Level Agreement (SLA). New services are unstable. When organizations are satisfied that a new or preview service fits their use case, they take a huge risk on the guarantees an SLA provides. This might lead to unexpected downtime, disaster recovery program, and availability issues. When such things do occur, the perception is that this is true for cloud services in general which is not the case and is even more problematic.

Another antipattern is that all cloud services are more resilient and available than those on-premises. Increased resiliency implies recovery after failures and availability implies running in healthy state with little or no downtime. It is true that cloud services offer these advantages but not all of them follow suit. Even when services offer them, they might be offered at a premium or an additional feature.

Take availability for instance and it depends on service models like PaaS and SaaS or on technical architectures like load-balanced availability sets and availability zones. A single VM may be highly available, but it can still be a single point of failure leading to a case when its downtime might cause the services that are hosted to be in an unrecoverable state.

Another common antipattern is when cloud providers try to make their internal IT department a cloud provider. It becomes responsible for reference architectures while it is providing PaaS or SaaS to business units. This antipatterns severely hampers usability, efficiency, resiliency, and security. Sometimes IT is even tasked with providing monolithic end-to-end services which results in an order for a fully managed cloud VM as a service but IT controls who can access and use the entire platform and business units don’t get to take full advantage of the cloud portal or get SSH or RDP access. This kind of wrapper over cloud services which can be several and changing frequently, does not lower the cost of release that business units want. Instead, a mature cloud operating model such as centralized operations with guardrails like governance can empower the business units.

Finally, the choice of the right model improves the cloud adoption roadmap.