Thursday, December 2, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.  

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the noisy neighbor antipattern. This article focuses on transient fault handling.

Transient errors can occur anywhere in the cloud. When the services are hosted on-premises, the performance and availability are provided via redundant and often underutilized hardware, but the components are located close to each other. This reduces the failures from networking though not from power failures or other faults. The cloud provides unparalleled availability, but it might involve network latency and there can be any number of errors resulting from an unreachable network. Other forms of transient failures may come from:

1) many shared resources that are subject to throttling in order to protect the resource. Some services will refuse connections when the load rises to a specific level, or a maximum throughput rate may be reached. This helps to provide uniform quality of service for neighbors and other tenants using the shared resource.

2) commodity hardware units that make up the cloud where the performance is delivered by dynamically distributing the load across multiple computing units and infrastructure components. In this case, the faults are handled by dynamically recycling the affected components.

3)  hardware components including network infrastructure such as routers and load balancers, between the application and the resources and the services it uses. 

4) Clients when the conditions affect it such that the reachability to the service is affected due to the intermittent Internet disconnections.

Cloud-based services, applications, and solutions must work around these transient failures because they are hard to eliminate.

First, they must have a built-in retry mechanism although they can use varying scope from the level of an individual system call to the API implementations. 

Second, they should determine if the operation is suitable for retrying. Retry operations where the faults are transient and there is at least some likelihood that the operation will succeed when reattempted. These are easily known from the error codes for calls where the transient errors originate from. 

Third, the retry count and interval must be decided for this to work. Some strategies include exponential backoff, Incremental intervals, and regular intervals, immediate retry, and randomization.

Finally, retry storm antipatterns must be avoided.

No comments:

Post a Comment