This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.
There are several references to best practices throughout
the series of articles we wrote from the documentation for the Azure Public
Cloud. The previous article focused on the antipatterns to avoid, specifically
the noisy neighbor antipattern. This article focuses on transient fault
handling. 
Transient errors can occur anywhere in the cloud. When
the services are hosted on-premises, the performance and availability are
provided via redundant and often underutilized hardware, but the components are
located close to each other. This reduces the failures from networking though
not from power failures or other faults. The cloud provides unparalleled
availability, but it might involve network latency and there can be any number
of errors resulting from an unreachable network. Other forms of transient
failures may come from: 
1) many shared resources that are subject to throttling
in order to protect the resource. Some services will refuse connections when
the load rises to a specific level, or a maximum throughput rate may be
reached. This helps to provide uniform quality of service for neighbors and
other tenants using the shared resource. 
2) commodity hardware units that make up the cloud where
the performance is delivered by dynamically distributing the load across
multiple computing units and infrastructure components. In this case, the
faults are handled by dynamically recycling the affected components. 
3)  hardware
components including network infrastructure such as routers and load balancers,
between the application and the resources and the services it uses.  
4) Clients when the conditions affect it such that the
reachability to the service is affected due to the intermittent Internet
disconnections. 
Cloud-based services, applications, and solutions must
work around these transient failures because they are hard to eliminate. 
First, they must have a built-in retry mechanism although
they can use varying scope from the level of an individual system call to the
API implementations.  
Second, they should determine if the operation is
suitable for retrying. Retry operations where the faults are transient and
there is at least some likelihood that the operation will succeed when
reattempted. These are easily known from the error codes for calls where the
transient errors originate from.  
Third, the retry count and interval must be decided for
this to work. Some strategies include exponential backoff, Incremental intervals,
and regular intervals, immediate retry, and randomization. 
Finally, retry storm antipatterns must be avoided.
 
No comments:
Post a Comment