This is a continuation of an article that describes operational considerations for hosting
solutions on the Azure public cloud.
There are several references to best practices throughout
the series of articles we wrote from the documentation for the Azure Public
Cloud. The previous article focused on the antipatterns to avoid, specifically
the monolithic persistence antipattern. This one focuses on the Retry storm
antipattern.
When I/O requests fail due to transient errors, services
must retry their calls. It helps overcome errors, throttle and rate limits and
avoid surfacing and requiring user intervention for operational errors. But
when the number of retries or the duration of retries is not governed, the
retries are frequent and numerous, which can have a significant impact on
performance and responsiveness. Network calls and other I/O operations are much
slower compared to compute tasks. Each I/O request has a significant overhead
as it travels up and down the networking stack on local and remote and includes
the round trip time, and the cumulative effect of numerous I/O operations can
slow down the system. There are some manifestations of the retry storm.
Reading and writing individual records to a database as
distinct requests – When records are often fetched one at a time, then a series
of queries are run one after the other to get the information. It is
exacerbated when the Object-Relational Mapping hides the behavior underneath
the business logic and each entity is retrieved over several queries. The same
might happen on writing for an entity. When each of these queries is wrapped in
their own retry, they can cause severe errors.
Implementing a single logical operation as a series of
HTTP requests. This occurs when objects residing on a remote server are
represented as a proxy in the memory of the local system. The code appears as
if an object is modified locally when in fact every modification is coming with
at least the cost of the RTT. When there are many networks round trips, the
cost is cumulative and even prohibitive. It is easily observable when a proxy
object has many properties, and each property get/set requires a relay to the
remote object. In such a case, there is also the requirement to perform
validation after every access.
Reading and writing to a file on disk – File I/O also
hides the distributed nature of interconnected file systems. Every byte written to a file on amount must
be relayed to the original on the remote server. When the writes are several,
the cost accumulates quickly. It is even more noticeable when the writes are
only a few bytes and frequent. When individual requests are wrapped in a retry,
the number of calls can rise dramatically.
There are several ways to fix the problem. They are about
detection and remedy. The remedies include capping the number of retry attempts
and preventing retrying for a long period of time. The retries could include an
exponential backoff strategy that increases the duration between successive
calls exponentially, handle errors gracefully, use the circuit breaker pattern which
is specifically designed to break the retry storm. Official SDKs for
communicating to Azure Services already include sample implementations of retry
logic. When the number of I/O requests is many, they can be batched into coarse
requests. The database can be read with one query substituting many queries. It
also provides an opportunity for the database to execute it better and faster.
Web APIs can be designed with the REST best practices. Instead of separate GET
methods for different properties, there can be a single GET method for the
resource representing the object. Even if the response body is large, it will
likely be a single request. File I/O can be improved with buffering and using
cache. Files may need not be opened or closed repeatedly. This helps to reduce
fragmentation of the file on disk.
When more information is retrieved via fewer I/O calls
and fewer retries, the operational necessary evil becomes less risky but there
is also a risk of falling into the extraneous fetching antipattern. The right
tradeoff depends on the usages. It is also important to read-only as much as
necessary to avoid both the size and the frequency of calls and their retries.
Sometimes, data can also be partitioned into two chunks, frequently accessed
data that accounts for most requests and less frequently accessed data that is
used rarely. When data is written, resources need not be locked at too large a
scope or for a longer duration. Retries can also be prioritized so that only
the lower scope retries are issued for idempotent workflows.
No comments:
Post a Comment