Thursday, November 25, 2021

 This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.  

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the retry storm antipattern. This one focuses on the noisy neighbor antipattern. 

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the monolithic persistence antipattern. This one focuses on synchronous I/O antipattern.  

This antipattern occurs when there are many tenants that can starve other tenants as they hold up a disproportionate set of critical resources from a shared and reserved pool of resources meant for all tenants.  The noisy neighbor problem occurs when one tenant causes problem for another. Some common examples of resource intensive operations include, retrieving or persisting data to a database, sending a request to a web service, posting a message or retrieving a message from a queue, and writing or reading from a file in a blocking manner. There is a lot of advantages to running dedicated calls especially from debugging and troubleshooting purposes because the calls do not have interference, but multi-tenancy enables reuse of the same components. The overuse of this feature can hurt performance due to the tenants' consuming resources that can starve other tenants. It appears notably when there are components or I/O requiring synchronous I/O. The application uses library that only uses synchronous methods or I/O in this case. The base tier may have finite capacity to scale up. Compute resources are better suitable for scale out rather than scale up and one of the primary advantages of a clean separation of layers with asynchronous processing is that they can be hosted even independently. Container orchestration frameworks facilitate this very well. As an example, the frontend can issue a request and wait for a response without having to delay the user experience. It can use the model-view-controller paradigms so that they are not only fast but can also be hosted such that tenants using one view model do not affect the other. 

This antipattern can be fixed in one of several ways. First the processing can be moved out of the application tier into an Azure Function or some background api layer. Tenants are given promises and are actively monitored. If the application frontend is confined to data input and output display operations using only the capabilities that the frontend is optimized for, then it will not manifest this antipattern. APIs and Queries can articulate the business layer interactions such that the tenants find it responsive while the system reserves the right to perform. Many libraries and components provide both synchronous and asynchronous interfaces. These can then be used judiciously with the asynchronous pattern working for most API calls. Finally, limits and throttling can be applied. Application gateway and firewall rules can handle restrictions to specific tenants

The introduction of long running queries and stored procedures, blocking I/O and network waits often goes against the benefits of a responsive multi-tenant service. If the processing is already under the control of the service, then it can be optimized further. 

There are several ways to fix this antipattern. They are about detection and remedy. The remedies include capping the number of tenant attempts and preventing retrying for a long period of time. The tenant calls could include an exponential backoff strategy that increases the duration between successive calls exponentially, handle errors gracefully, use the circuit breaker pattern which is specifically designed to break the retry storm. Official SDKs for communicating to Azure Services already include sample implementations of retry logic. When the number of I/O requests is many, they can be batched into coarse requests. The database can be read with one query substituting many queries. It also provides an opportunity for the database to execute it better and faster. Web APIs can be designed with the REST best practices. Instead of separate GET methods for different properties, there can be a single GET method for the resource representing the object. Even if the response body is large, it will likely be a single request. File I/O can be improved with buffering and using cache. Files may need not be opened or closed repeatedly. This helps to reduce fragmentation of the file on disk. 

When more information is retrieved via fewer I/O calls and fewer retries, the operational necessary evil becomes less risky but there is also a risk of falling into the extraneous fetching antipattern. The right tradeoff depends on the usages. It is also important to read-only as much as necessary to avoid both the size and the frequency of calls and their retries for tenants. Sometimes, data can also be partitioned into two chunks, frequently accessed data that accounts for most requests and less frequently accessed data that is used rarely. When data is written, resources need not be locked at too large a scope or for a longer duration. Tenant calls, limits and throttling can also be prioritized so that only the higher priority tenant calls go through.


No comments:

Post a Comment