This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.
There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the noisy neighbor antipattern. This article focuses on performance tuning for multiple backend services.
An example of an application using multiple backend services is a drone delivery application that runs on Azure Kubernetes Service. Customers use a web application to schedule deliveries by drone. The backend services include a delivery service manager that manages deliveries, a drone scheduler that schedules drones for pickup, and a package service manager that manages packages. The orders are not processed synchronously. An ingestion service puts the orders on a queue for processing and a workflow service coordinates the steps in the workflow. Clients call REST API to get their latest invoice which includes a summary of deliveries, packages, and total drone utilization. The information is retrieved from multiple backend services and then the results are aggregated for the user. The clients do not call the backend services directly. Instead, the application implements a Gateway Aggregation pattern.
Performance tuning begins with a baseline usually established with a load test. In this case, a six node AKS cluster with three replicas for each microservice was deployed for a step load test where the number of simulated users was stepped up from two to forty over a total duration of 8 minutes. It is observed that as the user load increases, the throughput average requests per second does not keep up. While there are no errors returned to the user, the throughput peaks halfway through the test and then drops off for the remainder. Resource contention, transient errors, and an increase in the rate of exceptions can contribute to this pattern.
One of the ways to tackle this bottleneck is to review the monitoring data. The average duration of the HTTP calls from the gateway to the backend services is noted. When the chart for the duration of different backend calls is plotted, it shows that the GetDroneUtilization takes much longer on average by an order of magnitude. The Gateway makes the calls to the backends in parallel, so the slowest operation determines how long it takes for the entire request to complete.
As the investigation narrows down to the GetDroneUtilization operation, the Azure Monitor for Containers is leveraged to pull up the resource consumption data for the CPU or memory utilization. Both the average and the maximum values are needed because the average will hide the spikes in utilization. If the overall utilization remains under 80%, this is not likely to be the issue.
Another chart that shows the response code from the Delivery services’ backend database shows that a considerable number of 429 error codes are returned from the calls made to the database. Cosmos DB which is the backend database in this case would throw this error only when it is temporarily throttling requests and usually when the caller is consuming more resource units than provisioned.
Fortunately, this level of focus comes with specific tools to assist with inferences. The Application Insight tool provides end-to-end telemetry for a representative sample of requests. The call to the GetDroneUtilization operation is analyzed for external dependencies. It shows that the Cosmos DB returns the 429-error code and waits 672 ms before retrying the operation. This means most of the delay is coming from waits without any corresponding activity. Another chart for resource unit consumption per partition versus provisioned resource units per partition will help with the original cause for the 429-error preceding the wait. It turns out that there are nine partitions that were provisioned with 100 resource units each and while the database spreads the operations across the partitions, the resource unit consumption has exceeded the provisioned resource units