Saturday, November 27, 2021

Part-1

 This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the noisy neighbor antipattern. This article focuses on performance tuning for distributed business transactions.

An example of an application using distributed transactions is a drone delivery application that runs on Azure Kubernetes Service.  Customers use a web application to schedule deliveries by drone. The backend services include a delivery service manager that manages deliveries, a drone scheduler that schedules drones for pickup, and a package service manager that manages packages. The orders are not processed synchronously.  An ingestion service puts the orders on a queue for processing and a workflow service coordinates the steps in the workflow.

Performance tuning begins with a baseline usually established with a load test. In this case, a six node AKS cluster with three replicas for each microservice was deployed for a step load test where the number of simulated users was stepped up from two to forty.

Since the users get back a response the moment their request is put on a queue, the processing of requests is not useful to study but when the backend cannot keep up with the request rate as the users increase, then it becomes useful to make performance improvements. A plot of incoming and outgoing messages will serve this purpose. When the outgoing messages fall severely behind the incoming messages, a few actions need to be taken which depend on the errors encountered at the time this occurs and indicates ongoing systematic issues. For example, the workflow service might be getting errors from the Delivery service. Let us say the errors indicate that an exception is being thrown due to memory limits in Azure Cache for Redis.

When the cache is added, it resolves a lot of the internal errors seen from the log, but the outbound responses still lag the incoming requests by an order of magnitude. A Kusto query on the logs indicates that the throughput of completed messages based on data points at 5-second samples indicates that the backend is a bottleneck. This can be alleviated by scaling out the backend services - package, delivery, and drone scheduler to see if throughput increases. The number of replicas is increased from 3 to 6. The load test shows only modest improvement. Outgoing messages are still not keeping up with incoming messages. The Azure Monitor for containers indicates that the problem is not resource-exhaustion because the CPU is underutilized at less than 40% even in the 95th percentile and memory utilization is under 20%.  The problem might not be the cluster nodes but the containers or pods which might be resource-constrained. If the pods also appear healthy, then adding more pods will not solve the problem.


No comments:

Post a Comment