This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.
There are several references to best practices throughout
the series of articles we wrote from the documentation for the Azure Public
Cloud. The previous article focused on the antipatterns to avoid, specifically
the noisy neighbor antipattern. This article focuses on performance tuning for
distributed business transactions.
An example of an application using distributed
transactions is a drone delivery application that runs on Azure Kubernetes
Service. Customers use a web application
to schedule deliveries by drone. The backend services include a delivery
service manager that manages deliveries, a drone scheduler that schedules
drones for pickup, and a package service manager that manages packages. The
orders are not processed synchronously.
An ingestion service puts the orders on a queue for processing and a
workflow service coordinates the steps in the workflow.
Performance tuning begins with a baseline usually established
with a load test. In this case, a six node AKS cluster with three replicas for
each microservice was deployed for a step load test where the number of
simulated users was stepped up from two to forty.
Since the users get back a response the moment their
request is put on a queue, the processing of requests is not useful to study
but when the backend cannot keep up with the request rate as the users
increase, then it becomes useful to make performance improvements. A plot of
incoming and outgoing messages will serve this purpose. When the outgoing
messages fall severely behind the incoming messages, a few actions need to be
taken which depend on the errors encountered at the time this occurs and
indicates ongoing systematic issues. For example, the workflow service might be
getting errors from the Delivery service. Let us say the errors indicate that
an exception is being thrown due to memory limits in Azure Cache for Redis.
When the cache is added, it resolves a lot of the
internal errors seen from the log, but the outbound responses still lag the
incoming requests by an order of magnitude. A Kusto query on the logs indicates
that the throughput of completed messages based on data points at 5-second
samples indicates that the backend is a bottleneck. This can be alleviated by
scaling out the backend services - package, delivery, and drone scheduler to
see if throughput increases. The number of replicas is increased from 3 to 6.
The load test shows only modest improvement. Outgoing messages are still not
keeping up with incoming messages. The Azure Monitor for containers indicates
that the problem is not resource-exhaustion because the CPU is underutilized at
less than 40% even in the 95th percentile and memory utilization is under 20%. The problem might not be the cluster nodes
but the containers or pods which might be resource-constrained. If the pods
also appear healthy, then adding more pods will not solve the problem.
No comments:
Post a Comment