Sunday, November 28, 2021

Part-2

 The problem might not be the cluster nodes but the containers or pods which might be resource-constrained. If the pods also appear healthy, then adding more pods will not solve the problem.

Application Insight might show that the duration of the workflow service’s process operation is 246 ms. The query can even request a breakout of the processing time per each of the calls to the three backend services. The individual processing times for each of these services might also appear reasonable leaving the shortfall in request processing unexplained. One of the key observations here is that the overall processing time of about 250 ms might indicate a fixed cost that puts an upper bound on how fast the messages can be processed in serial. The key to increasing throughput is to facilitate the parallel processing of messages. The delays in processing appear to come from network RTT for I/O Completions. Fortunately, orders in the request queue are independent of each other. These two factors enable us to increase parallelism which we demonstrate by setting the MaxConcurrentCalls to 20 from the initial value of 1 and the Prefetch Count to increase to 3000 from the local cache from the initial value of 0. The best practices for performance improvement using Service bus messaging indicate looking at dead letter queue. Service bus atomically retrieves and locks a message as it is processed so that it is not delivered to other receivers. When the lock expires, the messages can be delivered to other receivers. After a maximum number of delivery attempts which is configurable, Service Bus puts the messages in the dead letter queue for examining later. The Workflow service is prefetching a large batch of 3000 messages where the total time to process each message is longer and results in messages timing out, going back into the queue, and eventually reaching the dead-letter queue. This behavior can also be tracked via the MessageLostLockException. This symptom is mitigated with the lock duration set to 5 minutes to prevent lock timeouts. The plot for incoming and outgoing messages confirms that the system is now keeping up with the rate of incoming messages. The results from the performance load test show that over the total duration of 8 minutes, the application completed 25,000 operations, with a peak throughput of 72 operations per second, representing a 400% increase in maximum throughput.

While this solution clearly works, repeating the experiment over a much longer period shows that the application cannot sustain this rate. The container metrics show that the maximum CPU utilization was close to 100% At this point, the application appears to be CPU bound. So, scaling the cluster now might increase performance unlike earlier. The new setting for the cluster now includes 12 cluster nodes with 3 replicas for Ingestion service, 6 replicas for workflow service, and 9 replicas for package delivery and drone scheduler. 

To recap, the bottlenecks identified include out-of-memory exceptions for Azure Cache for Redis, Lack of parallelism in message processing, insufficient message lock duration, leading to locking timeouts, and messages being placed in the dead letter queue and CPU exhaustion. The metrics used to detect these bottlenecks include the rate of incoming and outgoing Service Bus messages, the application map in application insights, errors and exceptions, custom log analytics queries, and CPU and memory utilization for containers. 

No comments:

Post a Comment