Cluster computing

Friday, December 3, 2021

Comparisons between web-queue-worker and event driven architecture.

Introduction:

This article is a comparison of two architectural styles in building services for the public cloud.

Description:

The two architectural styles correspond to:

1. Event-Driven architecture style: Event producers that generate a stream of events and event consumers that listen for events

2. A microservices architecture that consists of a collection of small, autonomous services.

This is a comparison of the features and their relative price comparisons as low or high:

Feature/Subsystem	Event-Driven Architecture	Microservices
Organization	Events are produced in near real-time, so consumers can respond immediately to events as they occur.	Each Service is self-contained and implements a single business functionality encapsulating a domain model.
Management	Subscribers can be added as necessary without impacting existing ones.	Services can be added as necessary without impacting existing ones
Benefits	The publishers and subscribers are decoupled. There are no point-to-point integrations. It's easy to add new consumers to the system. Consumers can respond to events immediately as they arrive. They are highly scalable and distributed. There are subsystems that have independent views of the event stream.	This is a simple architecture that focuses on end-to-end addition of business capabilities. They are easy to deploy and manage. There is a clear separation of concerns. The front end is decoupled from the worker using asynchronous messaging. The front end and the worker can be scaled independently.
Challenges	Event loss is tolerated so if there needs to be guaranteed delivery, this poses a challenge. Some IoT traffic mandate a guaranteed delivery Events are processed in exactly the order they arrive. Each consumer type typically runs in multiple instances, for resiliency and scalability. This can pose a challenge if the processing logic is not idempotent or the events must be processed in order.	Care must be taken to ensure that the front end and the worker do not become large, monolithic components that are difficult to maintain and update. It hides unnecessary dependencies when the front end and worker share data schemas or code modules.
Best practices	Events should be lean and mean and not bloated. Services should share only IDs and/or a timestamp. Large data transfer between services in this case is an antipattern. Loosely coupled event driven systems are best.	Expose a well-designed API to the client Autoscale to handle changes to load Cache semi-static data Use a CDN to host static content Use polyglot persistence when appropriate Partition data to improve scalability
Examples	The event driven architecture is suitable for edge computing including IoT traffic Works great for automations that rely heavily on asynchronous backend processing Useful to maintain order, retries and dead letter queues	The microservices are best suited for expanding the backend service portfolio such as for eCommerce Works great for transactional processing and deep separation of data access Useful to work with application gateway, load balancer and ingress.

Conclusion:

These are some comparisons between the two styles.

Thursday, December 2, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the noisy neighbor antipattern. This article focuses on transient fault handling.

Transient errors can occur anywhere in the cloud. When the services are hosted on-premises, the performance and availability are provided via redundant and often underutilized hardware, but the components are located close to each other. This reduces the failures from networking though not from power failures or other faults. The cloud provides unparalleled availability, but it might involve network latency and there can be any number of errors resulting from an unreachable network. Other forms of transient failures may come from:

1) many shared resources that are subject to throttling in order to protect the resource. Some services will refuse connections when the load rises to a specific level, or a maximum throughput rate may be reached. This helps to provide uniform quality of service for neighbors and other tenants using the shared resource.

2) commodity hardware units that make up the cloud where the performance is delivered by dynamically distributing the load across multiple computing units and infrastructure components. In this case, the faults are handled by dynamically recycling the affected components.

3) hardware components including network infrastructure such as routers and load balancers, between the application and the resources and the services it uses.

4) Clients when the conditions affect it such that the reachability to the service is affected due to the intermittent Internet disconnections.

Cloud-based services, applications, and solutions must work around these transient failures because they are hard to eliminate.

First, they must have a built-in retry mechanism although they can use varying scope from the level of an individual system call to the API implementations.

Second, they should determine if the operation is suitable for retrying. Retry operations where the faults are transient and there is at least some likelihood that the operation will succeed when reattempted. These are easily known from the error codes for calls where the transient errors originate from.

Third, the retry count and interval must be decided for this to work. Some strategies include exponential backoff, Incremental intervals, and regular intervals, immediate retry, and randomization.

Finally, retry storm antipatterns must be avoided.

Wednesday, December 1, 2021

Continued from previous post

The next step would require an increase in the resource units (RU) pertaining to this operation. When the RU is quadrupled, the throughput increases from 19 requests/second to 23 requests per second, and the average latency drops from 669ms to 569 ms. Notice that the maximum throughput is not significantly higher, but it eliminates all the 429 errors that were encountered. This is still a significant win.

The number of RUs provisioned still had sufficient headroom between provisioned and consumption. At this point, we could increase the RU per partition but let us review another angle where we plot the number of calls to the database per successful operation. The number of calls reduces from 11 to 9 but it should match the actual query plan. This implies that the database call was for a cross-partition query that targeted all nine partitions. The client must fan out the query to all the partitions and collect the results. The queries however were completed one after the other. The operation takes as long as the sum of all the queries and the problem will only get worse as the size of the data grows and more physical partitions are added.

If the queries were executed in parallel, the latency would decrease, and the throughput would increase. In fact, the gains would be so much that the throughput would keep pace with the load. One of the side effects of increasing the throughput is that the resource unit consumption would increase and the headroom between the provisioned and the consumption would shrink. This would entail a database scale-out of the operation, but an alternative might be to optimize the query. The cross-partition query is a concern especially given that it is being run every time instead of selectively. The query is trying to filter the data based on the owner and the time of the call. Switching the collection to the new partition key where the owner ID is the partition helps mitigate the cross-partition querying. This will dramatically improve the throughput and keep it more regular just like the other calls noticed from the monitoring data. A consequence of the improved performance is that the node CPU utilization is also improved. When this happens, we know that the bottleneck has been eliminated.

Tuesday, November 30, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.

An example of an application using multiple backend services is a drone delivery application that runs on Azure Kubernetes Service. Customers use a web application to schedule deliveries by drone. The backend services include a delivery service manager that manages deliveries, a drone scheduler that schedules drones for pickup, and a package service manager that manages packages. The orders are not processed synchronously. An ingestion service puts the orders on a queue for processing and a workflow service coordinates the steps in the workflow. Clients call REST API to get their latest invoice which includes a summary of deliveries, packages, and total drone utilization. The information is retrieved from multiple backend services and then the results are aggregated for the user. The clients do not call the backend services directly. Instead, the application implements a Gateway Aggregation pattern.

Performance tuning begins with a baseline usually established with a load test. In this case, a six node AKS cluster with three replicas for each microservice was deployed for a step load test where the number of simulated users was stepped up from two to forty over a total duration of 8 minutes. It is observed that as the user load increases, the throughput average requests per second does not keep up. While there are no errors returned to the user, the throughput peaks halfway through the test and then drops off for the remainder. Resource contention, transient errors, and an increase in the rate of exceptions can contribute to this pattern.

One of the ways to tackle this bottleneck is to review the monitoring data. The average duration of the HTTP calls from the gateway to the backend services is noted. When the chart for the duration of different backend calls is plotted, it shows that the GetDroneUtilization takes much longer on average by an order of magnitude. The Gateway makes the calls to the backends in parallel, so the slowest operation determines how long it takes for the entire request to complete.

As the investigation narrows down to the GetDroneUtilization operation, the Azure Monitor for Containers is leveraged to pull up the resource consumption data for the CPU or memory utilization. Both the average and the maximum values are needed because the average will hide the spikes in utilization. If the overall utilization remains under 80%, this is not likely to be the issue.

Another chart that shows the response code from the Delivery services’ backend database shows that a considerable number of 429 error codes are returned from the calls made to the database. Cosmos DB which is the backend database in this case would throw this error only when it is temporarily throttling requests and usually when the caller is consuming more resource units than provisioned.

Fortunately, this level of focus comes with specific tools to assist with inferences. The Application Insight tool provides end-to-end telemetry for a representative sample of requests. The call to the GetDroneUtilization operation is analyzed for external dependencies. It shows that the Cosmos DB returns the 429-error code and waits 672 ms before retrying the operation. This means most of the delay is coming from waits without any corresponding activity. Another chart for resource unit consumption per partition versus provisioned resource units per partition will help with the original cause for the 429-error preceding the wait. It turns out that there are nine partitions that were provisioned with 100 resource units each and while the database spreads the operations across the partitions, the resource unit consumption has exceeded the provisioned resource units

Monday, November 29, 2021

This is a summary of the book titled “13 things mentally strong people don’t do” by Amy Morin. The author is a licensed, clinical, social worker, college psychology instructor and psychotherapist and is dedicated to all those who strive to become better today than they were yesterday. She cuts to the chase with clear and precise instructions. Some excerpts follow in this summary:

Thoughts, behaviors and feelings are intertwined. When used together, the “think positive” approach propels us forward otherwise they can even create a downward spiral. The points mentioned below are manifestations that are associated with people who understand this intertwining and become mentally strong. They need not appear tough or ignore their emotions, but they are resilient, more satisfied and demonstrate enhanced performance.

1. They don’t waste time feeling sorry for themselves. Self-pity is the classic symptom of the weak and to gain strength, they must avoid this self-destructive behavior by behaving in a manner that makes it hard to feel sorry for themselves. One way to do this is to exchange self-pity for gratitude. The more they journal their gratitude, the stronger they become.

2. They don’t give away their power. There is always a buffer between the stimulus and their response. They do not let others offend them, turn them or trigger a knee-jerk reaction. Retaining their power is about being confident about who they are and the choices they make. Identifying the people who have taken their power and reframing their language helps them in this regard.

3. They don’t shy away from change. Managing change can be daunting but the successful create a success plan for the change. They behave like the person they want to become. Balancing emotions and rational thoughts help make it easier.

4. They don’t focus on things they can’t control. They develop a balanced sense of control. They identify their fears. They focus on what they can do which includes influencing people even without controlling them. Insisting on doing everything by themselves goes against their practice.

5. They don’t worry about pleasing everyone. They identify their values and behave accordingly. They make a note of who they want to please, and it does not include everybody. They practice tolerating uncomfortable emotions.

6. They don’t fear taking calculated risks. They are aware of the emotional reactions to risk taking and they identify the types of risks that are particularly challenging. They analyze risks before they decide. They also monitor the results so they can learn from each risk.

7. They don’t dwell on the past. They reflect on the past just enough to learn from it. They move forward even if it is painful. Working through the grief lets them focus on the present and plan. They also find ways to make peace with the past, but they never pretend that it did not happen. They don’t try to undo the past or make up for past mistakes.

8. They don’t make the same mistakes repeatedly. They acknowledge their personal responsibility for each mistake and even create a written plan to avoid repeating it. They identify the triggers and the warning signs for old behavior patterns and practice self-discipline strategies. They never make excuses or respond impulsively. They never put themselves in situations where they are likely to fail. Resisting temptation is one way to avoid repeating mistakes.

9. They don’t resent other people’s success. They replace negative thoughts that breed resentment. They celebrate accomplishments, focus on strengths and co-operate rather than compete with everyone. They do not compare themselves to everyone around them or treat them as direct competition.

10. They don’t give up after the first failure. They view failure as a learning opportunity, and they resolve to try again. They identify and replace irrational thoughts and they focus on improving their skill rather than showing them off. They do not quit or assume that future attempts will be the same as the past.

11. They don’t fear alone time. They learn how to appreciate silence and to be alone with their thoughts. They schedule a date with themselves at least once a month and practice mindfulness and meditation regularly. They do not indulge in beliefs that limit them and they do not always keep background noise on.

12. They don’t feel the world owes them anything. They develop healthy amounts of self-esteem, and they recognize areas of life where they believe they are superior. They focus on what they must give rather than what they must take. They think about other people’s feelings. They are certainly not selfish or egoist.

13. They don’t expect immediate results. Instead, they create realistic expectations, find accurate ways to measure progress, and celebrate milestones along the way. They don’t limit themselves to believing that if it is not working for them now, they are not making progress. They don’t look for shortcuts.

And finally, a conclusion on maintaining mental strength. This is a continuous process where they monitor their behavior, regulate their emotions and think about their thoughts.

Sunday, November 28, 2021

Part-2

The problem might not be the cluster nodes but the containers or pods which might be resource-constrained. If the pods also appear healthy, then adding more pods will not solve the problem.

Application Insight might show that the duration of the workflow service’s process operation is 246 ms. The query can even request a breakout of the processing time per each of the calls to the three backend services. The individual processing times for each of these services might also appear reasonable leaving the shortfall in request processing unexplained. One of the key observations here is that the overall processing time of about 250 ms might indicate a fixed cost that puts an upper bound on how fast the messages can be processed in serial. The key to increasing throughput is to facilitate the parallel processing of messages. The delays in processing appear to come from network RTT for I/O Completions. Fortunately, orders in the request queue are independent of each other. These two factors enable us to increase parallelism which we demonstrate by setting the MaxConcurrentCalls to 20 from the initial value of 1 and the Prefetch Count to increase to 3000 from the local cache from the initial value of 0. The best practices for performance improvement using Service bus messaging indicate looking at dead letter queue. Service bus atomically retrieves and locks a message as it is processed so that it is not delivered to other receivers. When the lock expires, the messages can be delivered to other receivers. After a maximum number of delivery attempts which is configurable, Service Bus puts the messages in the dead letter queue for examining later. The Workflow service is prefetching a large batch of 3000 messages where the total time to process each message is longer and results in messages timing out, going back into the queue, and eventually reaching the dead-letter queue. This behavior can also be tracked via the MessageLostLockException. This symptom is mitigated with the lock duration set to 5 minutes to prevent lock timeouts. The plot for incoming and outgoing messages confirms that the system is now keeping up with the rate of incoming messages. The results from the performance load test show that over the total duration of 8 minutes, the application completed 25,000 operations, with a peak throughput of 72 operations per second, representing a 400% increase in maximum throughput.

While this solution clearly works, repeating the experiment over a much longer period shows that the application cannot sustain this rate. The container metrics show that the maximum CPU utilization was close to 100% At this point, the application appears to be CPU bound. So, scaling the cluster now might increase performance unlike earlier. The new setting for the cluster now includes 12 cluster nodes with 3 replicas for Ingestion service, 6 replicas for workflow service, and 9 replicas for package delivery and drone scheduler.

To recap, the bottlenecks identified include out-of-memory exceptions for Azure Cache for Redis, Lack of parallelism in message processing, insufficient message lock duration, leading to locking timeouts, and messages being placed in the dead letter queue and CPU exhaustion. The metrics used to detect these bottlenecks include the rate of incoming and outgoing Service Bus messages, the application map in application insights, errors and exceptions, custom log analytics queries, and CPU and memory utilization for containers.

Saturday, November 27, 2021

Part-1

This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.

An example of an application using distributed transactions is a drone delivery application that runs on Azure Kubernetes Service. Customers use a web application to schedule deliveries by drone. The backend services include a delivery service manager that manages deliveries, a drone scheduler that schedules drones for pickup, and a package service manager that manages packages. The orders are not processed synchronously. An ingestion service puts the orders on a queue for processing and a workflow service coordinates the steps in the workflow.

Since the users get back a response the moment their request is put on a queue, the processing of requests is not useful to study but when the backend cannot keep up with the request rate as the users increase, then it becomes useful to make performance improvements. A plot of incoming and outgoing messages will serve this purpose. When the outgoing messages fall severely behind the incoming messages, a few actions need to be taken which depend on the errors encountered at the time this occurs and indicates ongoing systematic issues. For example, the workflow service might be getting errors from the Delivery service. Let us say the errors indicate that an exception is being thrown due to memory limits in Azure Cache for Redis.

When the cache is added, it resolves a lot of the internal errors seen from the log, but the outbound responses still lag the incoming requests by an order of magnitude. A Kusto query on the logs indicates that the throughput of completed messages based on data points at 5-second samples indicates that the backend is a bottleneck. This can be alleviated by scaling out the backend services - package, delivery, and drone scheduler to see if throughput increases. The number of replicas is increased from 3 to 6. The load test shows only modest improvement. Outgoing messages are still not keeping up with incoming messages. The Azure Monitor for containers indicates that the problem is not resource-exhaustion because the CPU is underutilized at less than 40% even in the 95^th percentile and memory utilization is under 20%. The problem might not be the cluster nodes but the containers or pods which might be resource-constrained. If the pods also appear healthy, then adding more pods will not solve the problem.