Cluster computing

Tuesday, December 7, 2021

Event driven architectural style for cloud computing

The choice of architecture for a web service has a significant contribution to this effect. We review the choices between Event-Driven and the Big Data architectural styles.

Event Driven architecture consists of event producers and consumers. Event producers are those that generate a stream of events and event consumers are ones that listen for events

The scale out can be adjusted to suit the demands of the workload and the events can be responded to in real time. Producers and consumers are isolated from one another. In some extreme cases such as IoT, the events must be ingested at very high volumes. There is scope for a high degree of parallelism since the consumers are run independently and in parallel, but they are tightly coupled to the events. Network latency for message exchanges between producers and consumers is kept to a minimum. Consumers can be added as necessary without impacting existing ones.

Some of the benefits of this architecture include the following: The publishers and subscribers are decoupled. There are no point-to-point integrations. It's easy to add new consumers to the system. Consumers can respond to events immediately as they arrive. They are highly scalable and distributed. There are subsystems that have independent views of the event stream.

Some of the challenges faced with this architecture include the following: Event loss is tolerated so if there needs to be guaranteed delivery, this poses a challenge. Some IoT traffic mandate a guaranteed delivery Events are processed in exactly the order they arrive. Each consumer type typically runs in multiple instances, for resiliency and scalability. This can pose a challenge if the processing logic is not idempotent, or the events must be processed in order.

Some of the best practices demonstrated by this code. Events should be lean and mean and not bloated. Services should share only IDs and/or a timestamp. Large data transfer between services in this case is an antipattern. Loosely coupled event driven systems are best.

Some of the examples with this architectural style include edge computing including IoT traffic. It works great for automations that rely heavily on asynchronous backend processing and it is useful to maintain order, retries and dead letter queues

Monday, December 6, 2021

Big Compute vs Big Data architectural styles for implementing a cloud service

A web service for the cloud must be well suited for the business purpose its serves not only in its functionality but also in the non-functional aspects which are recorded in the Service-Level Agreements. The choice of architecture for a web service has a significant contribution to this effect. We review the choices between Big Compute and the Big Data architectural style.

The Big Compute architectural style refers to the requirements for many cores to handle the compute for the business such as for image rendering, fluid dynamics, financial risk modeling, oil exploration, drug design and engineering stress analysis. The scale out of the computational tasks is achieved by their discrete, isolated and finite nature where some input is taken in raw form and processed into an output. The scale out can be adjusted to suit the demands of the workload and the outputs can be conflated as is customary with map-reduce problems. Since the tasks are run independently and in parallel, they are tightly coupled. Network latency for message exchanges between tasks is kept to a minimum. The commodity VMs used from the infrastructure is usually the higher end of the compute in that tier. Simulations and number crunching such as for astronomical calculations involve hundreds if not thousands of such compute.

Some of the benefits of this architecture include the following: 1) high performance due to the parallelization of tasks. 2) ability to scale out to arbitrarily large number of cores, 3) ability to utilize a wide variety of compute units and 4) dynamic allocation and deallocation of compute.

Some of the challenges faced with this architecture include the following: Managing the VM architecture, the volume of number crunching, the provisioning of thousands of cores on time and getting diminishing returns from additional cores.

Some of the best practices demonstrated by this code include It exposes a well-designed API to the client. It can auto scale to handle changes in the load. It caches semi-static data. It uses a CDN to host static content. It uses a polyglot persistence when appropriate. It partitions data to improve scalability, it reduces contention, and optimizes performance.

Some of the examples with this architectural style include applications that leverage the Azure Batch managed service to use a VM pool with uploaded code and data artifacts. In this case, the Azure Batch provisions the VMs, assigns the tasks and monitors the progress. It can automatically scale out the VMs in response to the workloads. When an HPC pack is deployed to Azure, it can burst the HPC cluster to handle peak workload.

Sunday, December 5, 2021

The architectural styles for implementing a cloud service. (Continued)

Let’s compare the architectural style described in the previous post with the N-Tier architectural style of building services. This involves many logical layers and physical tiers. It comprises of WebTier, Messaging and Middle-tier and it may or may not involve a FrontEnd. In the closed style, a layer can only call the next layer immediately down and in the open style, a layer can call any of the layers below it.

Some of the benefits of this application include the following: There is portability between cloud and on-premises, and between cloud platforms. There is less learning curve for most developers. There is a natural evolution from the traditional application mode, and it is open to heterogeneous environment (Window/Linux)

Some of the challenges faced with this architectural style include: The middle tier degenerates to a data access layer that just does CRUD operations on the database which introduces unnecessary latency. There is a monolithic design that prevents independent deployment of features. Managing an IaaS application is more work than an application that uses only managed services. It can be difficult to manage network security in a large system.

Some of the best practices faced with this architectural style include changes in load can easily be accomplished by scaling out. Asynchronous messaging can decouple tiers. Semi static data can be cached. The database tier can be configured for high availability, using a solution such as SQL Server which is always on availability groups. It places a web application firewall (WAF) between the front end and the Internet. It places each tier in its own subnet and use subnets as a security boundary. The access is restricted to the data tier.

Some examples include a simple web application, or an application that migrates an on-premises application to Azure with minimal refactoring, and a unified development of on-premises and cloud applications.

Conclusion: Both these styles serve the purpose of a cloud service very well.

Saturday, December 4, 2021

The architectural styles for implementing a cloud service.

Introduction:

The web-queue can absorb the latencies from the events and user actions are translated to events. It decouples the frontend and the API layer so that they become more responsive to the users. All the actions taken by the user can be mapped to one or other form of messages that are sent to the message queue which is usually the service bus. The Web Queue can handle plenty of messages and can even scale out to catch up with the items in the queue. Each message is handled by a different handler and there is one-to-one mapping which makes it easy to view.

Some of the benefits of this architecture include the following: 1) It is relatively simple architecture that is easy to understand.2) It is Easy to deploy and manage. 3) There is a clear separation of concerns. 4) The front end is decoupled from the worker using asynchronous messaging and 5) the front end and the worker can be scaled independently.

Some of the challenges faced with this architecture include the following: The front end and the worker can both become arbitrarily large, monolithic components which increase the maintenance costs. It may also hide dependencies, if the front end and worker share data schemas or code modules.

Some of the examples with this architectural style include applications with a relatively simple domain, those with some long-running workflows or batch operations or when there are managed services rather than infrastructure as a service (IaaS).

Friday, December 3, 2021

Comparisons between web-queue-worker and event driven architecture.

Introduction:

This article is a comparison of two architectural styles in building services for the public cloud.

Description:

The two architectural styles correspond to:

1. Event-Driven architecture style: Event producers that generate a stream of events and event consumers that listen for events

2. A microservices architecture that consists of a collection of small, autonomous services.

This is a comparison of the features and their relative price comparisons as low or high:

Feature/Subsystem	Event-Driven Architecture	Microservices
Organization	Events are produced in near real-time, so consumers can respond immediately to events as they occur.	Each Service is self-contained and implements a single business functionality encapsulating a domain model.
Management	Subscribers can be added as necessary without impacting existing ones.	Services can be added as necessary without impacting existing ones
Benefits	The publishers and subscribers are decoupled. There are no point-to-point integrations. It's easy to add new consumers to the system. Consumers can respond to events immediately as they arrive. They are highly scalable and distributed. There are subsystems that have independent views of the event stream.	This is a simple architecture that focuses on end-to-end addition of business capabilities. They are easy to deploy and manage. There is a clear separation of concerns. The front end is decoupled from the worker using asynchronous messaging. The front end and the worker can be scaled independently.
Challenges	Event loss is tolerated so if there needs to be guaranteed delivery, this poses a challenge. Some IoT traffic mandate a guaranteed delivery Events are processed in exactly the order they arrive. Each consumer type typically runs in multiple instances, for resiliency and scalability. This can pose a challenge if the processing logic is not idempotent or the events must be processed in order.	Care must be taken to ensure that the front end and the worker do not become large, monolithic components that are difficult to maintain and update. It hides unnecessary dependencies when the front end and worker share data schemas or code modules.
Best practices	Events should be lean and mean and not bloated. Services should share only IDs and/or a timestamp. Large data transfer between services in this case is an antipattern. Loosely coupled event driven systems are best.	Expose a well-designed API to the client Autoscale to handle changes to load Cache semi-static data Use a CDN to host static content Use polyglot persistence when appropriate Partition data to improve scalability
Examples	The event driven architecture is suitable for edge computing including IoT traffic Works great for automations that rely heavily on asynchronous backend processing Useful to maintain order, retries and dead letter queues	The microservices are best suited for expanding the backend service portfolio such as for eCommerce Works great for transactional processing and deep separation of data access Useful to work with application gateway, load balancer and ingress.

Conclusion:

These are some comparisons between the two styles.

Thursday, December 2, 2021

This is a continuation of an article that describes operational considerations for hosting solutions on the Azure public cloud.

There are several references to best practices throughout the series of articles we wrote from the documentation for the Azure Public Cloud. The previous article focused on the antipatterns to avoid, specifically the noisy neighbor antipattern. This article focuses on transient fault handling.

Transient errors can occur anywhere in the cloud. When the services are hosted on-premises, the performance and availability are provided via redundant and often underutilized hardware, but the components are located close to each other. This reduces the failures from networking though not from power failures or other faults. The cloud provides unparalleled availability, but it might involve network latency and there can be any number of errors resulting from an unreachable network. Other forms of transient failures may come from:

1) many shared resources that are subject to throttling in order to protect the resource. Some services will refuse connections when the load rises to a specific level, or a maximum throughput rate may be reached. This helps to provide uniform quality of service for neighbors and other tenants using the shared resource.

2) commodity hardware units that make up the cloud where the performance is delivered by dynamically distributing the load across multiple computing units and infrastructure components. In this case, the faults are handled by dynamically recycling the affected components.

3) hardware components including network infrastructure such as routers and load balancers, between the application and the resources and the services it uses.

4) Clients when the conditions affect it such that the reachability to the service is affected due to the intermittent Internet disconnections.

Cloud-based services, applications, and solutions must work around these transient failures because they are hard to eliminate.

First, they must have a built-in retry mechanism although they can use varying scope from the level of an individual system call to the API implementations.

Second, they should determine if the operation is suitable for retrying. Retry operations where the faults are transient and there is at least some likelihood that the operation will succeed when reattempted. These are easily known from the error codes for calls where the transient errors originate from.

Third, the retry count and interval must be decided for this to work. Some strategies include exponential backoff, Incremental intervals, and regular intervals, immediate retry, and randomization.

Finally, retry storm antipatterns must be avoided.

Wednesday, December 1, 2021

Continued from previous post

The next step would require an increase in the resource units (RU) pertaining to this operation. When the RU is quadrupled, the throughput increases from 19 requests/second to 23 requests per second, and the average latency drops from 669ms to 569 ms. Notice that the maximum throughput is not significantly higher, but it eliminates all the 429 errors that were encountered. This is still a significant win.

The number of RUs provisioned still had sufficient headroom between provisioned and consumption. At this point, we could increase the RU per partition but let us review another angle where we plot the number of calls to the database per successful operation. The number of calls reduces from 11 to 9 but it should match the actual query plan. This implies that the database call was for a cross-partition query that targeted all nine partitions. The client must fan out the query to all the partitions and collect the results. The queries however were completed one after the other. The operation takes as long as the sum of all the queries and the problem will only get worse as the size of the data grows and more physical partitions are added.

If the queries were executed in parallel, the latency would decrease, and the throughput would increase. In fact, the gains would be so much that the throughput would keep pace with the load. One of the side effects of increasing the throughput is that the resource unit consumption would increase and the headroom between the provisioned and the consumption would shrink. This would entail a database scale-out of the operation, but an alternative might be to optimize the query. The cross-partition query is a concern especially given that it is being run every time instead of selectively. The query is trying to filter the data based on the owner and the time of the call. Switching the collection to the new partition key where the owner ID is the partition helps mitigate the cross-partition querying. This will dramatically improve the throughput and keep it more regular just like the other calls noticed from the monitoring data. A consequence of the improved performance is that the node CPU utilization is also improved. When this happens, we know that the bottleneck has been eliminated.