Cluster computing

Chaos engineering

Administrators will find that this section is familiar to them. There are technical considerations that emphasize design and service-level objectives, as well as the scale of the solution. But drills, testing and validations are essential to guarantee smooth operations. Chaos engineering is one such method to test the reliability of an infrastructure solution. While reliability of individual resources might be a concern for the public provider, that of the deployments fall on the IaC authors and deployers. As a contrast from other infrastructure concerns such as security that has mitigation in theory and design involving Zero Trust and least privilege principles, Chaos Engineering is all about trying it out with drills and controlled environments.
By the time a deployment of infrastructure is planned, business considerations have included the following:
understanding what kind of solution is being created such as business-to-business, business-to-consumer, or enterprise software 2. Defining the tenants in terms of number and growth plans, 3. Defining the pricing model and ensuring it aligns with the tenants’ consumption of Azure resources. 4. Understanding whether we need to separate the tenants into different tiers and based on the customer’s requirements, deciding on the tenancy model.
And a well-architected review would have addressed the key pillars of 1) Reliability, 2) Security, 3) Cost Optimization, 4) Operational Excellence, and 5) Performance efficiency. Performance is usually based on external factors and is very close to customer satisfaction. Continuous telemetry and reactiveness are essential to tuned up performance.

Security and reliability are operation concerns. When trying out the deployments for testing reliability, it is often required to inject faults and bring down components to check how the remaining part of the system behaves. The idea of injection of failure also finds parallels in beefing up security in the form of penetrative testing. The difference is that security testing is geared towards exploitation while reliability testing is geared toward reducing mean time between failures.

The component-down testing quite drastic which involves powering down the zone. There are a lot of network connections to and from cloud resources, so it becomes hard to find an alternative to a component that is down. A multi-tiered approach is necessitated to enable robustness against component-down design. Mitigation often involve workarounds and diverting traffic to healthy redundancies.
Having multi-region deployments of components not only improves reliability but also draws business from the neighborhood of the new presence. A geographical presence for a public cloud is only somewhat different from that of a private cloud. A public cloud lists regions where the services it offers are hosted. A region may have three availability zones for redundancy and availability and each zone may host a variety of cloud computing resources – small or big. Each availability zone may have one or more stadium sized datacenters. When the infrastructure is established, the process of commissioning services in the new region can be referred to as buildouts. Including appropriate buildouts increases reliability of the system in face of failures.
Component-down testing for chaos engineering differs from business continuity and disaster recovery planning in that one discovers problems in reliability and the other determines acceptable mitigation. One cannot do without the other.

For a complete picture on reliability of infrastructure deployments, additional considerations become necessary. These include:

First, the automation must involve context switching between the platform and the task for deploying each service to the region. The platform co-ordinates these services and must maintain an order, dependency and status during the tasks.
Second, the task of each service itself is complicated and requires definitions in terms of region-specific parameters to an otherwise region agnostic service model.
Third, the service must manifest their dependencies declaratively so that they can be validated and processed correctly. These dependencies may be between services, on external resources and the availability or event from another activity.
Fourth, the service buildouts must be retry-able on errors and exceptions otherwise the platform will require a lot of manual intervention which increase the cost
Fifth, the communication between the automated activities and manual interventions must be captured with the help of the ticket tracking or incident management system
Sixth, the workflow and the state for each activity pertaining to the task must follow standard operating procedures that are defined independent of region and available to audit
Seventh, the technologies for the platform execution and that for the deployments of the services might be different requiring consolidation and coordination between the two. In such case, the fewer the context switches between the two the better.
Eighth, the platform itself must have support for templates, event publishing and subscription, metadata store, onboarding and bootstrapping processes that can be reused.
Ninth, the platform should support parameters for enabling a region to be differentiated from others or for customer satisfaction in terms of features or services available.
Tenth, the progress for the buildout of new regions must be actively tracked with the help of tiles for individual tasks and rows per services.

#codingexercise: CodingExercise-10-24-2024.docx

Cluster computing

Thursday, October 24, 2024

No comments:

Post a Comment