Saturday, June 26, 2021

 Zone Down failure simulation: 

Introduction:  A public cloud enables geo-redundancy for resources by providing many geographical regions where resources can be allocated. A region comprises several availability zones where resources allocated redundantly across the zones. If the resources fail in one zone, they are switched out with those from another zone. A zone may comprise several datacenters each of which may be stadium-sized centers that provision compute, storage and networking. When the user requests a resource from the public cloud, it usually has 99.99% availability. When the resources are provisioned for zone-redundancy, their availability increases to 99.999%. The verification for resources to failover to an alternate zone when one goes down is key to measuring that improvement in availability. This has been a manual exercise so far. This article attempts to explore the options to automate the testing from a resource perspective. 

Description: 

Though they sound like Availability sets, the AZs comprise datacenters with independent power, cooling, and networking and the availability sets are logical groupings of virtual machines. AZ is a combination of both a fault domain as well as an update domain, so changes do not occur at the same time. Services that support availability zones fall under two categories: zonal services – where a resource is pinned to a specific zone and a zone-redundant service for a platform that replicates automatically across zones. Business continuity is established with a combination of zones and azure region pairs. 

The availability zones can be queried from SDK, and they are mere numbers within a location. For example, az VM list-skus --location eastus2 --output table will list VM SKUs based on region and zones. The zones are identified by numbers such as 1, 2, 3 and these do not mean anything other than that the zones are distinct. The numbers don’t change for the lifetime of the zone, but they don’t have any direct correlation to physical zone representations. 

There are ways in which individual zone-resilient services can allow zone redundancy to be configured. 

When the services allow in-place migration of resources from zonal to zone-redundancy or changing the number of zones for the resource that the service provisions, the simulation of the zone down behavior is as straightforward as asking the service to reconfigure the resource by specifying exactly what zones to have. For example, it could start with [“1”, “2”, “3”] and to simulate a zone down the failure of “3”, it could be reprovisioned with [“1”, “2”] This in-place migration is not expected to cause any downtime for the resource because “1” and “2” continue to remain as part of the configuration. Also, the re-provisioning can be revolving around the zones requiring only the source and target zone pair and since there are three zones, that resource can always be accessed from one zone or the other.  

Conclusion: Zone down can be simulated when there is adequate support from the services that provide the resource. 

Reference the earlier discussion on this topic: https://1drv.ms/w/s!Ashlm-Nw-wnWzhemFZTD0rT35pTS?e=kTGWox


No comments:

Post a Comment