Cluster computing

Monday, May 8, 2023

Disaster recovery:

Preparedness in the event of an emergency arising from a region wide failure constitutes disaster recovery plan. They can be broadly categorized into four approaches ranging from low cost and complexity of taking backups to more complex and costlier options involving partial or full redundant deployments.

On a sliding scale of the tradeoffs mentioned, these can be pegged as follows:

ç==================================================================è

| | | |

Backup & Restore Pilot Light Warm Standby Multi-region active-active

RTO/RPO in hours ~ 1hr order of minutes near real-time

For lower priority live data business critical towards zero data loss

One thing to call out here is that it is a myth that the price increases linearly as

$ $$ $$$ $$$$

Because the unit is not a full deployment stamp comprising of multiple resource types and instead the geoDR or georedundant features are built into the resource types and by careful selection, the price of one option can be lower than the other for different baskets of selections.

The following the questions to ask for a DR plan for some of the cloud service products.

Databricks:

How many notebooks, repos, clusters, jobs does your Databricks instance have?

Has all the data been backed to Azure Storage Accounts?

Do you export your workspace to a repo with the databricks workspace cli command?

Do you have many upstream data sources?

Do you need to replicate a lot of data between failover and failback? Would it be possible to be selective about your data or leverage storage accounts that are accessible from other regions? Please exclude data in clusters / instances and only data sources

Do you have any control plane data such as notebook, source code, job configuration, cluster management, and user/group ACL data that is not already in IaC or GitHub and needs to be replicated to the secondary region?

Do you have any network configuration in data plane such as firewall rules, NAT configuration that must be replicated to secondary region and is part of tfstate files?

Do you have a priority on the processes that are critical to business and must be replicated on regional service-wide cloud-service provider outage?

Do you have streaming data that you ingest via say Kafka queues, change data capture stream, file-based continuous processing, or trigger once file processing? Checkpoints can be replicated to if they are not already in a managed storage like a managed disk or a storage account.

Do you use Managed Disks that are greater than 32 TiB in size?

Do you want to participate in a drill and recovery procedure?

Cluster computing

Monday, May 8, 2023

No comments:

Post a Comment