Disaster recovery:
Preparedness in the event of an emergency arising from a
region wide failure constitutes disaster recovery plan. They can be broadly
categorized into four approaches ranging from low cost and complexity of taking
backups to more complex and costlier options involving partial or full
redundant deployments.
On a sliding scale of the tradeoffs mentioned, these can be
pegged as follows:
ç==================================================================è
            |                                   |                                   |                                   |
      Backup &
Restore           Pilot Light               Warm Standby            Multi-region active-active
       RTO/RPO in
hours            ~ 1hr                      order of minutes       near real-time
       For lower
priority         live data                       business critical         towards zero data loss
One thing to call out here is that it is a myth that the
price increases linearly as
       $                                          
$$                         
$$$                              
$$$$
Because the unit is not a full deployment stamp comprising
of multiple resource types and instead the geoDR or georedundant features are
built into the resource types and by careful selection, the price of one option
can be lower than the other for different baskets of selections.
The following the questions to ask for a DR plan for some of
the cloud service products.
Databricks:
How many notebooks, repos, clusters, jobs does your
Databricks instance have?
Has all the data been backed to Azure Storage Accounts?
Do you export your workspace to a repo with the databricks
workspace cli command?
Do you have many upstream data sources? 
Do you need to replicate a lot of data between failover and
failback? Would it be possible to be selective about your data or leverage
storage accounts that are accessible from other regions? Please exclude data in
clusters / instances and only data sources
Do you have any control plane data such as notebook, source
code, job configuration, cluster management, and user/group ACL data that is
not already in IaC or GitHub and needs to be replicated to the secondary
region?
Do you have any network configuration in data plane such as
firewall rules, NAT configuration that must be replicated to secondary region
and is part of tfstate files?
Do you have a priority on the processes that are critical to
business and must be replicated on regional service-wide cloud-service provider
outage?
Do you have streaming data that you ingest via say Kafka
queues, change data capture stream, file-based continuous processing, or
trigger once file processing? Checkpoints can be replicated to if they are not
already in a managed storage like a managed disk or a storage account.
Do you use Managed Disks that are greater than 32 TiB in
size?
Do you want to participate in a drill and recovery
procedure?
 
No comments:
Post a Comment