(Continued from previous post)
1.
Databricks jobs, notebooks, and ADF pipelines
a.
Recommended posture: active-passive with a warm
secondary workspace and parallelized redeployment of orchestration metadata and
code. Azure Databricks DR guidance states clearly that recovery typically means
a secondary workspace in another region, redeploying jobs, and dependencies,
and then reestablishing access, while Azure DR guidance recommends keeping
recovery procedures automated and idempotent.
b.
A good service mapping is: Azure Databricks
workspace in primary and secondary regions, Repos or CI/CD for notebooks and
job definitions, Azure Data Factory for orchestration, private networking, Key
Vault-backed secret scopes or linked services, and durable storage for
intermediate artifacts and checkpoints. For ADF, I treat pipelines, triggers,
linked services, integration runtimes, and global parameters as
source-controlled assets, and make sure any self-hosted integration runtimes
have failover capacity or a second node outside the blast radius.
c.
Recommended RTO/RPO bands: RTO 30–120 minutes
and RPO 5–60 minutes, depending on how much data can be replayed versus how
much must be preserved at the exact point of failure. If the workload is
batch-oriented and upstream systems can replay, I tolerate a larger RPO; if it
drives downstream finance or operational reporting, keep RPO much tighter and
make checkpoints more frequent.
d.
Failover checklist: freeze primary job
execution; snapshot or validate source data state; I confirm secondary
workspace, cluster policies, libraries, identities, and secrets are in place;
redeploy notebooks, jobs, triggers, and pipeline definitions; validate
integration runtime connectivity; run a small canary job first; confirm
read/write access to landing zones and sinks; then resume scheduled batch flows
only after checksum or row-count verification.
No comments:
Post a Comment