Cluster computing

Tuesday, April 7, 2026

(Continued from previous post)

1. Databricks jobs, notebooks, and ADF pipelines

a. Recommended posture: active-passive with a warm secondary workspace and parallelized redeployment of orchestration metadata and code. Azure Databricks DR guidance states clearly that recovery typically means a secondary workspace in another region, redeploying jobs, and dependencies, and then reestablishing access, while Azure DR guidance recommends keeping recovery procedures automated and idempotent.

b. A good service mapping is: Azure Databricks workspace in primary and secondary regions, Repos or CI/CD for notebooks and job definitions, Azure Data Factory for orchestration, private networking, Key Vault-backed secret scopes or linked services, and durable storage for intermediate artifacts and checkpoints. For ADF, I treat pipelines, triggers, linked services, integration runtimes, and global parameters as source-controlled assets, and make sure any self-hosted integration runtimes have failover capacity or a second node outside the blast radius.

c. Recommended RTO/RPO bands: RTO 30–120 minutes and RPO 5–60 minutes, depending on how much data can be replayed versus how much must be preserved at the exact point of failure. If the workload is batch-oriented and upstream systems can replay, I tolerate a larger RPO; if it drives downstream finance or operational reporting, keep RPO much tighter and make checkpoints more frequent.

d. Failover checklist: freeze primary job execution; snapshot or validate source data state; I confirm secondary workspace, cluster policies, libraries, identities, and secrets are in place; redeploy notebooks, jobs, triggers, and pipeline definitions; validate integration runtime connectivity; run a small canary job first; confirm read/write access to landing zones and sinks; then resume scheduled batch flows only after checksum or row-count verification.

Cluster computing

Tuesday, April 7, 2026

No comments:

Post a Comment