Wednesday, April 8, 2026

 (continued from previous article)

  1.  

    1.  

    1. Heavy Kubernetes Apps + MySQL/CosmosDB/PostgreSQL + Airflow 

    1. Recommended posture: active-passive at the cluster level, with database-specific DR depending on the backend and strict separation between platform recovery and data recovery. Azure’s AKS guidance favors mirrored clusters across regions for true regional resilience, and the deployment-stamp pattern is a strong fit because each stamp can be destroyed and redeployed as a unit if necessary. 

    1. A good service mapping is: AKS in both regions, Azure Container Registry replicated or globally reachable, GitOps or CI/CD for manifests and Helm charts, Azure Backup or equivalent backup for cluster state and persistent volumes, and a database DR strategy that matches the engine. Cosmos DB is the easiest to make multi-region resilient when configured for multi-region writes or automatic failover; PostgreSQL can use geo-restore or read replicas; MySQL needs its own backup/replication plan tuned to the service tier and business tolerance. Airflow’s scheduler, metadata database, DAGs, connections, and secret backends must be recoverable as first-class assets, otherwise the cluster can come back but orchestration remains broken. 

    1. Recommended RTO/RPO bands: RTO 30–180 minutes and RPO 5–30 minutes for many platform workloads, but database choice changes the practical floor. If Cosmos DB multi-region is used, aim for the tighter end; if my clients rely on database restore rather than continuous replication, the RPO and RTO will be noticeably larger. 

    1. Failover checklist: verify secondary AKS cluster, node pools, ingress, DNS, secrets, and registry access; restore or confirm app config maps, secrets, and persistent volumes; check the database path first and validate the chosen DR mechanism; redeploy Airflow scheduler and metadata database bindings; run a test DAG; validate service endpoints, autoscaling, and background jobs; then cut traffic only after application and job health are confirmed. 

No comments:

Post a Comment