(continued from previous article)
Heavy Kubernetes Apps + MySQL/CosmosDB/PostgreSQL + Airflow
Recommended posture: active-passive at the cluster level, with database-specific DR depending on the backend and strict separation between platform recovery and data recovery. Azure’s AKS guidance favors mirrored clusters across regions for true regional resilience, and the deployment-stamp pattern is a strong fit because each stamp can be destroyed and redeployed as a unit if necessary.
A good service mapping is: AKS in both regions, Azure Container Registry replicated or globally reachable, GitOps or CI/CD for manifests and Helm charts, Azure Backup or equivalent backup for cluster state and persistent volumes, and a database DR strategy that matches the engine. Cosmos DB is the easiest to make multi-region resilient when configured for multi-region writes or automatic failover; PostgreSQL can use geo-restore or read replicas; MySQL needs its own backup/replication plan tuned to the service tier and business tolerance. Airflow’s scheduler, metadata database, DAGs, connections, and secret backends must be recoverable as first-class assets, otherwise the cluster can come back but orchestration remains broken.
Recommended RTO/RPO bands: RTO 30–180 minutes and RPO 5–30 minutes for many platform workloads, but database choice changes the practical floor. If Cosmos DB multi-region is used, I aim for the tighter end; if my clients rely on database restore rather than continuous replication, the RPO and RTO will be noticeably larger.
Failover checklist: I verify secondary AKS cluster, node pools, ingress, DNS, secrets, and registry access; restore or confirm app config maps, secrets, and persistent volumes; check the database path first and validate the chosen DR mechanism; redeploy Airflow scheduler and metadata database bindings; run a test DAG; validate service endpoints, autoscaling, and background jobs; then cut traffic only after application and job health are confirmed.
No comments:
Post a Comment