Disaster recovery using Azure DNS and Traffic Manager:
This is a
continuation of a series of articles on operational engineering aspects of
Azure public cloud computing that included the most networking
discussions on Azure DNS which is a full-fledged general availability service.
This article
focuses on disaster recovery using Azure DNS and Network Traffic Manager. The
purpose of disaster recovery is to revive functionality after a severe loss for
the application. The level of revival may be graded as unavailable, partially
available or fully available. A multi-region architecture provides some fault
tolerance and resiliency against application or infrastructure by facilitating
a failover. The region redundancy helps achieve failover and high availability
but the approaches for disaster recovery might vary from business to business.
The following are listed as some of the options.
-
Active-passive with cold standby: In this failover
solution, the VMs and other applications are running in the standby mode are
not active until there is a need for failover.Backups, VM Images and resource
manager templates continue to be replicated usually to a different region. This
is cost-effective but takes time to complete a failover.
-
The active/passive with pilot light
failover solution sets up a standby environment with minimal
configuration. The setup has only the
necessary services running to support only a minimum and critical set of
applications. The scenario can only execute minimal functionality, but it can
scale up and launch more services to take bulk of the production load if a
failover occurs. Data mirroring can be setup with a site-to-site vpn.
-
the active passive with warm standby
is setup such that it can take up a base load and initiate scaling until all
instances are up and running. The solution isn’t scaled to take full production
workload, but it is functional. It is an enhancement over the previous approach
but short of a full-blown approach.
Two requirements that come from this planning deserve
callouts. Firstly, a deployment
mechanism must be used to replicate instances, data and configurations between
primary and standby environments. The recovery can be done natively or
third-party services. Secondly, a solution must be developed to divert
network/web traffic from the primary site to the secondary site. This type of
disaster recovery can be achieved via Azure DNS, Traffic Manager for DNS or
third-party global load balancers.
The Azure DNS
manual failover solution for disaster recovery uses the standard DNS mechanism
to failover to the backup site. It assumes that both the primary and the
secondary endpoints have static IP addresses that don’t change often, an Azure
DNS zone exists for both the primary and secondary site and that the TTL is at
or below the RTO SLA set in the organization. Since the DNS Server is outside
the failover or disaster zone, it does not get impacted by any downtime. The
user is merely required to make a flip. The solution is scripted and the low
TTL set against the zone ensures that no resolver around the world caches it
for long periods. For cold standby and pilot light, since some prewarming
activity is involved, enough time must be given before making the flip. The use
of Azure traffic manger automates this flip when both the primary and the
secondary have a full deployment complete with cloud services and a
synchronized database. The traffic manager routes the new requests to the
secondary region on service disruption. By virtue of the inbuilt probes for various
types of health checks, the Azure Traffic Manager falls back to its rules
engine to perform the failover.
No comments:
Post a Comment