- Overview/Purpose:
This pattern articulates the way to provide business
continuity and disaster recovery for your virtual machines such that the state
can be recovered after a user or application error, regional data center
outage, or unplanned disruptions.
- Solution Design
Use the Azure Site Recovery to:
1.
Continuously replicate to a different target region and
2.
Setup a replication policy with
a.
Recovery point retention policy set to 1 day.
b.
App-consistent snapshot frequency set to 0 hours.
3.
Use a Backup vault to store the snapshots.
4.
The replication settings must specify target location,
target subscription which can be the same as the source subscription and a
target resource group, failover virtual network, failover subnet, new replica
managed disk at destination, cache storage at source which is a Standard
storage account, availability options for each VM and capacity reservation.
5.
After the VMs are enabled for replication, we can check
the status of VM health under replicated items.
Do not setup VM replication for Databricks VM or other
commodity compute that have no persistence of processing state.
Fail over to the secondary region and fail back to the
primary region during and after outage.
With Azure Site Recovery:
Feature |
Cost |
RTO |
RPO |
Fail over and fail back |
Free for 31 days, Incur charges for Azure storage, storage transactions and
data transfers. A recovered VM might
also incur compute charges. |
Varies from a few minutes up to 2 hours |
Recovery points can be as frequent as every hour. |
Azure Backup is complimentary to Site Recovery. It allows
for granular backups and restores specific data while Site Recovery allows for
the protection of an entire site with automation and orchestration to make the
failover and failback process seamless.
If possible, run a test drill for your changes.
Terraform to apply:
Definitions:
resource "azurerm_virtual_machine""vm"
{
name
=
"vm"
location
=
azurerm_resource_group.primary.location
resource_group_name
=
azurerm_resource_group.primary.name
vm_size
=
"Standard_B1s"
network_interface_ids
=
[
azurerm_network_interface.vm.id
]
storage_image_reference
{
publisher
=
"OpenLogic"
offer
=
"CentOS"
sku
=
"7.5"
version
=
"latest"
}
storage_os_disk
{
name
=
"vm-os-disk"
os_type
=
"Linux"
caching
=
"ReadWrite"
create_option
=
"FromImage"
managed_disk_type
=
"Premium_LRS"
}
os_profile
{
admin_username
=
"test-admin-123"
admin_password
=
"test-pwd-123"
computer_name
=
"vm"
}
os_profile_linux_config
{
disable_password_authentication
=
false
}
}
resource "azurerm_recovery_services_vault""vault"
{
name
=
"example-recovery-vault"
location
=
azurerm_resource_group.secondary.location
resource_group_name
=
azurerm_resource_group.secondary.name
sku
=
"Standard"
}
resource "azurerm_site_recovery_fabric""primary"
{
name
=
"primary-fabric"
resource_group_name
=
azurerm_resource_group.secondary.name
recovery_vault_name
=
azurerm_recovery_services_vault.vault.name
location
=
azurerm_resource_group.primary.location
}
resource "azurerm_site_recovery_fabric""secondary"
{
name
=
"secondary-fabric"
resource_group_name
=
azurerm_resource_group.secondary.name
recovery_vault_name
=
azurerm_recovery_services_vault.vault.name
location
=
azurerm_resource_group.secondary.location
}
resource "azurerm_site_recovery_protection_container""primary"
{
name
=
"primary-protection-container"
resource_group_name
=
azurerm_resource_group.secondary.name
recovery_vault_name
=
azurerm_recovery_services_vault.vault.name
recovery_fabric_name
=
azurerm_site_recovery_fabric.primary.name
}
resource "azurerm_site_recovery_protection_container""secondary"
{
name
=
"secondary-protection-container"
resource_group_name
=
azurerm_resource_group.secondary.name
recovery_vault_name
=
azurerm_recovery_services_vault.vault.name
recovery_fabric_name
=
azurerm_site_recovery_fabric.secondary.name
}
resource "azurerm_site_recovery_replication_policy""policy"
{
name
=
"policy"
resource_group_name
=
azurerm_resource_group.secondary.name
recovery_vault_name
=
azurerm_recovery_services_vault.vault.name
recovery_point_retention_in_minutes
=
24
*
60
application_consistent_snapshot_frequency_in_minutes
=
4
*
60
}