Cluster computing

Wednesday, May 31, 2023

Pattern #2: Backup your Virtual Machines

Overview/Purpose:

This pattern articulates the way to provide business continuity and disaster recovery for your virtual machines such that the state can be recovered after a user or application error, regional data center outage, or unplanned disruptions.

Solution Design

Use the Azure Site Recovery to:

1. Continuously replicate to a different target region and

2. Setup a replication policy with

a. Recovery point retention policy set to 1 day.

b. App-consistent snapshot frequency set to 0 hours.

3. Use a Backup vault to store the snapshots.

4. The replication settings must specify target location, target subscription which can be the same as the source subscription and a target resource group, failover virtual network, failover subnet, new replica managed disk at destination, cache storage at source which is a Standard storage account, availability options for each VM and capacity reservation.

5. After the VMs are enabled for replication, we can check the status of VM health under replicated items.

Do not setup VM replication for Databricks VM or other commodity compute that have no persistence of processing state.

Fail over to the secondary region and fail back to the primary region during and after outage.

With Azure Site Recovery:

Feature

Cost

RTO

RPO

Fail over and fail back

Free for 31 days,

Incur charges for Azure storage, storage transactions and data transfers. A recovered VM might also incur compute charges.

Varies from a few minutes up to 2 hours

Recovery points can be as frequent as every hour.

Azure Backup is complimentary to Site Recovery. It allows for granular backups and restores specific data while Site Recovery allows for the protection of an entire site with automation and orchestration to make the failover and failback process seamless.

If possible, run a test drill for your changes.

Terraform to apply:

Definitions:

resource "azurerm_virtual_machine" "vm" {

  name                  = "vm"

  location              = azurerm_resource_group.primary.location

  resource_group_name   = azurerm_resource_group.primary.name

  vm_size               = "Standard_B1s"

  network_interface_ids = [azurerm_network_interface.vm.id]

  storage_image_reference {

    publisher = "OpenLogic"

    offer     = "CentOS"

    sku       = "7.5"

    version   = "latest"

  storage_os_disk {

    name              = "vm-os-disk"

    os_type           = "Linux"

    caching           = "ReadWrite"

    create_option     = "FromImage"

    managed_disk_type = "Premium_LRS"

  os_profile {

    admin_username = "test-admin-123"

    admin_password = "test-pwd-123"

    computer_name  = "vm"

  os_profile_linux_config {

    disable_password_authentication = false

resource "azurerm_recovery_services_vault" "vault" {

  name                = "example-recovery-vault"

  location            = azurerm_resource_group.secondary.location

  resource_group_name = azurerm_resource_group.secondary.name

  sku                 = "Standard"

resource "azurerm_site_recovery_fabric" "primary" {

  name                = "primary-fabric"

  resource_group_name = azurerm_resource_group.secondary.name

  recovery_vault_name = azurerm_recovery_services_vault.vault.name

  location            = azurerm_resource_group.primary.location

resource "azurerm_site_recovery_fabric" "secondary" {

  name                = "secondary-fabric"

  resource_group_name = azurerm_resource_group.secondary.name

  recovery_vault_name = azurerm_recovery_services_vault.vault.name

  location            = azurerm_resource_group.secondary.location

resource "azurerm_site_recovery_protection_container" "primary" {

  name                 = "primary-protection-container"

  resource_group_name  = azurerm_resource_group.secondary.name

  recovery_vault_name  = azurerm_recovery_services_vault.vault.name

  recovery_fabric_name = azurerm_site_recovery_fabric.primary.name

resource "azurerm_site_recovery_protection_container" "secondary" {

  name                 = "secondary-protection-container"

  resource_group_name  = azurerm_resource_group.secondary.name

  recovery_vault_name  = azurerm_recovery_services_vault.vault.name

  recovery_fabric_name = azurerm_site_recovery_fabric.secondary.name

resource "azurerm_site_recovery_replication_policy" "policy" {

  name                                                 = "policy"

  resource_group_name                                  = azurerm_resource_group.secondary.name

  recovery_vault_name                                  = azurerm_recovery_services_vault.vault.name

  recovery_point_retention_in_minutes                  = 24 * 60

  application_consistent_snapshot_frequency_in_minutes = 4 * 60

Tuesday, May 30, 2023

Pattern #1: Backup your MySQL database

Overview/Purpose:

This pattern articulates the way to provide business continuity and disaster recovery for your MySQL databases deployed on a single server or on a cluster in Azure such that the data can be recovered after a user or application error, regional data center outage, or unplanned disruptions.

Concepts to Understand

Paired Region: Azure supports cross region replication pairings for all geographies. Regions are paired for cross-region replication based on proximity and other factors. The Azure regional pairs in North America include East US – West US, East US 2 – Central US, North Central US – South Central US, West US 2 – West Central US, and West US 3 – East US. One of the benefits of choosing from these pairings is that if there’s a broad outage, recovery of at least one region is prioritized. Without pairings, the default region used across many deployments is Central US, but it is recommended to achieve high availability via availability zones and locally redundant or zone-redundant storage. Regions without a pair will not have geo-redundant storage.

Geo-Restore: This is a feature of the Azure Database for MySQL that allows the server to be restored with geo-redundant backups. The backups are hosted in the server’s paired region.

RTO: The Recovery Time Objective is the amount of time that the resource can be down without causing significant damage to the business and the time spent restoring it back to normal operations after the incident.

RPO: The Recovery Point Objective is the amount of time that might pass during a disruption before the quantity of data lost during that period is greater than the allowable threshold.

Solution Design

Set your MySQL server to take:

1. geo-redundant backups with the ability to initiate geo-restore, or

2. deploy read replicas in a different region.

With Geo-restore, a new server is created using the backup data that is replicated from another region. The overall time it takes to restore and recover depends on the size of the database and the number of logs to recover which is in the range of a few minutes to a few hours.

With read replicas, transaction logs from the primary are asynchronously streamed to the replica. In the event of a primary database outage due to a zone-level or regional level fault, failing over to the replica provides a shorter RTO and reduced data loss.

Feature	Cost	RTO	RPO
Geo-restore	Only on General-purpose/memory-optimized SKU	Varies	<1h
Read replicas	Available on Basic	Minutes but depends on latency, size of data and write workload	< 5 min

Terraform to apply:

Option 1:

resource "azurerm_mysql_flexible_server" "default" {

create_mode: “GeoRestore”

geo_redundant_backup_enabled = true

source_server_id: “other_server”

}

Changing the backup attribute to be geo_redundant from the default of locally redundant via Terraform, so that there is protection against region level failures, is an action that involves destroying the existing instance and creating it again.

Option 2:

resource "azurerm_mysql_flexible_server" "example" {

create_mode: “Replica”

source_server_id: “other_server”

sku_name = "B_Standard_B1s"

}

· If possible, run a test drill for your changes.

Recovery plan:Applications do not see the failure of a database or storage because the configured MySQL server automatically recovers but user action is required when there is a region failure or a user error. A region failure is a rare event and requires the promotion of a read replica to master. The replica is stopped and then promoted.

This pattern holds true for Cassandra cluster as well where we can specify hours_between_backups that defaults to 24 hours and it takes continuous backups. Paired region support is available for Kubernetes clusters and persistent volumes.

Note that the databases are typically backed up automatically every day, we only need to choose between geo-restoring from a backup or linking a replica to the original server. It works for both a single server instance as well as a high-availability flexible server instance.