9.6 Azure Site Recovery, Failover, and Replication

Key Takeaways

  • Azure Site Recovery replicates workloads so they can fail over to a recovery location during an outage or planned migration.
  • ASR is different from backup because it maintains replicated state for failover rather than storing retained recovery points for restore.
  • Replication design includes source and target region, cache storage, target networks, VM sizes, disk choices, and recovery point retention.
  • Test failover validates the disaster recovery plan without committing production failover.
  • Failback and reprotection must be planned so workloads are protected again after recovery.
Last updated: May 2026

Backup versus Site Recovery

Azure Backup and Azure Site Recovery solve different recovery problems. Backup protects data and systems by creating recovery points that can be restored. Site Recovery, or ASR, replicates machines to a recovery location and orchestrates failover. If the requirement says recover a deleted file from last Tuesday, think backup. If it says keep an application available during a regional outage, think ASR, load balancing, DNS, and application architecture.

ASR can protect Azure VMs between regions, VMware or physical servers to Azure, and other supported scenarios. For AZ-104, the most common pattern is Azure VM replication to a secondary Azure region using a Recovery Services vault. You configure source and target settings, enable replication, monitor health, run test failover, and perform failover when required.

RequirementBest fitReason
Restore a VM from last week's recovery pointAzure BackupRetained point-in-time restore.
Fail over a VM to another Azure regionAzure Site RecoveryReplication and failover orchestration.
Validate DR without disrupting productionASR test failoverRuns isolated validation.
Keep logs for auditDiagnostic setting to storage or workspaceMonitoring data retention, not workload DR.
Recover from accidental file deletionBackup file recoveryASR is not a file restore tool.

Replication design

A replication setup starts with a Recovery Services vault. The vault holds the ASR configuration and recovery metadata. For Azure-to-Azure replication, choose the source region, target region, target resource group, target virtual network, target subnet, cache storage account, replica managed disk type, and replication policy. The cache storage account is used during replication; the replicated disks are maintained for recovery.

Network design is critical. The failed-over VM needs a target virtual network and subnet that allow the application to work. IP addressing, DNS, NSGs, UDRs, private endpoints, firewalls, and load balancers may all need corresponding recovery-region design. ASR can replicate a VM, but it does not automatically redesign an application dependency chain. If the app depends on a database, identity endpoint, storage private endpoint, or on-premises route, the recovery plan must account for it.

Sizing also matters. Target VM size should be available in the recovery region and large enough for the workload. Disk SKU choices should match performance needs. Quotas must exist in the target region. If test failover fails because the target region lacks quota for the selected VM family, the replication configuration may be valid but the recovery cannot instantiate the VM.

ASR workflow

Portal path: Azure portal > Recovery Services vault > Site Recovery > Enable replication. For Azure VMs, select source settings, VMs, target settings, replication policy, and review.

A practical workflow:

  1. Create or select a Recovery Services vault.
  2. Confirm target region, target resource group, VNet, subnet, disk type, and cache storage account.
  3. Enable replication for protected VMs.
  4. Wait for initial replication to complete and confirm health.
  5. Create recovery plans for multi-VM applications.
  6. Run test failover into an isolated network.
  7. Document DNS, load balancer, and application validation steps.
  8. Perform planned or unplanned failover only when the business decision is made.
  9. Commit failover after validation.
  10. Reprotect so the workload is protected in the new direction.

Recovery plans are important for multi-tier applications. They group protected machines and can define startup order. For example, start domain services first, then databases, then application servers, then web servers. Recovery plans can also include scripts or manual actions. AZ-104 may ask how to coordinate failover for multiple VMs in order; recovery plans are the answer.

Test, planned, and unplanned failover

Test failover validates recovery without affecting production replication. Choose a test network that is isolated from production to avoid duplicate IPs, duplicate domain members, or unintended client traffic. After validation, clean up the test failover. This is the safest routine DR exercise.

Planned failover is used when the source is still available, such as during a planned datacenter migration or expected outage. It attempts to synchronize final changes before failover to minimize data loss. Unplanned failover is used when the source is unavailable. It may use the latest available recovery point and can involve data loss depending on replication state.

Failover typeUse caseData loss expectationKey action
Test failoverDR validationNo production impact expectedUse isolated test network and clean up.
Planned failoverControlled move while source is availableMinimized by final syncShut down or coordinate source and commit.
Unplanned failoverSource outageDepends on latest recovery pointChoose recovery point and validate app.
FailbackReturn to original or preferred regionRequires reprotection planningReprotect and fail over in reverse direction.

Monitoring replication

ASR provides replication health, RPO, job status, and error details in the vault. Azure Monitor alerts can notify on replication health issues or failover jobs. Backup Center and vault views help centralize operational awareness. A healthy replication status is not the same as a tested application recovery plan. The administrator should test failover regularly and verify application behavior.

KQL-style operational review can use diagnostics if vault logs are sent to Log Analytics. Table schemas depend on configuration, but the pattern is to filter the vault diagnostics for Site Recovery jobs and health events:

AzureDiagnostics
| where TimeGenerated > ago(24h)
| where Category has "AzureSiteRecovery"
| where JobStatus_s in ("Failed", "Warning") or ReplicationHealth_s != "Normal"
| project TimeGenerated, Resource, Category, JobStatus_s, ReplicationHealth_s, ErrorMessage_s
| order by TimeGenerated desc

Troubleshooting ASR

Initial replication failures often involve unsupported VM configuration, disk limits, extension problems, cache storage account access, or target region quota. Ongoing replication health issues can involve high churn, network connectivity, storage throttling, or agent issues depending on the protected workload type. Failover failures often involve target region capacity, target network configuration, boot diagnostics, encryption dependencies, or missing permissions.

For encrypted VMs, confirm Key Vault access in the target region and required permissions for recovery. For private workloads, confirm private DNS zones and endpoints exist or have a recovery design. For domain-joined VMs, confirm domain controller availability in the recovery environment. For load-balanced applications, confirm backend pools and health probes in the recovery region.

Exam traps

Do not choose ASR to recover a single deleted file. Do not choose Azure Backup alone when the requirement is regional failover with low recovery time. Do not run test failover into the production network unless the scenario explicitly accounts for isolation and conflicts. Do not forget to commit failover and reprotect after a real failover. Do not assume ASR protects every dependency; DNS, identity, networking, database replication, and application configuration must be part of the disaster recovery plan.

Test Your Knowledge

Which ASR operation validates disaster recovery without disrupting the production VM?

A
B
C
D
Test Your Knowledge

A multi-tier application must fail over with database servers started before web servers. What ASR feature should be used?

A
B
C
D
Test Your Knowledge

Which statement best distinguishes Azure Backup from Azure Site Recovery?

A
B
C
D