9.6 Azure Site Recovery, Failover, and Replication
Key Takeaways
- Azure Site Recovery replicates workloads so they can fail over to a recovery location during an outage or planned migration.
- ASR is different from backup because it maintains replicated state for failover rather than storing retained recovery points for restore.
- Replication design includes source and target region, cache storage, target networks, VM sizes, disk choices, and recovery point retention.
- Test failover validates the disaster recovery plan without committing production failover.
- Failback and reprotection must be planned so workloads are protected again after recovery.
Backup versus Site Recovery
Azure Backup and Azure Site Recovery solve different recovery problems. Backup protects data and systems by creating recovery points that can be restored. Site Recovery, or ASR, replicates machines to a recovery location and orchestrates failover. If the requirement says recover a deleted file from last Tuesday, think backup. If it says keep an application available during a regional outage, think ASR, load balancing, DNS, and application architecture.
ASR can protect Azure VMs between regions, VMware or physical servers to Azure, and other supported scenarios. For AZ-104, the most common pattern is Azure VM replication to a secondary Azure region using a Recovery Services vault. You configure source and target settings, enable replication, monitor health, run test failover, and perform failover when required.
| Requirement | Best fit | Reason |
|---|---|---|
| Restore a VM from last week's recovery point | Azure Backup | Retained point-in-time restore. |
| Fail over a VM to another Azure region | Azure Site Recovery | Replication and failover orchestration. |
| Validate DR without disrupting production | ASR test failover | Runs isolated validation. |
| Keep logs for audit | Diagnostic setting to storage or workspace | Monitoring data retention, not workload DR. |
| Recover from accidental file deletion | Backup file recovery | ASR is not a file restore tool. |
Replication design
A replication setup starts with a Recovery Services vault. The vault holds the ASR configuration and recovery metadata. For Azure-to-Azure replication, choose the source region, target region, target resource group, target virtual network, target subnet, cache storage account, replica managed disk type, and replication policy. The cache storage account is used during replication; the replicated disks are maintained for recovery.
Network design is critical. The failed-over VM needs a target virtual network and subnet that allow the application to work. IP addressing, DNS, NSGs, UDRs, private endpoints, firewalls, and load balancers may all need corresponding recovery-region design. ASR can replicate a VM, but it does not automatically redesign an application dependency chain. If the app depends on a database, identity endpoint, storage private endpoint, or on-premises route, the recovery plan must account for it.
Sizing also matters. Target VM size should be available in the recovery region and large enough for the workload. Disk SKU choices should match performance needs. Quotas must exist in the target region. If test failover fails because the target region lacks quota for the selected VM family, the replication configuration may be valid but the recovery cannot instantiate the VM.
ASR workflow
Portal path: Azure portal > Recovery Services vault > Site Recovery > Enable replication. For Azure VMs, select source settings, VMs, target settings, replication policy, and review.
A practical workflow:
- Create or select a Recovery Services vault.
- Confirm target region, target resource group, VNet, subnet, disk type, and cache storage account.
- Enable replication for protected VMs.
- Wait for initial replication to complete and confirm health.
- Create recovery plans for multi-VM applications.
- Run test failover into an isolated network.
- Document DNS, load balancer, and application validation steps.
- Perform planned or unplanned failover only when the business decision is made.
- Commit failover after validation.
- Reprotect so the workload is protected in the new direction.
Recovery plans are important for multi-tier applications. They group protected machines and can define startup order. For example, start domain services first, then databases, then application servers, then web servers. Recovery plans can also include scripts or manual actions. AZ-104 may ask how to coordinate failover for multiple VMs in order; recovery plans are the answer.
Test, planned, and unplanned failover
Test failover validates recovery without affecting production replication. Choose a test network that is isolated from production to avoid duplicate IPs, duplicate domain members, or unintended client traffic. After validation, clean up the test failover. This is the safest routine DR exercise.
Planned failover is used when the source is still available, such as during a planned datacenter migration or expected outage. It attempts to synchronize final changes before failover to minimize data loss. Unplanned failover is used when the source is unavailable. It may use the latest available recovery point and can involve data loss depending on replication state.
| Failover type | Use case | Data loss expectation | Key action |
|---|---|---|---|
| Test failover | DR validation | No production impact expected | Use isolated test network and clean up. |
| Planned failover | Controlled move while source is available | Minimized by final sync | Shut down or coordinate source and commit. |
| Unplanned failover | Source outage | Depends on latest recovery point | Choose recovery point and validate app. |
| Failback | Return to original or preferred region | Requires reprotection planning | Reprotect and fail over in reverse direction. |
Monitoring replication
ASR provides replication health, RPO, job status, and error details in the vault. Azure Monitor alerts can notify on replication health issues or failover jobs. Backup Center and vault views help centralize operational awareness. A healthy replication status is not the same as a tested application recovery plan. The administrator should test failover regularly and verify application behavior.
KQL-style operational review can use diagnostics if vault logs are sent to Log Analytics. Table schemas depend on configuration, but the pattern is to filter the vault diagnostics for Site Recovery jobs and health events:
AzureDiagnostics
| where TimeGenerated > ago(24h)
| where Category has "AzureSiteRecovery"
| where JobStatus_s in ("Failed", "Warning") or ReplicationHealth_s != "Normal"
| project TimeGenerated, Resource, Category, JobStatus_s, ReplicationHealth_s, ErrorMessage_s
| order by TimeGenerated desc
Troubleshooting ASR
Initial replication failures often involve unsupported VM configuration, disk limits, extension problems, cache storage account access, or target region quota. Ongoing replication health issues can involve high churn, network connectivity, storage throttling, or agent issues depending on the protected workload type. Failover failures often involve target region capacity, target network configuration, boot diagnostics, encryption dependencies, or missing permissions.
For encrypted VMs, confirm Key Vault access in the target region and required permissions for recovery. For private workloads, confirm private DNS zones and endpoints exist or have a recovery design. For domain-joined VMs, confirm domain controller availability in the recovery environment. For load-balanced applications, confirm backend pools and health probes in the recovery region.
Exam traps
Do not choose ASR to recover a single deleted file. Do not choose Azure Backup alone when the requirement is regional failover with low recovery time. Do not run test failover into the production network unless the scenario explicitly accounts for isolation and conflicts. Do not forget to commit failover and reprotect after a real failover. Do not assume ASR protects every dependency; DNS, identity, networking, database replication, and application configuration must be part of the disaster recovery plan.
Which ASR operation validates disaster recovery without disrupting the production VM?
A multi-tier application must fail over with database servers started before web servers. What ASR feature should be used?
Which statement best distinguishes Azure Backup from Azure Site Recovery?