High Availability and Disaster Recovery in Azure
Key Takeaways
- High availability keeps an application running through localized failures using in-region redundancy such as Availability Sets and Availability Zones.
- Disaster recovery restores service after a region-wide outage, typically with Azure Site Recovery and geo-redundant storage.
- Two VMs spread across Availability Zones carry a 99.99% VM SLA; an Availability Set carries 99.95%; a single VM with Premium SSD carries 99.9%.
- RPO measures acceptable data loss in time; RTO measures acceptable downtime in time.
- Azure Backup uses a Recovery Services vault with soft delete on by default, retaining deleted backups 14 extra days against accidental or malicious loss.
Quick Answer: High availability (HA) survives localized failures inside a region (Availability Sets, Availability Zones, load balancers). Disaster recovery (DR) survives an entire region failure (Azure Site Recovery, geo-redundant storage). RPO is how much data you can lose; RTO is how much time you can be down. Your SLA target dictates which pattern you must use.
High availability inside a region
HA is about removing single points of failure so that an application keeps serving users when a disk, rack, or data center hiccups. Azure gives you increasingly strong (and increasingly expensive) options.
| Pattern | Protects against | VM SLA |
|---|---|---|
| Single VM, Premium/Ultra SSD | Disk failure | 99.9% |
| Availability Set (2+ VMs) | Rack and host-update failures | 99.95% |
| Availability Zones (VMs in 2+ zones) | Whole data center failure | 99.99% |
| Multi-region active/active | Whole region failure | 99.99%+ |
Availability Set mechanics matter for the exam. It spreads VMs across fault domains (separate racks with separate power and network) and update domains (groups patched at different times). So planned host maintenance never reboots every replica at once, and a single rack power loss never takes the whole set down.
Availability Zones are three or more physically separate facilities in a region, each with independent power, cooling, and networking. Placing two VMs in two zones reaches the 99.99% SLA because a single building failure cannot take both down.
Load balancing options
| Service | Layer / scope | Use |
|---|---|---|
| Azure Load Balancer | Layer 4, regional | Spread TCP/UDP traffic across VMs in a region |
| Application Gateway | Layer 7, regional | HTTP routing, WAF, health probes |
| Traffic Manager | DNS, global | Route users to the closest healthy region |
| Front Door | Layer 7, global | Global HTTP load balancing and failover |
Disaster recovery across regions
Azure Site Recovery (ASR)
Azure Site Recovery is disaster recovery as a service. It continuously replicates VMs to a secondary region, supports one-click failover and failback, and lets you run a non-disruptive DR drill against an isolated network so you can prove your plan works without touching production. Recovery plans order the failover (database tier before web tier, for example) and can run automation scripts. Typical RTO is minutes and RPO is seconds to a few minutes.
Azure Backup
Azure Backup stores recovery points in a Recovery Services vault with no backup infrastructure to manage.
| What you can back up | How |
|---|---|
| Azure VMs | Application-consistent snapshots |
| Azure Files shares | Share snapshots |
| SQL Server / SAP HANA in Azure VMs | Database backup, down to 15-minute RPO for SQL |
| On-premises files and folders | MARS agent or Azure Backup Server |
Soft delete is enabled by default and keeps deleted backups for 14 additional days at no extra cost — a critical safety net against accidental deletion and ransomware. Backups can use geo-redundant storage so they survive a region loss.
RPO versus RTO
| Term | Means | Example |
|---|---|---|
| RPO (Recovery Point Objective) | Maximum acceptable data loss, in time | RPO of 1 hour = lose at most 1 hour of data |
| RTO (Recovery Time Objective) | Maximum acceptable downtime, in time | RTO of 4 hours = back online within 4 hours |
Worked example: A bank requires RPO of 5 minutes and RTO of 30 minutes for its core ledger. Daily Azure Backup alone gives an RPO of up to 24 hours — far too lossy. ASR with continuous replication (RPO in seconds) plus a pre-tested recovery plan meets both targets. Tighter RPO/RTO always costs more, so do not over-engineer a workload whose business tolerates hours of loss.
HA versus DR: do not confuse them
The exam frequently offers a backup-or-replication scenario where one answer is HA and one is DR. Use this rule of thumb:
| Question | Reach for |
|---|---|
| Keep an app up when one data center fails | Availability Zones (HA) |
| Survive planned host maintenance and rack loss | Availability Set (HA) |
| Recover the whole app after a region disaster | Azure Site Recovery (DR) |
| Restore individual files, VMs, or databases from a point in time | Azure Backup (DR) |
| Keep storage available if its primary region fails | Geo-redundant storage (GRS) |
Backup is not the same as Site Recovery. Azure Backup creates recovery points you restore from after data loss or corruption; ASR keeps a near-real-time replica you fail over to during an outage. A mature design uses both: ASR for fast regional failover and Backup for long-term, point-in-time, ransomware-resistant recovery. Geo-redundant storage underpins both by copying data to the paired region hundreds of miles away.
On the Exam: RPO = data, RTO = time. HA solves in-region failures (zones, sets); DR solves region failures. "Replicate VMs to another region for failover" = Azure Site Recovery. "Recover an accidentally deleted backup" = soft delete. "Lowest data loss for a critical database" points to continuous replication, not nightly backup.
What does Azure Site Recovery primarily provide?
An application can tolerate losing at most 15 minutes of data but must be running again within 1 hour. Which two values describe these requirements?
Which configuration provides the 99.99% SLA for Azure Virtual Machines?