5.4 Availability Zones, Sets, and VM Scale Sets
Key Takeaways
- Availability zones protect against datacenter-level failure within a region, while availability sets spread VMs across fault and update domains inside a datacenter placement model.
- VM Scale Sets provide a managed group of VM instances with uniform or flexible orchestration, autoscale, load balancing, and rolling upgrade patterns.
- High availability design must include compute placement, load balancing, health probes, application state, data tier resilience, and operational update strategy.
- Troubleshooting scale sets often involves instance health, autoscale rules, image or extension failures, quota, load balancer backend membership, and upgrade policy.
Availability Building Blocks
A single VM has planned and unplanned downtime risk. Azure provides several compute placement options to reduce that risk, but each solves a different problem. Availability zones are physically separate datacenter locations inside a supported region. Zone-redundant designs can survive a zone failure if the application, network path, and data tier are also resilient. Availability sets spread VMs across fault domains and update domains to reduce correlated host and maintenance impact. VM Scale Sets manage multiple VM instances as a group and are commonly used for stateless application tiers.
| Option | Protects against | Best fit | Key limitation |
|---|---|---|---|
| Availability zone | Datacenter or zone failure | Regional HA for supported services | Not every region or SKU supports every zone |
| Availability set | Host rack and planned update grouping | Legacy or non-zonal multi-VM apps | Must be chosen at VM creation |
| VM Scale Set | Instance failure and elastic demand | Horizontal app tiers | App should tolerate instance replacement |
| Azure Site Recovery | Regional disaster recovery | DR to another region | Failover design and testing required |
An availability set cannot be added to an existing VM after creation. To use one, deploy the VMs into the set at creation or rebuild the architecture. Zones are also selected at creation for zonal VMs. Moving a VM into a different zone is not a simple property edit.
Availability Zones
Zones are useful when the question mentions datacenter-level isolation, zone-redundant architecture, or high availability within a region. A zonal VM is pinned to one zone. A resilient application normally deploys multiple VMs across zones and places a zone-redundant load balancer, application gateway, or other traffic manager in front where appropriate. The storage design must also match the availability goal. Managed disks can use locally redundant or zone-redundant options depending on disk type and support.
Example CLI pattern:
az vm create \
--resource-group rg-prod-web \
--name vm-web-z1 \
--image Ubuntu2204 \
--zone 1 \
--size Standard_D2s_v5 \
--vnet-name vnet-prod \
--subnet snet-web \
--admin-username azureadmin \
--ssh-key-values ~/.ssh/id_rsa.pub
If a deployment fails in a zone, check SKU availability in that zone, zonal quota, unsupported disk settings, and capacity constraints. A VM size might be available in the region but not in the requested zone.
Availability Sets
Availability sets use fault domains and update domains. Fault domains represent groups of hardware that share power and network infrastructure. Update domains represent groups that can be rebooted during planned maintenance. Azure distributes VMs in an availability set across these domains. The design reduces the chance that one host failure or maintenance wave affects all instances.
Availability sets are not a scaling feature. They do not create new VMs, balance traffic, or repair unhealthy apps. You still need a load balancer or application-level traffic distribution, and you need at least two VMs. Managed disks are recommended because Azure aligns disk placement with VM fault domains for managed availability sets.
Bicep example:
resource avset 'Microsoft.Compute/availabilitySets@2024-07-01' = {
name: 'avset-web-prod'
location: location
sku: {
name: 'Aligned'
}
properties: {
platformFaultDomainCount: 2
platformUpdateDomainCount: 5
}
}
resource vm 'Microsoft.Compute/virtualMachines@2024-07-01' = {
name: vmName
location: location
properties: {
availabilitySet: {
id: avset.id
}
hardwareProfile: {
vmSize: 'Standard_D2s_v5'
}
storageProfile: storageProfile
osProfile: osProfile
networkProfile: networkProfile
}
}
VM Scale Sets
A VM Scale Set is a group of VM instances managed together. Scale sets can use uniform orchestration, where instances are more identical and managed as a set, or flexible orchestration, which supports a more VM-like model and broader high availability patterns. Scale sets can integrate with Azure Load Balancer or Application Gateway. They can use autoscale rules based on metrics such as CPU, queue length, or custom signals.
Scale sets work best for stateless workloads or workloads that externalize state to databases, storage, queues, or caches. Instance replacement is normal. If the app stores unique state on the local OS disk, scale-in or repair can cause data loss. Use custom images, cloud-init, extensions, or configuration management to make instances reproducible.
CLI example:
az vmss create \
--resource-group rg-prod-web \
--name vmss-web-prod \
--image Ubuntu2204 \
--instance-count 3 \
--vm-sku Standard_D2s_v5 \
--upgrade-policy-mode automatic \
--lb lb-web-prod \
--backend-pool-name bepool-web
az monitor autoscale create \
--resource-group rg-prod-web \
--resource vmss-web-prod \
--resource-type Microsoft.Compute/virtualMachineScaleSets \
--name autoscale-web \
--min-count 2 --max-count 10 --count 3
az monitor autoscale rule create \
--resource-group rg-prod-web \
--autoscale-name autoscale-web \
--condition "Percentage CPU > 70 avg 10m" \
--scale out 1
Upgrade and Health Strategy
Scale set upgrade policy controls how model changes reach instances. Manual means you update instances yourself. Automatic applies changes without as much staged control. Rolling upgrades update batches and can use health probes to pause when instances become unhealthy. Automatic instance repair can replace unhealthy instances when health signals are configured.
| Requirement | Recommended feature |
|---|---|
| Add instances during high CPU | Autoscale rule |
| Replace unhealthy scale set instances | Automatic repairs with health signal |
| Safely roll image updates | Rolling upgrade policy |
| Spread instances across zones | Zone-aware scale set design |
| Put instances behind one frontend IP | Azure Load Balancer backend pool |
Troubleshooting Scenarios
Scenario 1: Autoscale did not add instances. Check the autoscale setting target resource, metric namespace, time aggregation, operator, duration, cooldown, minimum and maximum limits, and whether the scale set already reached max count. Also check quota. Autoscale cannot create instances if the subscription lacks regional family quota.
Scenario 2: Instances are created but do not receive traffic. Check backend pool membership, load balancer rule, health probe path and port, NSG effective rules, guest firewall, and whether the app is listening. A failed health probe keeps the instance out of rotation even if the VM is running.
Scenario 3: Rolling upgrade stops. Check instance health, extension status, application health extension, boot diagnostics, and model differences. A bad custom script can make every new instance unhealthy. Fix the model, then update or reimage affected instances.
Exam Design Logic
Choose availability zones when the question requires protection from datacenter failure in the same region. Choose availability sets when the question describes two or more traditional VMs that must be isolated across fault and update domains and zones are not the focus. Choose VM Scale Sets when the requirement is many similar instances, autoscaling, automatic repair, or consistent rolling deployment. Choose Azure Site Recovery when the requirement is regional disaster recovery for VMs.
You need a group of stateless web VMs to add instances automatically when CPU remains high. Which Azure feature should you use?
Which statement about availability sets is correct?
A scale set instance is running but receives no load-balanced traffic. What should you check first?