Recovery Sites, Restoration Ordering, and Failover
Key Takeaways
- Hot sites recover fastest at highest cost; warm sites are partially prepared; cold sites are cheapest but slowest.
- Restoration ordering must follow business priority and technical dependencies, not convenience.
- Failover moves service to an alternate system or site; failback returns it to the primary after validation.
- DNS, identity, network routing, certificates, and databases usually must recover before dependent applications.
- Recovery must be validated before users are sent back, to avoid split-brain, stale data, and duplicate transactions.
Recovery Sites
Disaster recovery often needs an alternate place to run technology — another physical data center, a cloud region, a managed recovery provider, an alternate office, or a temporary workspace. The right choice depends on business impact, RTO, RPO, budget, risk, and operational complexity.
| Site type | Description | Typical RTO | Best fit |
|---|---|---|---|
| Hot site | Fully equipped, current systems and data, near-ready | Minutes to hours | Very short RTO, high criticality |
| Warm site | Some hardware, connectivity, and replicated data | Hours to a day | Moderate RTO, some setup acceptable |
| Cold site | Space, power, basic infrastructure only | Days | Long RTO, lowest cost |
| Cloud recovery | Workloads recover into cloud infrastructure or another region | Minutes to hours | Flexible capacity and automation |
| Mobile site | Transportable recovery facility | Variable | Temporary location after facility loss |
The cost-versus-speed tradeoff is the heart of most exam questions. A hot site recovers quickly but is expensive because capacity, connectivity, software licensing, and continuous data synchronization must be maintained. A cold site is cheap but slow because equipment, systems, and data must be brought in or provisioned from scratch. A warm site sits between them, with some hardware and partial data ready. When a question stresses lowest cost, lean cold; when it stresses shortest downtime tolerated, lean hot.
Restoration Ordering
Restoration ordering is not a technical preference — it follows mission-essential functions and their dependencies. A web application is useless until identity services, DNS, network routing, database access, secrets, certificates, and monitoring are restored. A typical dependency-aware order:
- Safety and facility-access decisions (people first).
- Network connectivity, routing, DNS, and firewall rules.
- Identity, privileged access, secrets, certificates, and admin tooling.
- Storage platforms, databases, and replicated data.
- Core application services that support mission-essential functions.
- Validation, monitoring, logging, and security controls.
- User access, customer communication, and lower-priority systems.
This sequence flexes by scenario. If identity is compromised, the team may need to rebuild or rotate credentials before restoring dependent applications. If a database is corrupted, restoring application servers first wastes time — the data layer must be clean and validated first. The exam reliably rewards answers that cite dependencies and priority over "restore everything" or "restore the easiest system first."
Failover and Failback
Failover moves service from a failed primary environment to an alternate one. It can be manual or automated, and examples include promoting a standby database, routing users to another cloud region, switching a phone queue to a backup carrier, or activating workloads at a recovery site.
Failback is the controlled return from the alternate environment to the primary after the primary is repaired, cleaned, validated, and ready. Failback carries its own risk because data created during alternate operation must be reconciled.
| Hazard during failback | What it means | Mitigation |
|---|---|---|
| Split-brain | Two nodes both think they are primary | Quorum, fencing, single authoritative source |
| Stale data | Primary still holds pre-failover data | Re-sync from the active replica before cutover |
| Duplicate transactions | Records written in both environments | Reconciliation and idempotent processing |
| Premature return | Cutting back to an unstable primary | Validate stability before scheduling failback |
Scenario Example
An online testing platform loses its primary region during an exam window. The plan says candidate authentication and active exam delivery have the shortest RTO. The team fails over DNS to a secondary region, confirms identity-provider federation works, validates exam session state from replicated storage, confirms monitoring, and updates proctors. Reporting dashboards wait until active exams are stable.
If the primary region returns after 30 minutes, the correct answer is not "fail back immediately." The team should validate stability, compare data, ensure no sessions are lost, and schedule failback only when risk is acceptable. Often, staying on the recovery site until the exam window ends is safer than moving live users twice during a fragile period — a classic CC distractor that punishes the hasty choice.
Matching the Site to the Objective
The deciding factor between site types is almost always the RTO set in the business impact analysis, balanced against budget. A function with a 15-minute RTO cannot tolerate a cold site, because provisioning hardware and restoring data would take far longer than 15 minutes. A function with a five-day RTO does not justify the recurring cost of a hot site. When a question hands you an RTO and a cost constraint, choose the cheapest site type that still meets the RTO — that is the disciplined answer the exam expects, not simply "buy the hot site."
Cloud recovery has reshaped this decision. Because cloud capacity can be provisioned on demand, organizations can keep data replicated cheaply and spin up compute only when disaster strikes — a model sometimes called pilot light or warm standby. This blurs the old hot/warm/cold lines, but the underlying tradeoff of speed versus standing cost remains identical.
Validation Before Returning Users
No recovery is complete until it is validated. The exam consistently rewards answers that verify a system works before sending users to it. Validation steps include:
- Confirming the application starts and authenticates against the identity provider.
- Running representative transactions and checking data integrity.
- Verifying monitoring, logging, and security controls are active.
- Comparing data between primary and recovery copies to detect divergence.
Skipping validation is how teams create duplicate transactions, lose data written during failover, or send customers to a half-working system. A restored server that no one has tested is a liability, not a recovery — connect every restoration step to a validation step before declaring success.
Which recovery site offers the fastest recovery but the highest ongoing maintenance cost?
Failover and failback differ in direction. Which statement describes failback?
A restored web server cannot function because DNS, identity, certificates, and its database are still unavailable. What DR weakness does this reveal?