Recovery Sites, Restoration Ordering, and Failover

Key Takeaways

  • Hot sites recover fastest at highest cost; warm sites are partially prepared; cold sites are cheapest but slowest.
  • Restoration ordering must follow business priority and technical dependencies, not convenience.
  • Failover moves service to an alternate system or site; failback returns it to the primary after validation.
  • DNS, identity, network routing, certificates, and databases usually must recover before dependent applications.
  • Recovery must be validated before users are sent back, to avoid split-brain, stale data, and duplicate transactions.
Last updated: June 2026

Recovery Sites

Disaster recovery often needs an alternate place to run technology — another physical data center, a cloud region, a managed recovery provider, an alternate office, or a temporary workspace. The right choice depends on business impact, RTO, RPO, budget, risk, and operational complexity.

Site typeDescriptionTypical RTOBest fit
Hot siteFully equipped, current systems and data, near-readyMinutes to hoursVery short RTO, high criticality
Warm siteSome hardware, connectivity, and replicated dataHours to a dayModerate RTO, some setup acceptable
Cold siteSpace, power, basic infrastructure onlyDaysLong RTO, lowest cost
Cloud recoveryWorkloads recover into cloud infrastructure or another regionMinutes to hoursFlexible capacity and automation
Mobile siteTransportable recovery facilityVariableTemporary location after facility loss

The cost-versus-speed tradeoff is the heart of most exam questions. A hot site recovers quickly but is expensive because capacity, connectivity, software licensing, and continuous data synchronization must be maintained. A cold site is cheap but slow because equipment, systems, and data must be brought in or provisioned from scratch. A warm site sits between them, with some hardware and partial data ready. When a question stresses lowest cost, lean cold; when it stresses shortest downtime tolerated, lean hot.

Restoration Ordering

Restoration ordering is not a technical preference — it follows mission-essential functions and their dependencies. A web application is useless until identity services, DNS, network routing, database access, secrets, certificates, and monitoring are restored. A typical dependency-aware order:

  1. Safety and facility-access decisions (people first).
  2. Network connectivity, routing, DNS, and firewall rules.
  3. Identity, privileged access, secrets, certificates, and admin tooling.
  4. Storage platforms, databases, and replicated data.
  5. Core application services that support mission-essential functions.
  6. Validation, monitoring, logging, and security controls.
  7. User access, customer communication, and lower-priority systems.

This sequence flexes by scenario. If identity is compromised, the team may need to rebuild or rotate credentials before restoring dependent applications. If a database is corrupted, restoring application servers first wastes time — the data layer must be clean and validated first. The exam reliably rewards answers that cite dependencies and priority over "restore everything" or "restore the easiest system first."

Failover and Failback

Failover moves service from a failed primary environment to an alternate one. It can be manual or automated, and examples include promoting a standby database, routing users to another cloud region, switching a phone queue to a backup carrier, or activating workloads at a recovery site.

Failback is the controlled return from the alternate environment to the primary after the primary is repaired, cleaned, validated, and ready. Failback carries its own risk because data created during alternate operation must be reconciled.

Hazard during failbackWhat it meansMitigation
Split-brainTwo nodes both think they are primaryQuorum, fencing, single authoritative source
Stale dataPrimary still holds pre-failover dataRe-sync from the active replica before cutover
Duplicate transactionsRecords written in both environmentsReconciliation and idempotent processing
Premature returnCutting back to an unstable primaryValidate stability before scheduling failback

Scenario Example

An online testing platform loses its primary region during an exam window. The plan says candidate authentication and active exam delivery have the shortest RTO. The team fails over DNS to a secondary region, confirms identity-provider federation works, validates exam session state from replicated storage, confirms monitoring, and updates proctors. Reporting dashboards wait until active exams are stable.

If the primary region returns after 30 minutes, the correct answer is not "fail back immediately." The team should validate stability, compare data, ensure no sessions are lost, and schedule failback only when risk is acceptable. Often, staying on the recovery site until the exam window ends is safer than moving live users twice during a fragile period — a classic CC distractor that punishes the hasty choice.

Matching the Site to the Objective

The deciding factor between site types is almost always the RTO set in the business impact analysis, balanced against budget. A function with a 15-minute RTO cannot tolerate a cold site, because provisioning hardware and restoring data would take far longer than 15 minutes. A function with a five-day RTO does not justify the recurring cost of a hot site. When a question hands you an RTO and a cost constraint, choose the cheapest site type that still meets the RTO — that is the disciplined answer the exam expects, not simply "buy the hot site."

Cloud recovery has reshaped this decision. Because cloud capacity can be provisioned on demand, organizations can keep data replicated cheaply and spin up compute only when disaster strikes — a model sometimes called pilot light or warm standby. This blurs the old hot/warm/cold lines, but the underlying tradeoff of speed versus standing cost remains identical.

Validation Before Returning Users

No recovery is complete until it is validated. The exam consistently rewards answers that verify a system works before sending users to it. Validation steps include:

  • Confirming the application starts and authenticates against the identity provider.
  • Running representative transactions and checking data integrity.
  • Verifying monitoring, logging, and security controls are active.
  • Comparing data between primary and recovery copies to detect divergence.

Skipping validation is how teams create duplicate transactions, lose data written during failover, or send customers to a half-working system. A restored server that no one has tested is a liability, not a recovery — connect every restoration step to a validation step before declaring success.

Test Your Knowledge

Which recovery site offers the fastest recovery but the highest ongoing maintenance cost?

A
B
C
D
Test Your Knowledge

Failover and failback differ in direction. Which statement describes failback?

A
B
C
D
Test Your Knowledge

A restored web server cannot function because DNS, identity, certificates, and its database are still unavailable. What DR weakness does this reveal?

A
B
C
D