PracticeBlogFlashcardsEspañol

DR and HA Concepts: RPO, RTO, MTTR, MTBF, Sites, Failover, and Testing

Key Takeaways

  • RPO defines acceptable data loss in time, while RTO defines acceptable time to restore service.
  • MTTR measures average repair or recovery time, while MTBF measures expected time between failures.
  • Cold, warm, and hot sites represent different tradeoffs among cost, readiness, and recovery speed.
  • High availability designs include redundancy, clustering, active-active, active-passive, load balancing, and failover.
  • DR and HA plans must be tested through tabletop, walk-through, simulation, failover, and restore exercises.
Last updated: April 2026

Disaster Recovery, High Availability, and Testing

Resilience planning connects business requirements to technical design. The right answer depends on how much downtime, data loss, cost, and complexity the organization can accept.

Recovery and Reliability Terms

TermMeaningExample
RPORecovery point objective, maximum acceptable data loss measured in timeLose no more than 15 minutes of transactions
RTORecovery time objective, target time to restore serviceRestore VPN service within 1 hour
MTTRMean time to repair or recoverAverage time to restore after failure
MTBFMean time between failuresExpected reliability interval for a component

RPO and RTO are requirements. MTTR and MTBF are measurements or estimates that help evaluate reliability and supportability.

Recovery Sites

Site typeReadinessCostRecovery speed
Cold siteFacility or space with minimal equipmentLowerSlowest
Warm siteSome systems, connectivity, and data preparedMediumModerate
Hot siteReady environment with current data and active services or rapid failoverHigherFastest

A cold site may be acceptable for a low-priority office function with a long RTO. A hot site may be required for a service with a short RTO and short RPO.

High Availability Designs

DesignDescriptionNotes
Redundant componentExtra power supply, link, switch, firewall, or ISPRemoves a single point of failure
Load balancingDistributes client requests across multiple systemsCan improve scale and availability
ClusteringMultiple nodes act as one service groupOften supports failover
Active-activeMultiple nodes handle production traffic at the same timeEfficient but more complex
Active-passiveStandby node waits until failoverSimpler capacity planning but idle resources
Geographic redundancyServices run or recover in another locationReduces site or regional risk

HA reduces downtime from component failures. DR addresses larger disruptions such as site loss, major outage, corruption, or disaster conditions. They overlap but are not the same.

Failover and Split-Brain

Failover moves service from a failed component or site to a standby or peer. Designs should define health checks, failover triggers, state synchronization, routing changes, DNS behavior, and how failback occurs. Cluster designs must prevent split-brain, where two nodes believe they are primary and accept conflicting changes.

Testing Methods

Test typeWhat happens
TabletopStakeholders discuss a scenario and decisions
Walk-throughTeam reviews procedures step by step
SimulationExercise uses realistic inputs without full production impact
Failover testService is moved to standby resources
Restore testData or configuration is restored and validated
Full interruption testProduction is intentionally interrupted under controlled conditions

Testing should produce evidence: results, gaps, timing, failed assumptions, updated contacts, and revised runbooks. A plan that is never tested is only a theory.

Practical Scenario

A company says its ordering system must recover within 30 minutes and lose no more than 5 minutes of orders. That requirement points toward a warm or hot design with frequent replication or transaction protection, tested failover, current documentation, monitoring, and a restoration procedure. A nightly backup to offline storage would not meet the RPO.

Common Exam Traps

TrapBetter exam reasoning
"RTO is the amount of data loss."RTO is time to restore service; RPO is acceptable data loss.
"Active-active is always best."It can be costly and complex; requirements drive design.
"A hot site eliminates the need for backups."Replication can copy corruption or deletion; backups are still needed.
"A tabletop test proves systems can fail over."It validates decisions and procedures, not necessarily technical failover.

Quick Drill

Match the clue:

  1. Back online within 2 hours: RTO.
  2. Lose no more than 10 minutes of changes: RPO.
  3. Average repair time after device failure: MTTR.
  4. Both firewalls pass production traffic: active-active.
  5. Recovery facility with equipment already running: hot site.
Test Your Knowledge

A service must be restored within 45 minutes and can lose no more than 10 minutes of data. Which statement is correct?

A
B
C
D
Test Your KnowledgeMulti-Select

Which statements about recovery sites are correct? Choose two.

Select all that apply

A hot site is generally faster to recover than a cold site
A cold site usually has less equipment and readiness than a warm site
A cold site always has live replicated production data
A hot site removes the need to test recovery plans
Test Your Knowledge

Which exercise involves stakeholders discussing a disaster scenario and decisions without necessarily moving production services?

A
B
C
D