DR and HA Concepts: RPO, RTO, MTTR, MTBF, Sites, Failover, and Testing
Key Takeaways
- RPO defines acceptable data loss in time, while RTO defines acceptable time to restore service.
- MTTR measures average repair or recovery time, while MTBF measures expected time between failures.
- Cold, warm, and hot sites represent different tradeoffs among cost, readiness, and recovery speed.
- High availability designs include redundancy, clustering, active-active, active-passive, load balancing, and failover.
- DR and HA plans must be tested through tabletop, walk-through, simulation, failover, and restore exercises.
Disaster Recovery, High Availability, and Testing
Resilience planning connects business requirements to technical design. The right answer depends on how much downtime, data loss, cost, and complexity the organization can accept.
Recovery and Reliability Terms
| Term | Meaning | Example |
|---|---|---|
| RPO | Recovery point objective, maximum acceptable data loss measured in time | Lose no more than 15 minutes of transactions |
| RTO | Recovery time objective, target time to restore service | Restore VPN service within 1 hour |
| MTTR | Mean time to repair or recover | Average time to restore after failure |
| MTBF | Mean time between failures | Expected reliability interval for a component |
RPO and RTO are requirements. MTTR and MTBF are measurements or estimates that help evaluate reliability and supportability.
Recovery Sites
| Site type | Readiness | Cost | Recovery speed |
|---|---|---|---|
| Cold site | Facility or space with minimal equipment | Lower | Slowest |
| Warm site | Some systems, connectivity, and data prepared | Medium | Moderate |
| Hot site | Ready environment with current data and active services or rapid failover | Higher | Fastest |
A cold site may be acceptable for a low-priority office function with a long RTO. A hot site may be required for a service with a short RTO and short RPO.
High Availability Designs
| Design | Description | Notes |
|---|---|---|
| Redundant component | Extra power supply, link, switch, firewall, or ISP | Removes a single point of failure |
| Load balancing | Distributes client requests across multiple systems | Can improve scale and availability |
| Clustering | Multiple nodes act as one service group | Often supports failover |
| Active-active | Multiple nodes handle production traffic at the same time | Efficient but more complex |
| Active-passive | Standby node waits until failover | Simpler capacity planning but idle resources |
| Geographic redundancy | Services run or recover in another location | Reduces site or regional risk |
HA reduces downtime from component failures. DR addresses larger disruptions such as site loss, major outage, corruption, or disaster conditions. They overlap but are not the same.
Failover and Split-Brain
Failover moves service from a failed component or site to a standby or peer. Designs should define health checks, failover triggers, state synchronization, routing changes, DNS behavior, and how failback occurs. Cluster designs must prevent split-brain, where two nodes believe they are primary and accept conflicting changes.
Testing Methods
| Test type | What happens |
|---|---|
| Tabletop | Stakeholders discuss a scenario and decisions |
| Walk-through | Team reviews procedures step by step |
| Simulation | Exercise uses realistic inputs without full production impact |
| Failover test | Service is moved to standby resources |
| Restore test | Data or configuration is restored and validated |
| Full interruption test | Production is intentionally interrupted under controlled conditions |
Testing should produce evidence: results, gaps, timing, failed assumptions, updated contacts, and revised runbooks. A plan that is never tested is only a theory.
Practical Scenario
A company says its ordering system must recover within 30 minutes and lose no more than 5 minutes of orders. That requirement points toward a warm or hot design with frequent replication or transaction protection, tested failover, current documentation, monitoring, and a restoration procedure. A nightly backup to offline storage would not meet the RPO.
Common Exam Traps
| Trap | Better exam reasoning |
|---|---|
| "RTO is the amount of data loss." | RTO is time to restore service; RPO is acceptable data loss. |
| "Active-active is always best." | It can be costly and complex; requirements drive design. |
| "A hot site eliminates the need for backups." | Replication can copy corruption or deletion; backups are still needed. |
| "A tabletop test proves systems can fail over." | It validates decisions and procedures, not necessarily technical failover. |
Quick Drill
Match the clue:
- Back online within 2 hours: RTO.
- Lose no more than 10 minutes of changes: RPO.
- Average repair time after device failure: MTTR.
- Both firewalls pass production traffic: active-active.
- Recovery facility with equipment already running: hot site.
A service must be restored within 45 minutes and can lose no more than 10 minutes of data. Which statement is correct?
Which statements about recovery sites are correct? Choose two.
Select all that apply
Which exercise involves stakeholders discussing a disaster scenario and decisions without necessarily moving production services?