DR and HA Concepts: RPO, RTO, MTTR, MTBF, Sites, Failover, and Testing
Key Takeaways
- RPO defines acceptable data loss in time; RTO defines acceptable time to restore service — do not swap them on the exam.
- MTTR is mean time to repair/recover and MTBF is mean time between failures; both are reliability measurements, not requirements.
- Cold, warm, and hot sites trade cost against readiness: cold is cheapest and slowest, hot is most expensive and fastest.
- First Hop Redundancy Protocols (FHRP) such as VRRP and HSRP provide gateway failover; clustering and load balancing provide service redundancy.
- DR/HA plans must be tested — tabletop, walk-through, simulation, failover, and restore — and a hot site never removes the need for backups.
Disaster Recovery, High Availability, and Testing
Network+ objective 3.3 ties business requirements to resilient design. The right architecture depends on how much downtime, data loss, cost, and complexity the organization can tolerate. The exam tests four buckets: the recovery metrics, recovery sites, HA designs, and testing types.
Recovery and reliability terms
| Term | Meaning | Example |
|---|---|---|
| RPO | Recovery Point Objective — max acceptable data loss, measured in time | Lose no more than 15 minutes of transactions |
| RTO | Recovery Time Objective — target time to restore service | Restore VPN within 1 hour |
| MTTR | Mean Time To Repair/Recover — average restore time | 45 minutes to swap a failed switch |
| MTBF | Mean Time Between Failures — expected reliability interval | A power supply rated 100,000 hours |
The single most-tested distinction: RPO is about data (how far back), RTO is about time (how long down). A nightly backup gives an RPO of up to 24 hours; if the requirement is "lose no more than 5 minutes," nightly backup fails and you need frequent replication. RPO and RTO are requirements set by the business; MTTR and MTBF are measurements used to evaluate supportability.
Recovery sites
| Site type | Readiness | Cost | Recovery speed |
|---|---|---|---|
| Cold | Space and power, little or no equipment | Lowest | Slowest (days) |
| Warm | Some hardware and connectivity, partial data | Medium | Moderate (hours) |
| Hot | Fully equipped, current data, rapid failover | Highest | Fastest (minutes) |
A cold site fits a low-priority function with a long RTO; a hot site is required when both RTO and RPO are short. A cloud / DRaaS option can blur these lines by spinning up capacity on demand.
High availability designs
| Design | Description | Note |
|---|---|---|
| Redundant component | Dual power supply, link, switch, firewall, ISP | Removes a single point of failure |
| Load balancing | Spreads requests across nodes | Scale and availability |
| Clustering | Multiple nodes act as one service | Supports failover |
| Active-active | All nodes carry production traffic | Efficient but complex |
| Active-passive | Standby waits for failover | Idle capacity, simpler planning |
| FHRP (VRRP / HSRP) | Redundant default gateway | Hosts keep one virtual gateway IP |
| Geographic redundancy | Service runs/recovers in another region | Survives site loss |
HA reduces downtime from component failures; DR addresses larger disruptions such as site loss, mass corruption, or disaster. First Hop Redundancy Protocols — VRRP (open standard) and HSRP (Cisco) — present one virtual gateway IP so clients keep working when the active router fails.
Failover and split-brain
Failover moves service from a failed component to a standby or peer and must define health checks, triggers, state synchronization, routing/DNS changes, and failback. Cluster designs must prevent split-brain, where two nodes both believe they are primary and accept conflicting writes — usually solved with a quorum or witness.
Testing methods
| Test | What happens |
|---|---|
| Tabletop | Stakeholders discuss a scenario and decisions |
| Walk-through | Team reviews procedures step by step |
| Simulation | Realistic inputs without full production impact |
| Failover test | Service is actively moved to standby |
| Restore test | Data/config is restored and validated |
| Full interruption | Production is intentionally stopped under control |
A tabletop validates decisions and communication, not the technical failover — only a failover or full-interruption test proves the systems actually cut over. Testing must produce evidence: timings, gaps, failed assumptions, and updated runbooks. An untested plan is only a theory.
Practical scenario
An ordering system must recover within 30 minutes (RTO) and lose no more than 5 minutes of orders (RPO). That points to a warm or hot design with frequent replication, FHRP or clustered gateways, tested failover, and current documentation. A nightly offline backup cannot meet the 5-minute RPO and would fail an audit.
Why a hot site still needs backups
A frequent exam misconception is that real-time replication to a hot site removes the need for backups. It does not, because replication faithfully copies everything — including a malicious encryption by ransomware, an accidental table drop, or a corrupted file. Within seconds the same damage exists at both sites, leaving no clean copy to restore from. Backups protect a different threat: they provide point-in-time recovery to a moment before the corruption, satisfying the RPO.
The resilient design therefore layers both — replication for fast failover against hardware and site loss, and versioned, ideally immutable or offline, backups against logical corruption and deletion.
Translating requirements into design
The exam often hands you an RTO and RPO and asks for the matching architecture. A long RTO and long RPO can be met cheaply with a cold site and nightly backups. A short RTO with a short RPO forces near-continuous replication, clustered or load-balanced services, FHRP gateway redundancy, and a warm or hot site with tested failover. The discipline is to let the business numbers drive the spend rather than over-engineering every service to active-active, which wastes money on functions that could tolerate hours of downtime.
Common exam traps
- RTO is time to restore; RPO is acceptable data loss. Do not swap them.
- Active-active is not always best — it adds cost and complexity; requirements drive design.
- A hot site does not eliminate backups; replication can copy corruption or a deletion.
- A tabletop test does not prove technical failover; only a failover/full-interruption test does.
- MTTR and MTBF are measurements, not requirements; RPO and RTO are the requirements the business sets.
A service must be restored within 45 minutes and can lose no more than 10 minutes of data. Which statement is correct?
Which statements about recovery sites are correct? Choose two.
Select all that apply
Hosts on a subnet must keep a single default gateway IP even if the active router fails. Which technology provides this?
Which exercise validates decisions and communication without necessarily moving production services to standby resources?