DR and HA Concepts: RPO, RTO, MTTR, MTBF, Sites, Failover, and Testing

Key Takeaways

  • RPO defines acceptable data loss in time; RTO defines acceptable time to restore service — do not swap them on the exam.
  • MTTR is mean time to repair/recover and MTBF is mean time between failures; both are reliability measurements, not requirements.
  • Cold, warm, and hot sites trade cost against readiness: cold is cheapest and slowest, hot is most expensive and fastest.
  • First Hop Redundancy Protocols (FHRP) such as VRRP and HSRP provide gateway failover; clustering and load balancing provide service redundancy.
  • DR/HA plans must be tested — tabletop, walk-through, simulation, failover, and restore — and a hot site never removes the need for backups.
Last updated: June 2026

Disaster Recovery, High Availability, and Testing

Network+ objective 3.3 ties business requirements to resilient design. The right architecture depends on how much downtime, data loss, cost, and complexity the organization can tolerate. The exam tests four buckets: the recovery metrics, recovery sites, HA designs, and testing types.

Recovery and reliability terms

TermMeaningExample
RPORecovery Point Objective — max acceptable data loss, measured in timeLose no more than 15 minutes of transactions
RTORecovery Time Objective — target time to restore serviceRestore VPN within 1 hour
MTTRMean Time To Repair/Recover — average restore time45 minutes to swap a failed switch
MTBFMean Time Between Failures — expected reliability intervalA power supply rated 100,000 hours

The single most-tested distinction: RPO is about data (how far back), RTO is about time (how long down). A nightly backup gives an RPO of up to 24 hours; if the requirement is "lose no more than 5 minutes," nightly backup fails and you need frequent replication. RPO and RTO are requirements set by the business; MTTR and MTBF are measurements used to evaluate supportability.

Recovery sites

Site typeReadinessCostRecovery speed
ColdSpace and power, little or no equipmentLowestSlowest (days)
WarmSome hardware and connectivity, partial dataMediumModerate (hours)
HotFully equipped, current data, rapid failoverHighestFastest (minutes)

A cold site fits a low-priority function with a long RTO; a hot site is required when both RTO and RPO are short. A cloud / DRaaS option can blur these lines by spinning up capacity on demand.

High availability designs

DesignDescriptionNote
Redundant componentDual power supply, link, switch, firewall, ISPRemoves a single point of failure
Load balancingSpreads requests across nodesScale and availability
ClusteringMultiple nodes act as one serviceSupports failover
Active-activeAll nodes carry production trafficEfficient but complex
Active-passiveStandby waits for failoverIdle capacity, simpler planning
FHRP (VRRP / HSRP)Redundant default gatewayHosts keep one virtual gateway IP
Geographic redundancyService runs/recovers in another regionSurvives site loss

HA reduces downtime from component failures; DR addresses larger disruptions such as site loss, mass corruption, or disaster. First Hop Redundancy ProtocolsVRRP (open standard) and HSRP (Cisco) — present one virtual gateway IP so clients keep working when the active router fails.

Failover and split-brain

Failover moves service from a failed component to a standby or peer and must define health checks, triggers, state synchronization, routing/DNS changes, and failback. Cluster designs must prevent split-brain, where two nodes both believe they are primary and accept conflicting writes — usually solved with a quorum or witness.

Testing methods

TestWhat happens
TabletopStakeholders discuss a scenario and decisions
Walk-throughTeam reviews procedures step by step
SimulationRealistic inputs without full production impact
Failover testService is actively moved to standby
Restore testData/config is restored and validated
Full interruptionProduction is intentionally stopped under control

A tabletop validates decisions and communication, not the technical failover — only a failover or full-interruption test proves the systems actually cut over. Testing must produce evidence: timings, gaps, failed assumptions, and updated runbooks. An untested plan is only a theory.

Practical scenario

An ordering system must recover within 30 minutes (RTO) and lose no more than 5 minutes of orders (RPO). That points to a warm or hot design with frequent replication, FHRP or clustered gateways, tested failover, and current documentation. A nightly offline backup cannot meet the 5-minute RPO and would fail an audit.

Why a hot site still needs backups

A frequent exam misconception is that real-time replication to a hot site removes the need for backups. It does not, because replication faithfully copies everything — including a malicious encryption by ransomware, an accidental table drop, or a corrupted file. Within seconds the same damage exists at both sites, leaving no clean copy to restore from. Backups protect a different threat: they provide point-in-time recovery to a moment before the corruption, satisfying the RPO.

The resilient design therefore layers both — replication for fast failover against hardware and site loss, and versioned, ideally immutable or offline, backups against logical corruption and deletion.

Translating requirements into design

The exam often hands you an RTO and RPO and asks for the matching architecture. A long RTO and long RPO can be met cheaply with a cold site and nightly backups. A short RTO with a short RPO forces near-continuous replication, clustered or load-balanced services, FHRP gateway redundancy, and a warm or hot site with tested failover. The discipline is to let the business numbers drive the spend rather than over-engineering every service to active-active, which wastes money on functions that could tolerate hours of downtime.

Common exam traps

  • RTO is time to restore; RPO is acceptable data loss. Do not swap them.
  • Active-active is not always best — it adds cost and complexity; requirements drive design.
  • A hot site does not eliminate backups; replication can copy corruption or a deletion.
  • A tabletop test does not prove technical failover; only a failover/full-interruption test does.
  • MTTR and MTBF are measurements, not requirements; RPO and RTO are the requirements the business sets.
Test Your Knowledge

A service must be restored within 45 minutes and can lose no more than 10 minutes of data. Which statement is correct?

A
B
C
D
Test Your KnowledgeMulti-Select

Which statements about recovery sites are correct? Choose two.

Select all that apply

A hot site generally recovers faster than a cold site
A cold site usually has less equipment and readiness than a warm site
A cold site always holds live, continuously replicated production data
A hot site removes the need to test the recovery plan
Test Your Knowledge

Hosts on a subnet must keep a single default gateway IP even if the active router fails. Which technology provides this?

A
B
C
D
Test Your Knowledge

Which exercise validates decisions and communication without necessarily moving production services to standby resources?

A
B
C
D