2.4 Disaster Recovery Strategies — RTO and RPO
Key Takeaways
- RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time; RTO (Recovery Time Objective) is the maximum acceptable downtime measured in time.
- The four AWS DR strategies, cheapest/slowest to costliest/fastest, are Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active.
- Pilot Light keeps the data layer running in the DR Region but provisions compute only on failover; Warm Standby runs a scaled-down full stack ready to scale up.
- Multi-Site Active-Active serves live traffic from two or more Regions for near-zero RTO/RPO at the highest cost and complexity.
- AWS Backup centralizes backups across services and accounts, supports cross-Region/cross-account copy, and offers Vault Lock (WORM) to block deletion for compliance.
Quick Answer: RPO = how much data (in time) you can lose; RTO = how long (in time) you can be down. Match them to four strategies: Backup & Restore (cheapest, hours), Pilot Light (data layer hot, minutes–hours), Warm Standby (scaled-down stack, minutes), Multi-Site Active-Active (live everywhere, near-zero). Tighter objectives cost more.
Defining RPO and RTO
These two terms anchor almost every DR exam question. Read the scenario, extract the numbers, then pick the cheapest strategy that still meets both.
| Metric | Definition | Example |
|---|---|---|
| RPO (Recovery Point Objective) | Max data loss tolerated, in time | RPO 1 hour → may lose up to 1 hour of writes |
| RTO (Recovery Time Objective) | Max downtime tolerated, in time | RTO 4 hours → must be back within 4 hours |
Lower RPO/RTO always means more standing infrastructure and higher cost — that trade-off is the heart of the domain. A backup-only design can have an RPO of hours; an active-active design approaches zero.
The Four Strategies
1. Backup & Restore
Regularly back up data (and AMIs/templates) to another Region; on disaster, restore and rebuild. RTO hours, RPO hours, lowest cost. Services: AWS Backup, S3 Cross-Region Replication, EBS snapshots, RDS automated backups. Best for dev/test and tolerant workloads.
2. Pilot Light
Keep only the core data layer (database) replicating and running in the DR Region; compute is off until failover, then provisioned from pre-baked AMIs or CloudFormation. RTO minutes–hours, RPO minutes, low-medium cost.
3. Warm Standby
Run a scaled-down but fully functional copy of production in the DR Region at all times. On disaster you scale it up to full size and shift traffic. RTO minutes, RPO seconds–minutes, medium-high cost. Services: Auto Scaling, Aurora Global Database, Route 53 failover. Best for business-critical apps needing rapid recovery.
4. Multi-Site Active-Active
Full production runs in two or more Regions simultaneously, all serving live traffic. On failure, traffic simply shifts. RTO and RPO near-zero, highest cost and complexity. Services: DynamoDB Global Tables, Aurora Global Database, Route 53 latency/weighted routing, CloudFront. Best for financial, healthcare, and global commerce.
Strategy Comparison
| Strategy | RTO | RPO | Cost | What runs in DR |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Nothing (just backups) |
| Pilot Light | Mins–Hours | Minutes | $$ | Data layer only |
| Warm Standby | Minutes | Secs–Mins | $$$ | Scaled-down full stack |
| Multi-Site | Near-zero | Near-zero | $$$$ | Full production |
A reliable shortcut: "database always on, compute off" = Pilot Light; "small version of everything always running" = Warm Standby; "both Regions live" = Multi-Site.
AWS Backup
AWS Backup centralizes and automates backups across services so you do not script each one separately.
| Feature | Detail |
|---|---|
| Supported services | EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, S3, Storage Gateway |
| Backup plans | Frequency, retention, lifecycle to cold storage |
| Cross-Region copy | Replicate backups to a DR Region |
| Cross-account copy | Isolate backups in a separate account |
| Vault Lock | WORM (Write Once Read Many) — blocks deletion for compliance |
On the Exam: "Centralized backup across many services/accounts" → AWS Backup. "Prevent anyone, even admins, from deleting backups for a retention period" → AWS Backup Vault Lock in compliance mode. "Lowest-cost DR that tolerates hours of downtime" → Backup & Restore.
Building Blocks That Drive RPO and RTO
Your achievable RPO is set by how often data is captured or replicated, and your RTO by how fast you can stand the stack back up. Knowing which AWS feature delivers which window is what the exam tests.
| Data-protection feature | RPO it enables |
|---|---|
| RDS automated backups + nightly snapshots | Hours (to last snapshot) |
| RDS / DynamoDB point-in-time recovery (PITR) | ~5 minutes |
| Aurora Global Database replication | Sub-second |
| DynamoDB Global Tables | ~Sub-second (active-active) |
| S3 Cross-Region Replication | Minutes (async) |
To shrink RTO, pre-bake everything: store AMIs and CloudFormation/CDK templates so the DR environment rebuilds with one deploy, keep AMIs copied to the DR Region, and automate DNS cutover with Route 53 health checks + failover. Infrastructure as code is the difference between an hours-long manual rebuild and a minutes-long automated one, so a Backup & Restore design with templated infrastructure can beat a sloppy Pilot Light.
Testing, Pitfalls, and AWS Elastic Disaster Recovery
- Test the plan. A DR strategy that is never failover-tested is assumed broken; the exam favors answers that include regular game-day failover drills and automated runbooks.
- AWS Elastic Disaster Recovery (DRS) continuously replicates on-premises or cross-Region servers at the block level into a low-cost staging area and launches full instances on demand — an exam answer for low-RPO recovery of existing servers (including lift-and-shift from on-prem) without re-architecting.
- Don't over-buy. If a scenario tolerates hours of downtime, Multi-Site Active-Active is the wrong (over-engineered, over-cost) answer; pick the cheapest tier that meets both RPO and RTO. Conversely, never answer Backup & Restore when the scenario states near-zero downtime.
- Cross-account isolation (a separate backup account, often with AWS Backup cross-account copy and SCP guardrails) protects backups from ransomware or a compromised production account — choose it when the threat is malicious deletion rather than infrastructure failure.
In short, decode the two numbers, map each to the feature that achieves it, automate the rebuild, and select the lowest-cost strategy that still clears both bars.
A business-critical application requires an RTO of about 10 minutes and an RPO of seconds, but the team wants to avoid the cost of running full production in two Regions. Which strategy fits best?
In a Pilot Light strategy, which components are kept running continuously in the DR Region?
A compliance mandate requires that backups cannot be deleted or altered by anyone, including administrators, for the full retention period. Which AWS Backup feature satisfies this?