2.4 Disaster Recovery Strategies — RTO and RPO
Key Takeaways
- RPO (Recovery Point Objective) defines how much data loss is acceptable; RTO (Recovery Time Objective) defines how quickly the system must be restored.
- The four DR strategies from cheapest to fastest recovery are: Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active.
- Backup & Restore has the highest RTO/RPO (hours) but lowest cost; Multi-Site Active-Active has near-zero RTO/RPO but highest cost.
- Pilot Light keeps core systems (database) running in the DR Region but requires scaling up compute during failover.
- Warm Standby runs a scaled-down version of the full environment in the DR Region, reducing failover time to minutes.
Disaster Recovery Strategies — RTO and RPO
Quick Answer: RPO = max acceptable data loss (time). RTO = max acceptable downtime (time). Four DR strategies: Backup & Restore (cheapest, hours RTO), Pilot Light (core running, minutes-hours RTO), Warm Standby (scaled-down environment, minutes RTO), Multi-Site Active-Active (most expensive, near-zero RTO).
RPO and RTO Defined
| Metric | Definition | Example |
|---|---|---|
| RPO (Recovery Point Objective) | Maximum amount of data you can afford to lose, measured in time | RPO of 1 hour = you can lose up to 1 hour of data |
| RTO (Recovery Time Objective) | Maximum time your system can be down after a disaster | RTO of 4 hours = system must be back online within 4 hours |
The RPO/RTO Trade-off
Lower RPO/RTO = more expensive and complex to achieve.
The Four DR Strategies
1. Backup and Restore
Concept: Regularly back up data to another Region. In a disaster, restore from backups and rebuild the environment.
| Aspect | Detail |
|---|---|
| RTO | Hours (time to restore + rebuild) |
| RPO | Hours (depends on backup frequency) |
| Cost | Lowest (pay only for backup storage) |
| Complexity | Lowest |
| AWS Services | AWS Backup, S3 Cross-Region Replication, EBS Snapshots, RDS Automated Backups |
Best for: Non-critical workloads, development environments, workloads where hours of downtime are acceptable.
2. Pilot Light
Concept: Keep the core of your system (typically the database) running in the DR Region at all times. Other components (compute, app servers) are provisioned only during failover.
| Aspect | Detail |
|---|---|
| RTO | Minutes to hours (time to scale up compute) |
| RPO | Minutes (near-real-time data replication) |
| Cost | Low-Medium (database running, compute off) |
| Complexity | Medium |
| AWS Services | RDS Cross-Region replica, Aurora Global DB, pre-configured AMIs, CloudFormation templates |
Best for: Workloads that can tolerate some downtime but need minimal data loss.
3. Warm Standby
Concept: Run a scaled-down but fully functional version of the production environment in the DR Region at all times. Scale up to full production capacity during failover.
| Aspect | Detail |
|---|---|
| RTO | Minutes (scale up existing resources) |
| RPO | Seconds to minutes (continuous replication) |
| Cost | Medium-High (scaled-down environment always running) |
| Complexity | Medium-High |
| AWS Services | Auto Scaling, Route 53 failover, Aurora Global DB, reduced-size EC2 instances |
Best for: Business-critical applications that need rapid recovery.
4. Multi-Site Active-Active
Concept: Full production environment running in two or more Regions simultaneously. Traffic is served from all Regions at all times.
| Aspect | Detail |
|---|---|
| RTO | Near-zero (traffic automatically shifts) |
| RPO | Near-zero (synchronous or near-synchronous replication) |
| Cost | Highest (full production in multiple Regions) |
| Complexity | Highest |
| AWS Services | DynamoDB Global Tables, Aurora Global DB, Route 53 latency/weighted routing, CloudFront |
Best for: Mission-critical, zero-downtime applications (financial services, healthcare, global e-commerce).
Strategy Comparison
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Low |
| Pilot Light | Mins-Hours | Minutes | $$ | Medium |
| Warm Standby | Minutes | Seconds-Mins | $$$ | Medium-High |
| Multi-Site | Near-zero | Near-zero | $$$$ | High |
AWS Backup
AWS Backup is a centralized service to automate and manage backups across AWS services.
| Feature | Detail |
|---|---|
| Supported services | EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, S3, and more |
| Backup plans | Define frequency, retention, lifecycle (cold → delete) |
| Cross-Region | Copy backups to another Region for DR |
| Cross-account | Copy backups to another account for isolation |
| Vault Lock | WORM (Write Once Read Many) — prevents backup deletion (compliance) |
| Point-in-time | Continuous backups for supported services (e.g., RDS, DynamoDB) |
On the Exam: "Centralized backup management across multiple services and accounts" → AWS Backup. "Prevent backup deletion for compliance" → AWS Backup Vault Lock.
A company requires an RTO of 15 minutes and an RPO of 1 minute for their critical application. Which DR strategy should they implement?
In a Pilot Light disaster recovery strategy, which components typically run continuously in the DR Region?
Order the disaster recovery strategies from LOWEST cost to HIGHEST cost:
Arrange the items in the correct order