Disaster Recovery Plan
Key Takeaways
- The DRP restores IT systems, data, and infrastructure to meet the RTO and RPO that the BIA defined.
- Site strategy (hot, warm, cold, mobile, cloud) is selected to match RTO cost trade-offs; shorter RTOs cost more.
- RPO dictates backup/replication frequency; RTO dictates how fast the recovery process must complete.
- DRP testing escalates from checklist and walkthrough to simulation, parallel, and full-interruption tests.
What the Disaster Recovery Plan Does
The Disaster Recovery Plan (DRP) is the technology-focused plan for restoring IT systems, applications, networks, and data after a disruptive event. It is the IT execution arm of the broader Business Continuity Plan and must satisfy the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) that the BIA established. If the BIA says the ERP system has a 4-hour RTO and a 30-minute RPO, the DRP must demonstrate it can rebuild that system within four hours using data no more than 30 minutes old.
The DRP documents recovery teams and their tasks, the priority order for restoring systems (driven by BIA criticality), the location and method of backups, the alternate processing site, and the step-by-step technical procedures. It also defines success criteria so a recovery can be declared complete and validated, not merely "the server is up."
Backup Strategy Driven by RPO
RPO sets how often data must be captured. The smaller the tolerable data loss, the more frequent and more expensive the protection.
| Strategy | Typical RPO it supports | Relative cost |
|---|---|---|
| Daily full backup to tape | Up to ~24 hours of loss | Low |
| Periodic incremental + offsite | Hours | Moderate |
| Snapshots every few minutes | Minutes | Higher |
| Synchronous replication | Near zero | Highest |
A classic exam pairing: an RPO measured in minutes cannot be met by nightly backups, replication or frequent snapshots are required. Conversely, paying for synchronous replication on a system with a 24-hour RPO wastes money the BIA does not justify.
Recovery Sites and DRP Testing
Matching Site Strategy to RTO
The recovery site is selected to meet the RTO at acceptable cost. There is a direct trade-off: faster recovery costs more.
- Hot site: Live, fully configured duplicate; recovery in minutes to hours. Most expensive, for the shortest RTOs.
- Warm site: Hardware and connectivity in place; software and data must be loaded. Moderate cost, hours to a day.
- Cold site: Power, space, and HVAC only; everything else must be installed. Cheapest, days to recover.
- Mobile site: Transportable, prefitted unit for field or regional recovery.
- Cloud / DRaaS: On-demand failover with pay-as-you-go economics, increasingly common.
- Reciprocal agreement: Mutual-aid pact with another organization; low cost but unreliable capacity and confidentiality risk.
Levels of DRP Testing
ISACA expects testing rigor to escalate. Know the order from least to most disruptive:
| Test type | Description | Disruption |
|---|---|---|
| Checklist / desk review | Verify plan completeness and resources on paper | None |
| Structured walkthrough / tabletop | Team talks through roles and scenarios | None |
| Simulation | Acting out a scenario without touching production | Low |
| Parallel | Recovery systems run alongside production to validate | Moderate |
| Full interruption | Production is failed over to recovery; highest realism | High, risk to operations |
Worked judgment: A new DRP has never been validated and the business cannot risk an outage. The appropriate first test is a structured walkthrough or simulation, not a full-interruption test, you build confidence before stressing production. Jumping straight to full interruption on an unproven plan is the trap answer. After each test, findings feed plan updates, mirroring the lessons-learned discipline of incident management.
Recovery Strategy Trade-offs and Common Exam Pitfalls
The DRP turns BIA numbers into engineering and spending decisions, and CISM tests whether candidates can match the strategy to the requirement without over- or under-investing.
Cost Versus Recovery Speed
The governing relationship is that shorter RTOs and RPOs cost more. A near-zero RPO requires synchronous replication and redundant storage; a near-zero RTO requires a live hot site or active-active architecture. The security manager's role is to ensure the chosen solution is proportional to documented business impact. Spending hot-site money on a system the BIA rated low criticality wastes budget; relying on nightly tape for a system with a 15-minute RPO guarantees a failed recovery. The defensible answer always traces back to the BIA.
Backups Are Not a Recovery Plan by Themselves
Having backups is necessary but not sufficient. ISACA expects:
- Offsite or geographically separated copies so a single site disaster does not destroy both production and backups.
- Regular restore testing, an untested backup is an assumption, not a recovery capability. Backups that cannot be restored are a frequent root cause of failed recoveries.
- Protection of backups against ransomware, including immutable or air-gapped copies, since attackers target backups to force payment.
- Documented recovery procedures, so restoration does not depend on one person's memory.
Frequent CISM Pitfalls
- Choosing a recovery site before the BIA exists, this is solving without requirements and is consistently the wrong answer.
- Confusing RTO and RPO when sizing the solution; RTO drives site choice, RPO drives backup frequency.
- Skipping straight to full-interruption testing on an unproven plan, risking the very outage the plan is meant to prevent.
- Treating the DRP as separate from the BCP, the DRP must support the continuity priorities the BCP sets, not run on its own logic.
- Forgetting people, recovery procedures assume trained staff are available; cross-training and documentation reduce key-person risk.
The synthesis CISM wants: the DRP is a business-driven, tested IT recovery capability sized to BIA objectives, with proven backups, an appropriately costed recovery site, and an escalating test program whose findings continuously improve the plan.
A critical application has a Recovery Time Objective (RTO) of 30 minutes. Which recovery site strategy is most appropriate?
An organization has never validated its newly written DRP and cannot tolerate any production outage. Which test should it run first?