8.5 Disaster Recovery, Backup, Availability, and Resilience
Key Takeaways
- Disaster recovery restores technology services, while business continuity keeps critical business processes operating.
- Recovery objectives should be based on business impact analysis, not technology preference.
- Backups are useful only when they are protected, restorable, tested, and aligned to RTO and RPO.
- Resilience combines redundancy, diversity, monitoring, exercises, supplier planning, and executive decision authority.
Recovery as a Business Decision
Disaster recovery and resilience are not only data center topics. They address the ability of the organization to continue critical processes during disruption and restore services to acceptable levels. Disruptions may include ransomware, cloud outages, network failures, power loss, fire, flood, supplier failure, regional disaster, human error, pandemic conditions, or destructive insider activity. The CISSP view starts with business impact, then selects technical and operational controls that match that impact.
Business continuity focuses on sustaining critical business processes. Disaster recovery focuses on restoring technology services that support those processes. The two must be connected. A trading platform, hospital scheduling system, payroll process, call center, and manufacturing control system may all need different recovery priorities. The recovery plan should not simply restore the easiest systems first. It should restore what the business needs first.
A business impact analysis identifies critical processes, dependencies, maximum tolerable downtime, financial impact, legal impact, safety impact, customer impact, and resource requirements. It supports recovery time objective and recovery point objective decisions. RTO is the target time to restore a service. RPO is the acceptable amount of data loss measured in time. These objectives should be approved by business owners because they drive cost and architecture.
Backups are a major recovery control, but backup existence is not backup readiness. Backups must be complete, protected from alteration, retained long enough, encrypted where appropriate, isolated from compromised administrators, and tested through restoration. Ransomware has made immutable or offline backup copies especially important for critical data. A backup that can be deleted by the same compromised account as production data may not provide resilience.
| Concept | Practical question | Example decision |
|---|---|---|
| BIA | What business process fails and how badly? | Customer payments cannot stop more than four hours |
| RTO | How quickly must service return? | Restore order entry within two hours |
| RPO | How much data loss is acceptable? | Lose no more than fifteen minutes of transactions |
| MTD | What is the outer limit before severe harm? | Manual process can sustain one business day |
| DR test | Can the plan work under real constraints? | Restore from isolated backup in a timed exercise |
Availability architecture includes redundancy, clustering, load balancing, failover, replication, alternate sites, diverse network paths, alternate power, spare hardware, and cloud region design. Higher availability usually costs more and may add complexity. Active-active systems reduce downtime but require careful data consistency and monitoring. Cold sites cost less but take longer to activate. Warm sites sit between cost and speed. The right answer depends on business impact and risk appetite.
Resilience also includes people and suppliers. A recovery plan may fail if only one administrator knows the procedure, if vendor contacts are outdated, if emergency procurement is blocked, if remote access depends on the failed identity system, or if fuel contracts do not cover a regional emergency. Plans should identify alternates for key roles, cross-train staff, document runbooks, and validate supplier commitments.
Testing is where plans become credible. Tabletop exercises test decision-making and coordination. Walkthroughs verify procedures and roles. Technical restore tests validate backup integrity. Parallel tests run recovery systems without stopping production. Full interruption tests provide strong evidence but carry risk and need executive approval. Test scope should grow as maturity improves. Findings should be tracked like other risk remediation work.
Backup and Recovery Control Checklist
- Classify systems by business process, data sensitivity, RTO, RPO, and dependency chain.
- Protect backup administration with separate credentials, MFA, logging, and least privilege.
- Maintain immutable, offline, or otherwise isolated copies for critical data and configurations.
- Test restoration of files, databases, identity services, cloud resources, and full applications.
- Document manual workarounds for critical processes when technology is unavailable.
- Review recovery plans after major system changes, supplier changes, exercises, and incidents.
Recovery sequencing matters. Identity, network, DNS, security tools, key management, and administrative access may need restoration before business applications can return. A plan that restores applications before authentication or network segmentation may not work. Dependency mapping helps define the order. It also helps leaders understand why a lower-profile platform may be a prerequisite for a high-profile service.
Crisis communication is part of resilience. Employees need instructions for alternate work locations, manual procedures, and safe communication channels. Customers may need service status. Regulators or contractual partners may require notice depending on impact. Executives need options such as degrade service, switch provider, invoke manual processing, or suspend a function. Communication templates should be prepared, but facts must be updated before use.
Resilience is not the same as perfection. Some downtime or data loss may be accepted for low-impact systems because the cost of near-zero outage is unjustified. For safety, regulated, revenue-critical, or mission-critical services, higher investment is appropriate. A CISSP should connect the cost of controls to the business impact they reduce, then verify that the chosen objectives are realistic through exercises and evidence.
The strongest recovery programs are boring in a useful way. They know their critical services, protect their backups, practice the uncomfortable steps, and correct gaps before a disaster. When a disruption occurs, the organization does not need heroic improvisation. It uses rehearsed authority, tested data, known dependencies, and risk-based priorities.
A business owner says an order system can lose no more than 30 minutes of transactions during recovery. Which objective is being defined?
Which backup design best addresses ransomware that compromises production administrator accounts?
What should primarily determine disaster recovery priorities?