5.2 Core Workflows and Decision Points
Key Takeaways
- Incident management restores service fast (often with a workaround); problem management finds and removes the root cause to stop recurrence.
- Changes flow through a Request for Change (RFC), assessment, approval, testing, scheduled implementation, and post-implementation review.
- Configuration management maintains the CMDB of configuration items (CIs) so change impact can be assessed; release management deploys bundled changes to production.
- Recovery-site cost rises with readiness: cold (cheapest, slowest) → warm → mobile → hot (most expensive, fastest); reciprocal arrangements are cheapest of all but least reliable.
- RAID and clustering provide high availability against component failure but are not a substitute for off-site backups against site-level disasters.
Service Management Workflows (ITIL)
CISA leans on ITIL service-management vocabulary. The exam repeatedly tests the distinction between four processes that candidates blur together.
- Incident management — An incident is an unplanned interruption or reduction in quality of an IT service. The goal is to restore service as quickly as possible, often with a temporary workaround. Speed matters more than root cause here.
- Problem management — A problem is the underlying cause of one or more incidents. Problem management is proactive and analytical: it performs root-cause analysis to eliminate recurring incidents permanently. A known error with a documented workaround lives here.
- Change management — A change is the addition, modification, or removal of anything that could affect IT services. Changes flow through a Request for Change (RFC), risk/impact assessment, authorization (often by a Change Advisory Board), testing, scheduled implementation, and a post-implementation review.
- Configuration management — Maintains the Configuration Management Database (CMDB), a record of all configuration items (CIs) and their relationships, so change impact can be assessed accurately.
Release management packages approved changes into a controlled deployment to production while protecting the integrity of the existing environment.
Memory hook: Incident = restore now. Problem = find the cause. Change = control the modification. Configuration = know what you have. Release = deploy it safely.
Why the Distinctions Get Tested
The exam exploits the incident/problem boundary constantly. If a stem describes a server crash and asks the immediate priority, the answer is to restore service (incident management), even if the cause is unknown. If the stem says the same outage keeps happening and asks for the best long-term action, the answer is root-cause analysis (problem management). Watch the timing cue: now versus recurring.
For changes, emergency fixes still require a control: an emergency change must be authorized (sometimes after the fact by an emergency CAB) and documented, never left undocumented because it was urgent. The biggest operational red flag the exam loves is a developer or operator implementing a change directly in production without approval, testing, or segregation of duties — that is the wrong answer to defend.
Batch and Job Scheduling
Day-to-day operations also include job/batch scheduling. Auditors check for an automated scheduler with documented dependencies, exception alerts for failed jobs, restart/rerun procedures, and review of console/scheduler logs. Unmonitored failed batch jobs (for example, an overnight posting that silently aborts) are a classic operations finding.
Recovery-Site Decision Spectrum
When a stem asks which alternate processing site is most appropriate, map the RTO and cost tolerance in the stem to this spectrum:
| Site | Equipment & data | Time to operational | Relative cost | Best for |
|---|---|---|---|---|
| Cold | Space, power, HVAC only; little/no hardware | Longest (days/weeks) | Lowest | Non-critical apps, long-term contracts |
| Warm | Hardware and connectivity present; software/data not fully current | Medium | Medium | Sensitive but not time-critical apps |
| Mobile | Pre-configured portable unit moved to the site | Variable | Medium | Regional outages, field deployment |
| Hot | Mirrors production; hardware, software, near-current data | Shortest (minutes/hours) | Highest | Vital, critical applications with low RTO |
| Reciprocal | Two organizations agree to host each other | Uncertain | Cheapest | Same-region offices; least reliable |
The core trade-off: shorter RTO → hotter site → higher cost. A hot site is justified when recovery must be fast and the application is critical. A reciprocal (mutual aid) agreement is the least expensive but the least dependable, because capacity, configuration compatibility, and the partner's own availability during a regional event are all uncertain.
High-availability techniques sit alongside sites: RAID (redundant disks), server clustering, and failover protect against component failure and reduce RTO for localized faults. They are not a substitute for off-site backups, which protect against site-level disasters, ransomware, and corruption that replication would simply copy to the standby.
RAID and High-Availability Building Blocks
CISA expects you to recognize the common RAID (Redundant Array of Independent Disks) levels and what each buys you, because they appear in availability and storage-operations questions.
| RAID level | Technique | Fault tolerance |
|---|---|---|
| RAID 0 | Striping only | None — any disk failure loses all data (performance only) |
| RAID 1 | Mirroring | Survives loss of one disk in a mirrored pair |
| RAID 5 | Block striping + distributed parity | Survives one disk failure; needs ≥3 disks |
| RAID 6 | Striping + double distributed parity | Survives two simultaneous failures; needs ≥4 disks |
| RAID 10 | Mirrored stripes | High performance and fault tolerance; needs ≥4 disks |
The trap is RAID 0: it improves throughput but provides no redundancy, so it never satisfies an availability requirement. Beyond RAID, server clustering lets a standby node take over automatically (failover), and redundant power, network paths, and load balancers remove single points of failure. All of these reduce RTO for component-level faults but, again, do not replace geographically separated backups.
The Change Lifecycle in Detail
Walk a normal change through its stages so the exam's process questions are automatic: RFC raised → categorized and risk-assessed → authorized (Change Advisory Board for significant changes) → built and tested in a non-production environment → scheduled and implemented with a back-out plan → post-implementation review → CMDB updated. A back-out (rollback) plan is the control auditors look for when a change asks to go to production; its absence is a finding. Emergency changes compress this flow but still require authorization and documentation, typically reviewed retroactively by an emergency CAB.
A critical application server fails during business hours and the cause is unknown. What is the immediate priority under ITIL?
An organization requires recovery of a vital, customer-facing application within two hours of any disruption. Which alternate-site strategy is most appropriate?
The same network outage has recurred three times this month, each restored by a manual workaround. What is the best long-term action?