6.2 Incident Management
Key Takeaways
- The purpose of incident management is to minimize the negative impact of incidents by restoring normal service operation as quickly as possible
- An incident is an unplanned interruption to a service, or reduction in the quality of a service
- Incidents are prioritized based on impact and urgency, not on a first-come-first-served basis
- Major incidents follow a separate procedure with dedicated resources, and swarming brings stakeholders together to resolve complex incidents
- Incident management relies on the service desk for logging and on escalation paths to specialist or supplier support teams
Purpose and Definition
Quick Answer: The purpose of the incident management practice is to minimize the negative impact of incidents by restoring normal service operation as quickly as possible. An incident is an unplanned interruption to a service, or reduction in the quality of a service.
Incident management is one of the seven practices examined in detail, so memorize the purpose word-for-word. The key idea is speed of restoration, not root cause. The job is to get the service working again — even with a temporary fix or workaround — and worry about why it happened later (that is problem management's job).
An incident must be unplanned. A scheduled maintenance window is not an incident. A server crash, a failed login service, or a printer that prints garbled output all qualify because they interrupt the service or reduce its quality below the agreed level. Every incident should be logged and managed to ensure it is resolved within target times that meet user expectations.
Incident management is one of the most visible practices because it directly affects users. ITIL 4 stresses that good incident management is as much about communication and collaboration as about technical fixes — keeping users informed of progress, setting realistic expectations, and recording accurate, complete incident records that other practices (like problem management) can later mine for patterns. A poorly documented incident is almost as bad as an unresolved one.
Categorization and Prioritization
When an incident is logged, it is categorized (assigning it a type so it can be routed and reported on) and prioritized. Prioritization is critical: incidents are not handled first-come-first-served. Priority is derived from impact and urgency.
- Impact — how badly the incident affects the business (how many users, how critical the service).
- Urgency — how quickly the incident needs resolution before impact grows.
These combine in a priority matrix to set a priority code that drives target resolution times defined in service level agreements (SLAs).
| High urgency | Medium urgency | Low urgency | |
|---|---|---|---|
| High impact | Priority 1 (Critical) | Priority 2 (High) | Priority 3 (Medium) |
| Medium impact | Priority 2 (High) | Priority 3 (Medium) | Priority 4 (Low) |
| Low impact | Priority 3 (Medium) | Priority 4 (Low) | Priority 5 (Planning) |
A Priority 1 incident affecting a revenue-critical service for all users gets immediate attention and a short target resolution time; a Priority 5 cosmetic issue can wait. Tools should automatically suggest a priority and match the incident against known errors to speed diagnosis.
Target resolution times are defined in service level agreements (SLAs) and tied to priority, not to the order incidents arrive. A typical agreement might require Priority 1 incidents to be resolved within 2 hours and Priority 4 within 5 business days. These targets give the service desk an objective basis for sequencing work and give the business a clear, measurable expectation. Categorization also feeds reporting: by grouping incidents by category, the organization can see which services or components generate the most disruption — invaluable input to proactive problem management and continual improvement.
Service Desk, Escalation, Swarming, and Major Incidents
Most incidents are logged through the service desk, which is the single point of contact for users. Low-complexity incidents are resolved at first contact. When the service desk cannot resolve an incident, it escalates it:
- Hierarchic escalation — raising the incident to managers/authority for decisions or resources (often for major incidents).
- Functional (technical) escalation — routing the incident to a specialist support group or supplier with the right expertise.
ITIL 4 promotes swarming as a modern alternative to rigid tiered escalation. In swarming, many stakeholders work together on an incident at the same time until it is clear who can best resolve it, then the others move on. This breaks down silos and speeds up complex diagnoses.
Major incidents
A major incident has very high impact and demands a separate, dedicated procedure with shorter timescales, greater urgency, and often a dedicated manager and team. Major incidents may trigger problem management to find the root cause afterward.
Tools and self-help
Modern incident management uses workflow and collaboration tools, automated logging, intelligent routing, and self-service portals. Simple incidents can be resolved by users through self-help or by automation, freeing specialists for high-priority work.
ITIL 4 explicitly recognizes different resolution paths by complexity. Low-complexity incidents may be handled by the service desk or even self-service automation. More complex incidents are routed to support teams aligned to specific technologies or products. The most complex incidents — and all major incidents — often need a temporary team drawn from several groups, which is exactly where swarming fits.
A good incident management tool integrates with the service configuration management practice so technicians can see which configuration items are involved, and with the known error database so documented workarounds surface automatically. Effective incident management therefore depends on tools, partner support arrangements, and well-designed escalation rules working together — a reminder that a practice spans all four dimensions, not just a single written procedure that anyone could follow in isolation.
What is the purpose of the incident management practice?
How is the priority of an incident determined in ITIL 4?
In ITIL 4, what does 'swarming' describe?
An unplanned outage takes down a payment service used by all customers during peak hours. Which classification best fits?