ITIL v5 Compass
Management Practices
Incident Management

Incident Management

Definition

"The practice of minimizing the negative impact of incidents by restoring normal service operation as quickly as possible."

To fulfill this purpose, organizations must:

  • Detect incidents early
  • Resolve incidents quickly and efficiently
  • Continually improve upon incident management

What is an Incident?

An incident is defined as "an unplanned interruption to a service or reduction in the quality of a service."

Incidents vary widely in impact and urgency. A major incident represents the highest business impact, requiring separate management procedures, dedicated teams, and faster escalation paths.

⚠️

Key Distinction: Incidents are symptoms (something gone wrong), while problems represent underlying causes. Known errors are documented problems with established workarounds. Incident management focuses on service restoration; problem management prevents recurrence.

Processes

Incident Handling and Resolution

The core process follows six steps:

Detection

Identify incidents through monitoring tools, user reports, or automated alerts

Registration

Log incidents with essential details (time, affected service, user, description)

Classification

Categorize by type, service, priority, and urgency for proper routing

Diagnosis

Investigate to determine resolution path; apply known solutions where available

Resolution

Apply fix, workaround, or escalate to appropriate support team

Closure

Confirm service restoration with user, document resolution, close record

Periodic Incident Review

A continuous improvement process that includes:

  1. Review: Analyze incident trends, patterns, and improvement areas
  2. Improvement Initiation: Identify and prioritize improvement actions
  3. Communication of Updates: Share findings and improvements with stakeholders

Key Terms

Workaround: "a solution that reduces or eliminates the impact of an incident or problem for which a full resolution is not yet available."

Swarming: "a technique for solving various complex tasks, in which multiple people with different areas of expertise work together on a task until it becomes clear which competencies are the most relevant and needed."

Prioritization: "the action of selecting tasks to work on first when it is impossible to assign resources to all tasks in the backlog."

Incident Priority

Priority is based on urgency (business resolution speed needed) and impact (detriment extent to business processes):

PriorityDescriptionTypical Response Target
P1 (Critical)Complete service outage for many users or business-critical functionMinutes
P2 (High)Major degradation affecting significant user groupsUnder 1 hour
P3 (Medium)Service degradation for limited users, workaround availableHours
P4 (Low)Minor issue affecting individual users, no business impactDays

Recommendations for Practice Success

  • Look at incidents from the service consumer's perspective, not only technical view
  • Gather and reuse data: every resolved incident is a knowledge source
  • Understand, manage, and improve the incident resolution value stream, not only the practice itself
  • Develop the practice continually but avoid overcomplication
  • Adjust for complexity: simple incidents get simple processes, complex ones use swarming
  • Demonstrate business value by linking incident metrics to business outcomes

Key Metrics

MetricWhat It Measures
Time to DetectHow quickly incidents are identified after occurrence
Detection via Monitoring (%)Proportion detected by tooling versus user reports
Diagnosis TimeAverage time from registration to root cause identification
ReassignmentsNumber of hand-offs between teams (fewer is better)
First-Time Resolution (%)Incidents resolved on first contact
Resolved Automatically (%)Incidents handled without human intervention
User SatisfactionPost-incident satisfaction score
Known Solution Usage (%)Incidents resolved using documented solutions
Trend ImprovementsReduction in incident volume and impact over time

Key Roles

  • Incident Manager: Coordinates incident management process, owns escalation procedures
  • Major Incident Manager: Leads response team for high-impact incidents, manages stakeholder communication

Software Tools

  • Monitoring and event management tools
  • Workflow and collaboration tools
  • Knowledge management tools, CMDB tools
  • Remote administration and software management tools
  • Ticketing and workflow systems
  • Reporting and survey tools

AI in Incident Management (ITIL v5)

CapabilityDescription
Automated DetectionAIOps and anomaly detection identify incidents before user reports
Intelligent ClassificationAI categorizes and prioritizes incidents based on historical data
Suggested ResolutionAI recommends fixes based on similar past incidents
Auto-ResolutionRoutine incidents resolved automatically without human intervention
Predictive AnalyticsAI identifies patterns that may lead to future incidents

90-Day Implementation Checklist

Month 1: Foundation

  • Define incident severity levels (P1 through P4) with clear business impact criteria
  • Select and configure ticketing system (or configure existing ITSM platform)
  • Create incident record template with mandatory fields
  • Train service desk staff on incident classification and initial diagnosis
  • Define escalation paths: functional (L1 → L2 → L3) and hierarchical (to management)
  • Establish major incident process with war room procedures

Month 2: Process Maturity

  • Implement SLA targets per severity level and configure alerting
  • Set up monitoring integration: alerts auto-create incident tickets
  • Create known error database (KEDB) with top 20 recurring issues
  • Establish major incident communication templates (internal + external)
  • Conduct first post-incident review (PIR) for a P1/P2 incident
  • Begin tracking metrics: MTTD, MTTR, FCR

Month 3: Optimization

  • Analyze incident trends (top 10 categories, repeat incidents, peak times)
  • Feed trend data to Problem Management for root cause investigation
  • Publish first monthly incident report to management
  • Measure and report: MTTD, MTTR, FCR, user satisfaction with resolution
  • Identify automation candidates (auto-classification, auto-routing, self-healing)
  • Plan next quarter improvements using Continual Improvement Register