ITIL v5 Compass
Management Practices
Problem Management

Problem Management

Definition

"The practice of reducing the likelihood and impact of incidents by identifying actual and potential causes of incidents, and managing workarounds and known errors."

To fulfill the purpose, organizations need to:

  • Identify and understand the problems and their impact on services
  • Optimize problem resolution and mitigation
⚠️

Problem management is about prevention, not reaction. While incident management restores service quickly, problem management investigates underlying causes to prevent recurring incidents. These practices work together but have fundamentally different objectives.

Key Terms

Problem: A cause or potential cause of one or more incidents.

Known error: A problem that has been analyzed but not resolved. A known error always has documented workaround and/or root cause.

Workaround: A solution reducing or eliminating impact of incident or problem for which full resolution is unavailable.

Technical debt: The total rework backlog from choosing workarounds instead of systemic solutions.

Problem model: A repeatable approach to managing a particular type of problem.

Processes

Proactive Problem Identification

Finding problems before they cause incidents:

  1. Review submitted information: Analyze incident trends, monitoring data, stakeholder feedback
  2. Problem registration: Record the problem with supporting evidence
  3. Initial problem categorization and assignment: Classify and assign appropriately

Reactive Problem Identification

Finding problems after incidents occur:

  1. Problem registration: Record problem triggered by incident patterns
  2. Initial problem categorization and assignment: Classify and assign

Problem Control

Investigating problems to find root causes:

  1. Problem investigation: Analyze to determine root cause
  2. Known error communication: Document findings and communicate known error and workaround

Error Control

Managing known errors toward resolution:

  1. Problem solution development: Design permanent fix
  2. Problem resolution initiation: Raise change request to implement fix
  3. Known error monitoring and review: Track impact and update workarounds
  4. Problem closure: Close problem record after successful resolution and verification

Incident, Problem, and Change Relationship

StepFromActionTo
1IncidentTriggers investigationProblem
2ProblemIdentifies root causeKnown Error + Workaround
3Known ErrorRaises change requestChange Enablement → Permanent Fix

Recommendations for Practice Success

  • Start logging problems immediately and ensure right people are in right roles
  • Problem manager/team won't fix all problems alone: practice needs a "swarming approach"
  • Prioritize problems by business value; publish top business problems list
  • SLAs don't apply directly to problems, but effective problem management significantly improves service quality
  • Get user and customer input on their problems; they often have pattern and impact insights
  • Use AI and automation tools where possible for pattern detection and root cause analysis

Key Metrics

MetricWhat it measures
Incidents without known errorProblems not yet investigated or documented
Problems identifiedVolume of new problems discovered
Incidents requiring urgent problem investigationSeverity-driven escalation rate
Incidents prevented by problem resolutionValue of proactive problem management
Incidents resolved by problem investigationReactive value contribution
Open known errorsBacklog of unresolved known errors

Key Roles

  • Problem manager: Coordinates problem management activities, prioritizes problem backlog
  • Problem coordinator: Supports investigation and communication activities

Software Tools

  • Monitoring and event management tools
  • Workflow management and collaboration tools
  • Knowledge management and CMDB tools
  • Analysis and reporting tools

AI in Problem Management (ITIL v5)

CapabilityDescription
Pattern detectionAI analyzes incident data to identify recurring patterns indicating problems
Automated RCAAI-powered root cause analysis reduces investigation time
Trend predictionPredictive analytics identify potential problems before causing incidents
Knowledge linkingAI connects related problems, incidents, and known errors across knowledge base

90-Day Implementation Checklist

Month 1: Foundation

  • Define problem management scope and relationship to incident management
  • Establish Known Error Database (KEDB) structure in ITSM platform
  • Create problem record template with root cause analysis fields
  • Identify problem manager role and responsibilities
  • Review incident data: identify top 10 recurring incident categories
  • Create first 5 known error records from existing knowledge

Month 2: Reactive Problem Management

  • Conduct root cause analysis (RCA) on top 3 recurring incident categories
  • Document root causes, workarounds, and permanent resolution plans
  • Establish problem review meetings (bi-weekly or monthly)
  • Link problems to related incidents in ITSM tool
  • Begin measuring: number of known errors, incidents linked to problems, repeat incident reduction
  • Train L2/L3 teams on RCA techniques (5 Whys, fishbone diagram)

Month 3: Proactive Problem Management

  • Analyze trends: identify problems before they cause incidents
  • Review change failure data for systemic issues
  • Conduct proactive infrastructure health reviews
  • Publish "Problem Management Impact Report" showing prevented incidents
  • Feed permanent resolution requirements into Change Enablement pipeline
  • Plan next quarter: target next 5 incident categories for RCA