Problem Management

Definition

"The practice of reducing the likelihood and impact of incidents by identifying actual and potential causes of incidents, and managing workarounds and known errors."

To fulfill the purpose, organizations need to:

Identify and understand the problems and their impact on services
Optimize problem resolution and mitigation

⚠️

Problem management is about prevention, not reaction. While incident management restores service quickly, problem management investigates underlying causes to prevent recurring incidents. These practices work together but have fundamentally different objectives.

Key Terms

Problem: A cause or potential cause of one or more incidents.

Known error: A problem that has been analyzed but not resolved. A known error always has documented workaround and/or root cause.

Workaround: A solution reducing or eliminating impact of incident or problem for which full resolution is unavailable.

Technical debt: The total rework backlog from choosing workarounds instead of systemic solutions.

Problem model: A repeatable approach to managing a particular type of problem.

Processes

Proactive Problem Identification

Finding problems before they cause incidents:

Review submitted information: Analyze incident trends, monitoring data, stakeholder feedback
Problem registration: Record the problem with supporting evidence
Initial problem categorization and assignment: Classify and assign appropriately

Reactive Problem Identification

Finding problems after incidents occur:

Problem registration: Record problem triggered by incident patterns
Initial problem categorization and assignment: Classify and assign

Problem Control

Investigating problems to find root causes:

Problem investigation: Analyze to determine root cause
Known error communication: Document findings and communicate known error and workaround

Error Control

Managing known errors toward resolution:

Problem solution development: Design permanent fix
Problem resolution initiation: Raise change request to implement fix
Known error monitoring and review: Track impact and update workarounds
Problem closure: Close problem record after successful resolution and verification

Incident, Problem, and Change Relationship

Step	From	Action	To
1	Incident	Triggers investigation	Problem
2	Problem	Identifies root cause	Known Error + Workaround
3	Known Error	Raises change request	Change Enablement → Permanent Fix

Recommendations for Practice Success

Start logging problems immediately and ensure right people are in right roles
Problem manager/team won't fix all problems alone: practice needs a "swarming approach"
Prioritize problems by business value; publish top business problems list
SLAs don't apply directly to problems, but effective problem management significantly improves service quality
Get user and customer input on their problems; they often have pattern and impact insights
Use AI and automation tools where possible for pattern detection and root cause analysis

Key Metrics

Metric	What it measures
Incidents without known error	Problems not yet investigated or documented
Problems identified	Volume of new problems discovered
Incidents requiring urgent problem investigation	Severity-driven escalation rate
Incidents prevented by problem resolution	Value of proactive problem management
Incidents resolved by problem investigation	Reactive value contribution
Open known errors	Backlog of unresolved known errors

Key Roles

Problem manager: Coordinates problem management activities, prioritizes problem backlog
Problem coordinator: Supports investigation and communication activities

Software Tools

Monitoring and event management tools
Workflow management and collaboration tools
Knowledge management and CMDB tools
Analysis and reporting tools

AI in Problem Management (ITIL v5)

Capability	Description
Pattern detection	AI analyzes incident data to identify recurring patterns indicating problems
Automated RCA	AI-powered root cause analysis reduces investigation time
Trend prediction	Predictive analytics identify potential problems before causing incidents
Knowledge linking	AI connects related problems, incidents, and known errors across knowledge base

90-Day Implementation Checklist

Month 1: Foundation

Define problem management scope and relationship to incident management
Establish Known Error Database (KEDB) structure in ITSM platform
Create problem record template with root cause analysis fields
Identify problem manager role and responsibilities
Review incident data: identify top 10 recurring incident categories
Create first 5 known error records from existing knowledge

Month 2: Reactive Problem Management

Conduct root cause analysis (RCA) on top 3 recurring incident categories
Document root causes, workarounds, and permanent resolution plans
Establish problem review meetings (bi-weekly or monthly)
Link problems to related incidents in ITSM tool
Begin measuring: number of known errors, incidents linked to problems, repeat incident reduction
Train L2/L3 teams on RCA techniques (5 Whys, fishbone diagram)

Month 3: Proactive Problem Management

Analyze trends: identify problems before they cause incidents
Review change failure data for systemic issues
Conduct proactive infrastructure health reviews
Publish "Problem Management Impact Report" showing prevented incidents
Feed permanent resolution requirements into Change Enablement pipeline
Plan next quarter: target next 5 incident categories for RCA

Incident Management Change Enablement