Problem Management
Definition
"The practice of reducing the likelihood and impact of incidents by identifying actual and potential causes of incidents, and managing workarounds and known errors."
To fulfill the purpose, organizations need to:
- Identify and understand the problems and their impact on services
- Optimize problem resolution and mitigation
Problem management is about prevention, not reaction. While incident management restores service quickly, problem management investigates underlying causes to prevent recurring incidents. These practices work together but have fundamentally different objectives.
Key Terms
Problem: A cause or potential cause of one or more incidents.
Known error: A problem that has been analyzed but not resolved. A known error always has documented workaround and/or root cause.
Workaround: A solution reducing or eliminating impact of incident or problem for which full resolution is unavailable.
Technical debt: The total rework backlog from choosing workarounds instead of systemic solutions.
Problem model: A repeatable approach to managing a particular type of problem.
Processes
Proactive Problem Identification
Finding problems before they cause incidents:
- Review submitted information: Analyze incident trends, monitoring data, stakeholder feedback
- Problem registration: Record the problem with supporting evidence
- Initial problem categorization and assignment: Classify and assign appropriately
Reactive Problem Identification
Finding problems after incidents occur:
- Problem registration: Record problem triggered by incident patterns
- Initial problem categorization and assignment: Classify and assign
Problem Control
Investigating problems to find root causes:
- Problem investigation: Analyze to determine root cause
- Known error communication: Document findings and communicate known error and workaround
Error Control
Managing known errors toward resolution:
- Problem solution development: Design permanent fix
- Problem resolution initiation: Raise change request to implement fix
- Known error monitoring and review: Track impact and update workarounds
- Problem closure: Close problem record after successful resolution and verification
Incident, Problem, and Change Relationship
| Step | From | Action | To |
|---|---|---|---|
| 1 | Incident | Triggers investigation | Problem |
| 2 | Problem | Identifies root cause | Known Error + Workaround |
| 3 | Known Error | Raises change request | Change Enablement → Permanent Fix |
Recommendations for Practice Success
- Start logging problems immediately and ensure right people are in right roles
- Problem manager/team won't fix all problems alone: practice needs a "swarming approach"
- Prioritize problems by business value; publish top business problems list
- SLAs don't apply directly to problems, but effective problem management significantly improves service quality
- Get user and customer input on their problems; they often have pattern and impact insights
- Use AI and automation tools where possible for pattern detection and root cause analysis
Key Metrics
| Metric | What it measures |
|---|---|
| Incidents without known error | Problems not yet investigated or documented |
| Problems identified | Volume of new problems discovered |
| Incidents requiring urgent problem investigation | Severity-driven escalation rate |
| Incidents prevented by problem resolution | Value of proactive problem management |
| Incidents resolved by problem investigation | Reactive value contribution |
| Open known errors | Backlog of unresolved known errors |
Key Roles
- Problem manager: Coordinates problem management activities, prioritizes problem backlog
- Problem coordinator: Supports investigation and communication activities
Software Tools
- Monitoring and event management tools
- Workflow management and collaboration tools
- Knowledge management and CMDB tools
- Analysis and reporting tools
AI in Problem Management (ITIL v5)
| Capability | Description |
|---|---|
| Pattern detection | AI analyzes incident data to identify recurring patterns indicating problems |
| Automated RCA | AI-powered root cause analysis reduces investigation time |
| Trend prediction | Predictive analytics identify potential problems before causing incidents |
| Knowledge linking | AI connects related problems, incidents, and known errors across knowledge base |
90-Day Implementation Checklist
Month 1: Foundation
- Define problem management scope and relationship to incident management
- Establish Known Error Database (KEDB) structure in ITSM platform
- Create problem record template with root cause analysis fields
- Identify problem manager role and responsibilities
- Review incident data: identify top 10 recurring incident categories
- Create first 5 known error records from existing knowledge
Month 2: Reactive Problem Management
- Conduct root cause analysis (RCA) on top 3 recurring incident categories
- Document root causes, workarounds, and permanent resolution plans
- Establish problem review meetings (bi-weekly or monthly)
- Link problems to related incidents in ITSM tool
- Begin measuring: number of known errors, incidents linked to problems, repeat incident reduction
- Train L2/L3 teams on RCA techniques (5 Whys, fishbone diagram)
Month 3: Proactive Problem Management
- Analyze trends: identify problems before they cause incidents
- Review change failure data for systemic issues
- Conduct proactive infrastructure health reviews
- Publish "Problem Management Impact Report" showing prevented incidents
- Feed permanent resolution requirements into Change Enablement pipeline
- Plan next quarter: target next 5 incident categories for RCA