Monitoring and Event Management
Definition
"The practice of systematically observing services and service components, recording and reporting selected changes of state identified as events."
To fulfil the purpose, an organization needs to:
- Establish and maintain event models and monitoring needs
- Provide timely, relevant monitoring data to stakeholders
- Detect, interpret, and respond to events promptly
Key Terms
Event: Any change of state that has significance for the management of a service or other Configuration Item (CI).
Monitoring: Repeated observation of a system, practice, process, service, or other entity to detect events and to ensure that the current status is known.
Threshold: The value of a metric that triggers a pre-defined response. Thresholds translate abstract service level targets into concrete, actionable monitoring rules.
Alert: A notification that an action needs to be taken, a threshold has been reached, something has changed, or a failure has occurred.
Event Types
| Type | Description | Response |
|---|---|---|
| Informational | A regular change of state with no action needed | Log for reference; may be useful for trend analysis |
| Warning | A change approaching a threshold or unusual pattern | Review and assess; may require proactive action |
| Exception | A threshold has been breached or a failure has occurred | Immediate response required; triggers incident management |
From monitoring to action: Monitoring generates raw data. Event management interprets that data and triggers the appropriate response. The quality of your event classification and correlation rules determines whether your monitoring investment creates value or just noise.
Processes
Monitoring Planning
Designing what, how, and why to monitor:
Define the objective
Define the objective of monitoring (why we monitor this component)
Define what to monitor
Define what needs to be and can be monitored (feasibility assessment)
Define event types
Define types of events for the object of monitoring (informational, warning, exception)
Define thresholds
Define thresholds for different types of events
Define service health model
Define a service health model (end-to-end event correlations)
Define event correlations
Define event correlations and rule sets (how events relate to each other)
Define monitoring action plans
Define monitoring action plans (what happens when events occur)
Define tool capabilities
Define required monitoring tool capabilities (tooling requirements)
Event Handling
Responding to detected events:
- Detect event: Event is identified by monitoring tools
- Log event: Record the event with timestamp, source, and context
- Filter and correlate event: Remove noise; link related events
- Classify event: Determine if informational, warning, or exception
- Select event response: Choose the appropriate action based on classification
- Notifications sent; response carried out: Execute the response plan
Monitoring and Event Management Review
Periodic review of the practice:
- Major events review: Analyse significant events for lessons learned
- Review of filtering and correlation analysis: Tune rules to reduce noise
- Review of service health models: Update models to reflect infrastructure changes
- Review of event response procedures and automation: Optimize response plans
- Review of monitoring and event tools: Assess tool capability and gaps
- Review of statistical information: Trend analysis and capacity planning
Recommendations for Practice Success
- Develop the monitoring strategy with proper tools and processes, and review it regularly
- Understand component purposes and stakeholder needs before designing monitoring
- Adjust monitoring based on event context and significance
- Avoid monitoring events of unknown significance unnecessarily (reduces noise)
- Review monitoring report usage and effectiveness regularly
- Collaborate post-incident to improve prevention through monitoring
- Use automation to assess event significance and respond appropriately
Key Metrics
| Metric | What it measures |
|---|---|
| Satisfaction with practice approach | Stakeholder confidence in monitoring strategy |
| Organizational adherence to the approach | Consistency of monitoring implementation |
| Unmet or unrealistic recommendations (%) | Quality of monitoring design |
| Satisfaction with monitoring data and presentation | Usefulness of monitoring output |
| Monitoring data quality | Accuracy and completeness of collected data |
| Impact of event management errors | Consequences of misclassified or missed events |
| Event communication noise | Volume of irrelevant alerts |
| Incidents/problems from poor event management | Failures attributable to monitoring gaps |
Key Roles
This practice does not define specific named roles. Monitoring and event management responsibilities are typically distributed across infrastructure, application, and service management teams.
Software Tools
- Monitoring and event management tools (including native and add-on tools)
- Workflow management and collaboration tools
- Knowledge management and CMDB tools
- Analysis and reporting tools
AIOps and Monitoring (ITIL v5)
| Capability | Description |
|---|---|
| Anomaly detection | AI identifies unusual patterns that rule-based monitoring would miss |
| Event correlation | AI links related events across infrastructure layers to reduce noise |
| Predictive alerts | AI forecasts potential issues before thresholds are breached |
| Auto-remediation | AI triggers automated responses to known event patterns |
| Capacity prediction | AI forecasts resource needs based on usage trends |