Monitoring and Event Management

Definition

"The practice of systematically observing services and service components, recording and reporting selected changes of state identified as events."

To fulfil the purpose, an organization needs to:

Establish and maintain event models and monitoring needs
Provide timely, relevant monitoring data to stakeholders
Detect, interpret, and respond to events promptly

Key Terms

Event: Any change of state that has significance for the management of a service or other Configuration Item (CI).

Monitoring: Repeated observation of a system, practice, process, service, or other entity to detect events and to ensure that the current status is known.

Threshold: The value of a metric that triggers a pre-defined response. Thresholds translate abstract service level targets into concrete, actionable monitoring rules.

Alert: A notification that an action needs to be taken, a threshold has been reached, something has changed, or a failure has occurred.

Event Types

Type	Description	Response
Informational	A regular change of state with no action needed	Log for reference; may be useful for trend analysis
Warning	A change approaching a threshold or unusual pattern	Review and assess; may require proactive action
Exception	A threshold has been breached or a failure has occurred	Immediate response required; triggers incident management

💡

From monitoring to action: Monitoring generates raw data. Event management interprets that data and triggers the appropriate response. The quality of your event classification and correlation rules determines whether your monitoring investment creates value or just noise.

Processes

Monitoring Planning

Designing what, how, and why to monitor:

Define the objective

Define the objective of monitoring (why we monitor this component)

Define what to monitor

Define what needs to be and can be monitored (feasibility assessment)

Define event types

Define types of events for the object of monitoring (informational, warning, exception)

Define thresholds

Define thresholds for different types of events

Define service health model

Define a service health model (end-to-end event correlations)

Define event correlations

Define event correlations and rule sets (how events relate to each other)

Define monitoring action plans

Define monitoring action plans (what happens when events occur)

Define tool capabilities

Define required monitoring tool capabilities (tooling requirements)

Event Handling

Responding to detected events:

Detect event: Event is identified by monitoring tools
Log event: Record the event with timestamp, source, and context
Filter and correlate event: Remove noise; link related events
Classify event: Determine if informational, warning, or exception
Select event response: Choose the appropriate action based on classification
Notifications sent; response carried out: Execute the response plan

Monitoring and Event Management Review

Periodic review of the practice:

Major events review: Analyse significant events for lessons learned
Review of filtering and correlation analysis: Tune rules to reduce noise
Review of service health models: Update models to reflect infrastructure changes
Review of event response procedures and automation: Optimize response plans
Review of monitoring and event tools: Assess tool capability and gaps
Review of statistical information: Trend analysis and capacity planning

Recommendations for Practice Success

Develop the monitoring strategy with proper tools and processes, and review it regularly
Understand component purposes and stakeholder needs before designing monitoring
Adjust monitoring based on event context and significance
Avoid monitoring events of unknown significance unnecessarily (reduces noise)
Review monitoring report usage and effectiveness regularly
Collaborate post-incident to improve prevention through monitoring
Use automation to assess event significance and respond appropriately

Key Metrics

Metric	What it measures
Satisfaction with practice approach	Stakeholder confidence in monitoring strategy
Organizational adherence to the approach	Consistency of monitoring implementation
Unmet or unrealistic recommendations (%)	Quality of monitoring design
Satisfaction with monitoring data and presentation	Usefulness of monitoring output
Monitoring data quality	Accuracy and completeness of collected data
Impact of event management errors	Consequences of misclassified or missed events
Event communication noise	Volume of irrelevant alerts
Incidents/problems from poor event management	Failures attributable to monitoring gaps

Key Roles

💡

This practice does not define specific named roles. Monitoring and event management responsibilities are typically distributed across infrastructure, application, and service management teams.

Software Tools

Monitoring and event management tools (including native and add-on tools)
Workflow management and collaboration tools
Knowledge management and CMDB tools
Analysis and reporting tools

AIOps and Monitoring (ITIL v5)

Capability	Description
Anomaly detection	AI identifies unusual patterns that rule-based monitoring would miss
Event correlation	AI links related events across infrastructure layers to reduce noise
Predictive alerts	AI forecasts potential issues before thresholds are breached
Auto-remediation	AI triggers automated responses to known event patterns
Capacity prediction	AI forecasts resource needs based on usage trends

Service Request Management Availability Management