Operate

Stage 6 in the Lifecycle

Operate maintains live products and supporting systems in an "agreed working state" so delivery can proceed without unnecessary disruption. While customers rarely interact directly with operations, they experience the impact when availability or performance fail.

What you should take away

State the official purpose of operate
Define event, incident, and change as introduced in this chapter
Explain SRE and observability at Foundation level
List the three workflow steps

Official purpose

The purpose of operate is to maintain and monitor digital products and supporting systems ensuring they remain reliable and perform as agreed.

Operations work includes:

Running platforms and systems
Routine testing (continuity, security)
Backups and monitoring
Event handling
Policy and compliance maintenance

Good operations remain largely invisible; failures become visible through delivery and support channels.

Key facts

Question	Answer
Why?	Maintain and monitor products ensuring optimal performance and reliability
Who?	Product teams, IT operations teams, SRE teams
When?	Continually; triggered by transitioned solutions and onboarded suppliers
Key outputs?	Operating products/services, performance records and reports
Success metrics?	Monitoring coverage/effectiveness, reliability, incident impact, stakeholder satisfaction

Key definitions

Event: Any change of state that matters for managing a product, service, or configuration item
Incident: Unplanned interruption or quality reduction in service; causes include technology errors, human error, external factors, or unauthorized changes
Change: Addition, modification, or removal of anything affecting products or services

Reliability, SRE, and observability

Reliability = performing intended function for required time or cycles.

Site Reliability Engineering (SRE) applies software-engineering discipline to operations, building scalable, reliable systems.

Observability = inferring internal state from external signals (metrics, logs, traces). Products should be "designed for observability" for strong operational data.

Collaboration into Operate

Effective operations require early involvement from operations/SRE teams during design and transition phases for visibility, knowledge transfer, and feedback loops. Dedicated SRE teams coordinate with multiple product teams.

High-level workflow (three steps)

Assess transitioned solutions and operational requirements

Plan operational activities; confirm resource availability

Execute operational plans; report status to stakeholders

Triggers and outputs

Triggered by: Deployments to live environments, transitioned resources, onboarded suppliers

Outputs feed: Deliver and support activities; stable products underpin delivery; deviations trigger support. Operational data informs discover, design, build, and transition.

Extended operations view

Monitoring and response

Real-time monitoring, alerting, dashboards
AIOps-style pattern recognition

Incident and problem (operational lens)

Restore service quickly
Reduce repeat incidents through problem management

Related management practices

Practice	Role
Monitoring and Event Management	Observe and classify
Incident Management	Restore service
Problem Management	Reduce underlying causes
Infrastructure and Platform Management	Run platforms
Information Security Management	Operational security
Availability Management	Meet availability targets
Capacity and Performance Management	Performance and capacity

Inputs and outputs

Inputs: Live product, runbooks, SLAs, monitoring configuration

Outputs: Operational metrics, incidents/problems, reports, improvement signals

Metrics (examples)

Availability
MTTD / MTTR
Incident volume and trends
Automation rate
Customer-impacting incidents

Monitoring and Event Management Infrastructure and Platform Management Availability Management Capacity and Performance Management

5. Transition 7. Deliver