Operate
Stage 6 in the Lifecycle
Operate maintains live products and supporting systems in an "agreed working state" so delivery can proceed without unnecessary disruption. While customers rarely interact directly with operations, they experience the impact when availability or performance fail.
What you should take away
- State the official purpose of operate
- Define event, incident, and change as introduced in this chapter
- Explain SRE and observability at Foundation level
- List the three workflow steps
Official purpose
The purpose of operate is to maintain and monitor digital products and supporting systems ensuring they remain reliable and perform as agreed.
Operations work includes:
- Running platforms and systems
- Routine testing (continuity, security)
- Backups and monitoring
- Event handling
- Policy and compliance maintenance
Good operations remain largely invisible; failures become visible through delivery and support channels.
Key facts
| Question | Answer |
|---|---|
| Why? | Maintain and monitor products ensuring optimal performance and reliability |
| Who? | Product teams, IT operations teams, SRE teams |
| When? | Continually; triggered by transitioned solutions and onboarded suppliers |
| Key outputs? | Operating products/services, performance records and reports |
| Success metrics? | Monitoring coverage/effectiveness, reliability, incident impact, stakeholder satisfaction |
Key definitions
- Event: Any change of state that matters for managing a product, service, or configuration item
- Incident: Unplanned interruption or quality reduction in service; causes include technology errors, human error, external factors, or unauthorized changes
- Change: Addition, modification, or removal of anything affecting products or services
Reliability, SRE, and observability
Reliability = performing intended function for required time or cycles.
Site Reliability Engineering (SRE) applies software-engineering discipline to operations, building scalable, reliable systems.
Observability = inferring internal state from external signals (metrics, logs, traces). Products should be "designed for observability" for strong operational data.
Collaboration into Operate
Effective operations require early involvement from operations/SRE teams during design and transition phases for visibility, knowledge transfer, and feedback loops. Dedicated SRE teams coordinate with multiple product teams.
High-level workflow (three steps)
Assess transitioned solutions and operational requirements
Plan operational activities; confirm resource availability
Execute operational plans; report status to stakeholders
Triggers and outputs
Triggered by: Deployments to live environments, transitioned resources, onboarded suppliers
Outputs feed: Deliver and support activities; stable products underpin delivery; deviations trigger support. Operational data informs discover, design, build, and transition.
Extended operations view
Monitoring and response
- Real-time monitoring, alerting, dashboards
- AIOps-style pattern recognition
Incident and problem (operational lens)
- Restore service quickly
- Reduce repeat incidents through problem management
Related management practices
| Practice | Role |
|---|---|
| Monitoring and Event Management | Observe and classify |
| Incident Management | Restore service |
| Problem Management | Reduce underlying causes |
| Infrastructure and Platform Management | Run platforms |
| Information Security Management | Operational security |
| Availability Management | Meet availability targets |
| Capacity and Performance Management | Performance and capacity |
Inputs and outputs
Inputs: Live product, runbooks, SLAs, monitoring configuration
Outputs: Operational metrics, incidents/problems, reports, improvement signals
Metrics (examples)
- Availability
- MTTD / MTTR
- Incident volume and trends
- Automation rate
- Customer-impacting incidents