Case Study: Major Cloud Outage
Scenario
A mid-size SaaS company (500 employees, 50,000 active users) experienced a complete platform outage when their cloud provider's primary data centre failed. The incident lasted 4 hours 37 minutes, with full restoration taking 7 hours. Business impact included approximately $180,000 in revenue loss, three enterprise customer escalations, one SLA breach proceeding, and decreased employee morale.
Phase 1: Chaos (First 45 minutes)
What happened
- Multiple alerting systems fired simultaneously, overwhelming the on-call engineer
- No clear incident commander was identified for 20 minutes
- CEO messaged engineering Slack requesting status updates
- Two teams independently attempted different remediation actions
- Customer support received numerous tickets with no information to share
Complexity context: Chaotic
This situation exhibited no discernible cause-and-effect relationships. "The priority is to act immediately and stabilize" operations.
ITIL v5 analysis: what should have happened
| ITIL Practice | Expected Action | What Actually Happened |
|---|---|---|
| Incident Management | Invoke Major Incident procedure; assign Incident Commander | No clear escalation; 20-minute delay |
| Monitoring and Event Management | Consolidate alerts, suppress cascading failures | Raw alerts overwhelmed team |
| Service Desk | Activate proactive communication to users | Awaited engineering status |
| Governance | Centralized rapid-response governance | Multiple conflicting instructions |
Key lesson: In chaotic contexts, consensus-seeking creates paralysis. Pre-defined Incident Commander authority for immediate decisions is essential.
Phase 2: Stabilization (45 minutes to 3 hours)
What happened
- Senior engineer assumed Incident Commander role
- Team identified primary region down; disaster recovery (DR) region available
- DR failover attempted but never tested under real load
- DR capacity provisioned for only 30% of production load
- Partial service restored at reduced capacity after 2 hours
Complexity context: transitioning from Chaotic to Complex
The situation evolved from chaotic to complex as "the team could sense patterns (DR worked but was undersized), but the full picture only became clear through experimentation."
ITIL v5 analysis
| ITIL Practice | Analysis |
|---|---|
| Service Continuity Management | DR existed but was untested and under-provisioned; continuity plans require regular realistic testing |
| Availability Management | Single-region deployment created a single point of failure |
| Risk Management | Cloud provider outage risk classified as "low probability"; impact was underestimated |
Phase 3: Resolution (3 to 7 hours)
What happened
- Primary region restored at hour 4:37
- Traffic migration from DR to primary began
- Database replication lag caused data inconsistencies
- Full restoration completed at hour 7 including data reconciliation
- Customer communication was inconsistent across users
ITIL v5 analysis
| Practice | Finding |
|---|---|
| Change Enablement | Failover/failback processes lacked documentation; emergency changes need pre-approved records |
| Service Level Management | SLA breach notification delayed 3 hours due to multiple approval requirements |
| Relationship Management | Account managers not notified until hour 3; should have been included in incident communication plan |
Phase 4: Post-Incident Review (2 weeks later)
Blameless post-mortem findings
| Contributing Factor | Category (Four Dimensions) | Root or Contributing? |
|---|---|---|
| No designated Incident Commander | Organizations and People | Contributing |
| DR never tested under load | Value Streams and Processes | Contributing |
| DR provisioned at 30% capacity | Information and Technology | Contributing |
| Single cloud region deployment | Partners and Suppliers | Contributing |
| Alert fatigue | Information and Technology | Contributing |
| No pre-written customer communication templates | Value Streams and Processes | Contributing |
Note: "This is typical of complex systems: failures emerge from the interaction of multiple contributing factors."
Improvement actions (registered in the Continual Improvement register)
| Action | Practice | Priority | Owner |
|---|---|---|---|
| Define and train Incident Commander role | Incident Management | P1 | VP Engineering |
| Monthly DR failover tests under realistic load | Service Continuity Management | P1 | SRE Manager |
| Increase DR capacity to 100% of production | Availability Management | P1 | Infrastructure Lead |
| Implement AIOps noise reduction | Monitoring and Event Management | P2 | Platform Team |
| Create pre-approved customer communication templates | Relationship Management | P2 | Service Desk Manager |
| Evaluate multi-region or multi-cloud architecture | Risk Management | P2 | CTO |
| Automate SLA breach notification | Service Level Management | P2 | Service Manager |
ITIL v5 concepts demonstrated
| Concept | How It Appeared |
|---|---|
| Complexity contexts | Chaotic (outage) → Complex (stabilization) → Ordered (post-mortem) |
| Four Dimensions | All dimensions contributed to failure |
| Governance patterns | Centralized (chaotic) → Guided (complex) → Compliance-based (post-mortem) |
| Value co-creation | Customer communication failure prevented value co-creation |
| Guiding Principles | "Think and work holistically" emerged as key lesson |
| Continual Improvement | Post-mortem actions tracked to prevent recurrence |
Discussion questions
- Which ITIL guiding principle was most violated during initial chaos?
- How would quarterly DR tests have altered outcomes?
- Using PESTLE analysis, what external factors should inform multi-cloud decisions?
- How should Incident Commander role relate to normal governance patterns?
Related pages
- Complexity-Based Decisions (four complexity contexts)
- DevOps and SRE Integration (SRE practices for reliability)
- Incident Management (practice detail)
- Service Continuity Management (DR and BCP)