ITIL v5 Compass
Case Studies
Major Cloud Outage

Case Study: Major Cloud Outage

Scenario

A mid-size SaaS company (500 employees, 50,000 active users) experienced a complete platform outage when their cloud provider's primary data centre failed. The incident lasted 4 hours 37 minutes, with full restoration taking 7 hours. Business impact included approximately $180,000 in revenue loss, three enterprise customer escalations, one SLA breach proceeding, and decreased employee morale.


Phase 1: Chaos (First 45 minutes)

What happened

  • Multiple alerting systems fired simultaneously, overwhelming the on-call engineer
  • No clear incident commander was identified for 20 minutes
  • CEO messaged engineering Slack requesting status updates
  • Two teams independently attempted different remediation actions
  • Customer support received numerous tickets with no information to share

Complexity context: Chaotic

This situation exhibited no discernible cause-and-effect relationships. "The priority is to act immediately and stabilize" operations.

ITIL v5 analysis: what should have happened

ITIL PracticeExpected ActionWhat Actually Happened
Incident ManagementInvoke Major Incident procedure; assign Incident CommanderNo clear escalation; 20-minute delay
Monitoring and Event ManagementConsolidate alerts, suppress cascading failuresRaw alerts overwhelmed team
Service DeskActivate proactive communication to usersAwaited engineering status
GovernanceCentralized rapid-response governanceMultiple conflicting instructions
⚠️

Key lesson: In chaotic contexts, consensus-seeking creates paralysis. Pre-defined Incident Commander authority for immediate decisions is essential.


Phase 2: Stabilization (45 minutes to 3 hours)

What happened

  • Senior engineer assumed Incident Commander role
  • Team identified primary region down; disaster recovery (DR) region available
  • DR failover attempted but never tested under real load
  • DR capacity provisioned for only 30% of production load
  • Partial service restored at reduced capacity after 2 hours

Complexity context: transitioning from Chaotic to Complex

The situation evolved from chaotic to complex as "the team could sense patterns (DR worked but was undersized), but the full picture only became clear through experimentation."

ITIL v5 analysis

ITIL PracticeAnalysis
Service Continuity ManagementDR existed but was untested and under-provisioned; continuity plans require regular realistic testing
Availability ManagementSingle-region deployment created a single point of failure
Risk ManagementCloud provider outage risk classified as "low probability"; impact was underestimated

Phase 3: Resolution (3 to 7 hours)

What happened

  • Primary region restored at hour 4:37
  • Traffic migration from DR to primary began
  • Database replication lag caused data inconsistencies
  • Full restoration completed at hour 7 including data reconciliation
  • Customer communication was inconsistent across users

ITIL v5 analysis

PracticeFinding
Change EnablementFailover/failback processes lacked documentation; emergency changes need pre-approved records
Service Level ManagementSLA breach notification delayed 3 hours due to multiple approval requirements
Relationship ManagementAccount managers not notified until hour 3; should have been included in incident communication plan

Phase 4: Post-Incident Review (2 weeks later)

Blameless post-mortem findings

Contributing FactorCategory (Four Dimensions)Root or Contributing?
No designated Incident CommanderOrganizations and PeopleContributing
DR never tested under loadValue Streams and ProcessesContributing
DR provisioned at 30% capacityInformation and TechnologyContributing
Single cloud region deploymentPartners and SuppliersContributing
Alert fatigueInformation and TechnologyContributing
No pre-written customer communication templatesValue Streams and ProcessesContributing
💡

Note: "This is typical of complex systems: failures emerge from the interaction of multiple contributing factors."

Improvement actions (registered in the Continual Improvement register)

ActionPracticePriorityOwner
Define and train Incident Commander roleIncident ManagementP1VP Engineering
Monthly DR failover tests under realistic loadService Continuity ManagementP1SRE Manager
Increase DR capacity to 100% of productionAvailability ManagementP1Infrastructure Lead
Implement AIOps noise reductionMonitoring and Event ManagementP2Platform Team
Create pre-approved customer communication templatesRelationship ManagementP2Service Desk Manager
Evaluate multi-region or multi-cloud architectureRisk ManagementP2CTO
Automate SLA breach notificationService Level ManagementP2Service Manager

ITIL v5 concepts demonstrated

ConceptHow It Appeared
Complexity contextsChaotic (outage) → Complex (stabilization) → Ordered (post-mortem)
Four DimensionsAll dimensions contributed to failure
Governance patternsCentralized (chaotic) → Guided (complex) → Compliance-based (post-mortem)
Value co-creationCustomer communication failure prevented value co-creation
Guiding Principles"Think and work holistically" emerged as key lesson
Continual ImprovementPost-mortem actions tracked to prevent recurrence

Discussion questions

  1. Which ITIL guiding principle was most violated during initial chaos?
  2. How would quarterly DR tests have altered outcomes?
  3. Using PESTLE analysis, what external factors should inform multi-cloud decisions?
  4. How should Incident Commander role relate to normal governance patterns?

Related pages