Availability Management
Definition
"The practice of ensuring that services deliver the agreed levels of availability to meet the needs of customers and users."
To fulfill this purpose, organizations must:
- Establish a shared view of target service levels
- Identify availability requirements
- Measure, assess, and report service availability
- Treat service availability risks
Key Terms
Availability: The capacity of an IT service or configuration item to perform its agreed function when needed.
Measuring Availability
Calculating Availability
Availability (%) = (Agreed Service Time - Downtime) / Agreed Service Time x 100
Common Availability Targets
| % Availability | Downtime/Month | Downtime/Year | Typical Use |
|---|---|---|---|
| 99% | ~7.3 hours | ~3.65 days | Internal systems, non-critical |
| 99.9% | ~43.8 minutes | ~8.76 hours | Business applications |
| 99.99% | ~4.4 minutes | ~52.6 minutes | E-commerce, financial services |
| 99.999% | ~26 seconds | ~5.3 minutes | Emergency services, critical infrastructure |
MTBF, MTRS, and MTBSI
| Metric | Abbreviation | Description |
|---|---|---|
| Mean Time Between Failures | MTBF | Average time between service failures |
| Mean Time to Restore Service | MTRS | Average time to restore service after failure |
| Mean Time Between Service Incidents | MTBSI | Average time between incidents |
Formulas
- MTBSI = MTBF + MTRS
- Availability = MTBF / (MTBF + MTRS)
Processes
Managing Product and Service Availability
- Analyze requirements to understand business availability needs
- Propose and verify solution design with appropriate controls
- Support and verify solution testing and implementation
- Support and verify monitoring and reporting
- Analyze data and initiate improvements
Measuring and Reporting Availability
- Analyze measurement and reporting needs and capabilities
- Agree availability measurement and reporting requirements
- Design availability measurements and reports
- Implement availability measurement and reporting
- Review availability measurement and reporting
Techniques to Improve Availability
| Technique | Description |
|---|---|
| Redundancy | Duplicate critical components to prevent single points of failure |
| Fault Tolerance | Design systems continuing operation when components fail |
| Active Monitoring | Detect problems before users notice them |
| Automated Failover | Automatically switch to backup systems during failures |
| Regular Testing | Verify that availability measures work as intended |
| Capacity Management | Prevent failures caused by resource exhaustion |
Recommendations for Practice Success
- Understand service consumer needs and expectations beyond technical metrics
- Understand legal and regulatory availability requirements
- Design availability matching actual business needs, not theoretical maximums
- Keep improving service availability without significant cost increases
- Automate availability controls where practical
- Integrate the practice into organizational value streams
Key Metrics
| Metric | What It Measures |
|---|---|
| Products/services with documented availability criteria | Coverage of availability requirements |
| Critical products/services with SLA-based availability requirements | Alignment with business needs |
| Timely updates to availability requirements | Responsiveness to change |
| Products/services monitored for availability | Monitoring coverage |
| Minimum time between failures | System reliability |
| Number of service disruptions | Incident frequency |
| Total downtime over the period | Aggregate impact |
| Maximum service outage | Worst-case impact |
| MTRS (Mean Time to Restore Service) | Recovery effectiveness |
| Effective availability controls | Control quality |
| Ratio actual losses vs expected losses | Accuracy of risk assessment |
Key Roles
- Availability Manager: Coordinates availability management activities, designs availability solutions, and reports on performance
Software Tools
- Availability and capacity modelling and management tools
- Automated testing tools
- Monitoring and event management tools
- Architecture management tools
- Analysis and reporting tools
- Service catalogue and CMDB tools
- Risk management tools