Reliability is a function of MTTF (mean time to failure) and MTTR (mean time to repair). For emergency response, MTTR is the most relevant metric — how quickly can the team bring the system back to health?
## The 3x Playbook Effect
Thinking through and recording best practices ahead of time in a "playbook" produces roughly a **3x improvement in MTTR** compared to "winging it."
The hero jack-of-all-trades on-call engineer does work, but the practiced on-call engineer armed with a playbook works much better.
## Humans Add Latency
A system that can avoid emergencies requiring human intervention will have higher availability than one that requires hands-on intervention — even if the automated system experiences more actual failures.
## Postmortem Culture
- Written for **all significant incidents**, regardless of whether they paged
- Postmortems that didn't trigger a page are **even more valuable** — they point to monitoring gaps
- Google operates under **blame-free postmortem culture**
- Goal: expose faults and apply engineering to fix them, not avoid or minimize
## Preparation Methods
- On-call playbooks with clear troubleshooting steps
- "Wheel of Misfortune" exercises (disaster role-playing)
## Cross-Domain Connections
- [[SRE Monitoring Output Types]] — alerts trigger emergency response
- [[SRE Change Management Seventy Percent Rule]] — changes cause most emergencies
- [[Incident Response 30-Minute Check-In Rhythm]] — operational incident cadence
---
*Source: Site Reliability Engineering, Chapter 1 (Treynor Sloss, 2016)*
*Extracted: 2026-03-25*