SRE Playbook MTTR Improvement - Nestor G Pestelos Jr (ngpestelos)

Reliability is a function of MTTF (mean time to failure) and MTTR (mean time to repair). For emergency response, MTTR is the most relevant metric — how quickly can the team bring the system back to health? ## The 3x Playbook Effect Thinking through and recording best practices ahead of time in a "playbook" produces roughly a **3x improvement in MTTR** compared to "winging it." The hero jack-of-all-trades on-call engineer does work, but the practiced on-call engineer armed with a playbook works much better. ## Humans Add Latency A system that can avoid emergencies requiring human intervention will have higher availability than one that requires hands-on intervention — even if the automated system experiences more actual failures. ## Postmortem Culture - Written for **all significant incidents**, regardless of whether they paged - Postmortems that didn't trigger a page are **even more valuable** — they point to monitoring gaps - Google operates under **blame-free postmortem culture** - Goal: expose faults and apply engineering to fix them, not avoid or minimize ## Preparation Methods - On-call playbooks with clear troubleshooting steps - "Wheel of Misfortune" exercises (disaster role-playing) ## Cross-Domain Connections - [[SRE Monitoring Output Types]] — alerts trigger emergency response - [[SRE Change Management Seventy Percent Rule]] — changes cause most emergencies - [[Incident Response 30-Minute Check-In Rhythm]] — operational incident cadence --- *Source: Site Reliability Engineering, Chapter 1 (Treynor Sloss, 2016)* *Extracted: 2026-03-25*