Error Budget Model - Nestor G Pestelos Jr (ngpestelos)

100% reliability is the wrong target for basically everything. No user can tell the difference between 100% and 99.999% available — the marginal effort to close that gap gets lost in the noise of other systems' unavailability (laptop, WiFi, ISP, power grid). **Error budget = 1 minus the availability target.** A service that's 99.99% available has a 0.01% error budget. Spend it on anything — ideally on shipping features fast via phased rollouts and 1% experiments. ## Why It Resolves Dev/Ops Conflict The structural conflict: devs want to ship fast, ops want stability. Error budgets reframe the goal from "zero outages" to "spend the budget getting maximum feature velocity." An outage is no longer a failure — it's an expected cost of innovation that both teams manage rather than fear. If the error budget is depleted, releases stop for the remainder of the quarter. This requires management support to enforce. ## Setting the Target Not a technical question — it's a product question: - What availability will users be happy with, given how they use the product? - What alternatives exist for dissatisfied users? - What happens to usage at different availability levels? ## The Cost/Benefit Math (Ch 3) For a $1M revenue service: 99.9% → 99.99% = 0.09% improvement = $900 incremental value. If cost to achieve > $900, not worth it. Each additional nine costs ~100x more than the previous one. ISP noise floor: 0.01% to 1% background error rate. Service errors below this floor are imperceptible to users. ## The Control Loop (Ch 3) See [[Error Budget Release Governor]] for the full mechanism. In brief: budget remaining → ship. Budget depleted → halt releases. Network outages and datacenter failures consume the same budget as bad pushes — shared responsibility. ## Cross-Domain Connections - [[SRE Evolution Framework]] — broader SRE context - [[SRE Fifty Percent Ops Cap]] — companion mechanism for engineering focus - [[SRE Monitoring Output Types]] — how to detect budget consumption - [[Error Budget Release Governor]] — the operational control loop - [[Risk as a Reliability Continuum]] — the framing that error budgets operationalize - [[Nonlinear Cost of Incremental Reliability]] — why each nine costs exponentially more - [[Request Success Rate Availability]] — how the budget is measured at Google --- *Source: Site Reliability Engineering, Chapters 1 and 3 (Treynor Sloss / Alvidrez / Roth, 2016)* *Extracted: 2026-03-25, updated 2026-03-26*