Testing as Reliability Quantification - Nestor G Pestelos Jr (ngpestelos)

## Core Insight Testing quantifies confidence in system reliability. Each test that passes both before and after a change reduces uncertainty about that change's impact. More test coverage enables more changes before reliability drops below acceptable levels. The ideal is zero MTTR bugs — caught by tests before reaching production, blocking the push entirely. ## Zero MTTR: The Ideal A system-level test that detects the exact problem monitoring would catch enables blocking the push before production. The bug still needs a source code fix, but users never see it. More zero-MTTR bugs → higher MTBF → developers encouraged to release faster (a virtuous cycle). ## Test Hierarchy **Traditional** (offline): - **Unit** — cheapest (ms, laptop), test separable units - **Integration** — assembled components with dependency injection (mocks) - **System** — smoke (critical behavior), performance (no degradation), regression (gallery of rogue bugs) **Production** (live): - **Configuration tests** — diff checked-in config vs running production - **Stress tests** — find catastrophic failure limits - **Canary tests** — not really tests; structured user acceptance with exponential rollout (0.1% → 1% → 10% → 100%) ## Bug Order (U) - **U=1**: broken code, linear with traffic (most common, easiest to catch) - **U=2**: randomly damages data future requests may see - **U=3**: damaged data is valid identifier to a previous request - Higher-order bugs are critical to catch during release — operational workload scales superlinearly ## Where to Start - Business-critical code (billing) - APIs other teams integrate against - Convert every reported bug into a regression test - Broken build = highest priority fix (stability drives agility) ## Config File Risk - **MTTR-critical** (changed on failure only): some uncertainty acceptable - **Release-state** (changes frequently): dominates site reliability if test coverage isn't better than the app's - Break-glass mechanism: push live before tests complete, but make noise (file a bug) ## Source - [[Site Reliability Engineering - Chapter 17 - Testing for Reliability|SRE Ch 17: Testing for Reliability]] by Alex Perry and Max Luebbe ## Related Concepts - [[Release Engineering Four Principles]] - [[Error Budgets as Risk Management Currency]] - [[Simplicity as Reliability Prerequisite]]