## Core Insight
Testing quantifies confidence in system reliability. Each test that passes both before and after a change reduces uncertainty about that change's impact. More test coverage enables more changes before reliability drops below acceptable levels. The ideal is zero MTTR bugs — caught by tests before reaching production, blocking the push entirely.
## Zero MTTR: The Ideal
A system-level test that detects the exact problem monitoring would catch enables blocking the push before production. The bug still needs a source code fix, but users never see it. More zero-MTTR bugs → higher MTBF → developers encouraged to release faster (a virtuous cycle).
## Test Hierarchy
**Traditional** (offline):
- **Unit** — cheapest (ms, laptop), test separable units
- **Integration** — assembled components with dependency injection (mocks)
- **System** — smoke (critical behavior), performance (no degradation), regression (gallery of rogue bugs)
**Production** (live):
- **Configuration tests** — diff checked-in config vs running production
- **Stress tests** — find catastrophic failure limits
- **Canary tests** — not really tests; structured user acceptance with exponential rollout (0.1% → 1% → 10% → 100%)
## Bug Order (U)
- **U=1**: broken code, linear with traffic (most common, easiest to catch)
- **U=2**: randomly damages data future requests may see
- **U=3**: damaged data is valid identifier to a previous request
- Higher-order bugs are critical to catch during release — operational workload scales superlinearly
## Where to Start
- Business-critical code (billing)
- APIs other teams integrate against
- Convert every reported bug into a regression test
- Broken build = highest priority fix (stability drives agility)
## Config File Risk
- **MTTR-critical** (changed on failure only): some uncertainty acceptable
- **Release-state** (changes frequently): dominates site reliability if test coverage isn't better than the app's
- Break-glass mechanism: push live before tests complete, but make noise (file a bug)
## Source
- [[Site Reliability Engineering - Chapter 17 - Testing for Reliability|SRE Ch 17: Testing for Reliability]] by Alex Perry and Max Luebbe
## Related Concepts
- [[Release Engineering Four Principles]]
- [[Error Budgets as Risk Management Currency]]
- [[Simplicity as Reliability Prerequisite]]