Golden-Signal Capacity Evaluation - Nestor G Pestelos Jr (ngpestelos)

# Golden-Signal Capacity Evaluation **Parent Topic**: [[Software/README]] A **service-level** capacity frame built on the four golden signals ([[Four Golden Signals of Monitoring]] — latency, traffic, errors, saturation). It sits *above* the resource-class [[Metric Evaluation Runbook]]: that runbook answers "which resource saturates first"; this frame answers "how much more traffic can the **service** take before any SLO breaks, and which signal breaks first." Saturation is the seam where the two meet. ## 1. The reframe — 1 driver + 3 ceilings The four signals are not four co-equal capacity inputs. For capacity planning they split by role: | Signal | Role | Why | |---|---|---| | **Traffic** | **load axis** (independent variable) | the demand you forecast and plan headroom *against* — rarely the binding ceiling for compute-bound services (but PPS / connection-rate / conntrack limits are traffic-shaped saturation — carry those under Saturation) | | **Latency** | user-facing ceiling | breaches before the system looks full; SRE calls latency increases *often* a **leading indicator of saturation** | | **Errors** | user-facing ceiling | correctness ceiling; at capacity it tends to be a **cliff, not a ramp** | | **Saturation** | system ceiling | "how full" the binding resource is — expands into the [[Metric Evaluation Runbook]] | Capacity is the **lowest traffic at which any of the three ceilings breaks**. This is the binding-constraint idea from [[Find Each Component's Red-Line Number]], lifted from per-component resources to the **minimum-viable** signal set — broader coverage than a CPU-only metric ([[CPU Over QPS for Capacity Modeling]]), but *not* a completeness proof. SRE's claim is "if you can only measure four, measure these," not "these four are exhaustive": a ceiling can still hide off-axis (downstream dependency / quota exhaustion, data-correctness drift) and read as false headroom. **GSH is exactly as complete as your three red-lines — no more.** ## 2. The metric — Golden-Signal Headroom (GSH) > **GSH = (λ\* − λ_now) / λ_now**, where **λ\* = min( λ\*_latency, λ\*_errors, λ\*_saturation )** - **λ\*ₛ** = the traffic level at which signal *s* crosses its red-line. - **λ\*** = the lowest of the three — the service capacity ceiling. - The output is **two things**: a headroom number (e.g. `0.6` = 60% more traffic) **and the binding signal label** (which ceiling you hit first). The label is the actionable half — it names what to fix to buy more capacity. - Reporting as a **multiple** is `λ\*/λ_now = 1 + GSH` — a 1.6× ceiling is **60%** headroom, not 160%. Keep the unit straight. **Preconditions** — GSH is well-defined only inside these; state them next to the number: 1. **A breach must occur in the tested range.** If no ceiling breaks, each λ\*ₛ is undefined and GSH is a **censored lower bound** (`≥ (λ_max − λ_now)/λ_now`), not the headroom. Report it as a bound — don't pass it off as the ceiling. 2. **Negative GSH = already over capacity.** If a signal is over its red-line at λ_now, then λ\*ₛ < λ_now and GSH < 0 → stop forecasting, the service is already breaching. Don't let a positive λ\* from another signal mask it. 3. **Curves must be monotone in the operating band.** Latency/errors can dip then re-climb (JIT warmup, GC, autoscale kicking in). Take the **last stable crossing**, not the literal first, or a warmup transient understates capacity. 4. **λ is volume at a fixed request mix.** λ\*_latency under a read-heavy mix ≠ write-heavy; a single λ\* is undefined unless mix is held constant. Re-run GSH per representative mix. GSH is a derived construct (synthesized here); the four signals it rests on are the SRE definition. It inherits the runbook's discipline — red-line-as-comparison, statistic-by-binding-type ([[Apply a Safety Factor Above the Ceiling]]) — at the service tier. ## 3. Per-signal evaluation table Same shape as the [[Metric Evaluation Runbook]] §3 (metric · statistic · red-line) but bucketed by signal. Thresholds are starting points — tune per environment. | Signal | Metric(s) | Statistic | Red-line (tune) | |---|---|---|---| | **Traffic** (driver) | req/s, QPS, concurrency | p95 peak + trend | none — forecast it; it's the x-axis ([[Forecast Peak-Driven Resources by Trending Peaks]]) | | **Latency** | p99 response time (success vs error latency separated) | p99 | > SLO target | | **Errors** | error rate / error-budget burn | sum or avg over window | > error budget ([[SLA Nines Translate to Downtime Budgets]]) | | **Saturation** | **→ [[Metric Evaluation Runbook]] §1–3** (per resource class) | per resource class | per-resource red-line | The Latency + Errors rows at a single λ are exactly [[Goodput as Capacity Truth-Teller]] — GSH puts traffic on the x-axis and reads both ceilings off the same load curve. ## 4. Relationship to the Metric Evaluation Runbook The runbook is **the expansion of one cell (Saturation)**, not a competitor: | | Metric Evaluation Runbook | Golden-Signal Capacity Evaluation | |---|---|---| | Organized by | resource class (compute / storage / cache) | service signal (the four) | | Direction | bottom-up — what breaks inside the box | top-down — what the SLO/user feels | | Center of gravity | saturation (credits, I/O wait, memory) | the full driver + 3-ceiling set | Mapping: the runbook's **user-facing row** *is* the Latency + Errors ceilings; **every other row** (compute / storage / cache / network) *is* Saturation. When the **Saturation** signal binds, drop into the runbook to find *which resource* and at what red-line. Don't restate its tables here. **λ\*_saturation is itself a nested `min`** over resource classes — `min(`CPU, memory, I/O wait, credits, cache, conntrack`)`. The runbook hand-off resolves *which* resource binds, but you still drive load against all of them simultaneously to read that single number; the clean "one λ\* per signal" hides a second minimization. **Prior art.** This two-tier split mirrors **RED** (Rate / Errors / Duration — service tier; Tom Wilkie) handing off to **USE** (Utilization / Saturation / Errors — resource tier; Brendan Gregg). GSH's driver+latency+errors is RED; the [[Metric Evaluation Runbook]] is the USE expansion. Borrow USE's precision while you're there: **saturation = queue length / wait time, not just utilization** — a sharper red-line for λ\*_saturation than "% busy." ## 5. Two traps to bake in - **"Saturation binds first" means different things by resource type.** For **degradation-type** resources, latency leads saturation ([[Define Ceilings by User-Facing Time Not System Metrics]]), so saturation binding first = latency/error SLOs too loose → tighten them. But for **cliff-type** resources (disk fill, credit depletion, OOM), saturation *should* bind first — the user-facing signal gives **no** leading warning right up to the cliff ([[Burstable CPU Utilization Masks Saturation]], [[Database Ceilings Have Hidden Cliffs]]). Don't "tighten SLOs" against a cliff; saturation is the only early ceiling there. - **Neither curve extrapolates linearly near the ceiling.** Both latency and errors go non-linear approaching saturation (queueing theory: latency → ∞ as utilization → 1). Errors *additionally* jump discontinuously when load-shedding or a circuit-breaker trips ([[Database Ceilings Have Hidden Cliffs]]). Find both by test — don't trend into the saturation zone. ## 6. The run loop 1. **Pick the SLOs** — fix the latency and error red-lines *before* pulling data (the runbook's "methodology before data" rule). 2. **Drive traffic** — load-test or observe each ceiling signal as a function of λ. 3. **Find each λ\*ₛ** — the traffic where each ceiling breaks (last stable crossing, per §2 precondition 3). 4. **Take the min** — λ\* and its binding-signal label. 5. **Report GSH + label** — headroom against *forecast* traffic, not just current ([[Maintain Capacity Headroom]]); re-confirm the binding signal after any change ([[The Defining Metric Itself Can Change]]). The binding signal tells you where to spend: Latency/Errors → architecture or SLO; Saturation → into the runbook for the resource fix. --- *Source: synthesized from [[Site Reliability Engineering]] (Beyer, Jones, Petoff & Murphy, O'Reilly 2016) — Ch 6 "Monitoring Distributed Systems" via [[Four Golden Signals of Monitoring]] — and the Capacity Planning evaluation corpus ([[Metric Evaluation Runbook]], [[Goodput as Capacity Truth-Teller]], [[Find Each Component's Red-Line Number]]). GSH is a derived construct synthesized here.*