# Golden-Signal Capacity Evaluation
**Parent Topic**: [[Software/README]]
A **service-level** capacity frame built on the four golden signals ([[Four Golden Signals of Monitoring]] — latency, traffic, errors, saturation). It sits *above* the resource-class [[Metric Evaluation Runbook]]: that runbook answers "which resource saturates first"; this frame answers "how much more traffic can the **service** take before any SLO breaks, and which signal breaks first." Saturation is the seam where the two meet.
## 1. The reframe — 1 driver + 3 ceilings
The four signals are not four co-equal capacity inputs. For capacity planning they split by role:
| Signal | Role | Why |
|---|---|---|
| **Traffic** | **load axis** (independent variable) | the demand you forecast and plan headroom *against* — rarely the binding ceiling for compute-bound services (but PPS / connection-rate / conntrack limits are traffic-shaped saturation — carry those under Saturation) |
| **Latency** | user-facing ceiling | breaches before the system looks full; SRE calls latency increases *often* a **leading indicator of saturation** |
| **Errors** | user-facing ceiling | correctness ceiling; at capacity it tends to be a **cliff, not a ramp** |
| **Saturation** | system ceiling | "how full" the binding resource is — expands into the [[Metric Evaluation Runbook]] |
Capacity is the **lowest traffic at which any of the three ceilings breaks**. This is the binding-constraint idea from [[Find Each Component's Red-Line Number]], lifted from per-component resources to the **minimum-viable** signal set — broader coverage than a CPU-only metric ([[CPU Over QPS for Capacity Modeling]]), but *not* a completeness proof. SRE's claim is "if you can only measure four, measure these," not "these four are exhaustive": a ceiling can still hide off-axis (downstream dependency / quota exhaustion, data-correctness drift) and read as false headroom. **GSH is exactly as complete as your three red-lines — no more.**
## 2. The metric — Golden-Signal Headroom (GSH)
> **GSH = (λ\* − λ_now) / λ_now**, where **λ\* = min( λ\*_latency, λ\*_errors, λ\*_saturation )**
- **λ\*ₛ** = the traffic level at which signal *s* crosses its red-line.
- **λ\*** = the lowest of the three — the service capacity ceiling.
- The output is **two things**: a headroom number (e.g. `0.6` = 60% more traffic) **and the binding signal label** (which ceiling you hit first). The label is the actionable half — it names what to fix to buy more capacity.
- Reporting as a **multiple** is `λ\*/λ_now = 1 + GSH` — a 1.6× ceiling is **60%** headroom, not 160%. Keep the unit straight.
**Preconditions** — GSH is well-defined only inside these; state them next to the number:
1. **A breach must occur in the tested range.** If no ceiling breaks, each λ\*ₛ is undefined and GSH is a **censored lower bound** (`≥ (λ_max − λ_now)/λ_now`), not the headroom. Report it as a bound — don't pass it off as the ceiling.
2. **Negative GSH = already over capacity.** If a signal is over its red-line at λ_now, then λ\*ₛ < λ_now and GSH < 0 → stop forecasting, the service is already breaching. Don't let a positive λ\* from another signal mask it.
3. **Curves must be monotone in the operating band.** Latency/errors can dip then re-climb (JIT warmup, GC, autoscale kicking in). Take the **last stable crossing**, not the literal first, or a warmup transient understates capacity.
4. **λ is volume at a fixed request mix.** λ\*_latency under a read-heavy mix ≠ write-heavy; a single λ\* is undefined unless mix is held constant. Re-run GSH per representative mix.
GSH is a derived construct (synthesized here); the four signals it rests on are the SRE definition. It inherits the runbook's discipline — red-line-as-comparison, statistic-by-binding-type ([[Apply a Safety Factor Above the Ceiling]]) — at the service tier.
## 3. Per-signal evaluation table
Same shape as the [[Metric Evaluation Runbook]] §3 (metric · statistic · red-line) but bucketed by signal. Thresholds are starting points — tune per environment.
| Signal | Metric(s) | Statistic | Red-line (tune) |
|---|---|---|---|
| **Traffic** (driver) | req/s, QPS, concurrency | p95 peak + trend | none — forecast it; it's the x-axis ([[Forecast Peak-Driven Resources by Trending Peaks]]) |
| **Latency** | p99 response time (success vs error latency separated) | p99 | > SLO target |
| **Errors** | error rate / error-budget burn | sum or avg over window | > error budget ([[SLA Nines Translate to Downtime Budgets]]) |
| **Saturation** | **→ [[Metric Evaluation Runbook]] §1–3** (per resource class) | per resource class | per-resource red-line |
The Latency + Errors rows at a single λ are exactly [[Goodput as Capacity Truth-Teller]] — GSH puts traffic on the x-axis and reads both ceilings off the same load curve.
## 4. Relationship to the Metric Evaluation Runbook
The runbook is **the expansion of one cell (Saturation)**, not a competitor:
| | Metric Evaluation Runbook | Golden-Signal Capacity Evaluation |
|---|---|---|
| Organized by | resource class (compute / storage / cache) | service signal (the four) |
| Direction | bottom-up — what breaks inside the box | top-down — what the SLO/user feels |
| Center of gravity | saturation (credits, I/O wait, memory) | the full driver + 3-ceiling set |
Mapping: the runbook's **user-facing row** *is* the Latency + Errors ceilings; **every other row** (compute / storage / cache / network) *is* Saturation. When the **Saturation** signal binds, drop into the runbook to find *which resource* and at what red-line. Don't restate its tables here.
**λ\*_saturation is itself a nested `min`** over resource classes — `min(`CPU, memory, I/O wait, credits, cache, conntrack`)`. The runbook hand-off resolves *which* resource binds, but you still drive load against all of them simultaneously to read that single number; the clean "one λ\* per signal" hides a second minimization.
**Prior art.** This two-tier split mirrors **RED** (Rate / Errors / Duration — service tier; Tom Wilkie) handing off to **USE** (Utilization / Saturation / Errors — resource tier; Brendan Gregg). GSH's driver+latency+errors is RED; the [[Metric Evaluation Runbook]] is the USE expansion. Borrow USE's precision while you're there: **saturation = queue length / wait time, not just utilization** — a sharper red-line for λ\*_saturation than "% busy."
## 5. Two traps to bake in
- **"Saturation binds first" means different things by resource type.** For **degradation-type** resources, latency leads saturation ([[Define Ceilings by User-Facing Time Not System Metrics]]), so saturation binding first = latency/error SLOs too loose → tighten them. But for **cliff-type** resources (disk fill, credit depletion, OOM), saturation *should* bind first — the user-facing signal gives **no** leading warning right up to the cliff ([[Burstable CPU Utilization Masks Saturation]], [[Database Ceilings Have Hidden Cliffs]]). Don't "tighten SLOs" against a cliff; saturation is the only early ceiling there.
- **Neither curve extrapolates linearly near the ceiling.** Both latency and errors go non-linear approaching saturation (queueing theory: latency → ∞ as utilization → 1). Errors *additionally* jump discontinuously when load-shedding or a circuit-breaker trips ([[Database Ceilings Have Hidden Cliffs]]). Find both by test — don't trend into the saturation zone.
## 6. The run loop
1. **Pick the SLOs** — fix the latency and error red-lines *before* pulling data (the runbook's "methodology before data" rule).
2. **Drive traffic** — load-test or observe each ceiling signal as a function of λ.
3. **Find each λ\*ₛ** — the traffic where each ceiling breaks (last stable crossing, per §2 precondition 3).
4. **Take the min** — λ\* and its binding-signal label.
5. **Report GSH + label** — headroom against *forecast* traffic, not just current ([[Maintain Capacity Headroom]]); re-confirm the binding signal after any change ([[The Defining Metric Itself Can Change]]).
The binding signal tells you where to spend: Latency/Errors → architecture or SLO; Saturation → into the runbook for the resource fix.
---
*Source: synthesized from [[Site Reliability Engineering]] (Beyer, Jones, Petoff & Murphy, O'Reilly 2016) — Ch 6 "Monitoring Distributed Systems" via [[Four Golden Signals of Monitoring]] — and the Capacity Planning evaluation corpus ([[Metric Evaluation Runbook]], [[Goodput as Capacity Truth-Teller]], [[Find Each Component's Red-Line Number]]). GSH is a derived construct synthesized here.*