# Metric Evaluation Runbook **Parent Topic**: [[Software/README]] A repeatable procedure for evaluating capacity/performance metrics so the verdict is **consistent regardless of who runs it** — not left to chance. The cure for inconsistency is to fix four variables *before* looking at any graph: **which metric, which statistic, which sampling interval, which threshold.** Everything below makes those four explicit. Thresholds are starting points to calibrate per environment. ## 1. Which metrics matter — by resource class Governing rule: **measure the resource that saturates first, in the unit the user actually feels** — don't default to CPU utilization everywhere ([[Find Each Component's Red-Line Number]], [[Eliminate Healthy Resources to Find the Binding One]]). | Resource class | Primary metric | Why | |---|---|---| | Fixed-performance compute | CPU utilization; memory | CPU is an honest saturation signal here | | Burstable compute | credit balance + burn ratio (**not** CPU alone) | a throttled instance reads "healthy" at flat baseline CPU ([[Burstable CPU Utilization Masks Saturation]]) | | Storage | consumption + disk **I/O wait** | I/O wait, not utilization, predicts the cliff ([[Disk IO Wait Predicts DB Lag Not Utilization]]) | | Cache | hit ratio, LRU reference age, time-to-serve | request rate alone misleads once the working set overflows | | User-facing | latency / time-to-serve | the real ceiling — system metrics can look fine while latency breaches ([[Define Ceilings by User-Facing Time Not System Metrics]]) | State a **red-line ceiling** per metric (a margin below the failure point — [[Apply a Safety Factor Above the Ceiling]]) so evaluation is a comparison, not a vibe. ## 2. Which statistic — average, max, or percentile The wrong statistic is the most common way two evaluations silently diverge. - **Peak-driven resources → p95/p99.** Average hides the peak you fail at; raw **Max** over-reacts to a single outlier. A percentile is "the peak without the one-off spike." - **Accrual/budget resources (burstable credits) → average over a window.** Credits accrue against the baseline *earn rate*, so average-vs-baseline (or burn ratio) is the true signal, not the instantaneous peak. - **Max** is a tripwire only ("did it ever touch the ceiling"), never the headline number. - **Sum** for credit-style metrics whose period exceeds their native publish interval. Rule of thumb: *peak-driven → percentile; accrual-driven → average; max only as a tripwire.* ## 3. The evaluation table (generic defaults — tune thresholds) Columns: **metric · statistic · coarse interval · red-line (tune) · drill interval · retention floor** **Fixed-performance compute** ``` CPU utilization | p95 (p99 tripwire) | 1h | p95>70% upgrade; p99~100% now | 1min | coarsens after 15d Memory | p95 | 1h | >75–80% (net of page cache) | 1min | coarsens after 15d ``` **Burstable compute** ``` Credit balance | Min + Avg | 5min | trending to 0 / pinned low | native | — Surplus credits charged| Sum | 5min | >0 sustained (unlimited) | native | — Credit burn ratio* | Avg/window | 5min | >1.0 | native | — CPU utilization | context | 1h | flat-at-baseline is the tell, not a threshold | — | — ``` *burn ratio = credit usage ÷ per-period accrual. In *standard* mode, surplus stays 0 even when saturated — trust the credit balance instead. **Storage** ``` Consumption (%/GB) | Max/last | 1h–daily | run-out date vs trend (forecast) | 1h ok | 1h retained long Disk I/O wait | p95 | 1h | > tuned threshold | 1min | coarsens after 15d ``` **Cache** ``` Time-to-serve | p95/p99 | 1h | > SLA target | 1min | coarsens after 15d Hit ratio | Avg | 1h | below target | finer | — LRU reference age | Avg | 1h | sustained decline | — | — ``` **User-facing & network** ``` Request latency | p95/p99 | 1h | > SLO | 1min | coarsens after 15d Error rate | Sum/Avg | 1h | > error budget | 1min | coarsens after 15d Bandwidth / device CPU | p95 | 1h | > tuned % | 1min | coarsens after 15d ``` ## 4. The run loop (coarse → fine) 1. **Coarse scan** — review a multi-week window (≈4 weeks) at the *coarse interval*, using the *statistic*. Surfaces recurring patterns and ranks subjects by red-line proximity ([[Match Metric Resolution to the Trend]]). 2. **Flag** — mark breaches; tag **repeat offenders** (chronic) vs single events. 3. **Drill** — reduce to the *drill interval* over each flagged window (see §5 constraint). 4. **Characterize** — the point of the drill: actual **duration** (a 5-minute blip vs a multi-hour sustained event), **magnitude**, **frequency** (breach count), and **impact** (did the user-facing ceiling or credit budget actually breach?). 5. **Decide** — chronic → fix (scale up / scale out / change family); one-off → possibly ride. The repeat-offender *count* drives the targeted-vs-fleet-wide call ([[Two-Measure Burstable Fleet Assessment]]). ## 5. The data-retention constraint (designs the drill) Time-series backends progressively **aggregate old data**: high-resolution samples are kept only for a recent window, then rolled up to coarser resolution as they age (a deliberate bounded-storage tradeoff — [[RRD Trades Old Detail for Bounded Storage]]). On CloudWatch specifically: 1-minute data is retained ~15 days, 5-minute ~63 days, 1-hour ~455 days, with automatic down-aggregation at each boundary. Consequence: a multi-week coarse scan can surface an event whose **fine-grained data no longer exists**. Design around it: - Do the fine drill **within the high-resolution retention window**, or - **Capture high-resolution up front** for events you must keep — a threshold alarm (M-of-N datapoints) or a scheduled metric export that snapshots fine data before it ages out, or - Accept the coarser floor for older events and state it. Some metrics have a hard native floor (e.g. burstable credit metrics publish at 5-minute only) — "fine-grained" there means the native interval, full stop. ## 6. Two rules that keep it consistent - **Statistic by binding type** — peak-driven → percentile; accrual-driven → average; max only as a tripwire. - **Re-confirm the binding metric after any change** — thresholds are platform-specific starting points, and the *defining metric itself* can move (e.g. a faster disk shifts the limit to the network) ([[The Defining Metric Itself Can Change]], [[Adding Capacity Moves the Bottleneck]]). The filled-in §3 table is the deliverable: two people running the same rows reach the same verdict. That table *is* the process. --- *Source: synthesized from [[The Art of Capacity Planning]] (John Allspaw, O'Reilly 2008), [[First-Principles Capacity Analysis]], and the Capacity Planning measurement/forecasting notes; CloudWatch retention per AWS documentation.*