# Metric Evaluation Runbook
**Parent Topic**: [[Software/README]]
A repeatable procedure for evaluating capacity/performance metrics so the verdict is **consistent regardless of who runs it** — not left to chance. The cure for inconsistency is to fix four variables *before* looking at any graph: **which metric, which statistic, which sampling interval, which threshold.** Everything below makes those four explicit. Thresholds are starting points to calibrate per environment.
## 1. Which metrics matter — by resource class
Governing rule: **measure the resource that saturates first, in the unit the user actually feels** — don't default to CPU utilization everywhere ([[Find Each Component's Red-Line Number]], [[Eliminate Healthy Resources to Find the Binding One]]).
| Resource class | Primary metric | Why |
|---|---|---|
| Fixed-performance compute | CPU utilization; memory | CPU is an honest saturation signal here |
| Burstable compute | credit balance + burn ratio (**not** CPU alone) | a throttled instance reads "healthy" at flat baseline CPU ([[Burstable CPU Utilization Masks Saturation]]) |
| Storage | consumption + disk **I/O wait** | I/O wait, not utilization, predicts the cliff ([[Disk IO Wait Predicts DB Lag Not Utilization]]) |
| Cache | hit ratio, LRU reference age, time-to-serve | request rate alone misleads once the working set overflows |
| User-facing | latency / time-to-serve | the real ceiling — system metrics can look fine while latency breaches ([[Define Ceilings by User-Facing Time Not System Metrics]]) |
State a **red-line ceiling** per metric (a margin below the failure point — [[Apply a Safety Factor Above the Ceiling]]) so evaluation is a comparison, not a vibe.
## 2. Which statistic — average, max, or percentile
The wrong statistic is the most common way two evaluations silently diverge.
- **Peak-driven resources → p95/p99.** Average hides the peak you fail at; raw **Max** over-reacts to a single outlier. A percentile is "the peak without the one-off spike."
- **Accrual/budget resources (burstable credits) → average over a window.** Credits accrue against the baseline *earn rate*, so average-vs-baseline (or burn ratio) is the true signal, not the instantaneous peak.
- **Max** is a tripwire only ("did it ever touch the ceiling"), never the headline number.
- **Sum** for credit-style metrics whose period exceeds their native publish interval.
Rule of thumb: *peak-driven → percentile; accrual-driven → average; max only as a tripwire.*
## 3. The evaluation table (generic defaults — tune thresholds)
Columns: **metric · statistic · coarse interval · red-line (tune) · drill interval · retention floor**
**Fixed-performance compute**
```
CPU utilization | p95 (p99 tripwire) | 1h | p95>70% upgrade; p99~100% now | 1min | coarsens after 15d
Memory | p95 | 1h | >75–80% (net of page cache) | 1min | coarsens after 15d
```
**Burstable compute**
```
Credit balance | Min + Avg | 5min | trending to 0 / pinned low | native | —
Surplus credits charged| Sum | 5min | >0 sustained (unlimited) | native | —
Credit burn ratio* | Avg/window | 5min | >1.0 | native | —
CPU utilization | context | 1h | flat-at-baseline is the tell, not a threshold | — | —
```
*burn ratio = credit usage ÷ per-period accrual. In *standard* mode, surplus stays 0 even when saturated — trust the credit balance instead.
**Storage**
```
Consumption (%/GB) | Max/last | 1h–daily | run-out date vs trend (forecast) | 1h ok | 1h retained long
Disk I/O wait | p95 | 1h | > tuned threshold | 1min | coarsens after 15d
```
**Cache**
```
Time-to-serve | p95/p99 | 1h | > SLA target | 1min | coarsens after 15d
Hit ratio | Avg | 1h | below target | finer | —
LRU reference age | Avg | 1h | sustained decline | — | —
```
**User-facing & network**
```
Request latency | p95/p99 | 1h | > SLO | 1min | coarsens after 15d
Error rate | Sum/Avg | 1h | > error budget | 1min | coarsens after 15d
Bandwidth / device CPU | p95 | 1h | > tuned % | 1min | coarsens after 15d
```
## 4. The run loop (coarse → fine)
1. **Coarse scan** — review a multi-week window (≈4 weeks) at the *coarse interval*, using the *statistic*. Surfaces recurring patterns and ranks subjects by red-line proximity ([[Match Metric Resolution to the Trend]]).
2. **Flag** — mark breaches; tag **repeat offenders** (chronic) vs single events.
3. **Drill** — reduce to the *drill interval* over each flagged window (see §5 constraint).
4. **Characterize** — the point of the drill: actual **duration** (a 5-minute blip vs a multi-hour sustained event), **magnitude**, **frequency** (breach count), and **impact** (did the user-facing ceiling or credit budget actually breach?).
5. **Decide** — chronic → fix (scale up / scale out / change family); one-off → possibly ride. The repeat-offender *count* drives the targeted-vs-fleet-wide call ([[Two-Measure Burstable Fleet Assessment]]).
## 5. The data-retention constraint (designs the drill)
Time-series backends progressively **aggregate old data**: high-resolution samples are kept only for a recent window, then rolled up to coarser resolution as they age (a deliberate bounded-storage tradeoff — [[RRD Trades Old Detail for Bounded Storage]]). On CloudWatch specifically: 1-minute data is retained ~15 days, 5-minute ~63 days, 1-hour ~455 days, with automatic down-aggregation at each boundary.
Consequence: a multi-week coarse scan can surface an event whose **fine-grained data no longer exists**. Design around it:
- Do the fine drill **within the high-resolution retention window**, or
- **Capture high-resolution up front** for events you must keep — a threshold alarm (M-of-N datapoints) or a scheduled metric export that snapshots fine data before it ages out, or
- Accept the coarser floor for older events and state it.
Some metrics have a hard native floor (e.g. burstable credit metrics publish at 5-minute only) — "fine-grained" there means the native interval, full stop.
## 6. Two rules that keep it consistent
- **Statistic by binding type** — peak-driven → percentile; accrual-driven → average; max only as a tripwire.
- **Re-confirm the binding metric after any change** — thresholds are platform-specific starting points, and the *defining metric itself* can move (e.g. a faster disk shifts the limit to the network) ([[The Defining Metric Itself Can Change]], [[Adding Capacity Moves the Bottleneck]]).
The filled-in §3 table is the deliverable: two people running the same rows reach the same verdict. That table *is* the process.
---
*Source: synthesized from [[The Art of Capacity Planning]] (John Allspaw, O'Reilly 2008), [[First-Principles Capacity Analysis]], and the Capacity Planning measurement/forecasting notes; CloudWatch retention per AWS documentation.*