EC2 Capacity Planning for Burstable Fleets - Nestor G Pestelos Jr (ngpestelos)

# EC2 Capacity Planning for Burstable Fleets **Parent Topic**: [[Software/README]] Methodology for analyzing a mixed EC2 fleet (mostly T-family burstable + some fixed-performance M/C/R) and recommending right-sizing or upgrades — including headroom for a forecasted near-term load increase. Generic methodology, no proprietary data. Thresholds are starting points to tune per environment. **Core principle:** the analysis *outputs* both the lever (scale up / scale out / change family / reserve) and the targeted-vs-fleet-wide call — neither is an input. Don't pre-commit to "fleet-wide upgrade." ## The framing trap to avoid "Which to upgrade, or upgrade the whole fleet" pre-supposes the lever is **scale-up** and the choice is **targeted-vs-fleet**. Both should fall out of the data. Four levers exist: **scale up** (bigger instance), **scale out** (more instances behind an Auto Scaling Group), **change family** (burstable→fixed, or newer generation), **reserve** (RI / Savings Plan for steady baseline). Prefer reversible scale-out over a committed fleet-wide scale-up when load growth is uncertain. ## Step 0 — Setup - Enable **AWS Compute Optimizer** as an independent right-sizing cross-check (ingests CloudWatch history; needs ≥30h minimum, **14 days** for the higher-confidence tier — aligns with the ≥2-week window below; validate against it, don't defer to it). - Confirm whether the **CloudWatch agent** is deployed. Memory and disk are NOT in default EC2 metrics. Without the agent you can analyze only CPU/network — declare memory an *unquantified risk* rather than assuming it's fine. ## Step 1 — Pull metrics (window ≥ 2 weeks, covering a peak cycle) **The burstable trap (dominant case):** low `CPUUtilization` does NOT mean healthy. A T-instance flat near its baseline % with a depleted `CPUCreditBalance` is **throttled and saturated** — it wants CPU it can't get. Always read CPU *next to* credit balance. Credit metrics publish at 5-minute frequency only — if you query a period >5 min, use the **Sum** statistic, not Average. **Check the credit mode first — it decides which signals are valid.** T3/T3a/T4g default to **unlimited**; T2 defaults to **standard**. In *unlimited* mode a saturated instance spends surplus credits and `CPUSurplusCreditsCharged` rises (you're paying for overage). In *standard* mode a saturated instance simply **throttles** — surplus is never charged, so `CPUSurplusCreditsCharged` stays 0 *even when saturated*. On a standard-mode (or mixed/older) fleet, surplus-charged is a false-negative; rely on `CPUCreditBalance` pinned near zero instead. T-family — the credit economy is the real signal: | Metric | Statistic | Reading | |--------|-----------|---------| | `CPUCreditBalance` | Minimum + Avg | Trending to / pinned at zero = chronically over baseline | | `CPUSurplusCreditsCharged` | Sum | >0 (unlimited mode) = already paying for sustained overage | | `CPUUtilization` | p95, p99, Avg | Meaningful only beside credit balance | | Memory (CWAgent) | p95 | Often the true binding constraint | Fixed-performance (M/C/R) — CPU is honest: | Metric | Statistic | Threshold | |--------|-----------|-----------| | `CPUUtilization` | p95 / p99 | p95 sustained >70% → upgrade; p99 ~100% → under-provisioned (latency-sensitive services; for batch/async workers p99 ~100% can be normal — apply by workload class) | | Memory (CWAgent) | p95 | >75–80% → memory-bound, upgrade — **but** `mem_used_percent` counts reclaimable page cache as used; net out cache (or use an available-memory metric) before thresholding, else healthy boxes false-positive | Fleet-ranking queries (CloudWatch **Metrics Insights**): ``` -- Worst credit depletion (upgrade candidates surface at top) SELECT AVG(CPUCreditBalance) FROM SCHEMA("AWS/EC2", InstanceId) GROUP BY InstanceId ORDER BY AVG() ASC LIMIT 25 -- Instances already paying for overage SELECT SUM(CPUSurplusCreditsCharged) FROM SCHEMA("AWS/EC2", InstanceId) GROUP BY InstanceId ORDER BY SUM() DESC LIMIT 25 ``` Metrics Insights aggregates are AVG/SUM/MAX/MIN/COUNT. For CPU **p95/p99**, use the metric's percentile statistic via the console graph or `GetMetricData` with `Stat=p95` — not a Metrics Insights `SELECT`. **Metrics Insights window/scale limits:** the query window was historically capped at the most recent **3 hours**; AWS extended it to **2 weeks** (Sept 2025) — enough for the analysis window above, but a hard ceiling. A *monthly* peak cycle cannot be ranked in one query — fall back to `GetMetricData` over the longer span. A single query also scans ≤10,000 metrics (DPS quota), so very large fleets need pagination, not one ranking query. ## Step 2 — Classify each instance T-family decision rule: | Pattern | Reading | Action | |---------|---------|--------| | Credit balance high/stable, CPU mostly below baseline, surplus = 0 | Healthy burstable | Keep | | Credit balance trending to zero / pinned low (any mode), OR — *unlimited mode only* — surplus charged >0 consistently | Chronically over baseline | Upgrade — *sustained* load → move to fixed (M-class); *spiky* load → larger T-size | | Credit balance maxed at accrual cap, CPU far below baseline | Over-provisioned | Downsize (cost saving) | The headline insight: **sustained above-baseline load means the workload has outgrown burstable** — the fix is a family change to fixed-performance, not a bigger T. Fixed-performance decision rule: p95 CPU (or memory) >70–75% → upgrade (scale up, or scale out if behind an ASG); p95 <20–30% on both → downsize candidate. ### Measuring spike frequency (chronic vs one-off) The classifier above hinges on "chronically over baseline" — but a single long event and many scattered blips both pin the credit balance low. Counting breach intervals separates them, and that count is what drives the targeted-vs-fleet call in Step 4. CloudWatch has no native spike count; build it with **metric math** in `GetMetricData` (per instance, over the ≥2-week window). `SUM(IF(...))` collapses the series to one number: ``` # Count above-baseline intervals (raw CPU) m1 = CPUUtilization # period 300s e1 = SUM(IF(m1 > BASELINE, 1, 0)) # scalar = # of 5-min intervals breaching # × 5 min = total time above baseline ("extended") ``` Prefer thresholding on **credit burn** for burstable — it counts episodes that actually saturate: ``` # accrual per 5-min period = credits_per_hour / 12 (t3.large = 36/12 = 3) m1 = CPUCreditUsage e1 = SUM(IF(m1 > 3, 1, 0)) # # of periods burning faster than earning ``` Use the instance's real baseline (t3.large 30%, etc.), not a flat 70%. Metrics Insights **cannot** do this (no conditional count) — loop `GetMetricData` across instances and sort the scalar offline to rank the worst spikers. For *consecutive-run length* (one 6h spike vs 72 blips), pull the raw series and compute runs in pandas, or use a CloudWatch alarm with "N of M datapoints" + `DescribeAlarmHistory`. **Read it as:** high count → chronic → upgrade / change family; low count + one long run → single event, possibly rideable. ## Step 3 — Overlay the forecast (near-term load increase) Required input: the **magnitude and source** of the expected increase. Known event → use its expected load multiple. General growth → extrapolate the trend slope from the 2-week window. - `projected_p95 = current_p95 × (1 + expected_growth)` - **Caveat — CPU rarely scales linearly with load.** This multiplier is a first-order estimate. Near saturation systems often go *superlinear* (lock/GC/context-switch contention — 2× load can be >2× CPU exactly when the box can least absorb it) or *sublinear* (caching, batching). For a high-stakes event, validate the projection with a load test rather than trusting the multiplier. - **Fixed instances:** target projected p95 ≤ **70%** (30% buffer for spikes + reaction time). - **T-family — average-below-baseline is necessary, not sufficient.** The first test is whether projected *average* utilization stays **below the instance's baseline %** so credits still accrue. But credit balance is *finite* (max accrual = 24h × credits-earned-per-hour). A **sustained spike above baseline** drains a finite balance even when the long-run average stays under baseline. So also test the forecast's peak: does `(over-baseline burn rate × spike duration)` exceed the accrued credit ceiling? If yes, credits deplete mid-event regardless of the average — pre-empt before the event. (Baseline per size — **t3 shown; t2/t3a/t4g differ**: t3.medium 20%, t3.large 30%, t3.xlarge 40%; baseline = (credits-earned-per-hour ÷ vCPUs) ÷ 60.) ## Step 4 — Roll up: targeted vs fleet-wide Tabulate every instance as Upgrade / Keep / Downsize, then: - **Targeted** if the Upgrade set is a minority clustering by role. - **Fleet-wide** if the *majority* of T-instances deplete credits under the forecast — a signal the whole tier has outgrown burstable; a fleet T→M (or larger-T) move is justified. Cross-check cost: sum of targeted upgrades vs fleet-wide. - Prefer **scale-out behind an ASG** over a committed fleet-wide scale-up when the increase is uncertain — it's reversible. ## Step 5 — Output One row per instance: ``` id | family/size | p95 CPU | min credit balance | surplus charged | memory p95 | current verdict | projected-load verdict | recommended action | est. cost Δ ``` The targeted-vs-fleet recommendation and lever choice fall directly out of this table. ## Limitations & failure modes The judgment layer (four levers, "read CPU beside credits", "outgrown burstable → change family") is robust. These are the mechanical traps that produce a *wrong* recommendation if ignored: - **Credit mode flips the surplus signal.** Standard-mode instances throttle silently — `CPUSurplusCreditsCharged` never fires. Confirm mode per instance; on standard mode trust only `CPUCreditBalance`. (Step 1.) - **A finite credit balance can deplete even when average < baseline.** A long enough above-baseline spike drains the 24h-accrual ceiling. Test peak duration × burn rate, not just the average. (Step 3.) - **CPU scaling is often non-linear.** The `× (1 + growth)` multiplier under-predicts near saturation. Load-test high-stakes events. (Step 3.) - **`mem_used_percent` over-reports** (page cache counts as used) — net out cache before flagging memory-bound. - **Metrics Insights caps at 2 weeks and ≤10,000 metrics/query.** Monthly cycles or very large fleets need `GetMetricData` + pagination. (Step 1.) - **Baselines shown are t3 only.** t2 (and its standard-mode default), t3a, t4g differ — recompute via the formula per generation. - **No CloudWatch agent → memory/disk are blind.** Declare them *unquantified risk*, never "fine." (Step 0.) --- *Source: synthesized from [[Burstable CPU Utilization Masks Saturation]], [[EC2 Burstable Baseline Utilization]], [[CloudWatch Observability Primitives]], and AWS EC2 / CloudWatch documentation.*