RDS CPU Saturation — Diagnose Before Right-Sizing - Nestor G Pestelos Jr (ngpestelos)

# RDS CPU Saturation — Diagnose Before Right-Sizing **Parent Topic**: [[Software/README]] Generic playbook for the common case: a database (RDS) outage caused by **CPU saturation traced to a specific API endpoint**, where the mandate is to *right-size instances while preventing outages*. No proprietary data — methodology only. ## The reframe that decides everything "Right-size" and "prevent outages" pull in **opposite directions**: - **Right-sizing** shrinks capacity to match demand — it *removes* headroom. - **CPU saturation from an endpoint** is a **demand-side / SQL-efficiency problem**, not a sizing problem. Right-sizing on top of a pathological workload bakes the problem in at a smaller, *less forgiving* instance. Correct order: **fix/contain the query → establish safe headroom → right-size against the corrected baseline.** Sizing is the last step, not the first. RDS CPU saturation traced to one endpoint is almost always inefficient SQL: missing index, full table scan, a bad plan, lock contention, or an N+1 firing many fast queries at high volume. ## Step 1 — Confirm the query (two tools, same incident window) - **RDS Performance Insights** (AWS-native, purpose-built). "Top SQL" plots **Average Active Sessions (AAS) vs. the vCPU line** — anything sustained above the vCPU line *is* the saturation, and it names the statement. Free at 7-day retention; enable it if it isn't on. - **APM / distributed tracing**. Trace the endpoint → service → **database statements**: captures SQL text, execution count, time-per-call. Key pivot: **one slow query, or many fast queries × huge volume (N+1)?** Different fixes. Cross-reference both at the incident timestamp before changing anything. ## Step 2 — The RDS constraint that reshapes the approach You **cannot horizontally autoscale RDS primary write/CPU capacity** the way you can EC2 behind an Auto Scaling Group. Provisioned RDS changes CPU only by resizing the instance class — a failover with downtime. (**Aurora Serverless v2** is the exception.) **Read replicas offload reads only** — see [[Replicas Scale Reads Not Writes]]. → For RDS, the demand-side fix and protection layers carry far more weight than sizing. Sizing is a blunt, expensive, downtime-incurring lever. ## Step 3 — Sequenced proactive measures 1. **Confirm + fix the query** — add index, rewrite, cache, or collapse the N+1. 2. **Protect** — `statement_timeout` so no single query can run CPU to 100%; connection pooling / RDS Proxy; a read replica for read-heavy endpoints; per-endpoint query budgets where feasible. 3. **Headroom + early alerting** — alert on CPU / Active Sessions *before* saturation, following the USE method (Utilization → Saturation → Errors). See [[Maintain Capacity Headroom]]. 4. **Right-size last**, against the corrected steady-state baseline. ## Signals — what to watch (USE method + which queue where) Detection should follow the **USE method** — Utilization → Saturation → Errors: - **Utilization** (CPU %) is **lagging and ambiguous** — it flatlines at 100% and tells you *that* you're saturated, not *how badly* or how much demand is backed up. - **Saturation** — **queue length is the leading signal**. It grows the moment arrival rate exceeds service rate, before latency blows up and well before the outage. This is the signal pure CPU-average right-sizing misses. - **Errors** — timeouts, rejected connections, 5xx. **Measure queue _time_, not raw depth.** Queue length is bimodal (near-zero, then it explodes), so a static depth threshold is fragile. Alert on **sojourn time** (how long work waits in the queue) — load- and capacity-normalized. This is the **CoDel** approach from [[Queue Management Under Overload]]; [[Backpressure as System Self-Regulation]] is the mechanism that bounds the queue. **Pick the right queue — for an RDS CPU stall the telling queue is downstream, not the front door:** | Layer | Queue signal that matters | |---|---| | EC2 app tier | Worker / accept queue depth; LB surge & connection metrics (autoscale on this — [[Queue-Depth Auto-Scaling Control Loop]]) | | **DB connection pool** | **Requests waiting for a free DB connection** — where an RDS CPU stall first surfaces as app-side backlog | | RDS itself | Performance Insights **Average Active Sessions vs. vCPU line** — the DB's own work queued on CPU | The cascade: RDS CPU saturates → queries slow → connections don't free → the pool's **wait queue** grows → the app tier backs up. So the two **earliest** warning signals are **DB connection-pool wait time** and **AAS-above-vCPU** — not the HTTP request queue, which can stay calm while the real backup forms deeper. (See [[Goodput as Capacity Truth-Teller]]: throughput looks fine while the queue and latency silently build.) ## Edge layer (CDN / edge platform in front of the web tier) If the web servers sit behind a CDN/edge platform, the topology is **edge → EC2 → RDS**, and the edge is the *cheapest place* to both protect RDS and shrink EC2 — it shapes what reaches the origin before any query runs. - **Protect at the edge.** Rate-limit or load-shed the offending endpoint at the edge (edge WAF / rate policies) — saturating requests never reach EC2 or RDS, so the query never runs. This is upstream of the front-door request queue: it caps *arrival rate* at the cheapest point. Turns the Step-3 "build app rate limiting" into "configure an existing edge control." - **Cache offload.** If the endpoint's response is cacheable at all — even **micro-caching** (a few seconds' TTL) — the edge serves it and collapses origin + DB load. Often solves an expensive-read incident outright. - **Right-size EC2 to origin (cache-miss) load, not total client traffic.** The edge absorbs cacheable hits; EC2 only sees misses. Sizing against total client traffic over-provisions. Raising cache-hit ratio is itself a right-sizing lever. - **Edge logs = a third diagnostic source.** Request-volume-per-URL at the edge settles the one-slow-query vs N+1 vs volume question (alongside Performance Insights + APM): edge rate spiked → volume problem; edge rate flat but DB CPU pegged → query problem. **Caveats:** - Edge caching helps RDS **only to the extent the endpoint is cacheable** — uncacheable dynamic/auth'd queries still pass through; for those the lever is edge rate-limiting, not caching. - Behind a CDN, EC2/RDS see the **edge's IPs**, not real clients — client attribution and any app-tier rate-limiting need `True-Client-IP` / `X-Forwarded-For`. Get this wrong both ways: **forget it** and all per-IP logic (rate limits, blocklists, geo, logs, WAF) collapses onto the CDN's IPs and is useless; **trust a client-supplied `X-Forwarded-For`** without verifying the request actually came from the CDN's published edge ranges and an attacker can **spoof** their IP to bypass rate limits and poison logs. Correct posture: trust only the header the CDN sets, only from its known edge IPs. - Verify first: (1) is the problematic endpoint actually edge-fronted (APIs sometimes bypass the CDN on a separate hostname)? (2) is its response cacheable at all? - Bot/abuse-driven spikes can be filtered at the edge (bot management) so they never reach the origin. ## Core principle Model capacity in **CPU terms, not request count** ([[CPU Over QPS for Capacity Modeling]]) — a few endpoint calls can consume a disproportionate share of a core. A request count looks fine while CPU is on fire. Saturation, not throughput, is the outage signal. ## Topology to establish (for the EC2 half of the mandate) Load balancer? Auto Scaling Group? RDS instance class + Multi-AZ + read replicas? APM service maps plus the AWS console reveal this quickly. The EC2 app tier *can* autoscale, so its right-sizing story differs from RDS.