Confirming a DB-Saturating Query — Performance Insights + APM

# Confirming a DB-Saturating Query — Performance Insights + APM **Parent Topic**: [[Software/README]] Companion runbook to [[RDS CPU Saturation — Diagnose Before Right-Sizing]]. Goal: identify the **exact SQL statement + triggering endpoint** behind an RDS CPU-saturation incident, and **classify** it so the fix is obvious. Generic methodology, no proprietary data. (Exact UI labels vary by tool/version — the concepts are stable.) ## Part A — RDS Performance Insights (is it the DB, and which SQL?) 1. RDS console → **Performance Insights** → select the instance → set the time range to the **incident window**. 2. Read the **Database load** chart: y-axis is **Average Active Sessions (AAS)**; a horizontal line marks **Max vCPU**. **AAS sustained above the vCPU line = CPU saturation.** That alone confirms the DB tier was the bottleneck. 3. Slice **"Top SQL"** — statements ranked by load contribution. The one(s) towering during the spike are the culprit. Capture the SQL digest/text and its share of load. 4. Slice **"Waits"** — what the load was *waiting on*: - **CPU** dominant → compute-bound: a scan / bad plan / missing index. - **IO** dominant → reads not served from memory. - **Lock / blocking** → contention, not raw query cost. ## Part B — APM / distributed tracing (which endpoint, and how is it calling the DB?) 1. Find the affected service → the request/endpoint (or open the incident/problem view and pivot to the service). 2. On the endpoint, open the **database-calls / response-time breakdown** — a good APM attributes DB calls to the endpoint with **execution count** and **time contribution**. 3. Open one request's **distributed trace (the call waterfall)** and expand the DB calls. This is the decisive view: - **Same statement called dozens/hundreds of times per request** → **N+1**. - **One call, large duration** → **single slow query**. 4. **Aggregate the database-statements view** across the window — `execution count × avg time = total DB time`. ## Part C — Classify (the table that picks your fix) | Signal | Diagnosis | Fix | |---|---|---| | One statement, high time/call, CPU waits | Slow query — missing index / bad plan / scan | `EXPLAIN (ANALYZE)`; add index or rewrite | | Same statement, huge call count, low time/call | N+1 | Batch / eager-load in the app | | Lock / blocking waits dominate | Contention | Tighten transaction scope; revisit isolation | | Query is fine, call rate tracks traffic | Demand / volume | Cache, rate-limit, offload to read replica | ## Part D — Confirm causation, not just correlation - **Align three timelines** over the outage: PI load spike, the APM's endpoint call-rate spike, and CloudWatch **`CPUUtilization`** on the RDS instance. All three overlapping = confirmed. - Run **`EXPLAIN (ANALYZE)`** on the suspect statement to prove the scan / bad plan rather than assume it. - If feasible, **reproduce in non-prod** before recommending the fix. ## Capture for the recommendation The statement · its triggering endpoint · classification (slow / N+1 / lock / volume) · `EXPLAIN` evidence · its AAS/CPU contribution · proposed fix + expected reduction. That package turns "I think it's a slow query" into a defensible root cause — and only then does right-sizing become a data-backed decision.