# Confirming a DB-Saturating Query — Performance Insights + APM
**Parent Topic**: [[Software/README]]
Companion runbook to [[RDS CPU Saturation — Diagnose Before Right-Sizing]]. Goal: identify the **exact SQL statement + triggering endpoint** behind an RDS CPU-saturation incident, and **classify** it so the fix is obvious. Generic methodology, no proprietary data. (Exact UI labels vary by tool/version — the concepts are stable.)
## Part A — RDS Performance Insights (is it the DB, and which SQL?)
1. RDS console → **Performance Insights** → select the instance → set the time range to the **incident window**.
2. Read the **Database load** chart: y-axis is **Average Active Sessions (AAS)**; a horizontal line marks **Max vCPU**. **AAS sustained above the vCPU line = CPU saturation.** That alone confirms the DB tier was the bottleneck.
3. Slice **"Top SQL"** — statements ranked by load contribution. The one(s) towering during the spike are the culprit. Capture the SQL digest/text and its share of load.
4. Slice **"Waits"** — what the load was *waiting on*:
- **CPU** dominant → compute-bound: a scan / bad plan / missing index.
- **IO** dominant → reads not served from memory.
- **Lock / blocking** → contention, not raw query cost.
## Part B — APM / distributed tracing (which endpoint, and how is it calling the DB?)
1. Find the affected service → the request/endpoint (or open the incident/problem view and pivot to the service).
2. On the endpoint, open the **database-calls / response-time breakdown** — a good APM attributes DB calls to the endpoint with **execution count** and **time contribution**.
3. Open one request's **distributed trace (the call waterfall)** and expand the DB calls. This is the decisive view:
- **Same statement called dozens/hundreds of times per request** → **N+1**.
- **One call, large duration** → **single slow query**.
4. **Aggregate the database-statements view** across the window — `execution count × avg time = total DB time`.
## Part C — Classify (the table that picks your fix)
| Signal | Diagnosis | Fix |
|---|---|---|
| One statement, high time/call, CPU waits | Slow query — missing index / bad plan / scan | `EXPLAIN (ANALYZE)`; add index or rewrite |
| Same statement, huge call count, low time/call | N+1 | Batch / eager-load in the app |
| Lock / blocking waits dominate | Contention | Tighten transaction scope; revisit isolation |
| Query is fine, call rate tracks traffic | Demand / volume | Cache, rate-limit, offload to read replica |
## Part D — Confirm causation, not just correlation
- **Align three timelines** over the outage: PI load spike, the APM's endpoint call-rate spike, and CloudWatch **`CPUUtilization`** on the RDS instance. All three overlapping = confirmed.
- Run **`EXPLAIN (ANALYZE)`** on the suspect statement to prove the scan / bad plan rather than assume it.
- If feasible, **reproduce in non-prod** before recommending the fix.
## Capture for the recommendation
The statement · its triggering endpoint · classification (slow / N+1 / lock / volume) · `EXPLAIN` evidence · its AAS/CPU contribution · proposed fix + expected reduction. That package turns "I think it's a slow query" into a defensible root cause — and only then does right-sizing become a data-backed decision.