Throughput Latency Tension - Nestor G Pestelos Jr (ngpestelos)

**Throughput** = how many requests (or units of work) a system handles per unit of time. Measured in RPS, QPS, or records/hour. **Latency** = how long each individual request takes. These two are in fundamental tension: optimizing for one degrades the other. ## Why They Conflict Higher throughput means more requests in the system at once. More requests → longer queues → each request waits longer → higher tail latency. Pushing a system toward maximum throughput is pushing it toward maximum queuing delay. Conversely, optimizing for low latency means keeping queues short — which means the system is often idle, wasting capacity. ## The Bigtable Example Same hardware, opposite needs: - **Low-latency users** (serving live requests): want empty queues, immediate processing. Provision with slack capacity. - **Throughput users** (batch/MapReduce): want full queues, never idle. Run hot, less redundancy. Success for one is literally failure for the other. Google solves this by partitioning into service tiers — throughput clusters cost 10-50% of low-latency clusters. ## Where Each Matters | Service Type | Primary Concern | |---|---| | User-facing serving | Latency (users feel delay) | | Pipelines / batch | Throughput (how fast data moves through) | | Storage reads | Latency | | Storage writes (bulk) | Throughput | ## The Queuing Insight At high load, small increases in utilization cause disproportionate increases in queue wait time. This is why tail latency (p99) gets dramatically worse under load even when median latency looks fine. Averages hide this — use percentiles. ## Cross-Domain Connections - [[Infrastructure Service Level Tiering]] — the Bigtable partition solution - [[Percentiles Over Averages for Latency]] — queuing effects amplify tails - [[SLI SLO SLA Hierarchy]] — throughput and latency are both common SLIs - [[Optimistic Locking as Ledger Scalability Solution]] — trades latency for throughput --- *Extracted: 2026-03-26*