Principles of Elastic Capacity Planning - Nestor G Pestelos Jr (ngpestelos)

NotebookLM: *What do I need to understand when planning capacity for a cloud-based software system?* When planning capacity for a cloud-based software system, you are no longer dealing with the traditional constraints of fixed physical infrastructure. Instead, cloud elasticity shifts the paradigm from asking "can we serve the load?" to determining **"what kind of user experience can we afford?"**. This optimization can be measured in **cost per visitor-second**, finding the sweet spot where the revenue generated by a fast, responsive user experience outweighs the cost of providing it. To effectively plan, model, and maintain this capacity, you must understand several core principles across metrics, forecasting, scaling, and system architecture: ### 1. Choose the Right Metrics - **Discard QPS for CPU-seconds:** "Queries per second" (QPS) is a poor proxy for capacity because the cost of individual queries can vary by 1,000x or more, and a simple software update can change that ratio overnight. Any model relying on QPS creates a moving target that silently degrades capacity planning. Instead, model capacity using actual resources, such as tracking customer quotas in normalized CPU-seconds per second. - **Track Goodput over Throughput:** Raw throughput is a deceptive metric because a saturated system will continue accepting requests while increasingly failing or timing out. **Goodput** measures only the subset of requests handled successfully _and_ with low latency. When load testing or planning capacity headroom, use goodput as the true signal of capacity. - **Understand the Throughput vs. Latency Tension:** Optimizing for one fundamentally degrades the other. Maximizing throughput fills up queues and spikes tail latency, whereas optimizing for low latency requires keeping queues short, which leaves the system frequently idle. You must provision user-facing serving systems with slack capacity for latency, while batch pipelines can run hot for throughput. ### 2. Forecasting and Growth Dynamics - **The SRE Capacity Planning Triad:** Effective capacity planning requires three mandatory steps: accurately forecasting organic demand, incorporating inorganic demand (like marketing campaigns and feature launches), and conducting regular load testing. - **Maintain a 6-Month Headroom:** Always try to maintain at least six months' worth of capacity headroom. This buffer absorbs unexpected traffic spikes and protects your system if a highly anticipated software replacement is delayed by bugs. - **Plan for Exponential, Not Linear, Growth:** Modern application features like mobile usage and real-time polling compound load by orders of magnitude. Do not plan linearly; success compounds, and your current capacity is likely already insufficient. - **Anticipate Launch Spikes:** Product launches are uniquely dangerous because public interest is notoriously unpredictable. Spikes can hit up to **15x to 25x normal peak traffic**. Launch traffic mixes also differ from steady-state traffic, invalidating standard load tests. To de-risk launches, execute gradual rollouts (e.g., one region at a time) and ensure redundant capacity is secured well in advance. ### 3. Scaling Strategies and Automation - **Leverage Auto-Scaling Loops:** Utilize simple control loops, such as a queue-depth auto-scaler that launches or removes instances to keep worker queues empty. These systems not only handle traffic fluctuations but can automatically absorb the impact of unexpected performance bugs until you fix them. - **Adopt Intent-Based Planning:** Manual bin-packing using spreadsheets is brittle. Shift toward intent-based capacity planning where you declare abstract requirements (e.g., "Run at 5 nines reliability" or "Meet demand with N+2 redundancy") and use computational optimization to autogenerate your resource allocations. - **Avoid "Ops Mode":** Do not solve scaling by linearly adding more administrators as traffic grows. The goal is to write software and eliminate scalability bottlenecks so human headcount only increases with system _complexity_, not with load. - **Understand Database Limits:** If using a relational database, you have four main levers: caching, query optimization, buying bigger hardware, and sharding. Remember that adding replicas only scales _reads_, not writes. A write-heavy application will eventually hit a hard limit on the master database's capacity. ### 4. Architectural Blind Spots - **Apply Data-Plane Rigor to Control Planes:** A widespread failure in capacity planning is treating control planes (DNS records, feature flags, routing tables) like simple configuration files. Control planes require strict consensus mechanisms. Importantly, you must load-test the recovery path: when a global system heals, backlog arrival rates can be 10–100× steady state, turning an unthrottled recovery into its own devastating outage. - **Justify Storage with Data:** Storage is one of the most expensive infrastructure components. Never request or approve storage capacity without strict, data-backed requirements, including workload patterns and 6/12/18-month capacity projections.