The Art of Capacity Planning - Nestor G Pestelos Jr (ngpestelos)

**Author**: John Allspaw **Publisher**: O'Reilly Media, First Edition — September 2008 (© Yahoo! Inc.) **Pages**: 154 (PDF); ~131 content pages + 3 appendices **ISBN-13**: 978-0-596-51857-8 %% **PDF**: `~/Library/CloudStorage/[email protected]/My Drive/Books/The Art of Capacity Planning.pdf` **Page offset**: book page 1 = PDF page 17 (offset **+16**) **Topic folder**: `2 Resources/Topics/Software/Engineering/Capacity Planning/` **Status**: ✅ COMPLETE — 5 chapters + 3 appendices / 68 notes **Added**: 2026-06-18 **Completed**: 2026-06-18 **Last activity**: 2026-06-18 %% ## Progressive Summary *The Art of Capacity Planning* (John Allspaw, O'Reilly 2008) argues capacity planning is an **empirical, iterative discipline, not a theoretical one**: measure your own system's real usage, find each component's ceiling, forecast from history, deploy ahead of growth — then repeat. Written at Flickr/Yahoo! (it opens with the July 2005 London-bombing traffic spike that nearly took Flickr down), it predates and seeds the DevOps/SRE literature, and is the direct predecessor to Allspaw's *Web Operations* (2010). Five chapters trace one loop. **Set goals** (requirements, SLAs, the performance-vs-capacity distinction) → **measure** (the ceiling-finding method: tie a primary-function metric to a hardware resource, find the red line via real production load) → **predict** (curve-fit history to the ceiling, separating peak-driven from consumption-driven resources, with a safety margin) → **deploy** (automation to shrink the one provisioning phase you control). Three appendices extend the loop to virtualization/cloud (same process at finer granularity, with cost as a new variable) and to instantaneous-growth firefighting (pre-built load-shedding switches, baking pages, serving stale). The single most consequential move is the **ceilings + history → forecast** loop grounded in measuring the *right* metric — and the repeated discovery that the right metric is non-obvious: disk **I/O wait** not utilization; the **photos-to-user ratio** not raw counts; **credit balance** not CPU for burstables. Read in 90 seconds six months from now: measure empirically, find each component's red line in the unit users actually feel, forecast peaks and consumption to that line with a safety margin, automate deployment, and re-run the loop forever — capacity is a process, not an event. ## Top Takeaways for Rereading **Process & model** - [[The Four-Step Capacity Planning Cycle]] — measure → predict → deploy → iterate - [[Forecasting Requires Ceilings Plus History]] — the two essential inputs - [[The Ceiling-Finding Method]] — the repeatable per-component procedure **Measurement** - [[Eliminate Healthy Resources to Find the Binding One]] — rule out the comfortable; find the constraint - [[Disk IO Wait Predicts DB Lag Not Utilization]] — the obvious gauge can lie - [[Find the Application Metric That Predicts the Ceiling]] — the predictor is often a derived ratio **Forecasting** - [[Forecast Peak-Driven Resources by Trending Peaks]] — trend the peaks, not the average - [[Apply a Safety Factor Above the Ceiling]] — borrow the structural-engineering margin - [[Don't Over-Fit Your Capacity Forecast]] — context beats R² **Economics & operations** - [[Don't Buy Capacity Before You Need It]] — JIT + Moore's Law - [[Work Backward From Run-Out Using Procurement Time]] — schedule from the run-out date - [[Pre-Build Load-Shedding Switches]] — the cheapest emergency capacity is spend you can stop **Cross-cutting** - [[Peak-Driven Capacity Differs From Consumption-Driven]] — peak envelope vs depletion timeline - [[The Defining Metric Itself Can Change]] — re-confirm the binding metric after any change ## About Allspaw's foundational text on **web capacity planning** — written at Flickr/Yahoo! and opening with the July 2005 London-bombing traffic spike that nearly took Flickr down. Its core argument: capacity planning is **empirical, not theoretical** — measure your own system's real usage, find each resource's ceiling, predict trends from history, and buy/deploy ahead of the curve. "Back-of-the-envelope" math and honest measurement beat elaborate simulation. This is the **direct predecessor** to Allspaw's later [[Book Inventory/Web Operations|Web Operations]] (2010) and a primary source the SRE-era literature draws on. Five chapters + three appendices: - **Ch 1** — Goals, Issues, and Processes in Capacity Planning - **Ch 2** — Setting Goals for Capacity - **Ch 3** — Measurement: Units of Capacity (the longest, most technical chapter) - **Ch 4** — Predicting Trends - **Ch 5** — Deployment - **App A** — Virtualization and Cloud Computing - **App B** — Dealing with Instantaneous Growth - **App C** — Capacity Tools ## Cross-Corpus Note (record) This book originated much of a corpus the vault already held in derived form ([[Book Inventory/Web Operations|Web Operations]], [[Book Inventory/Site Reliability Engineering|Site Reliability Engineering]], Release It). Synthesis applied **primary-beats-derivative**: captured Allspaw's canonical version where this 2008 book is the origin, cross-linked rather than duplicated where Web Ops/SRE/Release It already covered it (notably Ch 5 Deployment, ~70% pre-covered, and App A/B). App C (2008 tool catalog) yielded no notes by the dated-survey rule. ## Synthesis Progress | Chapter | Title | Pages (book) | Status | Atomic Notes | | ------- | ----- | ------------ | ------ | ------------ | | Ch 1 | Goals, Issues, and Processes in Capacity Planning | 1–10 | Complete (9 notes) | [[The Four-Step Capacity Planning Cycle]], [[Find Each Component's Red-Line Number]], [[Tie System Stats to Business Metrics]], [[Procurement Is a Capacity-Planning Step]], [[Accept Current Performance as the Planning Baseline]], [[Capacity Planning Is Empirical Not Theoretical]], [[Architecture Affects Capacity More Than Tuning]], [[User-Generated Content Makes Growth Unpredictable]], [[Use Quick-and-Dirty Capacity Math]] | | Ch 2 | Setting Goals for Capacity | 11–22 | Complete (11 notes) | [[Define Requirements Before Planning Capacity]], [[Interpret Synthetic Monitoring Before Trusting It]], [[SLA Nines Translate to Downtime Budgets]], [[Downtime Does Not Equal Linearly Lost Revenue]], [[API Capacity Is a Business Contract]], [[Slow Pages Are Not Always a Capacity Problem]], [[Define Ceilings by User-Facing Time Not System Metrics]], [[Split Architecture into Measurable Components]], [[Match Hardware Profile to Each Role's Bound Resource]], [[Diagonal Scaling Upgrades Horizontal Nodes]], [[Disaster Recovery Multiplies Capacity Cost]] | | Ch 3 | Measurement: Units of Capacity | 23–62 | Complete (19 notes) | [[Choose Measurement Tools by Capability Not Brand]], [[RRD Trades Old Detail for Bounded Storage]], [[Treat Logs as Past Metrics]], [[Networks Are a Finite Capacity Too]], [[Least-Connections Balancing Breaks for Databases]], [[Load Balancers Are Capacity Instruments]], [[The Ceiling-Finding Method]], [[Eliminate Healthy Resources to Find the Binding One]], [[Test Ceilings with Real Production Load]], [[Single-Machine Load Testing Has Limits]], [[Storage Capacity Has Two Dimensions]], [[Peak-Driven Capacity Differs From Consumption-Driven]], [[Match Metric Resolution to the Trend]], [[Disk IO Wait Predicts DB Lag Not Utilization]], [[Database Ceilings Have Hidden Cliffs]], [[Cache Only What Changes Slowly]], [[Cache Ceilings Use Hit Ratio Not Just Request Rate]], [[Isolate Resource Use on Multi-Use Servers]], [[Measure API Usage Per Key]] | | Ch 4 | Predicting Trends | 63–92 | Complete (16 notes) | [[Forecasting Requires Ceilings Plus History]], [[Don't Buy Capacity Before You Need It]], [[Don't Over-Fit Your Capacity Forecast]], [[Forecast Run-Out by Trending to the Ceiling]], [[Find the Application Metric That Predicts the Ceiling]], [[Forecast Peak-Driven Resources by Trending Peaks]], [[Small Data Sets Make Forecasts Fragile]], [[Automate Curve-Fitting into a Recurring Job]], [[Apply a Safety Factor Above the Ceiling]], [[Work Backward From Run-Out Using Procurement Time]], [[Adding Capacity Moves the Bottleneck]], [[Traffic Patterns Widen as Audience Globalizes]], [[Size Each Data Center for Its Partner's Full Load]], [[Capacity Planning Must Talk to Product]], [[Recalibrate Forecasts on a Moving Window]], [[The Defining Metric Itself Can Change]] | | Ch 5 | Deployment | 93–104 | Complete (5 notes; ~70% dedup vs Web Ops/Release It deployment corpus) | [[Automation Shrinks the Provisioning Time You Control]], [[Homogenize Hardware Types]], [[Compose Roles from Reusable Services]], [[Drive Deployment from Inventory as Source of Truth]], [[Bring Up a Data Center from Bare Metal via Automation]] | | App A | Virtualization and Cloud Computing | 105–120 | Complete (4 notes) | [[Treat Cloud Capacity as the Same Process]], [[Forecast Even When Deployment Is Instant]], [[Cloud Cost Is a Capacity Variable]], [[Drive Cloud Autoscaling with a Capacity Feedback Loop]] | | App B | Dealing with Instantaneous Growth | 121–126 | Complete (4 notes) | [[Adding Servers Can't Fix Architectural Limits]], [[Pre-Build Load-Shedding Switches]], [[Bake Pages or Serve Stale Under Load]], [[Host Your Status Page Outside Your Data Center]] | | App C | Capacity Tools | 127–130 | Complete (0 notes — 2008 tool catalog; durable tool concepts already captured in Ch 3/Ch 5 per the dated-survey dedup rule) | — | **Final tally:** 68 synthesis notes (Ch 1–5: 60, App A–B: 8; App C none by design). All 9/9 lint.