Find the Application Metric That Predicts the Ceiling - Nestor G Pestelos Jr (ngpestelos)

To forecast a system ceiling, find the *application-level* metric that actually predicts it — and it may be a derived ratio, not a raw count. Flickr's user databases hit replication lag at 40% disk I/O wait, but neither user count nor photo count predicted that wait: servers with 450,000 users sat at healthy I/O wait while others with 300,000 ran hot. The metric that tracked I/O wait was the **photos-to-user ratio**. An elbow appeared around a ratio of 85–90 (I/O wait jumping past 30%); the 40% ceiling corresponded to a ratio of ~110, so they kept databases in the 80–100 band. The payoff is twofold: the ratio is forecastable (uploads and registrations are tracked daily), and it is *controllable* — federating (sharding) data lets you distribute high-volume users across databases to hold the ratio down. Tie a hardware ceiling to a measurable, steerable user-level metric and you can both predict and manage it. --- *Source: [[The Art of Capacity Planning]] (John Allspaw, O'Reilly 2008) — Ch 4 — Predicting Trends*