Thundering Herd from Synchronized Scheduling - Nestor G Pestelos Jr (ngpestelos)

When people configure a "daily cron job," they commonly choose midnight. In a distributed system with thousands of teams, this creates massive simultaneous load spikes — 30+ MapReduce jobs spawning thousands of workers at the same instant. ## Google's Fix: Hash-Based Time Distribution Extended crontab with `?` wildcard: "any value is acceptable, system chooses." The system hashes the job configuration over the time range (e.g., 0-23 for hours) to distribute launches evenly. Users opt in by replacing specific times with `?`. Despite this, load remains spiky because some jobs have hard temporal dependencies (e.g., must run after a daily data export completes at a specific time). ## The Deeper Pattern Synchronized behavior in distributed systems creates correlated load spikes. The same problem appears in: - **Cache stampede**: Many clients discover cache expiry simultaneously and all hit the backend - **Fleet restarts**: Rolling restart with insufficient jitter causes periodic capacity dips - **Retry storms**: Exponential backoff without jitter causes synchronized retry waves - **Auto-scaling**: All instances scaling simultaneously based on the same metric threshold The universal fix: add randomization (jitter) to break synchronization. Google's `?` wildcard is jitter for scheduling.