# Thundering Herd Problem
## Core Concept
The thundering herd problem occurs when a large number of processes, threads, or clients simultaneously compete for the same resource — overwhelming the system that serves it. Originally an OS-level problem (all sleeping processes woken by a single event, only one can proceed), it now most commonly manifests in distributed systems as **cache stampede**: a popular cache key expires, and hundreds of concurrent requests bypass the cache and hit the database simultaneously, causing a load spike that can cascade into full system collapse.
The term also covers **retry storms**: after a brief service outage, well-intentioned client retry logic fires simultaneously, creating a second wave of load that prevents recovery.
## Mitigations
| Strategy | How it works | Trade-off |
|----------|-------------|-----------|
| **Exponential backoff + jitter** | Clients spread retries over time randomly | Adds latency to individual requests |
| **Distributed locking** | One request per cache key fetches from DB; others wait | Lock contention; need per-resource locks, not global |
| **Request coalescing** | Deduplicate in-flight requests for the same key | Requires coordination layer (e.g., singleflight in Go) |
| **Proactive cache refresh** | Refresh before expiration, not after | Wasted work if key is rarely accessed |
| **Staggered TTLs** | Add random jitter to cache expiration times | Complicates cache invalidation reasoning |
| **Circuit breakers** | Halt requests during detected overload | Requires tuning; too aggressive = unnecessary downtime |
| **Rate limiting + load shedding** | API rejects excess requests with Retry-After | Clients must respect headers; not all do |
## Key Evidence (from vault)
- **Rails cache stampede**: 24-hour `Rails.cache` expiration caused 25 simultaneous COUNT queries when caches expired together. Fixed with batch preload via a database view — check preloaded value *before* cache lookup, eliminating the stampede trigger entirely.
- **Optimistic locking retries**: Exponential backoff + jitter prevents thundering herd during retry-on-conflict patterns. See [[Optimistic Locking as Ledger Scalability Solution]].
- **Google outage** (Cindy Sridharan): remediation itself caused a thundering herd because there was no exponential backoff in the recovery path.
- **Disney+ Hotstar**: jitter + caching + pre-warming as multi-layered defense for massive concurrent viewership events.
- **Facebook Video**: per-resource request queuing to prevent unrelated requests from blocking each other during stampedes.
## Connections
- [[Optimistic Locking as Ledger Scalability Solution]] — backoff + jitter is the shared mitigation pattern
- [[Structured Refeeding After Fasting]] — same structural pattern across domains (see below)
## Cross-Domain: Graduated Load Restoration
The thundering herd and structured refeeding share a deep structural parallel: **a system in low-activity mode cannot safely absorb full load instantaneously.**
| Dimension | Thundering herd | Structured refeeding |
|-----------|----------------|---------------------|
| **System state** | Cache empty / service just recovered | Digestive enzymes downregulated, migrating motor complex active |
| **Dangerous input** | Hundreds of simultaneous requests | Large meal, high-fat/sugar bolus |
| **Failure mode** | Cascading overload, system collapse | Blood sugar spike → crash, digestive distress |
| **Core mitigation** | Stagger, jitter, gradual traffic ramp | Dates+water → pause → balanced plate |
| **The pause matters** | Backoff interval lets system recover between retries | 15-30 min lets glucose absorb and enzymes activate |
| **Pre-warming** | Proactive cache refresh before expiration | Pre-fast meal (suhoor) loads slow-release fuel before downtime |
| **Wrong instinct** | "Retry immediately and harder" | "Eat everything now" (feast response) |
The failure in both domains comes from **correct behavior applied without pacing**. Retrying is correct. Eating after fasting is correct. The damage comes from doing it all at once. The Ramadan protocol's dates → pause → balanced plate maps to cache warm-up → gradual traffic ramp → full load.
The shared discipline: **resist the impulse to restore full capacity instantly.**
## Implications
The thundering herd is a failure mode, not a bug — it emerges from correct behavior (caching, retrying) under specific conditions (synchronized expiration, correlated failures). The most reliable defense is **preventing synchronization** in the first place: staggered TTLs, jitter on retries, proactive refresh. Reactive defenses (circuit breakers, rate limiting) are safety nets, not primary controls. The preload pattern applies broadly: if you can compute the value eagerly, check the preload *before* the cache — eliminating the stampede trigger entirely.