Thundering Herd Problem - Nestor G Pestelos Jr (ngpestelos)

# Thundering Herd Problem ## Core Concept The thundering herd problem occurs when a large number of processes, threads, or clients simultaneously compete for the same resource — overwhelming the system that serves it. Originally an OS-level problem (all sleeping processes woken by a single event, only one can proceed), it now most commonly manifests in distributed systems as **cache stampede**: a popular cache key expires, and hundreds of concurrent requests bypass the cache and hit the database simultaneously, causing a load spike that can cascade into full system collapse. The term also covers **retry storms**: after a brief service outage, well-intentioned client retry logic fires simultaneously, creating a second wave of load that prevents recovery. ## Mitigations | Strategy | How it works | Trade-off | |----------|-------------|-----------| | **Exponential backoff + jitter** | Clients spread retries over time randomly | Adds latency to individual requests | | **Distributed locking** | One request per cache key fetches from DB; others wait | Lock contention; need per-resource locks, not global | | **Request coalescing** | Deduplicate in-flight requests for the same key | Requires coordination layer (e.g., singleflight in Go) | | **Proactive cache refresh** | Refresh before expiration, not after | Wasted work if key is rarely accessed | | **Staggered TTLs** | Add random jitter to cache expiration times | Complicates cache invalidation reasoning | | **Circuit breakers** | Halt requests during detected overload | Requires tuning; too aggressive = unnecessary downtime | | **Rate limiting + load shedding** | API rejects excess requests with Retry-After | Clients must respect headers; not all do | ## Key Evidence (from vault) - **Rails cache stampede**: 24-hour `Rails.cache` expiration caused 25 simultaneous COUNT queries when caches expired together. Fixed with batch preload via a database view — check preloaded value *before* cache lookup, eliminating the stampede trigger entirely. - **Optimistic locking retries**: Exponential backoff + jitter prevents thundering herd during retry-on-conflict patterns. See [[Optimistic Locking as Ledger Scalability Solution]]. - **Google outage** (Cindy Sridharan): remediation itself caused a thundering herd because there was no exponential backoff in the recovery path. - **Disney+ Hotstar**: jitter + caching + pre-warming as multi-layered defense for massive concurrent viewership events. - **Facebook Video**: per-resource request queuing to prevent unrelated requests from blocking each other during stampedes. ## Connections - [[Optimistic Locking as Ledger Scalability Solution]] — backoff + jitter is the shared mitigation pattern - [[Structured Refeeding After Fasting]] — same structural pattern across domains (see below) ## Cross-Domain: Graduated Load Restoration The thundering herd and structured refeeding share a deep structural parallel: **a system in low-activity mode cannot safely absorb full load instantaneously.** | Dimension | Thundering herd | Structured refeeding | |-----------|----------------|---------------------| | **System state** | Cache empty / service just recovered | Digestive enzymes downregulated, migrating motor complex active | | **Dangerous input** | Hundreds of simultaneous requests | Large meal, high-fat/sugar bolus | | **Failure mode** | Cascading overload, system collapse | Blood sugar spike → crash, digestive distress | | **Core mitigation** | Stagger, jitter, gradual traffic ramp | Dates+water → pause → balanced plate | | **The pause matters** | Backoff interval lets system recover between retries | 15-30 min lets glucose absorb and enzymes activate | | **Pre-warming** | Proactive cache refresh before expiration | Pre-fast meal (suhoor) loads slow-release fuel before downtime | | **Wrong instinct** | "Retry immediately and harder" | "Eat everything now" (feast response) | The failure in both domains comes from **correct behavior applied without pacing**. Retrying is correct. Eating after fasting is correct. The damage comes from doing it all at once. The Ramadan protocol's dates → pause → balanced plate maps to cache warm-up → gradual traffic ramp → full load. The shared discipline: **resist the impulse to restore full capacity instantly.** ## Implications The thundering herd is a failure mode, not a bug — it emerges from correct behavior (caching, retrying) under specific conditions (synchronized expiration, correlated failures). The most reliable defense is **preventing synchronization** in the first place: staggered TTLs, jitter on retries, proactive refresh. Reactive defenses (circuit breakers, rate limiting) are safety nets, not primary controls. The preload pattern applies broadly: if you can compute the value eagerly, check the preload *before* the cache — eliminating the stampede trigger entirely.