Cascading Failure — Cracks Jump Layers - Nestor G Pestelos Jr (ngpestelos)

A **cascading failure** occurs when a crack in one layer triggers a crack in a *calling* layer — problems cascade *upward* through the dependency tree (drawn roots-to-sky). If a database cluster goes dark, every caller has a problem; whether that becomes a cascade depends entirely on how the caller is written. If integration points are the number-one *source* of cracks, **cascading failures are the number-one crack *accelerator*** — so "preventing cascading failures is the very key to resilience." Modern systems of dozens or hundreds of interlinked services give cascades enormous room to run; **high fan-in** services (many callers) spread their problems widest and deserve extra scrutiny. The failure "jumps the gap" through a transmission mechanism — most often **blocked threads** and **resource pools drained by a lower-layer failure** (integration points without timeouts are a surefire way to create one). Sometimes the mechanism is the *reverse*: an overly aggressive caller. The defenses are **Circuit Breaker** (stop calling the troubled point) and **Timeouts** (guarantee you can return from a call that does go out), plus resource pools that always bound how long a thread waits to check out. --- *Source: [[Release It Second Edition]] (Michael T. Nygard, Pragmatic Bookshelf 2018) — Ch 4 — Stability Antipatterns*