In the airline outage, the team had long ago written scripts to take thread dumps of every Java application and snapshots of the databases. Nygard calls this "the perfect balance": **it's not improvised, it does not prolong the outage, yet it aids postmortem analysis.** The tension in any incident is between recovering fast (which destroys the failure state) and understanding the failure (which needs that state preserved). Pre-built, one-command capture resolves it: run it reflexively at the start of an incident, then restart and recover. The forensic snapshot — thread states, lock owners, pool occupancy, DB state — is frozen for later analysis even though the live system is rebooted seconds later. A thread dump in particular is "an open book": it reveals an application's third-party libraries, thread pools and their occupancy, background processing, and protocols — even with no source access. Build the capture tooling *before* the incident; you will never have time to write it during one. --- *Source: [[Release It Second Edition]] (Michael T. Nygard, Pragmatic Bookshelf 2018) — Ch 2 — Case Study: The Exception That Grounded an Airline*