Automated Diagnostic Capture Balances Speed and Insight

In the airline outage, the team had long ago written scripts to take thread dumps of every Java application and snapshots of the databases. Nygard calls this "the perfect balance": **it's not improvised, it does not prolong the outage, yet it aids postmortem analysis.** The tension in any incident is between recovering fast (which destroys the failure state) and understanding the failure (which needs that state preserved). Pre-built, one-command capture resolves it: run it reflexively at the start of an incident, then restart and recover. The forensic snapshot — thread states, lock owners, pool occupancy, DB state — is frozen for later analysis even though the live system is rebooted seconds later. A thread dump in particular is "an open book": it reveals an application's third-party libraries, thread pools and their occupancy, background processing, and protocols — even with no source access. Build the capture tooling *before* the incident; you will never have time to write it during one. --- *Source: [[Release It Second Edition]] (Michael T. Nygard, Pragmatic Bookshelf 2018) — Ch 2 — Case Study: The Exception That Grounded an Airline*