**Parent Topic**: [[System Design/README]] On **February 27, 2011**, Gmail lost a significant amount of user data despite many internal safeguards — the first large-scale use of **GTape**, Gmail's global offline (tape) backup system, to restore live customer data. Outcome: Google delivered an estimate of restore time, **restored all affected accounts within several hours of that estimate, and recovered 99%+ of the data** before the estimated completion. The accuracy wasn't luck — it came from planning and many prior *simulated* restores (dress rehearsals). Two lessons: - **Defense in depth across media.** Public reaction was surprise that Google used *tape*, given its disks and fast network. But tape is a distinct layer guarding against (a) failure of the internal Gmail redundancy/backup subsystems, and (b) a wide failure or zero-day in a device driver or filesystem affecting the underlying disk medium. This loss exceeded what internal disk-based means could recover — so the offline layer mattered. - **Rehearsed coordination.** Many teams, some unrelated to Gmail, pitched in; recovery succeeded because a central, choreographed plan existed — the product of regular dry runs. Google treats such failures as inevitable and plans for both foreseeable failures and random undifferentiated breakage. *Source: [[Site Reliability Engineering]] — Ch 26 — Data Integrity: What You Read Is What You Wrote*