## Core Insight
There is a fundamental distinction between automated systems (external scripts replacing manual steps) and autonomous systems (self-managing systems that need no human intervention by design). Automated systems still require humans as a fallback; autonomous systems treat self-healing as an intrinsic feature. The transition requires treating infrastructure as a software problem — using APIs, data distribution, and distributed system principles rather than scripts and SSH.
## The Borg Example
Google evolved from machine-specific assignments → state-tracking databases → Borg, which treats machines as a managed sea of schedulable resources with API calls to a central coordinator. Result: thousands of machines are born, die, and go to repair daily with zero SRE effort. Machine lifecycle management became a no-op.
## The Decider Example
MySQL on Borg needed failovers every 1-2 weeks (Borg tasks move frequently). Manual failover: 30-90 minutes. Decider daemon: <30 seconds, 95% of the time. Results: 95% reduction in operational work, 60% hardware freed through bin-packing. The team shifted from "optimizing for lack of failover" to "embracing failure as inevitable, optimizing to recover quickly."
## The Human Skill Decay Problem
Highly effective automation progressively relieves humans of direct system contact. When automation eventually fails, humans can no longer operate the system — reactions lose fluidity, mental models diverge from reality. This argues for regular practice drills and maintaining operational awareness even with high automation.
## The Diskerase Cautionary Tale
Automation bug interpreted an empty set (correctly meaning "no machines left to wipe") as a special value meaning "everything." Nearly all CDN machines were wiped within minutes. Lessons: rate limiting, sanity checks, idempotent workflows, never rely on implicit safety signals.
## Design-Phase Recommendation
Autonomous operation is difficult to retrofit. Standard software engineering practices help: decoupled subsystems, APIs, minimized side effects. The highest leverage is building systems that don't need automation, not automating existing manual procedures.
## Source
- [[Site Reliability Engineering - Chapter 7 - The Evolution of Automation at Google|SRE Ch 7: The Evolution of Automation at Google]] by Niall Murphy
## Related Concepts
- [[Automation Value Hierarchy in SRE]]
- [[Automation Ownership Must Follow System Ownership]]
- [[Strategic Short-Term Availability Trade-offs]]