Autonomous Systems vs Automated Systems - Nestor G Pestelos Jr (ngpestelos)

## Core Insight There is a fundamental distinction between automated systems (external scripts replacing manual steps) and autonomous systems (self-managing systems that need no human intervention by design). Automated systems still require humans as a fallback; autonomous systems treat self-healing as an intrinsic feature. The transition requires treating infrastructure as a software problem — using APIs, data distribution, and distributed system principles rather than scripts and SSH. ## The Borg Example Google evolved from machine-specific assignments → state-tracking databases → Borg, which treats machines as a managed sea of schedulable resources with API calls to a central coordinator. Result: thousands of machines are born, die, and go to repair daily with zero SRE effort. Machine lifecycle management became a no-op. ## The Decider Example MySQL on Borg needed failovers every 1-2 weeks (Borg tasks move frequently). Manual failover: 30-90 minutes. Decider daemon: <30 seconds, 95% of the time. Results: 95% reduction in operational work, 60% hardware freed through bin-packing. The team shifted from "optimizing for lack of failover" to "embracing failure as inevitable, optimizing to recover quickly." ## The Human Skill Decay Problem Highly effective automation progressively relieves humans of direct system contact. When automation eventually fails, humans can no longer operate the system — reactions lose fluidity, mental models diverge from reality. This argues for regular practice drills and maintaining operational awareness even with high automation. ## The Diskerase Cautionary Tale Automation bug interpreted an empty set (correctly meaning "no machines left to wipe") as a special value meaning "everything." Nearly all CDN machines were wiped within minutes. Lessons: rate limiting, sanity checks, idempotent workflows, never rely on implicit safety signals. ## Design-Phase Recommendation Autonomous operation is difficult to retrofit. Standard software engineering practices help: decoupled subsystems, APIs, minimized side effects. The highest leverage is building systems that don't need automation, not automating existing manual procedures. ## Source - [[Site Reliability Engineering - Chapter 7 - The Evolution of Automation at Google|SRE Ch 7: The Evolution of Automation at Google]] by Niall Murphy ## Related Concepts - [[Automation Value Hierarchy in SRE]] - [[Automation Ownership Must Follow System Ownership]] - [[Strategic Short-Term Availability Trade-offs]]