Control Plane Needs Data Plane Rigor - Nestor G Pestelos Jr (ngpestelos)

> "Where have we assumed a control-plane concern is simpler than a data-plane concern? That assumption is where the next incident lives." **Source**: Synthesis from the [AWS us-east-1 DynamoDB outage, October 19–20, 2025](https://aws.amazon.com/message/101925/). Research notes: [[20260415 AWS us-east-1 DynamoDB Outage Oct 2025 Post-Mortem]]. ## The Meta-Principle Engineers apply rigor to data planes — replication, consensus, quorums, staleness detection, versioning — because data correctness is visibly tied to business outcomes. Control planes (DNS records, routing tables, config distribution, feature flags, IAM policies, service-discovery state) routinely get treated as *configuration* and built with lower rigor. The Oct 2025 AWS incident demonstrates the asymmetry is a fiction. The DynamoDB data plane performed flawlessly. The DNS control plane took the entire region down because two Enactor processes raced, a stale safety check didn't fire, and a cleanup routine deleted the active plan. A control-plane failure is indistinguishable from a data-plane failure from the customer's perspective — possibly worse, because it can prevent recovery. ## The Three Recurring Shortfalls 1. **Control plane treated as configuration, not consensus.** "Which plan is currently active?" is a consensus question even when it looks like a config question. Implementing it with cached reads and best-effort cleanup fails under latency spikes. 2. **Recovery path treated as free.** Normal-operation throttling is sized to steady-state arrival rates. When a dependency heals, backlog arrival rates can be 10–100× steady state. Unthrottled recovery becomes its own outage. 3. **"Global" services with single-region control planes.** Data-plane isolation between regions is real; control-plane isolation is frequently fictitious. A "regional" outage of the home region becomes a worldwide outage for every global service that depends on it. ## Prevention Heuristic Before every architecture review, ask each question once per control-plane component: - **Consensus**: Is this a single-writer system with a lease, or a multi-writer system with proper versioning? If it's neither, it's a race waiting to happen. - **Staleness**: Do safety checks verify against *authoritative current state at the moment of action*, or against state read seconds/minutes ago? - **Capacity removal velocity**: When this system decides to remove capacity (un-register an endpoint, mark an instance unhealthy, delete a plan), can it do so faster than the system can replace the capacity? - **Recovery path load**: Have we load-tested this system at the arrival rate produced by a 3-hour outage of its primary dependency? - **Global vs regional honesty**: Is this component labeled "global" because it serves global traffic, or because it truly runs with consensus across multiple regions? If the first, document the single-region risk explicitly. ## Why Teams Skip This - Control planes are often built earlier in a system's life, when traffic is low and races are rare. - Control-plane incidents are less frequent than data-plane incidents, so organizational memory underweights them. - Control-plane rigor costs more: consensus systems, full regional replicas, recovery-load testing all add complexity and expense. - The incentive gradient favors shipping new features over hardening control-plane code that *seems* to work. ## What Changes When You Adopt the Principle - Control-plane mutations go through a consensus primitive (Raft leader, etcd, ZooKeeper, or equivalent). - Safety checks read authoritative state immediately before action, or fail closed if they can't. - Recovery workflows have explicit throttles, often lower than normal-operations throttles. - Capacity-removal actions have velocity limits; never remove faster than you can replace. - Global services advertise their control-plane region explicitly, and customers can treat that disclosure as architectural input. ## Relationship to the Five-Challenge Frame Maps onto three of Vitillo's five challenges simultaneously: - **Coordination** — control planes are coordination systems even when not labeled as such. - **Resiliency** — unthrottled recovery breaks the resiliency contract. - **Maintainability** — control-plane rigor depends on operating the system as if it matters, not as if it were static config. The fact that a single principle touches three of the five challenges is why it produces such outsized incidents when neglected. ## Related Concepts - [[DNS Control Plane Race Condition]] - [[Congestive Collapse on Recovery]] - [[Global Control Plane Colocation Risk]] - [[Five-Challenge Frame of Distributed Systems]] - [[Ports and Adapters for Distributed Services]] - [[Bezos API Mandate]] ## Tags #system-design #distributed-systems #control-plane #resiliency #architecture-principle #aws #post-mortem