Reliable Abstractions on Unreliable Primitives - Nestor G Pestelos Jr (ngpestelos)

> "Chapter 2 describes how to build a reliable communication channel (TCP) on top of an unreliable one (IP), which can drop or duplicate data or deliver it out of order. Building reliable abstractions on top of unreliable ones is a common pattern we will encounter again in the rest of the book." **Source**: Roberto Vitillo — *Understanding Distributed Systems* (2nd ed., 2022), Part I Communication introduction. ## The Pattern Distributed systems are built by stacking reliable abstractions on top of unreliable primitives. Each layer makes the layer above simpler by absorbing failure modes of the layer below. The abstractions are never leak-free (Joel Spolsky's Law of Leaky Abstractions) — when they leak, debugging requires understanding the layer beneath. ## Canonical Stack | Layer | Guarantees | Unreliability absorbed | |-------|-----------|------------------------| | IP | Best-effort packet delivery | Hardware, switching, routing, congestion | | TCP | Ordered byte stream without gaps/duplicates | Packet loss, reordering, duplication | | TLS | Confidentiality, authenticity, integrity | Eavesdropping, tampering, impersonation | | HTTP | Request/response semantics over a reliable byte stream | Raw protocol design | | Service interface | Business-logic operations | Transport-level details | ## Where the Pattern Recurs Beyond Networking - **Exactly-once messaging** built on at-least-once delivery + deduplication - **Strong consistency** built on eventually-consistent replicas + consensus - **Durable state** built on volatile memory + write-ahead logs - **Virtual memory** built on a mix of RAM + swap with page-fault handling - **Cloud object storage** built on unreliable disks + replication + checksums - **Cluster leaders** built on failure-prone nodes + election protocols ## Why It Matters - **Design leverage**: naming the unreliable primitive forces you to state the failure modes the layer above must absorb. Skipping this step produces layers that promise more than they can deliver. - **Debugging discipline**: when an abstraction leaks, you must drop a layer. An engineer who only knows the top layer cannot fix leaks. - **Cost awareness**: every reliability guarantee costs latency, bandwidth, or CPU. TCP's reliability costs vs UDP. TLS's security costs vs plaintext. Strong consistency costs vs eventual. Choosing the right layer means choosing which guarantees you need and which you can drop. ## Anti-Pattern: Treating the Abstraction as Magic Believing that "TCP is reliable" absolves you from thinking about network failures. The abstraction leaks: connection resets, TIME_WAIT exhaustion, half-open connections, head-of-line blocking. Reliable abstractions manage the unreliability of the layer below — they do not eliminate it globally. ## Related Concepts - [[Five-Challenge Frame of Distributed Systems]] - [[Ports and Adapters for Distributed Services]] - [[DNS Control Plane Race Condition]] — a case where the layer above (service discovery) trusted a control plane (DNS management) that was more unreliable than advertised - [[Mass Test and Treat as Pandemic Response Model]] — demonstrates building reliable abstractions (standardized kits) on unreliable primitives (decentralized field conditions) ## Tags #system-design #distributed-systems #abstraction #networking #tcp #vitillo