## Core Insight The four golden signals — latency, traffic, errors, saturation — are the minimum viable monitoring for any user-facing system. If you can only measure four things, measure these. Together they cover user experience (latency), demand (traffic), correctness (errors), and capacity (saturation). ## The Four Signals 1. **Latency** — time to service a request. Must separate successful from failed request latency (a slow error is worse than a fast error) 2. **Traffic** — demand on the system. HTTP requests/sec for web services, network I/O for streaming, transactions/sec for storage 3. **Errors** — rate of failed requests. Three types: explicit (HTTP 500), implicit (200 with wrong content), policy-based (response exceeding SLO threshold) 4. **Saturation** — how "full" the service is. Systems degrade before 100% utilization. Latency increases are often a **leading indicator** of saturation ## Key Nuances - **Tail latency matters more than averages**: 1% of requests at 50x the average can dominate frontend experience in multi-service architectures. The 99th percentile of one backend becomes the median of the frontend. - **Use histogram buckets, not averages**: distribute boundaries exponentially (e.g., 0-10ms, 10-30ms, 30-100ms, 100-300ms) to visualize request distribution - **Saturation includes predictions**: "database fills hard drive in 4 hours" is a saturation signal ## Cross-Domain Applications - **Any service**: these four signals transfer to non-Google systems — they're fundamental, not Google-specific - **Capacity planning**: saturation signals feed directly into provisioning decisions - **Incident response**: start diagnosis by checking which golden signal is anomalous ## Source - [[Site Reliability Engineering - Chapter 6 - Monitoring Distributed Systems|SRE Ch 6: Monitoring Distributed Systems]] by Rob Ewaschuk ## Related Concepts - [[Symptom vs Cause Monitoring Distinction]] - [[Service Level Objectives as Reliability Framework]] - [[Strategic Short-Term Availability Trade-offs]]