Observability

Keywords

observability, sre, slo, sli, error budget, three pillars, metrics, logs, traces, alerting, golden signals, monitoring

Introduction

At 2 a.m. the pager went off, and the first thing the on-call engineer did was the thing everyone does: open the dashboards. They were green. Request rate, normal. CPU across the fleet, comfortable. Error rate, a flat line near zero. Every chart the team had ever thought to build said the system was healthy. And yet support tickets were piling up — real users, real money, checkout failing intermittently for reasons no graph could explain. The dashboards answered every question the team had anticipated, and the outage was none of those questions.

The system was twenty-odd microservices deep. A request entered at the gateway and fanned out to auth, cart, inventory, payments, a fraud check, each calling two or three more. The “error rate” the dashboard showed was an average across all of them, and the failure was a tail: a small fraction of requests that touched one downstream dependency timed out, retried, and returned a degraded-but-technically-200 response. No single service’s error gauge moved enough to cross a threshold. The aggregate hid the tail; the per-service charts hid the path. The team could see that something was off only because users told them — and then they were stuck, because nothing they had told the system to watch could tell them which of twenty services was the culprit.

What broke the impasse, an hour in, was someone pulling up a single failing request by its trace ID and reading it end to end. There it was: 380 milliseconds spent inside the payment service, waiting on a third-party API that was rate-limiting them. The trace localized in seconds what the dashboards could not localize at all. That gap — between the green dashboards and the one revealing trace — is the difference between monitoring and observability. The team had built monitoring: answers to questions they knew to ask in advance. What they lacked was observability: the ability to ask a question they had never anticipated about a running system and get an answer out of the telemetry it was already emitting.

This is the defining problem of operating systems at scale. You cannot predict every way a distributed system will fail, because the failure modes grow combinatorially with the number of services and their interactions. So you stop trying to predict them, and instead instrument the system richly enough that, when it fails in a way nobody foresaw, you can investigate it from the outside — reconstructing what happened from the data it left behind, rather than from a chart you were lucky enough to have built.

The Core Insight

Monitoring and observability are often used interchangeably, and conflating them is exactly the mistake that produced the all-green outage. They are different in kind, not degree.

Monitoring is the practice of watching predefined signals for known failure modes. You decide in advance what matters — error rate, latency, queue depth — build dashboards and alerts on those signals, and the system tells you when one crosses a line you drew. Monitoring is necessary and it is good at what it does: it answers the questions you already knew to ask. Its limit is precisely that. A dashboard can only show you a failure you anticipated well enough to chart.

Observability is a property of the system rather than a set of dashboards: the degree to which you can understand its internal state from its external outputs. A system is observable when its telemetry is rich and structured enough that you can pose a new, ad hoc question — “why are requests from premium users in eu-west slow only on the recommendation path?” — and answer it without shipping new code to go look. The term comes from control theory, where a system is observable if its internal state can be inferred from its outputs alone; here you treat the running system as a black box and reconstruct what’s happening inside it from what it emits.

That telemetry rests on three pillars — metrics, logs, and traces — and the point of this chapter is that each answers a different question, and that the real leverage sits in two places the cookbook treatments miss. The first is correlation: a metric tells you something is wrong, a trace tells you where, a log tells you what — and you can only follow that chain if all three share an identifier tying them to the same request. The second is definition: deciding what “healthy” even means, as a number, so “is the system okay?” stops being a 2 a.m. judgment call and becomes a measurable target — a service-level objective with an error budget attached.

A mental model

Think of the three pillars as instruments on a flight deck, each reading the same aircraft but answering a different question. The altimeter and airspeed indicator are your metrics: continuous numeric readouts that tell you that something is wrong and how much — you’re descending, and at this rate. Cheap to read, always on, perfect for noticing trouble, useless for explaining it. The flight data recorder is your logs: the detailed, timestamped record of discrete events — what happened, exactly, at this moment, in this component. And the trace is the flight path drawn on a map of the whole route: it shows where, across every leg of the journey, the time went and the trouble started. A pilot with only the altimeter knows they’re in trouble but not where; one with only the recorder can reconstruct events but not see the shape of the flight. You need all three, synchronized to the same clock — which, in our world, is the same trace ID.

The instrument that turns flying into a discipline rather than a vibe is the service-level objective, which converts “is it okay?” into a number. Instead of arguing about whether 200 milliseconds is “slow,” you declare a target — 99.9% of checkout requests complete in under 500 ms over a 30-day window — and now “okay” is defined, measurable, and the same answer for everyone. The gap between that target and perfection is your error budget: the unreliability you are allowed to spend. It is the single most useful number in operations, because it turns reliability from an impossible absolute (nothing is 100%) into a quantity you can manage like any other.

What to measure: the golden signals and SLOs

The first question anyone asks when instrumenting a service is “what do I measure?” — and the second is “how do I avoid drowning in dashboards?” Google’s SRE practice answers both, and it has two halves.

The first half is the four golden signals, the minimum set that characterizes the health of any user-facing service:

Latency — how long requests take, as a distribution (p50, p95, p99), never an average, because the average hides the tail where users suffer.
Traffic — how much demand the service is under: requests per second, by endpoint.
Errors — the rate of failed requests, including silent ones (a 200 that returns the wrong answer is still an error).
Saturation — how full the service is: utilization of its most constrained resource (CPU, memory, connection pool), the signal that predicts trouble before latency and errors degrade.

Instrument those four and you have caught most of what will hurt you. (Two named recipes formalize this: RED for request-driven services, USE for resources — the same idea pointed at different things, expanded in the metrics section below.)

The second half is the chain that turns a signal into a decision: SLI → SLO → error budget → alert. A service-level indicator (SLI) is a carefully chosen metric that reflects user happiness — the proportion of good requests, say, where “good” means fast and successful. A service-level objective (SLO) is a target on that SLI over a window: 99.9% good over 30 days. The error budget is the remainder: 0.1% of requests, roughly 43 minutes per month, that you are permitted to fail. And the alert fires not when CPU is high or a single error appears, but when you are burning that budget too fast to make the month. That is the cardinal rule, and it is the lesson of the 2 a.m. outage in reverse: alert on symptoms, not causes — on the user-facing SLO breach, not on the internal metric you guessed might cause one. A high CPU alert pages you for a non-problem; a burn-rate alert pages you only when users are actually being hurt. Figure 44.1 shows how the three pillars feed this loop.

What you’ll learn

How observability differs from monitoring, and why the distinction decides whether you can debug a failure you never anticipated
What each of the three pillars — metrics, logs, traces — is uniquely good at, and how they correlate through a shared request/trace ID into one investigable story
How to choose what to measure using the four golden signals and the RED/USE methods, and why latency must always be a distribution, never an average
How to define reliability as a number — SLI, SLO, and error budget — and use the budget to balance release velocity against stability
Why distributed tracing is the only practical way to debug latency and causality across a fleet of services, and how context propagation makes it work
How to build alerting that pages on symptoms (SLO burn rate), not causes, and why alert fatigue is a reliability risk in its own right
Where the three pillars’ backends live (Prometheus/Grafana, a log store, a tracing backend) and what each costs as the system grows

Prerequisites

Python: Observability — the in-code, application-level instrumentation that produces the telemetry this chapter operates on: the Prometheus client, structured logging, and the OpenTelemetry SDK inside a service. This chapter is the platform-level view of where all that telemetry goes and what you do with it.
Distributed systems basics — services that call other services over the network, why partial failure and tail latency are the norm, and why a single request can touch a dozen processes.
Comfort reading a dashboard and a query language conceptually; you do not need to master PromQL here, but you should be willing to think in rates and percentiles.

The three pillars at system scale

The phrase “three pillars” is everywhere, and it usually arrives as a flat list: metrics, logs, traces, here is a tool for each. That framing buries the actual insight, which is that the three pillars are not interchangeable and not redundant — they answer three genuinely different questions, and an investigation works by pivoting between them. Hold onto the three questions and everything else falls into place: metrics tell you that something is wrong and how much; logs tell you what happened; traces tell you where, across services.

Metrics are aggregated numbers over time — counters, gauges, and distributions sampled at a fixed interval and stored as time series. Their superpower is that they are cheap and always on: a metric is a handful of numbers per scrape no matter how much traffic flows through, so you can keep months of history and ask “how does today’s p99 compare to last Tuesday’s?” for almost nothing. Their weakness is the flip side — a metric is an aggregate that has thrown away the individual events, so it can tell you the error rate jumped to 3% but not which requests failed or why. Metrics are how you notice trouble and measure SLOs; they are the worst of the three for diagnosing it. The backend is almost universally Prometheus for collection and storage and Grafana for visualization: Prometheus scrapes each service’s /metrics endpoint on an interval into a local time-series database, Grafana draws the dashboards, and Alertmanager fires the alerts. (The pull model, the four metric types, and PromQL are hands-on in Python: Observability; here it’s enough that metrics are scraped, aggregated, and cheap.)

Logs are the opposite trade-off: discrete, timestamped event records, one per thing that happened, carrying full context. A log line can say exactly what occurred — this user, this order, this exception, this downstream response code — with no aggregation loss. That richness is the cost: logs are voluminous and expensive to store and search, and at fleet scale you cannot keep or read all of them. The backend is a centralized log store — Grafana Loki for a Kubernetes-native, label-indexed, cost-conscious setup, or the ELK stack (Elasticsearch) when you need full-text search and arbitrary-field queries — fed by a shipping agent (Fluentd, Filebeat, Vector) that collects logs off every host. Logs are how you learn what exactly happened once a metric or trace has told you where to look.

Traces are the pillar distributed systems make indispensable. A trace follows one request across every service, recording each hop as a span with its own start time, duration, and parent — assembling into a tree that is, in effect, the call stack of a request that happens to run across many machines. It is the only artifact that captures causality across service boundaries: it shows that the gateway’s 420 ms was mostly the order service’s 370 ms, which was mostly the payment service waiting 210 ms on a third-party API. The backend is a dedicated tracing system — Jaeger or Grafana Tempo — that receives spans (usually via an OpenTelemetry collector), reassembles them by trace ID, and renders the per-request waterfall the 2 a.m. engineer finally read.

The pillars become a system — rather than three disconnected tools — through one discipline: correlation by a shared identifier. Every metric exemplar, log line, and span for a given request carries the same trace ID. That shared key is what lets you pivot: a burn-rate alert fires on a metric; from its exemplar you jump to a representative trace and see the time was spent in the payment service; you filter the logs by that trace ID and read that the service got three 429s from Stripe. Metric → trace → log, that → where → what, in three clicks instead of an hour of grepping. Without the shared ID you have three silos and a war room; with it, one reconstructable story. Figure 44.1 lays out the flow: one instrumented request emitting all three pillars, each tagged with the trace ID, each flowing to its own backend, all correlating back into a single picture you can interrogate from the outside.

Note

The three pillars are increasingly seen as an implementation detail of a single goal rather than three separate products. OpenTelemetry — the vendor-neutral standard for generating and shipping all three signal types — exists precisely to unify them behind one instrumentation API and one trace ID, so that correlation is the default rather than an integration project. Think in terms of correlated telemetry about requests, not “the metrics tool, the logs tool, and the traces tool.”

Metrics and the golden signals

Metrics deserve a closer look because they are where most teams start and where most teams overspend. A metric is a time series: a named stream of numeric samples, each tagged with a set of labels (key-value dimensions like method, endpoint, status) and a timestamp. The labels are what make metrics multidimensional and queryable — http_requests_total{status="500", endpoint="/checkout"} is a different series from the same metric with status="200" — and they are also where the bodies are buried, as we will see.

The golden signals tell you which metrics to collect; RED and USE tell you how to organize them. For any request-handling service, RED — Rate, Errors, Duration (the latency distribution) — gives you roughly 80% of the operational insight for 20% of the effort. For any resource it depends on — a disk, a connection pool, a queue — USE — Utilization, Saturation, Errors — does the same. They are complementary: RED watches the service from the user’s side, USE watches resources from the system’s side, and saturation bridges them, because resources saturate before user-facing latency degrades.

The number most often gotten wrong is latency, which you must record as a distribution. An average is nearly useless: a service with a 50 ms average might serve half its users in 10 ms and half in 90 ms, or 99% in 20 ms and 1% in 3 seconds — and the second case is an outage for that 1%, invisible in the mean. So metrics systems use histograms, bucketing observations so you can compute percentiles (p50, p95, p99) after the fact, and you SLO on the tail (p99), not the average, because the tail is where users churn.

The cost of metrics is cardinality, the single most common way teams break their metrics stack. Every unique combination of label values is a separate time series, stored and indexed independently. A metric with method (5 values), endpoint (50), and status (10) is 2,500 series — fine. Add a user_id label and you multiply by your user count; user_id × request_id is effectively unbounded, and the time-series database runs out of memory and falls over. The rule is firm: labels must have bounded, low cardinality — methods, status classes, endpoints, regions, yes; user IDs, request IDs, full URL paths, no. High-cardinality detail belongs in logs and traces, which are built for per-event data; putting it in metrics is the classic, expensive mistake (see the war story below).

SLIs, SLOs, and error budgets

This is the SRE core, where observability stops being a set of tools and becomes a way of making decisions. The motivating problem is that “reliability” sounds like an absolute — more is better, aim for 100% — and that framing is both wrong and ruinous. Wrong, because nothing is 100%: the network drops packets, dependencies fail, deploys have bugs, and chasing the last fraction of a nine costs exponentially more than the one before it. Ruinous, because a team that treats every error as unacceptable freezes all change to avoid risk, and a system that never changes is a dead system. The error budget cures both by making reliability a quantity to manage rather than an absolute to chase.

The chain is short. An SLI (service-level indicator) is a metric chosen to reflect user experience, most usefully expressed as a ratio of good events to total — the fraction of requests that were both successful and fast enough. An SLO (service-level objective) is a target on that SLI over a rolling window: “99.9% of requests good over the last 30 days.” (Don’t confuse it with an SLA, a service-level agreement — an external contract with financial or legal consequences. Your internal SLO should be stricter than any SLA, so you find out before your customers do.) The error budget is simply 1 − SLO over the window: at 99.9%, 0.1% of requests, which for a 30-day month is about 43 minutes of allowed unavailability.

The error budget is a lever, and this is what changes how a team operates. While the budget is healthy, the team ships freely — takes risks, runs experiments, deploys on Friday — because the cost of a small mistake is covered. When the budget is burning fast, the policy flips: non-essential deploys freeze, focus shifts from features to stability, on-call gets paged. The budget becomes the shared, objective arbiter between two teams otherwise structurally at war — product, which wants to ship, and SRE, which wants stability. Instead of arguing about whether a deploy is “too risky,” they look at the budget: full, ship it; empty, fix it first. The number, not the loudest voice, decides.

One caution on the target: do not reach for 99.99% because more nines sound better. Each nine is an order of magnitude more expensive, and 99.99% leaves about 4 minutes of budget per month — often less than a single deploy or a brief dependency blip — so you live in permanent freeze, paging on noise. Start at 99.5% or 99.9%, measure actual user impact, and tighten only where the data shows users genuinely care.

Distributed tracing

In a monolith, when a request is slow you attach a profiler and read the call stack. In a fleet of microservices, that call stack is distributed across processes and machines, and no single profiler can see it: the request enters at the gateway, and from there the “stack” is a tree of network calls — to auth, to the order service, which calls inventory and payment, which calls a third-party API — each running in its own process, each timed only against its own clock. Distributed tracing reassembles that scattered call stack into one readable picture. It is not a nice-to-have past a handful of services; it is the only practical way to debug latency and causality across them, which is why the 2 a.m. outage was unsolvable until someone read a trace.

The model is small. A trace is one request’s whole journey, identified by a single trace ID. Each unit of work along the way — a service handling the request, a database query, an outbound call — is a span, with a start time, duration, status, attributes (the endpoint, the order ID), and a pointer to its parent. The spans form a tree, and rendered on a timeline they become the waterfall: horizontal bars showing what ran when and for how long, nested by parent-child, so the critical path — the longest chain that determined total latency — jumps out visually. You look at the waterfall and see that 380 of the request’s 420 ms were spent in one span deep in the payment service, and you have localized the problem.

The mechanism that makes this possible — and the one thing that most often breaks it — is context propagation. For spans from different services to assemble into one trace, each outbound call must carry the trace ID and current span ID forward so the receiver can attach its spans as children. That context rides in request metadata: an HTTP header (the W3C traceparent standard), gRPC metadata, message-broker headers. If any hop drops it, the trace fractures into disconnected single-service fragments, useless for debugging a cascade — which is why auto-instrumentation libraries exist, injecting propagation into every common framework so application code need not remember to forward headers on every call. (The in-code mechanics — configuring the OpenTelemetry SDK, creating spans, calling inject on outbound requests — live in Python: Observability.)

The last concept is sampling, forced on you by volume: a high-traffic system can generate tens of millions of spans per minute, more than the service they describe is worth storing, so you keep a fraction. Head-based sampling decides at trace start (a random 1%, say) — cheap, but it may discard the rare slow request you most wanted. Tail-based sampling buffers the whole trace in the collector and decides after the outcome is known — keep all errors and slow requests plus a baseline of normal ones — which is smarter but needs collector memory. For most large systems, tail-based or adaptive sampling is the right default, precisely because the interesting traces are the rare ones a naive random sample throws away.

Alerting done right

An alert is a claim that a human must act now, and every alert that fires without meeting that bar erodes the system’s reliability — not the software’s, the team’s. This is the part of observability that is most often done backwards, and the consequences are human, which makes them easy to ignore until they are catastrophic.

The cardinal rule is alert on symptoms, not causes. A symptom is something a user feels: requests are slow, requests are failing, the SLO is burning. A cause is an internal condition you guessed might lead to a symptom: CPU is high, a queue is long, a disk is filling. Alerting on causes is seductive — they feel proactive — but it pages people for conditions that may never affect a user (high CPU on a service keeping up fine) while missing the failures you didn’t predict. Alerting on symptoms pages only when users are actually hurt, and catches every cause of that hurt, including the ones you never thought to chart. In SLO terms the right alert is a burn-rate alert: it fires when you are consuming error budget fast enough to threaten the SLO, and the best practice — multi-window, multi-burn-rate alerting — pairs a fast window (page now: you’ll exhaust the month’s budget in hours) with a slow window (ticket: a steady leak that needs attention but not a 2 a.m. wake-up). Severity follows the burn rate, not the metric.

The second rule is that every alert must be actionable. If the response to a page is “acknowledge and go back to sleep, it’ll clear up,” that alert should not exist. The cost of getting this wrong is alert fatigue, a genuine reliability hazard rather than a quality-of-life complaint: a pager that cries wolf trains the team to ignore it, and the muscle memory of dismissing noise eventually dismisses the one page that mattered. The discipline is ruthless — fewer alerts, each tied to a symptom, each with a runbook, each tuned (a sustained-duration for clause) so it doesn’t flap on momentary spikes. An alert that has fired ten times this month and been actioned zero times isn’t protecting the system; it is degrading the people who protect the system.

War story: the pager that cried wolf, and the cardinality bill that followed

A platform team, stung by an earlier outage, resolved to “never be surprised again” and went on an alerting spree: a page for every service whose CPU crossed 80%, every queue past a threshold, every 5xx. Within a month the on-call was getting forty pages a night, almost all self-resolving — a batch job spiking CPU, a queue draining a beat late, a single transient error. The team did the rational thing and started ignoring the pager: filters, muted channels, alerts swiped away without reading. Then a real incident — a slow, steady error-budget burn on checkout — fired an alert that looked exactly like the noise, got swiped away with the rest, and ran for six hours until a customer escalation forced someone to look. The noisy alerting hadn’t made them safer; it had trained them to miss the one alert that mattered.

The cleanup made it worse first. To diagnose the missed burn, an engineer added user_id and request_id as labels on the request metrics — “so we can slice by user.” Cardinality exploded: every unique user and request became its own time series, the Prometheus instance’s memory climbed for two days and then OOM-killed itself mid-incident, taking the dashboards down at the worst moment, and the metrics bill tripled before anyone caught it. Two lessons, permanently linked: alert on symptoms (SLO burn), not on every cause, or you train your team to ignore the pager — and keep high-cardinality detail in logs and traces, never in metric labels, or you’ll pay for it in memory, money, and an outage of your own making.

A note on logs at scale

Logs are the pillar that scales worst and costs most, and a few platform-level disciplines keep them useful instead of ruinous. Structure them: emit JSON with consistent fields rather than free-form strings, so the log store can index and query by field (level, service, trace_id) instead of regex-scraping text. Correlate them: every line must carry the trace/request ID, which is what turns a pile of unrelated lines into a queryable timeline of one request’s journey and lets you pivot into logs from a trace. Centralize them: ship logs off every ephemeral container to a central store (Loki, Elasticsearch) before the container dies and takes its local logs with it. And sample and tier them: at high volume you cannot afford to keep everything, so log errors and a fraction of successes, push verbose per-request detail into traces instead of INFO logs, and set retention by value — days of hot, searchable logs and weeks of cheap cold archive, not months of everything at full fidelity. The failure mode here is mundane and expensive: a team that logs every request and response body at INFO discovers the logging bill has outgrown the compute bill, and the signal they need is buried under noise they are paying to store. Log deliberately.

Build it → Observability as a product is the whole subject of Project 09: Data Observability, which implements freshness, volume, and quality SLIs over data pipelines — the direct analog of service SLOs applied to data. The FastAPI services in Project 05: SaaS Web Platform are a realistic multi-service target to instrument with the golden signals and trace across, and Project 49: AI Benchmark Suite shows the measurement discipline — latency distributions, percentiles, not averages — that underpins trustworthy metrics.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — Define the golden signals for a service. Pick a user-facing service (an API, a checkout endpoint) and write down its four golden signals concretely: the latency distribution you’d record (which percentiles), the traffic metric, the error definition (remember the silent-200 case), and the one saturation signal that warns you first. Then sketch the dashboard — what panels, in what order, so an on-call engineer can tell in ten seconds whether the service is healthy — noting for each panel which pillar it draws from.
Level II — Write an SLO and the alert that protects it. For the same service, define an SLI as a ratio of good events to total, choose an SLO target (justify 99.9% over 99.99%), and compute the 30-day error budget in minutes. Then write two alerts in prose: a good symptom-based, multi-window burn-rate alert (a fast window that pages, a slow one that tickets), and a bad cause-based alert (“CPU > 80%”) — explaining why the bad one pages without user pain and why the good one catches failures you never anticipated.
Level III — Design end-to-end observability for a request path. Take a request crossing at least four services (gateway → orders → inventory + payment → third-party API). Describe how a single shared trace ID threads metrics, logs, and traces so that, given only a symptom (“checkout p99 is breaching the SLO”), an engineer can pivot metric → trace → log to localize an arbitrary failure: which labels are safe on the metrics (and which must stay in logs/traces to avoid a cardinality explosion), what context must propagate on each hop and what breaks if it doesn’t, and what sampling keeps the rare slow traces. Finally, write the error-budget policy — what the team may do with a healthy budget versus a fast-burning one — i.e. how the budget governs deploy cadence.

Summary

Observability is not a fancier word for monitoring. Monitoring answers the questions you knew to ask, with dashboards and alerts on predefined signals; observability is the property that lets you ask a question you never anticipated about a running system and answer it from the telemetry it already emits — which is the only thing that works when a system is too complex to predict every failure mode. That telemetry rests on three pillars that answer three different questions — metrics (that something is wrong and how much), logs (what happened), traces (where, across services) — and the leverage is in correlating them through a shared trace ID and in defining “healthy” as a number. That number is the SLO: an objective on a user-reflecting SLI, with an error budget that converts reliability from an impossible absolute into a managed quantity and serves as the objective arbiter between shipping and stability. Alert on the symptom — the SLO burn rate — not on causes, keep alerts few and actionable so the pager stays trustworthy, and keep high-cardinality detail out of metrics, where it belongs in logs and traces.

Key takeaways

Monitoring answers predefined questions; observability lets you ask new ones from the outside. At system scale you cannot predict every failure, so you instrument to investigate anything.
The three pillars are not redundant: metrics tell you that and how much, logs tell you what, traces tell you where. Their power comes from correlating all three via a shared trace/request ID.
Measure the four golden signals (latency, traffic, errors, saturation); record latency as a distribution and SLO on the tail, never the average.
An SLO turns “is it okay?” into a number; the error budget (1 − SLO) is the lever that governs release velocity versus stability and arbitrates between product and SRE.
Distributed tracing is the only practical way to debug latency and causality across a fleet — and it only works if context propagates on every hop.
Alert on symptoms (SLO burn rate), not causes; keep alerts actionable, because a noisy pager trains the team to miss the real one. Keep high-cardinality labels out of metrics.

Connections to other chapters

Python: Observability (prerequisite): this chapter is the platform-level view; that one is the in-code how-to that produces the telemetry discussed here — the Prometheus client, structured logging, the OpenTelemetry SDK, and the inject call that propagates trace context. Read it for the instrumentation; read this for what to do with the resulting signals across a whole system.
Python: Microservices / Go: Web Services (related): the moment a system is more than a handful of services, distributed tracing stops being optional — it is the only way to follow a request across the boundaries those chapters introduce. The golden signals and context propagation here are what make a microservice fleet operable rather than just buildable.
CI/CD (sibling): a deploy is an event you watch. The metrics this chapter defines are exactly what a canary or progressive rollout reads to decide whether to proceed or roll back, and the error budget is what tells the pipeline whether the team has room to ship at all. Observability is what closes the loop on continuous delivery.
Kubernetes (extension, Part V): on a cluster, application telemetry and infrastructure telemetry meet — pod restarts, node saturation, and request latency in one place — and the autoscaler is a direct consumer of metrics, reading them to add or remove capacity. The golden signals you define here become the inputs to scheduling and scaling decisions there.

Beyer, Jones, Petoff & Murphy (eds.), Site Reliability Engineering (Google, O’Reilly,
1. — the foundational text on SLIs, SLOs, error budgets, and the four golden signals; the “Monitoring Distributed Systems” chapter is the canonical statement of symptom-based alerting.
The Site Reliability Workbook (Google, O’Reilly, 2018) — the practical companion, with worked examples of defining SLOs and building multi-window, multi-burn-rate alerts.

Deep dives

Majors, Fong-Jones & Miranda, Observability Engineering (O’Reilly, 2022) — the modern case for observability as distinct from monitoring, high-cardinality event data, and debugging the unknown-unknowns.
OpenTelemetry documentation (opentelemetry.io) — the vendor-neutral standard that unifies metrics, logs, and traces behind one instrumentation API and one trace context.

Historical context

Sigelman et al., “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure” (Google, 2010) — the paper that introduced distributed tracing as we practice it, and the lineage behind every spans-and-trace-ID system since (Zipkin, Jaeger, Tempo).

--- title: "Observability" keywords: [observability, sre, slo, sli, error budget, three pillars, metrics, logs, traces, alerting, golden signals, monitoring] difficulty: intermediate prerequisites: [python-observability, distributed-systems-basics] estimated_time: "3-4 hours" --- ## Introduction At 2 a.m. the pager went off, and the first thing the on-call engineer did was the thing everyone does: open the dashboards. They were green. Request rate, normal. CPU across the fleet, comfortable. Error rate, a flat line near zero. Every chart the team had ever thought to build said the system was healthy. And yet support tickets were piling up — real users, real money, checkout failing intermittently for reasons no graph could explain. The dashboards answered every question the team had anticipated, and the outage was none of those questions. The system was twenty-odd microservices deep. A request entered at the gateway and fanned out to auth, cart, inventory, payments, a fraud check, each calling two or three more. The "error rate" the dashboard showed was an average across all of them, and the failure was a tail: a small fraction of requests that touched one downstream dependency timed out, retried, and returned a degraded-but-technically-200 response. No single service's error gauge moved enough to cross a threshold. The aggregate hid the tail; the per-service charts hid the *path*. The team could see that *something* was off only because users told them — and then they were stuck, because nothing they had told the system to watch could tell them *which of twenty services* was the culprit. What broke the impasse, an hour in, was someone pulling up a single failing request by its trace ID and reading it end to end. There it was: 380 milliseconds spent inside the payment service, waiting on a third-party API that was rate-limiting them. The trace localized in seconds what the dashboards could not localize at all. That gap — between the green dashboards and the one revealing trace — is the difference between *monitoring* and *observability*. The team had built monitoring: answers to questions they knew to ask in advance. What they lacked was observability: the ability to ask a question they had *never anticipated* about a running system and get an answer out of the telemetry it was already emitting. This is the defining problem of operating systems at scale. You cannot predict every way a distributed system will fail, because the failure modes grow combinatorially with the number of services and their interactions. So you stop trying to predict them, and instead instrument the system richly enough that, when it fails in a way nobody foresaw, you can investigate it from the outside — reconstructing what happened from the data it left behind, rather than from a chart you were lucky enough to have built. ### The Core Insight Monitoring and observability are often used interchangeably, and conflating them is exactly the mistake that produced the all-green outage. They are different in kind, not degree. **Monitoring** is the practice of watching *predefined* signals for *known* failure modes. You decide in advance what matters — error rate, latency, queue depth — build dashboards and alerts on those signals, and the system tells you when one crosses a line you drew. Monitoring is necessary and it is good at what it does: it answers the questions you already knew to ask. Its limit is precisely that. A dashboard can only show you a failure you anticipated well enough to chart. **Observability** is a *property of the system* rather than a set of dashboards: the degree to which you can understand its internal state from its external outputs. A system is observable when its telemetry is rich and structured enough that you can pose a new, ad hoc question — "why are requests from premium users in eu-west slow only on the recommendation path?" — and answer it without shipping new code to go look. The term comes from control theory, where a system is observable if its internal state can be inferred from its outputs alone; here you treat the running system as a black box and reconstruct what's happening inside it from what it emits. That telemetry rests on **three pillars — metrics, logs, and traces** — and the point of this chapter is that each answers a *different* question, and that the real leverage sits in two places the cookbook treatments miss. The first is *correlation*: a metric tells you something is wrong, a trace tells you where, a log tells you what — and you can only follow that chain if all three share an identifier tying them to the same request. The second is *definition*: deciding what "healthy" even means, as a number, so "is the system okay?" stops being a 2 a.m. judgment call and becomes a measurable target — a **service-level objective** with an error budget attached. ### A mental model Think of the three pillars as instruments on a flight deck, each reading the same aircraft but answering a different question. The **altimeter and airspeed indicator are your metrics**: continuous numeric readouts that tell you *that* something is wrong and *how much* — you're descending, and at this rate. Cheap to read, always on, perfect for noticing trouble, useless for explaining it. The **flight data recorder is your logs**: the detailed, timestamped record of discrete events — *what* happened, exactly, at this moment, in this component. And the **trace is the flight path drawn on a map of the whole route**: it shows *where*, across every leg of the journey, the time went and the trouble started. A pilot with only the altimeter knows they're in trouble but not where; one with only the recorder can reconstruct events but not see the shape of the flight. You need all three, synchronized to the same clock — which, in our world, is the same trace ID. The instrument that turns flying into a discipline rather than a vibe is the **service-level objective**, which converts "is it okay?" into a number. Instead of arguing about whether 200 milliseconds is "slow," you declare a target — *99.9% of checkout requests complete in under 500 ms over a 30-day window* — and now "okay" is defined, measurable, and the same answer for everyone. The gap between that target and perfection is your **error budget**: the unreliability you are *allowed* to spend. It is the single most useful number in operations, because it turns reliability from an impossible absolute (nothing is 100%) into a quantity you can manage like any other. ### What to measure: the golden signals and SLOs The first question anyone asks when instrumenting a service is "what do I measure?" — and the second is "how do I avoid drowning in dashboards?" Google's SRE practice answers both, and it has two halves. The first half is the **four golden signals**, the minimum set that characterizes the health of any user-facing service: 1. **Latency** — how long requests take, as a distribution (p50, p95, p99), *never* an average, because the average hides the tail where users suffer. 2. **Traffic** — how much demand the service is under: requests per second, by endpoint. 3. **Errors** — the rate of failed requests, including silent ones (a 200 that returns the wrong answer is still an error). 4. **Saturation** — how full the service is: utilization of its most constrained resource (CPU, memory, connection pool), the signal that predicts trouble *before* latency and errors degrade. Instrument those four and you have caught most of what will hurt you. (Two named recipes formalize this: **RED** for request-driven services, **USE** for resources — the same idea pointed at different things, expanded in the metrics section below.) The second half is the chain that turns a signal into a decision: **SLI → SLO → error budget → alert**. A *service-level indicator* (SLI) is a carefully chosen metric that reflects user happiness — the *proportion of good requests*, say, where "good" means fast and successful. A *service-level objective* (SLO) is a target on that SLI over a window: 99.9% good over 30 days. The *error budget* is the remainder: 0.1% of requests, roughly 43 minutes per month, that you are permitted to fail. And the alert fires not when CPU is high or a single error appears, but when you are *burning that budget too fast to make the month*. That is the cardinal rule, and it is the lesson of the 2 a.m. outage in reverse: **alert on symptoms, not causes** — on the user-facing SLO breach, not on the internal metric you guessed might cause one. A high CPU alert pages you for a non-problem; a burn-rate alert pages you only when users are actually being hurt. @fig-obs-pillars shows how the three pillars feed this loop. ### What you'll learn - How observability differs from monitoring, and why the distinction decides whether you can debug a failure you never anticipated - What each of the three pillars — metrics, logs, traces — is uniquely good at, and how they correlate through a shared request/trace ID into one investigable story - How to choose what to measure using the four golden signals and the RED/USE methods, and why latency must always be a distribution, never an average - How to define reliability as a number — SLI, SLO, and error budget — and use the budget to balance release velocity against stability - Why distributed tracing is the only practical way to debug latency and causality across a fleet of services, and how context propagation makes it work - How to build alerting that pages on symptoms (SLO burn rate), not causes, and why alert fatigue is a reliability risk in its own right - Where the three pillars' backends live (Prometheus/Grafana, a log store, a tracing backend) and what each costs as the system grows ### Prerequisites - **Python: Observability** — the in-code, application-level instrumentation that *produces* the telemetry this chapter operates on: the Prometheus client, structured logging, and the OpenTelemetry SDK inside a service. This chapter is the platform-level view of where all that telemetry goes and what you do with it. - **Distributed systems basics** — services that call other services over the network, why partial failure and tail latency are the norm, and why a single request can touch a dozen processes. - Comfort reading a dashboard and a query language conceptually; you do not need to master PromQL here, but you should be willing to think in rates and percentiles. --- ## The three pillars at system scale The phrase "three pillars" is everywhere, and it usually arrives as a flat list: metrics, logs, traces, here is a tool for each. That framing buries the actual insight, which is that the three pillars are not interchangeable and not redundant — they answer three genuinely different questions, and an investigation works by *pivoting between them*. Hold onto the three questions and everything else falls into place: **metrics tell you _that_ something is wrong and _how much_; logs tell you _what_ happened; traces tell you _where_, across services.** **Metrics** are aggregated numbers over time — counters, gauges, and distributions sampled at a fixed interval and stored as time series. Their superpower is that they are cheap and always on: a metric is a handful of numbers per scrape no matter how much traffic flows through, so you can keep months of history and ask "how does today's p99 compare to last Tuesday's?" for almost nothing. Their weakness is the flip side — a metric is an *aggregate* that has thrown away the individual events, so it can tell you the error rate jumped to 3% but not *which* requests failed or why. Metrics are how you *notice* trouble and measure SLOs; they are the worst of the three for *diagnosing* it. The backend is almost universally **Prometheus** for collection and storage and **Grafana** for visualization: Prometheus scrapes each service's `/metrics` endpoint on an interval into a local time-series database, Grafana draws the dashboards, and Alertmanager fires the alerts. (The pull model, the four metric types, and PromQL are hands-on in *Python: Observability*; here it's enough that metrics are scraped, aggregated, and cheap.) **Logs** are the opposite trade-off: discrete, timestamped event records, one per thing that happened, carrying full context. A log line can say *exactly* what occurred — this user, this order, this exception, this downstream response code — with no aggregation loss. That richness is the cost: logs are voluminous and expensive to store and search, and at fleet scale you cannot keep or read all of them. The backend is a centralized **log store** — Grafana Loki for a Kubernetes-native, label-indexed, cost-conscious setup, or the ELK stack (Elasticsearch) when you need full-text search and arbitrary-field queries — fed by a shipping agent (Fluentd, Filebeat, Vector) that collects logs off every host. Logs are how you learn *what exactly* happened once a metric or trace has told you *where* to look. **Traces** are the pillar distributed systems make indispensable. A trace follows *one* request across every service, recording each hop as a *span* with its own start time, duration, and parent — assembling into a tree that is, in effect, the call stack of a request that happens to run across many machines. It is the only artifact that captures *causality across service boundaries*: it shows that the gateway's 420 ms was mostly the order service's 370 ms, which was mostly the payment service waiting 210 ms on a third-party API. The backend is a dedicated **tracing system** — **Jaeger** or Grafana Tempo — that receives spans (usually via an OpenTelemetry collector), reassembles them by trace ID, and renders the per-request waterfall the 2 a.m. engineer finally read. The pillars become a *system* — rather than three disconnected tools — through one discipline: **correlation by a shared identifier**. Every metric exemplar, log line, and span for a given request carries the same **trace ID**. That shared key is what lets you pivot: a burn-rate alert fires on a *metric*; from its exemplar you jump to a representative *trace* and see the time was spent in the payment service; you filter the *logs* by that trace ID and read that the service got three 429s from Stripe. Metric → trace → log, *that → where → what*, in three clicks instead of an hour of grepping. Without the shared ID you have three silos and a war room; with it, one reconstructable story. @fig-obs-pillars lays out the flow: one instrumented request emitting all three pillars, each tagged with the trace ID, each flowing to its own backend, all correlating back into a single picture you can interrogate from the outside. ![The three pillars of observability at system scale: metrics (that something is wrong and how much) feed dashboards and SLO alerts, logs (what happened) feed a searchable store, and traces (where, across services) feed a per-request waterfall — all tagged with a shared trace id so an arbitrary failure can be reconstructed from the outside.](../assets/diagrams/rendered/obs_three_pillars.svg){#fig-obs-pillars .lightbox} ::: {.callout-note} The three pillars are increasingly seen as an implementation detail of a single goal rather than three separate products. OpenTelemetry — the vendor-neutral standard for generating and shipping all three signal types — exists precisely to unify them behind one instrumentation API and one trace ID, so that correlation is the default rather than an integration project. Think in terms of *correlated telemetry about requests*, not "the metrics tool, the logs tool, and the traces tool." ::: ## Metrics and the golden signals Metrics deserve a closer look because they are where most teams start and where most teams overspend. A metric is a **time series**: a named stream of numeric samples, each tagged with a set of *labels* (key-value dimensions like `method`, `endpoint`, `status`) and a timestamp. The labels are what make metrics multidimensional and queryable — `http_requests_total{status="500", endpoint="/checkout"}` is a different series from the same metric with `status="200"` — and they are also where the bodies are buried, as we will see. The golden signals tell you *which* metrics to collect; RED and USE tell you how to organize them. For any request-handling service, **RED** — Rate, Errors, Duration (the latency distribution) — gives you roughly 80% of the operational insight for 20% of the effort. For any resource it depends on — a disk, a connection pool, a queue — **USE** — Utilization, Saturation, Errors — does the same. They are complementary: RED watches the service from the user's side, USE watches resources from the system's side, and saturation bridges them, because resources saturate *before* user-facing latency degrades. The number most often gotten wrong is **latency, which you must record as a distribution**. An average is nearly useless: a service with a 50 ms average might serve half its users in 10 ms and half in 90 ms, or 99% in 20 ms and 1% in 3 seconds — and the second case is an outage for that 1%, invisible in the mean. So metrics systems use *histograms*, bucketing observations so you can compute percentiles (p50, p95, p99) after the fact, and you SLO on the tail (p99), not the average, because the tail is where users churn. The cost of metrics is **cardinality**, the single most common way teams break their metrics stack. Every unique combination of label values is a *separate time series*, stored and indexed independently. A metric with `method` (5 values), `endpoint` (50), and `status` (10) is 2,500 series — fine. Add a `user_id` label and you multiply by your user count; `user_id` × `request_id` is effectively unbounded, and the time-series database runs out of memory and falls over. The rule is firm: **labels must have bounded, low cardinality** — methods, status classes, endpoints, regions, yes; user IDs, request IDs, full URL paths, no. High-cardinality detail belongs in logs and traces, which are *built* for per-event data; putting it in metrics is the classic, expensive mistake (see the war story below). ## SLIs, SLOs, and error budgets This is the SRE core, where observability stops being a set of tools and becomes a way of *making decisions*. The motivating problem is that "reliability" sounds like an absolute — more is better, aim for 100% — and that framing is both wrong and ruinous. Wrong, because nothing is 100%: the network drops packets, dependencies fail, deploys have bugs, and chasing the last fraction of a nine costs exponentially more than the one before it. Ruinous, because a team that treats every error as unacceptable freezes all change to avoid risk, and a system that never changes is a dead system. The error budget cures both by making reliability a *quantity to manage* rather than an absolute to chase. The chain is short. An **SLI** (service-level indicator) is a metric chosen to reflect user experience, most usefully expressed as a *ratio of good events to total* — the fraction of requests that were both successful and fast enough. An **SLO** (service-level objective) is a target on that SLI over a rolling window: "99.9% of requests good over the last 30 days." (Don't confuse it with an **SLA**, a service-level *agreement* — an external contract with financial or legal consequences. Your internal SLO should be *stricter* than any SLA, so you find out before your customers do.) The **error budget** is simply `1 − SLO` over the window: at 99.9%, 0.1% of requests, which for a 30-day month is about 43 minutes of allowed unavailability. The error budget is a *lever*, and this is what changes how a team operates. While the budget is healthy, the team ships freely — takes risks, runs experiments, deploys on Friday — because the cost of a small mistake is covered. When the budget is burning fast, the policy flips: non-essential deploys freeze, focus shifts from features to stability, on-call gets paged. The budget becomes the *shared, objective arbiter* between two teams otherwise structurally at war — product, which wants to ship, and SRE, which wants stability. Instead of arguing about whether a deploy is "too risky," they look at the budget: full, ship it; empty, fix it first. The number, not the loudest voice, decides. One caution on the target: do not reach for 99.99% because more nines sound better. Each nine is an order of magnitude more expensive, and 99.99% leaves about 4 minutes of budget *per month* — often less than a single deploy or a brief dependency blip — so you live in permanent freeze, paging on noise. Start at 99.5% or 99.9%, measure actual user impact, and tighten only where the data shows users genuinely care. ## Distributed tracing In a monolith, when a request is slow you attach a profiler and read the call stack. In a fleet of microservices, that call stack is *distributed across processes and machines*, and no single profiler can see it: the request enters at the gateway, and from there the "stack" is a tree of network calls — to auth, to the order service, which calls inventory and payment, which calls a third-party API — each running in its own process, each timed only against its own clock. Distributed tracing *reassembles* that scattered call stack into one readable picture. It is not a nice-to-have past a handful of services; it is the *only* practical way to debug latency and causality across them, which is why the 2 a.m. outage was unsolvable until someone read a trace. The model is small. A **trace** is one request's whole journey, identified by a single **trace ID**. Each unit of work along the way — a service handling the request, a database query, an outbound call — is a **span**, with a start time, duration, status, attributes (the endpoint, the order ID), and a pointer to its *parent*. The spans form a tree, and rendered on a timeline they become the **waterfall**: horizontal bars showing what ran when and for how long, nested by parent-child, so the critical path — the longest chain that determined total latency — jumps out visually. You look at the waterfall and *see* that 380 of the request's 420 ms were spent in one span deep in the payment service, and you have localized the problem. The mechanism that makes this possible — and the one thing that most often breaks it — is **context propagation**. For spans from different services to assemble into one trace, each outbound call must carry the trace ID and current span ID forward so the receiver can attach its spans as children. That context rides in request metadata: an HTTP header (the W3C `traceparent` standard), gRPC metadata, message-broker headers. If any hop drops it, the trace fractures into disconnected single-service fragments, useless for debugging a cascade — which is why auto-instrumentation libraries exist, injecting propagation into every common framework so application code need not remember to forward headers on every call. (The in-code mechanics — configuring the OpenTelemetry SDK, creating spans, calling `inject` on outbound requests — live in *Python: Observability*.) The last concept is **sampling**, forced on you by volume: a high-traffic system can generate tens of millions of spans per minute, more than the service they describe is worth storing, so you keep a fraction. *Head-based* sampling decides at trace start (a random 1%, say) — cheap, but it may discard the rare slow request you most wanted. *Tail-based* sampling buffers the whole trace in the collector and decides *after* the outcome is known — keep all errors and slow requests plus a baseline of normal ones — which is smarter but needs collector memory. For most large systems, tail-based or adaptive sampling is the right default, precisely because the interesting traces are the rare ones a naive random sample throws away. ## Alerting done right An alert is a claim that a human must act *now*, and every alert that fires without meeting that bar erodes the system's reliability — not the software's, the team's. This is the part of observability that is most often done backwards, and the consequences are human, which makes them easy to ignore until they are catastrophic. The cardinal rule is **alert on symptoms, not causes**. A symptom is something a user feels: requests are slow, requests are failing, the SLO is burning. A cause is an internal condition you *guessed* might lead to a symptom: CPU is high, a queue is long, a disk is filling. Alerting on causes is seductive — they feel proactive — but it pages people for conditions that may never affect a user (high CPU on a service keeping up fine) while missing the failures you didn't predict. Alerting on symptoms pages only when users are actually hurt, and catches *every* cause of that hurt, including the ones you never thought to chart. In SLO terms the right alert is a **burn-rate alert**: it fires when you are consuming error budget fast enough to threaten the SLO, and the best practice — *multi-window, multi-burn-rate* alerting — pairs a fast window (page now: you'll exhaust the month's budget in hours) with a slow window (ticket: a steady leak that needs attention but not a 2 a.m. wake-up). Severity follows the burn rate, not the metric. The second rule is that **every alert must be actionable**. If the response to a page is "acknowledge and go back to sleep, it'll clear up," that alert should not exist. The cost of getting this wrong is **alert fatigue**, a genuine reliability hazard rather than a quality-of-life complaint: a pager that cries wolf trains the team to ignore it, and the muscle memory of dismissing noise eventually dismisses the one page that mattered. The discipline is ruthless — fewer alerts, each tied to a symptom, each with a runbook, each tuned (a sustained-duration `for` clause) so it doesn't flap on momentary spikes. An alert that has fired ten times this month and been actioned zero times isn't protecting the system; it is degrading the people who protect the system. ::: {.callout-warning} ## War story: the pager that cried wolf, and the cardinality bill that followed A platform team, stung by an earlier outage, resolved to "never be surprised again" and went on an alerting spree: a page for every service whose CPU crossed 80%, every queue past a threshold, every 5xx. Within a month the on-call was getting forty pages a night, almost all self-resolving — a batch job spiking CPU, a queue draining a beat late, a single transient error. The team did the rational thing and started ignoring the pager: filters, muted channels, alerts swiped away without reading. Then a real incident — a slow, steady error-budget burn on checkout — fired an alert that looked exactly like the noise, got swiped away with the rest, and ran for six hours until a customer escalation forced someone to look. The noisy alerting hadn't made them safer; it had trained them to miss the one alert that mattered. The cleanup made it worse first. To diagnose the missed burn, an engineer added `user_id` and `request_id` as labels on the request metrics — "so we can slice by user." Cardinality exploded: every unique user and request became its own time series, the Prometheus instance's memory climbed for two days and then OOM-killed itself mid-incident, taking the dashboards down at the worst moment, and the metrics bill tripled before anyone caught it. Two lessons, permanently linked: **alert on symptoms (SLO burn), not on every cause, or you train your team to ignore the pager** — and **keep high-cardinality detail in logs and traces, never in metric labels, or you'll pay for it in memory, money, and an outage of your own making.** ::: ## A note on logs at scale Logs are the pillar that scales worst and costs most, and a few platform-level disciplines keep them useful instead of ruinous. **Structure them**: emit JSON with consistent fields rather than free-form strings, so the log store can index and query by field (`level`, `service`, `trace_id`) instead of regex-scraping text. **Correlate them**: every line must carry the trace/request ID, which is what turns a pile of unrelated lines into a queryable timeline of one request's journey and lets you pivot into logs from a trace. **Centralize them**: ship logs off every ephemeral container to a central store (Loki, Elasticsearch) before the container dies and takes its local logs with it. And **sample and tier them**: at high volume you cannot afford to keep everything, so log errors and a fraction of successes, push verbose per-request detail into traces instead of `INFO` logs, and set retention by value — days of hot, searchable logs and weeks of cheap cold archive, not months of everything at full fidelity. The failure mode here is mundane and expensive: a team that logs every request and response body at `INFO` discovers the logging bill has outgrown the compute bill, and the signal they need is buried under noise they are paying to store. Log deliberately. > **Build it →** Observability as a product is the whole subject of > [Project 09: Data Observability](https://github.com/jchu0/applied-cs-projects/tree/main/09-data-observability), > which implements freshness, volume, and quality SLIs over data pipelines — the direct > analog of service SLOs applied to data. The FastAPI services in > [Project 05: SaaS Web Platform](https://github.com/jchu0/applied-cs-projects/tree/main/05-saas-web-platform) > are a realistic multi-service target to instrument with the golden signals and trace > across, and > [Project 49: AI Benchmark Suite](https://github.com/jchu0/applied-cs-projects/tree/main/49-ai-benchmark-suite) > shows the measurement discipline — latency distributions, percentiles, not averages — > that underpins trustworthy metrics. --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — Define the golden signals for a service.** Pick a user-facing service (an API, a checkout endpoint) and write down its four golden signals concretely: the latency distribution you'd record (which percentiles), the traffic metric, the error definition (remember the silent-200 case), and the one saturation signal that warns you first. Then sketch the dashboard — what panels, in what order, so an on-call engineer can tell in ten seconds whether the service is healthy — noting for each panel which pillar it draws from. 2. **Level II — Write an SLO and the alert that protects it.** For the same service, define an SLI as a ratio of good events to total, choose an SLO target (justify 99.9% over 99.99%), and compute the 30-day error budget in minutes. Then write two alerts in prose: a *good* symptom-based, multi-window burn-rate alert (a fast window that pages, a slow one that tickets), and a *bad* cause-based alert ("CPU > 80%") — explaining why the bad one pages without user pain and why the good one catches failures you never anticipated. 3. **Level III — Design end-to-end observability for a request path.** Take a request crossing at least four services (gateway → orders → inventory + payment → third-party API). Describe how a *single shared trace ID* threads metrics, logs, and traces so that, given only a symptom ("checkout p99 is breaching the SLO"), an engineer can pivot metric → trace → log to localize an *arbitrary* failure: which labels are safe on the metrics (and which must stay in logs/traces to avoid a cardinality explosion), what context must propagate on each hop and what breaks if it doesn't, and what sampling keeps the rare slow traces. Finally, write the error-budget policy — what the team may do with a healthy budget versus a fast-burning one — i.e. how the budget governs deploy cadence. ## Summary Observability is not a fancier word for monitoring. Monitoring answers the questions you knew to ask, with dashboards and alerts on predefined signals; observability is the property that lets you ask a question you *never anticipated* about a running system and answer it from the telemetry it already emits — which is the only thing that works when a system is too complex to predict every failure mode. That telemetry rests on three pillars that answer three different questions — metrics (*that* something is wrong and *how much*), logs (*what* happened), traces (*where*, across services) — and the leverage is in correlating them through a shared trace ID and in defining "healthy" as a number. That number is the SLO: an objective on a user-reflecting SLI, with an error budget that converts reliability from an impossible absolute into a managed quantity and serves as the objective arbiter between shipping and stability. Alert on the symptom — the SLO burn rate — not on causes, keep alerts few and actionable so the pager stays trustworthy, and keep high-cardinality detail out of metrics, where it belongs in logs and traces. ### Key takeaways - Monitoring answers predefined questions; observability lets you ask new ones from the outside. At system scale you cannot predict every failure, so you instrument to investigate anything. - The three pillars are not redundant: metrics tell you *that* and *how much*, logs tell you *what*, traces tell you *where*. Their power comes from correlating all three via a shared trace/request ID. - Measure the four golden signals (latency, traffic, errors, saturation); record latency as a distribution and SLO on the tail, never the average. - An SLO turns "is it okay?" into a number; the error budget (`1 − SLO`) is the lever that governs release velocity versus stability and arbitrates between product and SRE. - Distributed tracing is the only practical way to debug latency and causality across a fleet — and it only works if context propagates on every hop. - Alert on symptoms (SLO burn rate), not causes; keep alerts actionable, because a noisy pager trains the team to miss the real one. Keep high-cardinality labels out of metrics. ### Connections to other chapters - **Python: Observability** (prerequisite): this chapter is the platform-level view; that one is the in-code how-to that *produces* the telemetry discussed here — the Prometheus client, structured logging, the OpenTelemetry SDK, and the `inject` call that propagates trace context. Read it for the instrumentation; read this for what to do with the resulting signals across a whole system. - **Python: Microservices / Go: Web Services** (related): the moment a system is more than a handful of services, distributed tracing stops being optional — it is the only way to follow a request across the boundaries those chapters introduce. The golden signals and context propagation here are what make a microservice fleet operable rather than just buildable. - **CI/CD** (sibling): a deploy is an event you watch. The metrics this chapter defines are exactly what a canary or progressive rollout *reads* to decide whether to proceed or roll back, and the error budget is what tells the pipeline whether the team has room to ship at all. Observability is what closes the loop on continuous delivery. - **Kubernetes** (extension, Part V): on a cluster, application telemetry and infrastructure telemetry meet — pod restarts, node saturation, and request latency in one place — and the autoscaler is a *direct consumer* of metrics, reading them to add or remove capacity. The golden signals you define here become the inputs to scheduling and scaling decisions there. ## Further reading ### Essential - Beyer, Jones, Petoff & Murphy (eds.), *Site Reliability Engineering* (Google, O'Reilly, 2016) — the foundational text on SLIs, SLOs, error budgets, and the four golden signals; the "Monitoring Distributed Systems" chapter is the canonical statement of symptom-based alerting. - *The Site Reliability Workbook* (Google, O'Reilly, 2018) — the practical companion, with worked examples of defining SLOs and building multi-window, multi-burn-rate alerts. ### Deep dives - Majors, Fong-Jones & Miranda, *Observability Engineering* (O'Reilly, 2022) — the modern case for observability as distinct from monitoring, high-cardinality event data, and debugging the unknown-unknowns. - *OpenTelemetry* documentation (opentelemetry.io) — the vendor-neutral standard that unifies metrics, logs, and traces behind one instrumentation API and one trace context. ### Historical context - Sigelman et al., *"Dapper, a Large-Scale Distributed Systems Tracing Infrastructure"* (Google, 2010) — the paper that introduced distributed tracing as we practice it, and the lineage behind every spans-and-trace-ID system since (Zipkin, Jaeger, Tempo).