Python: Observability

Keywords

observability, structured logging, metrics, distributed tracing, prometheus, opentelemetry, correlation id, instrumentation, slo

Introduction

It is 2:14 a.m. The pager says the checkout service is “down.” It is not down — the homepage loads, the health check is green, and one engineer can place a test order while another cannot. Somewhere between the load balancer and the database, some fraction of requests is failing, and nobody can say which fraction, on which endpoint, for whom, or why. The on-call engineer tails the log, and what scrolls past is a wall of prose: print-style lines like Processing order for user... done and Error in payment, with no timestamp worth the name, no request ID, no status code, no way to tell one user’s failing request from the thousand healthy ones interleaved with it. No graph shows the error rate climbing, so nobody knows whether this started ten minutes ago or ten hours ago. And when the trail leads into the payment service it simply stops — the two services share no thread of identity, so there is no way to follow a single request across the boundary. The team spends two hours grepping, guessing, and restarting things until the symptom goes away on its own. They never learn the cause.

That outage was not caused by a hard bug. It was caused by a blind system. The code was probably fine; the team’s ability to ask questions of the running system was not. You can’t operate what you can’t see, and a service that emits unstructured strings, no metrics, and no cross-service trace is a black box with a power cord. This chapter is about instrumenting a Python service so that the next 2 a.m. incident is a five-minute query instead of a two-hour archaeology dig — turning the system from something you poke at from the outside into something that tells you what it is doing.

The Core Insight

It is tempting to call this “monitoring,” but observability is a different and larger thing, and the distinction is the whole point. Monitoring answers questions you already knew to ask: you decide in advance that CPU above 80% matters, you build a dashboard and an alert for it, and the system tells you when that specific thing happens. It is a fixed set of pre-wired questions. The 2 a.m. outage was a failure of monitoring precisely because nobody had pre-wired the question “is a specific subset of POST /checkout requests failing for users in one region?” — and you cannot pre-wire every question, because the interesting failures are the ones you did not anticipate.

Observability is the property that lets you ask new questions about a running system from the outside, without shipping new code, using only the telemetry the system already emits. It is a property of the system, not a product you buy. A system is observable to the degree that its external outputs let you reconstruct its internal state — to the degree that, when something novel breaks, you can slice and pivot your way to the cause from data that is already there.

That property rests on three pillars, each answering a question the others cannot:

Logs — discrete, timestamped records of what happened. A log line is an event with context: this user, this order, this error, at this moment. Logs are how you explain a failure once you’ve found it.
Metrics — aggregated numbers measuring how much and how often: requests per second, error rate, latency distribution, queue depth. Metrics are cheap to store and fast to query over long windows, which makes them the substrate for dashboards and alerts. Metrics are how you notice a failure and see its trend.
Traces — the causal path of a single request as it moves across services, recording where the time went. A trace is how you find which component, in a chain of five, is the slow or failing one.

The slogan worth memorizing: metrics tell you that something is wrong, traces tell you where, and logs tell you why. No single pillar is sufficient. The 2 a.m. team had none of them.

A mental model

Picture the three pillars as three lenses pointed at the same stream of events. A single request generates one underlying reality — it arrived, did some work, called another service, succeeded or failed, took some number of milliseconds — and each lens captures a different projection of it. The log lens captures the narrative: the discrete things that happened, in order, with context. The metric lens captures the statistics: strip away the individuality and count, sum, and bucket the events into numbers you can chart. The trace lens captures the shape: the request as a tree of timed operations spanning service boundaries.

Or, in the operator’s framing: telemetry is the instrument panel of a machine you cannot open up. You cannot pull a running distributed system onto a workbench and watch the gears turn; what you have instead is the dashboard — the altimeter, the fuel gauge, the engine-temperature light — and your ability to fly through weather depends on whether those instruments exist and agree. An uninstrumented service is a cockpit with the windshield painted black. The work of this chapter is wiring up the gauges.

What to instrument (and what it costs)

Telemetry is not free — it costs CPU to emit, network to ship, storage to retain, and money to a vendor or a TSDB to query. So instrumentation is a budgeting decision, and a few defaults make the budget go far. Figure 12.1 shows the shape: one instrumented request fanning out into three signals that share an identity.

Structure everything, log deliberately. Make every log line structured (JSON, with fields) rather than a string — the marginal cost is near zero and the payoff is that logs become queryable — but don’t log every line at every level in the hot path: pick levels that mean something and emit DEBUG sparingly in production.

Metrics for the golden signals. Google’s SRE practice names four signals that, for a request-serving service, catch most of what matters: latency (how long requests take), traffic (how many you’re getting), errors (how many fail), and saturation (how full the system is). Instrument those four first; they are cheap, aggregated, and exactly what an alert needs to watch.

Traces for cross-service latency. A trace is the only tool that answers “the request took 900 ms — where did the 900 ms go?” across a chain of services. Sample them: you rarely need every trace, just a representative slice plus all the errors and slow ones.

The one caveat that turns telemetry from an asset into a liability is cardinality. A metric’s cost is roughly the number of distinct label-value combinations it can produce — its time series count. A label like endpoint has a handful of values and is fine. A label like user_id has millions, and attaching it to a metric quietly creates millions of time series, which can take a Prometheus instance down or turn a metrics bill into a budget incident. The discipline is simple and non-negotiable: metric labels must be bounded; high-cardinality identifiers belong in logs and traces, never in metric labels.

What you’ll learn

How observability differs from monitoring, and why the three pillars — logs, metrics, traces — each answer a question the others cannot
How to replace print-style logging with structured JSON logs carrying a request/correlation ID, using structlog or the standard library done right
How to choose log levels that mean something, so ERROR is signal and not noise
How to instrument the golden signals with the Prometheus client — counters, gauges, and histograms — and which metric type fits which measurement
Why cardinality discipline is the difference between a healthy metrics pipeline and a self-inflicted outage
How OpenTelemetry spans and context propagation let one trace follow a request across service boundaries, and why that’s the only way to debug distributed latency
How a single trace ID stitches logs, metrics, and traces into one story — the correlation that makes the three pillars more than the sum of their parts
How to turn the golden signals into an SLO and the alert that protects it

Prerequisites

Python: Web Development — you’ll instrument the request lifecycle: middleware, the request/response cycle, and async handlers are where this telemetry attaches.
Comfort with Python decorators, context managers, and async/await, since middleware and span instrumentation lean on all three.
A working mental model of a multi-service system: an API that calls other services, which is the setting in which tracing earns its keep.

Structured logging: from prose to data

The single highest-leverage change you can make to a service’s observability is to stop logging sentences and start logging records. A line like User 123 logged in from 192.168.1.1 is readable by exactly one consumer — a human reading it in order — and hostile to every other use. You cannot ask “show me all logins from this IP in the last hour” without a fragile regular expression, and the moment someone rewords the message your query breaks. The information is there, but trapped in prose.

A structured log carries the same information as fields: an event name plus typed key-value pairs, rendered as JSON. The same login becomes {"event": "user_login", "user_id": 123, "ip": "192.168.1.1", "level": "info", "timestamp": "..."}. Now it is data. Your log backend indexes the fields, and “all logins from this IP” is a filter, not a regex. This is the difference between a log you read and a log you query, and at scale you will almost always be querying.

In Python the cleanest way to get there is structlog, which lets you build a processor pipeline — a chain of small functions that each enrich or transform a log event before it is rendered. You configure it once at startup: pick the processors that add a timestamp, a level, exception formatting, and finally a JSON renderer. The illustrative shape, with the bound-logger ergonomics that make it pleasant to use:

import structlog

# Configure once at startup. ConsoleRenderer in dev (colored, human-readable),
# JSONRenderer in prod (machine-parseable). The pipeline order matters: each
# processor sees the event dict the previous one produced.
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,      # pull in per-request context
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.format_exc_info,          # render exceptions as fields
        structlog.processors.JSONRenderer(),           # the final, queryable form
    ],
)

log = structlog.get_logger()
log = log.bind(service="order-service")                # context on every line from here
log.info("order_created", order_id=812, total=49.90, payment_method="card")

The standard library logging module can do this too — a JSON formatter on a handler gets you most of the way — but structlog’s processor pipeline and contextvars integration make request-scoped context (which we need next) far less awkward. Either way, the rule is the same: never print to log, and never build a log message with string formatting. A print has no level, no timestamp, no structure, and no way to be turned off; it is the 2 a.m. wall of prose by construction.

Correlation IDs: a thread through every line

Structured fields make a single line queryable. But the question you actually have at 2 a.m. is “show me everything that happened for this one failing request,” and that requires every line emitted while handling a request to carry the same identifier. That identifier is the correlation ID (often called a request ID), and it is the spine the rest of this chapter hangs on.

The mechanism in Python is contextvars. A ContextVar is like a thread-local that also works correctly under async/await: set it at the start of a request and it is transparently available to every function called downstream, including across await boundaries, without threading it through every signature. You set it once, in middleware, and a structlog processor copies it onto every log line automatically.

import uuid
from contextvars import ContextVar
import structlog
from starlette.middleware.base import BaseHTTPMiddleware

request_id_var: ContextVar[str] = ContextVar("request_id", default="")
log = structlog.get_logger()

class RequestContextMiddleware(BaseHTTPMiddleware):
    """Stamp every request with a correlation ID and bind it to the log context."""
    async def dispatch(self, request, call_next):
        # Honor an upstream ID if present (so a caller's ID flows through us),
        # otherwise mint one. Either way, every log line below carries it.
        request_id = request.headers.get("X-Request-ID") or str(uuid.uuid4())
        structlog.contextvars.bind_contextvars(request_id=request_id)
        response = await call_next(request)
        response.headers["X-Request-ID"] = request_id   # hand it back to the caller
        return response

Two details earn their place. First, the middleware honors an incoming X-Request-ID before minting a new one — so when an upstream service already started a request, its ID flows through yours rather than being replaced, and the whole journey shares one identity. Second, it writes the ID back into the response header, so a client (or a load balancer’s access log) can report the exact ID to correlate against. Now a single query — request_id = "abc-123" — returns the complete, ordered story of one request, which is precisely what the 2 a.m. team could not get.

Log levels that mean something

Structure and correlation are wasted if every line is logged at INFO and ERROR is sprinkled on anything mildly unusual. Levels are a contract with whoever is on call, and the contract is about actionability. A workable rule of thumb: if you’d want to be woken at 3 a.m. for it, it’s ERROR or CRITICAL; if it’s a notable business event worth a permanent record, it’s INFO; if it’s only useful while actively debugging, it’s DEBUG. The subtle one is WARNING: a declined payment is not an error — the system worked exactly as designed — so it is a WARNING (or even INFO) about a business outcome, not an ERROR about a system failure. Getting this distinction right is what keeps ERROR meaning “a human should look at this,” which is the only thing that keeps anyone looking.

War story: the unstructured logs that hid the outage

The 2 a.m. incident in the introduction traces to a logging decision made years earlier. The service logged with f-strings — logger.info(f"processing order {oid} for {uid}") — which felt fine in development, where you read a handful of lines in order. In production those lines arrived interleaved across thousands of concurrent requests, with no request ID to disentangle them and no fields to filter on. When a subset of /checkout requests started failing, the on-call engineer could not isolate the failures from the healthy traffic, group by endpoint or status, or follow any single request to its cause. The fix was not heroic: a structlog JSON pipeline, a request ID set in middleware, one line to bind it onto every log. The next incident of the same shape was diagnosed in four minutes with a single query filtering on status == "error" and grouping by endpoint. The lesson: unstructured logs are a debt you only feel during an outage — exactly when you can least afford to pay it.

Build it → A production-grade structured-logging-and-correlation setup across a real multi-service FastAPI stack lives in Project 05: SaaS Web Platform, where request IDs thread through the API, background jobs, and the data layer.

Metrics: the golden signals

Logs tell the story of individual events; metrics tell you the shape of all of them at once. A metric is a number, sampled over time, optionally split by a few labels — and because it is just numbers, a metrics backend can store years of it cheaply and answer “what was the p99 latency last Tuesday?” in milliseconds. That efficiency is why metrics, not logs, are what dashboards chart and what alerts watch: you do not alert on logs, you alert on metrics.

The Prometheus client library gives Python three metric types, and choosing the right one is not stylistic — the wrong type produces wrong data. A counter only ever goes up (you query its rate, not its raw value): use it for totals like requests served, errors, orders created. A gauge goes up and down: use it for a current value like in-flight requests, queue depth, or connection-pool usage. A histogram records the distribution of a value into buckets: use it for latency and sizes, so you can compute percentiles. The decision is mechanical — does it only go up? → counter; does it go up and down? → gauge; do you need percentiles? → histogram.

Wired into request-handling middleware, the golden signals fall out of these three types almost for free: a counter gives you traffic and errors, a histogram gives you latency, a gauge gives you saturation.

from prometheus_client import Counter, Histogram, Gauge

# Traffic + errors live on one counter, sliced by a few BOUNDED labels.
requests_total = Counter(
    "http_requests_total", "Total HTTP requests",
    ["method", "endpoint", "status"],          # all low-cardinality
)
# Latency as a distribution; buckets chosen around the SLO you care about.
request_duration = Histogram(
    "http_request_duration_seconds", "Request latency",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
# Saturation: a value that rises and falls with concurrent load.
in_flight = Gauge("http_requests_in_progress", "In-flight requests", ["endpoint"])

The histogram is the quietly powerful one. It does not store every latency; it counts how many requests fell into each bucket, and from those counts Prometheus computes any percentile at query time with histogram_quantile. That is why you pick buckets around your SLO: with a 250 ms target, a bucket boundary at 0.25 lets you ask “what fraction of requests beat the SLO?” directly. Averages lie — a handful of 10-second requests vanish into a healthy-looking mean — which is why you instrument the distribution and watch p99, not the average.

Cardinality discipline

Notice what is not a label above: no user_id, no order_id, no URL path with the IDs still in it. This is the cardinality rule from earlier made concrete, and it is the single most important habit in metrics. Prometheus stores one time series per distinct combination of label values. With method (a few), endpoint (dozens), and status (a handful), http_requests_total is a few hundred series — trivial. Add user_id and it becomes one series per user, growing without bound until the instance runs out of memory and falls over — taking your visibility down at the exact moment you’d want it.

The corollary is that dynamic path segments must be normalized before they become a label: /users/123/orders/456 and /users/789/orders/012 are both the route /users/{id}/orders/{id}, and it is the route, not the instance, that belongs in the endpoint label. The high-cardinality specifics — which user, which order — belong in logs and trace attributes, where they are searchable without exploding a time-series database. Metrics are for the aggregate; logs and traces are for the individual.

War story: the label that blew up the bill

A team added a customer_id label to their request counter so they could “slice latency by customer.” It worked beautifully in staging, where there were twelve test customers. In production there were 400,000. Each new customer minted a fresh time series on every endpoint-status combination, the active series count climbed past ten million within a day, and the managed-metrics bill arrived an order of magnitude over budget — while the Prometheus remote-write queue backed up and dropped samples, degrading the very dashboards the label was meant to enrich. The fix was to delete the label and move per-customer analysis to the logging pipeline, where 400,000 distinct values are unremarkable. The rule that would have prevented it: a metric label’s value set must be small and bounded; if you can’t enumerate it, it doesn’t go in a label.

Build it → Golden-signal metrics, histogram bucket design, and cardinality guardrails applied to a real data pipeline are worked end-to-end in Project 09: Data Observability, which instruments freshness, volume, and error-rate signals over a streaming source. For the discipline behind the latency numbers — how percentiles are measured and how to trust the deltas you chart — Project 49: AI Benchmark Suite builds the measurement harness that produces them.

Distributed tracing: where the time went

Metrics tell you the p99 latency on /checkout jumped to 900 ms. They do not tell you why, because the moment a request leaves your service to call another, the single-service view goes dark. Was it your code, the payment service, the database, or the network between them? In a system of five services, the latency you measure at the edge is a sum of contributions you cannot see individually. Distributed tracing is the only tool that decomposes that sum.

A trace is the record of one request’s journey, modeled as a tree of spans. A span is a single timed operation — “handle the HTTP request,” “query the database,” “call the payment service” — with a start time, a duration, and attributes. The top-level span is the request, its children are the operations it performed, and crucially, when an operation crosses into another service, the child span is created inside that other service and still belongs to the same trace. View the tree and the latency budget is laid bare: 700 ms of the 900 ms was spent inside the payment service’s database call, three levels deep. No other signal can show you that.

The standard for this in Python is OpenTelemetry (OTel), a vendor-neutral API and SDK — you instrument once against the OTel API and can export to Jaeger, Tempo, or a commercial backend without touching your code. You create spans with a context manager, which guarantees the span is closed and its duration recorded even if the body raises:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

async def get_order(order_id: int) -> dict:
    # The context manager opens the span on entry, records duration and closes it
    # on exit — even on exception. Attributes make the span searchable later.
    with tracer.start_as_current_span("get_order") as span:
        span.set_attribute("order.id", order_id)        # searchable, not a metric label
        try:
            return await db.fetch_order(order_id)
        except Exception as exc:
            span.record_exception(exc)                   # attach the error to the span
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            raise

Note where order_id lives: as a span attribute, not a metric label. A trace is the right home for high-cardinality identity — you can search traces by order.id without any of the time-series explosion that the same value would cause in a metric. This is the division of labor again: aggregate in metrics, individual in traces and logs.

Context propagation: crossing the boundary

The magic — and the part people most often get wrong — is making one trace span two services. When service A calls service B over HTTP, B has no idea it is part of A’s trace unless A tells it. OpenTelemetry does this by injecting the trace context (trace ID, parent span ID, and sampling decision) into the outgoing HTTP headers — the W3C traceparent header — and B extracts that context and starts its spans as children of A’s. The trace ID is preserved across the wire, so both services’ spans land in the same tree.

from opentelemetry.propagate import inject, extract

# --- Service A, calling out: inject the current trace context into headers ---
async def call_payment(order):
    headers: dict[str, str] = {}
    inject(headers)                                  # adds the W3C `traceparent` header
    return await http_client.post("/charge", json=order, headers=headers)

# --- Service B, receiving: extract the context and continue the SAME trace ---
async def on_request(request, call_next):
    ctx = extract(dict(request.headers))             # read A's `traceparent`
    with tracer.start_as_current_span("POST /charge", context=ctx, kind=trace.SpanKind.SERVER):
        return await call_next(request)

In practice you rarely write this by hand for framework calls — OpenTelemetry’s auto-instrumentation for FastAPI, httpx, SQLAlchemy, and Redis injects and extracts context for you, so a fully traced request often needs only a few lines of setup at startup. But understanding the mechanism matters, because it is the thing that fails silently: forget propagation on one hop and your trace simply ends there — exactly the dead-end the 2 a.m. team hit. A trace that stops at a boundary is worse than no trace, because it looks complete.

Correlation: one ID, three views

Here is where the three pillars stop being three separate tools and become one instrument. Each pillar, on its own, answers part of the question. The power comes from being able to pivot between them — to see a spike on a metrics dashboard, click into an exemplar trace, find the slow span, and jump from that span to the exact log lines emitted while it ran. That pivot is only possible if all three signals carry a shared identity: the trace ID.

OpenTelemetry already mints a trace ID for every request. The move that ties everything together is to copy that trace ID into your logs and to expose it on your metrics’ exemplars, so the same string appears in all three places. In structlog this is one more processor:

from opentelemetry import trace

def add_trace_context(logger, method_name, event_dict):
    """structlog processor: stamp the active trace/span IDs onto every log line."""
    span = trace.get_current_span()
    if span.get_span_context().is_valid:
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")   # same ID the trace uses
        event_dict["span_id"] = format(ctx.span_id, "016x")
    return event_dict

With this in place, every log line emitted during a request carries the same trace_id as the trace itself, and the investigative loop closes: a metrics alert fires on error rate → you grab a trace_id from a failing exemplar → you view that trace and see the payment-service span is red → you filter the logs to that trace_id and read the exception that explains it. Metrics found it, traces located it, logs explained it — and a single ID carried you across all three without guessing. That is the difference between observability that works and three disconnected tools you alt-tab between hoping to spot the same incident in each.

A note on scope: this chapter is the application view — how you instrument a Python service so it emits good telemetry. The platform view — running the collectors, retaining the data, building dashboards, managing alert routing and the SRE practice around SLOs — is the subject of the cross-cutting Observability chapter in Part IV. This chapter produces the signals; that one operates the system that consumes them. Your job here is to make the service speak; the platform’s job is to listen at scale.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — Replace prose with records. Take a small FastAPI service that logs with print or f-strings. Configure structlog to emit JSON, add middleware that mints a correlation ID per request (honoring an incoming X-Request-ID), and bind it so every log line carries it. Fire ten concurrent requests, then prove the win: write the one query against your logs that returns the complete, ordered story of a single request and nothing else. Note how impossible that query was before.
Level II — Instrument the golden signals. Add Prometheus metrics to the same service: a request counter labeled by method, endpoint (normalized — strip the IDs out of the path), and status; and a latency histogram with buckets chosen around a 250 ms SLO. Expose /metrics, scrape it, and chart request rate, error rate, and p99 latency. Then write a short paragraph: name one label you were tempted to add that would have been a cardinality bomb, estimate how many time series it would have created at production scale, and say where that information belongs instead.
Level III — One trace, three views, one SLO. Split the work across two services (an API that calls a downstream service). Wire OpenTelemetry into both, propagate context across the HTTP boundary, and confirm that a single request produces one trace spanning both services. Add the structlog processor that stamps the trace_id onto every log line, and demonstrate the full pivot: from a metrics spike, to the offending trace, to the log lines for that exact trace_id. Finally, define a concrete SLO (e.g. “99% of requests under 250 ms over a rolling 30 days”) and write the Prometheus alert that protects it — a histogram_quantile over the latency buckets, with a for: duration so a momentary blip doesn’t page anyone. Explain why you alert on the SLO and not on raw CPU.

Summary

Observability is the property that lets you ask new questions of a running system from the outside — a strictly larger thing than monitoring’s pre-wired dashboards. It rests on three pillars that each answer a different question: logs (what happened), metrics (how much and how often), and traces (where the time went across services). Instrumenting a Python service means making every log line structured and carrying a correlation ID; measuring the golden signals with the right Prometheus metric type while ruthlessly bounding label cardinality; and using OpenTelemetry spans with context propagation so one trace follows a request across every boundary. The payoff is realized only when all three share a trace ID, so an investigation can pivot from a metric spike to the trace to the explaining log line — turning a two-hour 2 a.m. archaeology dig into a four-minute query.

Key takeaways

Observability ≠ monitoring: monitoring answers known questions; observability lets you ask new ones from telemetry the system already emits.
Metrics tell you that something is wrong, traces tell you where, logs tell you why — instrument all three, because no one of them is sufficient.
Structure every log line and stamp it with a correlation ID; never print and never build a log message with string formatting.
Instrument the golden signals — latency, traffic, errors, saturation — with counters, gauges, and histograms; pick the type by what you’re measuring, and pick histogram buckets around your SLO.
Cardinality is the trap: metric labels must be bounded; high-cardinality identifiers like user_id belong in logs and trace attributes, never in metric labels.
A trace is the only way to debug latency across services, and context propagation (W3C traceparent) is the load-bearing, silently-failing part — a trace that stops at a boundary looks complete and isn’t.
A shared trace ID across logs, metrics, and traces is the centerpiece: it turns three tools into one investigative loop.

Connections to other chapters

Python: Web Development (prerequisite): observability attaches to the request lifecycle taught there — middleware, the request/response cycle, async handlers are exactly where correlation IDs are set, metrics are recorded, and spans are opened. You can’t instrument a request you don’t yet understand.
Python: Microservices (extension): tracing is optional with one service and mandatory with several. The moment a request crosses a boundary, context propagation is the only thing standing between you and the dead-end trace from the introduction; that chapter is where the cross-service patterns introduced here become the default.
Observability (Cross-Cutting, Part IV) (extension): this chapter produces telemetry from one Python service; that chapter operates the platform that consumes it across a fleet — collectors, retention, dashboards, alert routing, and the SRE practice around SLOs and error budgets. Application instrumentation is the input to that platform view.
Orchestration with Kubernetes (Part V) (extension): in a cluster, your application telemetry meets the platform’s — pod and node metrics, the Prometheus Operator’s service discovery, liveness and readiness probes. The golden-signal metrics you expose here are what the cluster scrapes, alerts on, and autoscales against.

Beyer et al., Site Reliability Engineering (Google, 2016), esp. the chapters on Monitoring Distributed Systems (the golden signals) and Service Level Objectives — the canonical source for latency/traffic/errors/saturation and for alerting on SLOs rather than on raw resource thresholds.
OpenTelemetry documentation — the vendor-neutral standard for traces, metrics, and logs; the Python SDK guide and the W3C traceparent propagation spec are the references for the tracing sections above.

Deep dives

Majors, Fong-Jones & Miranda, Observability Engineering (O’Reilly, 2022) — the book that sharpened the observability-vs-monitoring distinction and makes the case for high-cardinality, high-dimensionality event data and querying unknown-unknowns.
Prometheus documentation — instrumentation and best practices — histogram vs. summary, label cardinality, and histogram_quantile, the practical backbone of the metrics section.

Historical context

Sigelman et al., “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure” (Google, 2010) — the paper that introduced spans, trace context propagation, and sampling, and from which essentially every modern tracing system, OpenTelemetry included, descends.
“The Log: What every software engineer should know about real-time data’s unifying abstraction” (Jay Kreps, 2013) — the foundational case for treating logs as structured, append-only streams of events rather than human-readable prose.

--- title: "Python: Observability" keywords: [observability, structured logging, metrics, distributed tracing, prometheus, opentelemetry, correlation id, instrumentation, slo] difficulty: intermediate prerequisites: [python-web-development] estimated_time: "3-4 hours" --- ## Introduction It is 2:14 a.m. The pager says the checkout service is "down." It is not down — the homepage loads, the health check is green, and one engineer can place a test order while another cannot. Somewhere between the load balancer and the database, some fraction of requests is failing, and nobody can say which fraction, on which endpoint, for whom, or why. The on-call engineer tails the log, and what scrolls past is a wall of prose: `print`-style lines like `Processing order for user... done` and `Error in payment`, with no timestamp worth the name, no request ID, no status code, no way to tell one user's failing request from the thousand healthy ones interleaved with it. No graph shows the error rate climbing, so nobody knows whether this started ten minutes ago or ten hours ago. And when the trail leads into the payment service it simply stops — the two services share no thread of identity, so there is no way to follow a single request across the boundary. The team spends two hours grepping, guessing, and restarting things until the symptom goes away on its own. They never learn the cause. That outage was not caused by a hard bug. It was caused by a *blind* system. The code was probably fine; the team's ability to *ask questions of the running system* was not. You can't operate what you can't see, and a service that emits unstructured strings, no metrics, and no cross-service trace is a black box with a power cord. This chapter is about instrumenting a Python service so that the next 2 a.m. incident is a five-minute query instead of a two-hour archaeology dig — turning the system from something you poke at from the outside into something that *tells you what it is doing*. ### The Core Insight It is tempting to call this "monitoring," but observability is a different and larger thing, and the distinction is the whole point. **Monitoring** answers *questions you already knew to ask*: you decide in advance that CPU above 80% matters, you build a dashboard and an alert for it, and the system tells you when that specific thing happens. It is a fixed set of pre-wired questions. The 2 a.m. outage was a failure of monitoring precisely because nobody had pre-wired the question "is a specific subset of POST `/checkout` requests failing for users in one region?" — and you cannot pre-wire every question, because the interesting failures are the ones you did not anticipate. **Observability** is the property that lets you ask *new* questions about a running system from the outside, without shipping new code, using only the telemetry the system already emits. It is a property of the system, not a product you buy. A system is observable to the degree that its external outputs let you reconstruct its internal state — to the degree that, when something novel breaks, you can slice and pivot your way to the cause from data that is already there. That property rests on **three pillars**, each answering a question the others cannot: 1. **Logs** — discrete, timestamped records of *what happened*. A log line is an event with context: this user, this order, this error, at this moment. Logs are how you explain a failure once you've found it. 2. **Metrics** — aggregated numbers measuring *how much and how often*: requests per second, error rate, latency distribution, queue depth. Metrics are cheap to store and fast to query over long windows, which makes them the substrate for dashboards and alerts. Metrics are how you *notice* a failure and see its trend. 3. **Traces** — the causal path of a single request as it moves *across services*, recording *where the time went*. A trace is how you find *which* component, in a chain of five, is the slow or failing one. The slogan worth memorizing: **metrics tell you that something is wrong, traces tell you where, and logs tell you why.** No single pillar is sufficient. The 2 a.m. team had none of them. ### A mental model Picture the three pillars as **three lenses pointed at the same stream of events**. A single request generates one underlying reality — it arrived, did some work, called another service, succeeded or failed, took some number of milliseconds — and each lens captures a different projection of it. The log lens captures the *narrative*: the discrete things that happened, in order, with context. The metric lens captures the *statistics*: strip away the individuality and count, sum, and bucket the events into numbers you can chart. The trace lens captures the *shape*: the request as a tree of timed operations spanning service boundaries. Or, in the operator's framing: telemetry is the **instrument panel of a machine you cannot open up**. You cannot pull a running distributed system onto a workbench and watch the gears turn; what you have instead is the dashboard — the altimeter, the fuel gauge, the engine-temperature light — and your ability to fly through weather depends on whether those instruments exist and agree. An uninstrumented service is a cockpit with the windshield painted black. The work of this chapter is wiring up the gauges. ### What to instrument (and what it costs) Telemetry is not free — it costs CPU to emit, network to ship, storage to retain, and money to a vendor or a TSDB to query. So instrumentation is a budgeting decision, and a few defaults make the budget go far. @fig-observability shows the shape: one instrumented request fanning out into three signals that share an identity. ![The three pillars of observability from one instrumented request: structured logs (what happened), metrics (how much/how often), and traces (where the time went across services) — all tagged with a shared trace ID so they can be correlated into a single story.](../assets/diagrams/rendered/py_observability.svg){#fig-observability .lightbox} **Structure everything, log deliberately.** Make every log line structured (JSON, with fields) rather than a string — the marginal cost is near zero and the payoff is that logs become *queryable* — but don't log every line at every level in the hot path: pick levels that mean something and emit `DEBUG` sparingly in production. **Metrics for the golden signals.** Google's SRE practice names four signals that, for a request-serving service, catch most of what matters: **latency** (how long requests take), **traffic** (how many you're getting), **errors** (how many fail), and **saturation** (how full the system is). Instrument those four first; they are cheap, aggregated, and exactly what an alert needs to watch. **Traces for cross-service latency.** A trace is the *only* tool that answers "the request took 900 ms — where did the 900 ms go?" across a chain of services. Sample them: you rarely need every trace, just a representative slice plus all the errors and slow ones. The one caveat that turns telemetry from an asset into a liability is **cardinality**. A metric's cost is roughly the number of distinct label-value combinations it can produce — its *time series count*. A label like `endpoint` has a handful of values and is fine. A label like `user_id` has millions, and attaching it to a metric quietly creates millions of time series, which can take a Prometheus instance down or turn a metrics bill into a budget incident. The discipline is simple and non-negotiable: **metric labels must be bounded; high-cardinality identifiers belong in logs and traces, never in metric labels.** ### What you'll learn - How observability differs from monitoring, and why the three pillars — logs, metrics, traces — each answer a question the others cannot - How to replace `print`-style logging with **structured JSON logs** carrying a request/correlation ID, using `structlog` or the standard library done right - How to choose **log levels** that mean something, so `ERROR` is signal and not noise - How to instrument the **golden signals** with the Prometheus client — counters, gauges, and histograms — and which metric type fits which measurement - Why **cardinality discipline** is the difference between a healthy metrics pipeline and a self-inflicted outage - How **OpenTelemetry** spans and context propagation let one trace follow a request across service boundaries, and why that's the only way to debug distributed latency - How a single **trace ID** stitches logs, metrics, and traces into one story — the correlation that makes the three pillars more than the sum of their parts - How to turn the golden signals into an **SLO** and the alert that protects it ### Prerequisites - **Python: Web Development** — you'll instrument the request lifecycle: middleware, the request/response cycle, and async handlers are where this telemetry attaches. - Comfort with Python decorators, context managers, and `async`/`await`, since middleware and span instrumentation lean on all three. - A working mental model of a multi-service system: an API that calls other services, which is the setting in which tracing earns its keep. --- ## Structured logging: from prose to data The single highest-leverage change you can make to a service's observability is to stop logging *sentences* and start logging *records*. A line like `User 123 logged in from 192.168.1.1` is readable by exactly one consumer — a human reading it in order — and hostile to every other use. You cannot ask "show me all logins from this IP in the last hour" without a fragile regular expression, and the moment someone rewords the message your query breaks. The information is *there*, but trapped in prose. A **structured log** carries the same information as fields: an event name plus typed key-value pairs, rendered as JSON. The same login becomes `{"event": "user_login", "user_id": 123, "ip": "192.168.1.1", "level": "info", "timestamp": "..."}`. Now it is data. Your log backend indexes the fields, and "all logins from this IP" is a filter, not a regex. This is the difference between a log you read and a log you *query*, and at scale you will almost always be querying. In Python the cleanest way to get there is `structlog`, which lets you build a *processor pipeline* — a chain of small functions that each enrich or transform a log event before it is rendered. You configure it once at startup: pick the processors that add a timestamp, a level, exception formatting, and finally a JSON renderer. The illustrative shape, with the bound-logger ergonomics that make it pleasant to use: ```python import structlog # Configure once at startup. ConsoleRenderer in dev (colored, human-readable), # JSONRenderer in prod (machine-parseable). The pipeline order matters: each # processor sees the event dict the previous one produced. structlog.configure( processors=[ structlog.contextvars.merge_contextvars, # pull in per-request context structlog.processors.add_log_level, structlog.processors.TimeStamper(fmt="iso"), structlog.processors.format_exc_info, # render exceptions as fields structlog.processors.JSONRenderer(), # the final, queryable form ], ) log = structlog.get_logger() log = log.bind(service="order-service") # context on every line from here log.info("order_created", order_id=812, total=49.90, payment_method="card") ``` The standard library `logging` module can do this too — a JSON formatter on a handler gets you most of the way — but `structlog`'s processor pipeline and `contextvars` integration make request-scoped context (which we need next) far less awkward. Either way, the rule is the same: **never `print` to log, and never build a log message with string formatting.** A `print` has no level, no timestamp, no structure, and no way to be turned off; it is the 2 a.m. wall of prose by construction. ### Correlation IDs: a thread through every line Structured fields make a *single* line queryable. But the question you actually have at 2 a.m. is "show me everything that happened *for this one failing request*," and that requires every line emitted while handling a request to carry the same identifier. That identifier is the **correlation ID** (often called a request ID), and it is the spine the rest of this chapter hangs on. The mechanism in Python is `contextvars`. A `ContextVar` is like a thread-local that also works correctly under `async`/`await`: set it at the start of a request and it is transparently available to every function called downstream, including across `await` boundaries, without threading it through every signature. You set it once, in middleware, and a `structlog` processor copies it onto every log line automatically. ```python import uuid from contextvars import ContextVar import structlog from starlette.middleware.base import BaseHTTPMiddleware request_id_var: ContextVar[str] = ContextVar("request_id", default="") log = structlog.get_logger() class RequestContextMiddleware(BaseHTTPMiddleware): """Stamp every request with a correlation ID and bind it to the log context.""" async def dispatch(self, request, call_next): # Honor an upstream ID if present (so a caller's ID flows through us), # otherwise mint one. Either way, every log line below carries it. request_id = request.headers.get("X-Request-ID") or str(uuid.uuid4()) structlog.contextvars.bind_contextvars(request_id=request_id) response = await call_next(request) response.headers["X-Request-ID"] = request_id # hand it back to the caller return response ``` Two details earn their place. First, the middleware *honors an incoming* `X-Request-ID` before minting a new one — so when an upstream service already started a request, its ID flows through yours rather than being replaced, and the whole journey shares one identity. Second, it writes the ID back into the response header, so a client (or a load balancer's access log) can report the exact ID to correlate against. Now a single query — `request_id = "abc-123"` — returns the complete, ordered story of one request, which is precisely what the 2 a.m. team could not get. ### Log levels that mean something Structure and correlation are wasted if every line is logged at `INFO` and `ERROR` is sprinkled on anything mildly unusual. Levels are a *contract* with whoever is on call, and the contract is about actionability. A workable rule of thumb: **if you'd want to be woken at 3 a.m. for it, it's `ERROR` or `CRITICAL`; if it's a notable business event worth a permanent record, it's `INFO`; if it's only useful while actively debugging, it's `DEBUG`.** The subtle one is `WARNING`: a declined payment is *not* an error — the system worked exactly as designed — so it is a `WARNING` (or even `INFO`) about a business outcome, not an `ERROR` about a system failure. Getting this distinction right is what keeps `ERROR` meaning "a human should look at this," which is the only thing that keeps anyone looking. ::: {.callout-warning} ## War story: the unstructured logs that hid the outage The 2 a.m. incident in the introduction traces to a logging decision made years earlier. The service logged with f-strings — `logger.info(f"processing order {oid} for {uid}")` — which felt fine in development, where you read a handful of lines in order. In production those lines arrived interleaved across thousands of concurrent requests, with no request ID to disentangle them and no fields to filter on. When a subset of `/checkout` requests started failing, the on-call engineer could not isolate the failures from the healthy traffic, group by endpoint or status, or follow any single request to its cause. The fix was not heroic: a `structlog` JSON pipeline, a request ID set in middleware, one line to bind it onto every log. The *next* incident of the same shape was diagnosed in four minutes with a single query filtering on `status == "error"` and grouping by `endpoint`. The lesson: unstructured logs are a debt you only feel during an outage — exactly when you can least afford to pay it. ::: > **Build it →** A production-grade structured-logging-and-correlation setup across a > real multi-service FastAPI stack lives in > [Project 05: SaaS Web Platform](https://github.com/jchu0/applied-cs-projects/tree/main/05-saas-web-platform), > where request IDs thread through the API, background jobs, and the data layer. ## Metrics: the golden signals Logs tell the story of individual events; **metrics** tell you the shape of *all* of them at once. A metric is a number, sampled over time, optionally split by a few labels — and because it is just numbers, a metrics backend can store years of it cheaply and answer "what was the p99 latency last Tuesday?" in milliseconds. That efficiency is why metrics, not logs, are what dashboards chart and what alerts watch: you do not alert on logs, you alert on metrics. The Prometheus client library gives Python three metric types, and choosing the right one is not stylistic — the wrong type produces wrong data. A **counter** only ever goes up (you query its *rate*, not its raw value): use it for totals like requests served, errors, orders created. A **gauge** goes up *and* down: use it for a current value like in-flight requests, queue depth, or connection-pool usage. A **histogram** records the *distribution* of a value into buckets: use it for latency and sizes, so you can compute percentiles. The decision is mechanical — *does it only go up?* → counter; *does it go up and down?* → gauge; *do you need percentiles?* → histogram. Wired into request-handling middleware, the golden signals fall out of these three types almost for free: a counter gives you traffic and errors, a histogram gives you latency, a gauge gives you saturation. ```python from prometheus_client import Counter, Histogram, Gauge # Traffic + errors live on one counter, sliced by a few BOUNDED labels. requests_total = Counter( "http_requests_total", "Total HTTP requests", ["method", "endpoint", "status"], # all low-cardinality ) # Latency as a distribution; buckets chosen around the SLO you care about. request_duration = Histogram( "http_request_duration_seconds", "Request latency", ["method", "endpoint"], buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0], ) # Saturation: a value that rises and falls with concurrent load. in_flight = Gauge("http_requests_in_progress", "In-flight requests", ["endpoint"]) ``` The histogram is the quietly powerful one. It does not store every latency; it counts how many requests fell into each bucket, and from those counts Prometheus computes any percentile at query time with `histogram_quantile`. That is why you pick buckets *around your SLO*: with a 250 ms target, a bucket boundary at 0.25 lets you ask "what fraction of requests beat the SLO?" directly. Averages lie — a handful of 10-second requests vanish into a healthy-looking mean — which is why you instrument the distribution and watch p99, not the average. ### Cardinality discipline Notice what is *not* a label above: no `user_id`, no `order_id`, no URL path with the IDs still in it. This is the cardinality rule from earlier made concrete, and it is the single most important habit in metrics. Prometheus stores one time series per distinct combination of label values. With `method` (a few), `endpoint` (dozens), and `status` (a handful), `http_requests_total` is a few hundred series — trivial. Add `user_id` and it becomes one series *per user*, growing without bound until the instance runs out of memory and falls over — taking your visibility down at the exact moment you'd want it. The corollary is that dynamic path segments must be **normalized** before they become a label: `/users/123/orders/456` and `/users/789/orders/012` are both the route `/users/{id}/orders/{id}`, and it is the route, not the instance, that belongs in the `endpoint` label. The high-cardinality specifics — *which* user, *which* order — belong in logs and trace attributes, where they are searchable without exploding a time-series database. Metrics are for the aggregate; logs and traces are for the individual. ::: {.callout-warning} ## War story: the label that blew up the bill A team added a `customer_id` label to their request counter so they could "slice latency by customer." It worked beautifully in staging, where there were twelve test customers. In production there were 400,000. Each new customer minted a fresh time series on every endpoint-status combination, the active series count climbed past ten million within a day, and the managed-metrics bill arrived an order of magnitude over budget — while the Prometheus remote-write queue backed up and dropped samples, degrading the very dashboards the label was meant to enrich. The fix was to delete the label and move per-customer analysis to the logging pipeline, where 400,000 distinct values are unremarkable. The rule that would have prevented it: **a metric label's value set must be small and bounded; if you can't enumerate it, it doesn't go in a label.** ::: > **Build it →** Golden-signal metrics, histogram bucket design, and cardinality > guardrails applied to a real data pipeline are worked end-to-end in > [Project 09: Data Observability](https://github.com/jchu0/applied-cs-projects/tree/main/09-data-observability), > which instruments freshness, volume, and error-rate signals over a streaming source. > For the discipline behind the latency *numbers* — how percentiles are measured and how > to trust the deltas you chart — > [Project 49: AI Benchmark Suite](https://github.com/jchu0/applied-cs-projects/tree/main/49-ai-benchmark-suite) > builds the measurement harness that produces them. ## Distributed tracing: where the time went Metrics tell you the p99 latency on `/checkout` jumped to 900 ms. They do *not* tell you why, because the moment a request leaves your service to call another, the single-service view goes dark. Was it your code, the payment service, the database, or the network between them? In a system of five services, the latency you measure at the edge is a sum of contributions you cannot see individually. **Distributed tracing** is the only tool that decomposes that sum. A **trace** is the record of one request's journey, modeled as a tree of **spans**. A span is a single timed operation — "handle the HTTP request," "query the database," "call the payment service" — with a start time, a duration, and attributes. The top-level span is the request, its children are the operations it performed, and crucially, when an operation crosses into another service, the child span is created *inside that other service* and still belongs to the same trace. View the tree and the latency budget is laid bare: 700 ms of the 900 ms was spent inside the payment service's database call, three levels deep. No other signal can show you that. The standard for this in Python is **OpenTelemetry** (OTel), a vendor-neutral API and SDK — you instrument once against the OTel API and can export to Jaeger, Tempo, or a commercial backend without touching your code. You create spans with a context manager, which guarantees the span is closed and its duration recorded even if the body raises: ```python from opentelemetry import trace tracer = trace.get_tracer(__name__) async def get_order(order_id: int) -> dict: # The context manager opens the span on entry, records duration and closes it # on exit — even on exception. Attributes make the span searchable later. with tracer.start_as_current_span("get_order") as span: span.set_attribute("order.id", order_id) # searchable, not a metric label try: return await db.fetch_order(order_id) except Exception as exc: span.record_exception(exc) # attach the error to the span span.set_status(trace.Status(trace.StatusCode.ERROR)) raise ``` Note where `order_id` lives: as a *span attribute*, not a metric label. A trace is the right home for high-cardinality identity — you can search traces by `order.id` without any of the time-series explosion that the same value would cause in a metric. This is the division of labor again: aggregate in metrics, individual in traces and logs. ### Context propagation: crossing the boundary The magic — and the part people most often get wrong — is making one trace span *two* services. When service A calls service B over HTTP, B has no idea it is part of A's trace unless A *tells* it. OpenTelemetry does this by **injecting** the trace context (trace ID, parent span ID, and sampling decision) into the outgoing HTTP headers — the W3C `traceparent` header — and B **extracts** that context and starts its spans as children of A's. The trace ID is preserved across the wire, so both services' spans land in the same tree. ```python from opentelemetry.propagate import inject, extract # --- Service A, calling out: inject the current trace context into headers --- async def call_payment(order): headers: dict[str, str] = {} inject(headers) # adds the W3C `traceparent` header return await http_client.post("/charge", json=order, headers=headers) # --- Service B, receiving: extract the context and continue the SAME trace --- async def on_request(request, call_next): ctx = extract(dict(request.headers)) # read A's `traceparent` with tracer.start_as_current_span("POST /charge", context=ctx, kind=trace.SpanKind.SERVER): return await call_next(request) ``` In practice you rarely write this by hand for framework calls — OpenTelemetry's auto-instrumentation for FastAPI, `httpx`, SQLAlchemy, and Redis injects and extracts context for you, so a fully traced request often needs only a few lines of setup at startup. But understanding the mechanism matters, because it is *the* thing that fails silently: forget propagation on one hop and your trace simply ends there — exactly the dead-end the 2 a.m. team hit. A trace that stops at a boundary is worse than no trace, because it looks complete. ## Correlation: one ID, three views Here is where the three pillars stop being three separate tools and become one instrument. Each pillar, on its own, answers part of the question. The power comes from being able to *pivot between them* — to see a spike on a metrics dashboard, click into an exemplar trace, find the slow span, and jump from that span to the exact log lines emitted while it ran. That pivot is only possible if all three signals carry a *shared identity*: the **trace ID**. OpenTelemetry already mints a trace ID for every request. The move that ties everything together is to copy that trace ID into your logs and to expose it on your metrics' exemplars, so the same string appears in all three places. In `structlog` this is one more processor: ```python from opentelemetry import trace def add_trace_context(logger, method_name, event_dict): """structlog processor: stamp the active trace/span IDs onto every log line.""" span = trace.get_current_span() if span.get_span_context().is_valid: ctx = span.get_span_context() event_dict["trace_id"] = format(ctx.trace_id, "032x") # same ID the trace uses event_dict["span_id"] = format(ctx.span_id, "016x") return event_dict ``` With this in place, every log line emitted during a request carries the same `trace_id` as the trace itself, and the investigative loop closes: a metrics alert fires on error rate → you grab a `trace_id` from a failing exemplar → you view that trace and see the payment-service span is red → you filter the logs to that `trace_id` and read the exception that explains it. Metrics found it, traces located it, logs explained it — and a single ID carried you across all three without guessing. That is the difference between observability that *works* and three disconnected tools you alt-tab between hoping to spot the same incident in each. A note on scope: this chapter is the *application* view — how you instrument a Python service so it emits good telemetry. The *platform* view — running the collectors, retaining the data, building dashboards, managing alert routing and the SRE practice around SLOs — is the subject of the cross-cutting **Observability** chapter in Part IV. This chapter produces the signals; that one operates the system that consumes them. Your job here is to make the service speak; the platform's job is to listen at scale. --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — Replace prose with records.** Take a small FastAPI service that logs with `print` or f-strings. Configure `structlog` to emit JSON, add middleware that mints a correlation ID per request (honoring an incoming `X-Request-ID`), and bind it so every log line carries it. Fire ten concurrent requests, then prove the win: write the one query against your logs that returns the complete, ordered story of a *single* request and nothing else. Note how impossible that query was before. 2. **Level II — Instrument the golden signals.** Add Prometheus metrics to the same service: a request counter labeled by `method`, `endpoint` (normalized — strip the IDs out of the path), and `status`; and a latency histogram with buckets chosen around a 250 ms SLO. Expose `/metrics`, scrape it, and chart request rate, error rate, and p99 latency. Then write a short paragraph: name one label you were tempted to add that would have been a cardinality bomb, estimate how many time series it would have created at production scale, and say where that information belongs instead. 3. **Level III — One trace, three views, one SLO.** Split the work across two services (an API that calls a downstream service). Wire OpenTelemetry into both, propagate context across the HTTP boundary, and confirm that a single request produces *one* trace spanning both services. Add the `structlog` processor that stamps the `trace_id` onto every log line, and demonstrate the full pivot: from a metrics spike, to the offending trace, to the log lines for that exact `trace_id`. Finally, define a concrete SLO (e.g. "99% of requests under 250 ms over a rolling 30 days") and write the Prometheus alert that protects it — a `histogram_quantile` over the latency buckets, with a `for:` duration so a momentary blip doesn't page anyone. Explain why you alert on the SLO and not on raw CPU. ## Summary Observability is the property that lets you ask new questions of a running system from the outside — a strictly larger thing than monitoring's pre-wired dashboards. It rests on three pillars that each answer a different question: **logs** (what happened), **metrics** (how much and how often), and **traces** (where the time went across services). Instrumenting a Python service means making every log line *structured* and carrying a correlation ID; measuring the *golden signals* with the right Prometheus metric type while ruthlessly bounding label cardinality; and using OpenTelemetry spans with context propagation so one trace follows a request across every boundary. The payoff is realized only when all three share a **trace ID**, so an investigation can pivot from a metric spike to the trace to the explaining log line — turning a two-hour 2 a.m. archaeology dig into a four-minute query. ### Key takeaways - Observability ≠ monitoring: monitoring answers known questions; observability lets you ask new ones from telemetry the system already emits. - Metrics tell you *that* something is wrong, traces tell you *where*, logs tell you *why* — instrument all three, because no one of them is sufficient. - Structure every log line and stamp it with a correlation ID; never `print` and never build a log message with string formatting. - Instrument the golden signals — latency, traffic, errors, saturation — with counters, gauges, and histograms; pick the type by what you're measuring, and pick histogram buckets around your SLO. - Cardinality is the trap: metric labels must be bounded; high-cardinality identifiers like `user_id` belong in logs and trace attributes, never in metric labels. - A trace is the only way to debug latency across services, and context propagation (W3C `traceparent`) is the load-bearing, silently-failing part — a trace that stops at a boundary looks complete and isn't. - A shared trace ID across logs, metrics, and traces is the centerpiece: it turns three tools into one investigative loop. ### Connections to other chapters - **Python: Web Development** (prerequisite): observability attaches to the request lifecycle taught there — middleware, the request/response cycle, async handlers are exactly where correlation IDs are set, metrics are recorded, and spans are opened. You can't instrument a request you don't yet understand. - **Python: Microservices** (extension): tracing is optional with one service and *mandatory* with several. The moment a request crosses a boundary, context propagation is the only thing standing between you and the dead-end trace from the introduction; that chapter is where the cross-service patterns introduced here become the default. - **Observability (Cross-Cutting, Part IV)** (extension): this chapter produces telemetry from one Python service; that chapter operates the platform that consumes it across a fleet — collectors, retention, dashboards, alert routing, and the SRE practice around SLOs and error budgets. Application instrumentation is the input to that platform view. - **Orchestration with Kubernetes** (Part V) (extension): in a cluster, your application telemetry meets the platform's — pod and node metrics, the Prometheus Operator's service discovery, liveness and readiness probes. The golden-signal metrics you expose here are what the cluster scrapes, alerts on, and autoscales against. ## Further reading ### Essential - Beyer et al., *Site Reliability Engineering* (Google, 2016), esp. the chapters on *Monitoring Distributed Systems* (the golden signals) and *Service Level Objectives* — the canonical source for latency/traffic/errors/saturation and for alerting on SLOs rather than on raw resource thresholds. - *OpenTelemetry documentation* — the vendor-neutral standard for traces, metrics, and logs; the Python SDK guide and the W3C `traceparent` propagation spec are the references for the tracing sections above. ### Deep dives - Majors, Fong-Jones & Miranda, *Observability Engineering* (O'Reilly, 2022) — the book that sharpened the observability-vs-monitoring distinction and makes the case for high-cardinality, high-dimensionality event data and querying *unknown-unknowns*. - *Prometheus documentation — instrumentation and best practices* — histogram vs. summary, label cardinality, and `histogram_quantile`, the practical backbone of the metrics section. ### Historical context - Sigelman et al., *"Dapper, a Large-Scale Distributed Systems Tracing Infrastructure"* (Google, 2010) — the paper that introduced spans, trace context propagation, and sampling, and from which essentially every modern tracing system, OpenTelemetry included, descends. - *"The Log: What every software engineer should know about real-time data's unifying abstraction"* (Jay Kreps, 2013) — the foundational case for treating logs as structured, append-only streams of events rather than human-readable prose.