Benchmarking Systems
benchmarking, performance, latency, throughput, percentiles, tail latency, warmup, variance, profiling, measurement
Introduction
A platform team had a Java service that, by their numbers, was fast: mean response time of 8 milliseconds, posted on a dashboard, defended in capacity planning. They shipped a change that cut the mean to 7. The graph dipped, someone got a high-five, and the on-call pager went off four hours later. Latency had not improved; it had gotten worse for the people who mattered. The change had shaved a millisecond off the common case while doubling the cost of the rare slow path — and at their request volume, “rare” meant tens of thousands of users a day now waiting half a second instead of a quarter. The mean fell because the fast requests got slightly faster and there are far more of them. The tail rose because the slow requests got much slower. One number moved down; the experience moved down with the tail.
It got worse when they looked closer. The 8-millisecond figure itself was an artifact. The benchmark fired a request, waited for the response, then fired the next — a closed loop. When the server stalled, the load generator stalled with it, so the moments of high latency produced fewer samples, not more. The benchmark had been quietly averaging away the exact behavior it was supposed to catch. The team had optimized against a number wrong in two compounding ways: a mean where they needed a tail, measured by a harness that under-counted even the tail it captured.
This is the normal condition of benchmarks, not an unlucky one. A benchmark is an experiment, and most experiments people run on software are wrong in a handful of predictable ways. The discipline of this chapter is learning to run the experiment so the number you report is the number you meant to measure.
The Core Insight
The instinct is to treat a benchmark as a measurement — you point a stopwatch at the code and read off the truth. But software performance is not a single value waiting to be observed; it’s a distribution produced by a system whose state you are also, inadvertently, measuring. The number you actually want is rarely the mean. It’s the shape of the distribution, and above all its tail — because at any real scale, every request is somebody’s slow request.
Naive benchmarks go wrong in a small, recurring catalogue of ways. Learn the list and you can audit almost any performance claim, including your own:
- No warmup. The first runs pay for JIT compilation, cold caches, lazy initialization, and connection setup. Time them and you measure startup, not steady state.
- Measuring the wrong thing. The harness times interpreter startup, an unused library, the network round trip, or the load generator itself — anything but the code under test.
- Reporting an average over a tail. A mean collapses a distribution into one number and hides exactly the slow requests users feel. A good mean routinely sits on top of a brutal p99.
- Ignoring variance. One run proves nothing. Without repeated trials you can’t tell a real 5% improvement from the machine being slightly quieter that minute.
- An unrepresentative workload. A dataset that fits in RAM, an empty HTTP handler, a batch size of one — each measures something other than production and flatters whatever you’re comparing.
- Dead-code elimination and coordinated omission. The optimizer deletes work whose result you never use, so a real computation “runs in 0 ns.” And a closed-loop load generator stops sending requests precisely when the server is slow, omitting the tail in a coordinated way that can under-report p99 by orders of magnitude.
Every one of these is a way of measuring a different system than the one you care about. Disciplined benchmarking is the practice of closing each gap on purpose.
A mental model
Treat a benchmark as a controlled laboratory experiment, not a stopwatch. You isolate the one variable you’re testing, hold everything else fixed, warm the apparatus to steady state, take many readings, and report the uncertainty — not a lone number with an implied infinite precision. The threats to validity are the same ones any experimentalist worries about: a contaminated sample (the noisy machine), an instrument that perturbs what it measures (the load generator that becomes the bottleneck), and a result you can’t reproduce (the run with no recorded environment).
There’s a companion image worth keeping for the tail specifically. At scale, your slowest customer is your reputation. A service handling a million requests an hour serves its p99 to ten thousand of them every hour — not a rounding-error fringe but a steady stream of real people. A page that loads in 80 ms at the median and 2 seconds at p99.9 is, for one user in a thousand, simply a slow page. The mean tells you how the average request felt; the tail tells you how many users are quietly deciding your product is broken.
When (and what) to benchmark
Benchmark when a performance claim needs evidence — you’re choosing between implementations, validating that an optimization actually helped, or guarding against regressions over time. The first decision is which kind of benchmark answers your question. A micro-benchmark isolates one function or code path; it’s the right tool for validating a specific optimization, and the most dangerous one, because an isolated loop is exactly what a compiler loves to delete or a JIT loves to special-case in ways production never sees. A macro- or end-to-end benchmark drives a realistic workload through the whole system; it’s slower and noisier but it measures something you’ll actually ship.
Figure 49.1 shows the pipeline every honest benchmark follows, micro or macro: a representative workload, a discarded warmup, repeated measured trials, the full distribution collected and summarized by percentiles, a variance check to separate signal from noise, and a comparison against a baseline.
There’s a prior question, though: should you benchmark at all yet? A benchmark tells you how fast something is; a profiler tells you where the time goes. If you don’t yet know which 5% of the code accounts for 95% of the latency, you’ll benchmark the wrong thing precisely. Profile first to find the hot path, then benchmark it to measure and defend the fix. And sometimes the honest answer is not to benchmark: for a throwaway question, a full harness is more work than it’s worth, and a benchmark you can’t run twice the same way will mislead more than it informs.
What you’ll learn
- How to structure a benchmark as a controlled experiment — isolate the variable, warm to steady state, run repeated trials, and control the environment
- Why the tail of the latency distribution, not the mean, is the number that predicts user experience — and how p50/p95/p99/p99.9 each tell a different story
- How throughput and latency trade off, and how Little’s Law lets you reason about the two together
- How to tell a real difference from run-to-run noise using variance and repeated trials
- Where the four common benchmark domains — languages, web frameworks, databases, and ML — each lie to you, and the one control that fixes each
- How to spot the integrity failures (dead-code elimination, coordinated omission, cherry-picking, vendor framing) that turn a benchmark into propaganda
Prerequisites
- Software-engineering fundamentals: profiling intuition, compilers vs. interpreters, and how caches and memory hierarchies shape performance
- Basic statistics: mean, median, percentiles, standard deviation, and what it means for a difference to be statistically significant
Methodology: the experiment, not the stopwatch
The whole craft reduces to one sentence: measure steady-state work, in isolation, many times, on a quiet machine, and make sure the work can neither be skipped nor inflated. Every clause is there to defeat a specific failure mode.
Steady state means discarding a warmup phase. JIT runtimes spend their first calls interpreting bytecode and compiling hot paths; a GPU’s first kernels pay for autotuning and allocation that can cost ten times the steady-state call; a database’s first queries fill a cold buffer cache. Time those and you measure the climb, not the cruise. Many times means repeated trials, because a single run can’t distinguish a real change from a momentary lull in noise. On a quiet machine means controlling the environment — same hardware, fixed input, pinned versions, CPU frequency locked, turbo and thermal throttling disabled. A warm laptop and a cold one produce different numbers for identical code.
The clause about work being skipped or inflated is the subtle one, and it has two faces. The skip is dead-code elimination: an optimizer that sees you never use a computation’s result is entitled to delete the computation. The inflation is timing the wrong boundaries — including process startup, an import, or a connection handshake in the interval you attribute to the algorithm.
The fix for the skip is to consume the result so the compiler can’t prove it’s dead. Most benchmarking libraries ship a “black box” helper for exactly this:
// Illustrative: force the optimizer to keep the work it would otherwise delete.
use std::hint::black_box;
fn bench() {
// Without black_box, an unused sum is provably dead and compiled away to "0 ns".
let total: u64 = (0..N).map(|i| expensive(black_box(i))).sum();
black_box(total); // the result is "observed", so the work must actually happen
}The fix for inflation is to time the function in process, after warmup, over many iterations — never to shell out and time a whole process invocation when you meant to time a function. A purpose-built harness (hyperfine for CLIs, pytest-benchmark, Go’s testing.B, Criterion for Rust) handles the warmup loop and statistics for you; reaching for time.time() around a single call is how most bad benchmarks are born.
Commit the benchmark harness and a machine specification next to every result. An unreproducible benchmark is an opinion, not evidence — six months later, “it was faster on my box” is unfalsifiable and therefore worthless.
Build it → A complete benchmarking harness with warmup, repeated trials, statistical analysis, and regression detection is implemented in Project 49: AI Benchmark Suite, the direct analog of this chapter for the ML domain.
Percentiles and the tail
If you take one thing from this chapter, take this: report the distribution, lead with the tail. The mean is the most natural summary and the most misleading, because latency distributions are not symmetric. They have a hard floor — nothing is faster than the fastest possible path — and a long right tail produced by garbage collection pauses, cache misses, lock contention, retries, and noisy neighbors. A single slow outlier drags the mean upward while leaving the median untouched, so the mean tells you neither what a typical request feels nor what a bad one feels.
Percentiles read the distribution at fixed points. The p50 (median) is the typical request; p95 and p99 are the tail, the experience of the unlucky 5% and 1%. The p99.9 is the deep tail, and it matters more than its name suggests because of fan-out: a page that makes 100 backend calls and waits for all of them experiences its slowest call, so the page’s median is governed by each service’s p99-ish behavior. Tail latency compounds across a request that touches many services. This is why teams at scale watch p99 and p99.9 the way the introduction’s team should have — and why a 15% improvement in the mean is not news until you’ve seen what happened to p99.
A short illustration of the gap, computed from raw samples rather than trusting a mean:
# Illustrative: the mean hides the tail; percentiles expose it.
import numpy as np
def summarize(latencies_ms: list[float]) -> dict[str, float]:
"""Report the distribution, not a single number."""
a = np.asarray(latencies_ms)
return {
"mean": float(a.mean()),
"p50": float(np.percentile(a, 50)),
"p99": float(np.percentile(a, 99)),
"p99.9": float(np.percentile(a, 99.9)),
}Run that over a realistic sample and you routinely see a mean of, say, 12 ms sitting above a p50 of 8 ms and below a p99 of 70 ms — the average lands between the typical case and the tail, describing neither. The honest report shows all of them.
The deepest tail trap has a name: coordinated omission, the bug that mangled the introduction’s benchmark. A closed-loop load generator sends a request, waits for the response, then sends the next. When the server is healthy this is fine. When the server stalls for 200 ms, the load generator stalls with it and simply doesn’t send the requests it would have sent during the stall — so the worst period of the test produces the fewest samples. The slow requests that should dominate your tail are silently never issued. The result can under-report p99 by 10× or 100×.
A team load-tested a service with a closed-loop client and reported a p99 of 12 ms. Production p99 was over a second. The client had been pausing during every GC stall instead of issuing requests through it, so the stalls were absent from the measured distribution. The fix was an open-loop generator that issues requests on a fixed schedule regardless of whether prior responses have returned — when the server stalls, the requests pile up and their full waiting time (including the time they sat in the queue) lands in the histogram. The latency tripled on paper overnight; nothing about the service had changed except that the benchmark had stopped lying.
The general defense against coordinated omission is to measure latency from the moment a request should have been sent, not from when it actually was, and to prefer open-loop load generators (wrk2, vegeta) over closed-loop ones for any test where the tail matters.
Throughput, latency, and the tradeoff
Latency is how long one operation takes; throughput is how many you complete per second. They are not the same axis, and optimizing one routinely costs the other. Batching is the canonical example: grouping work amortizes fixed costs and lifts throughput, but every item now waits for the batch to fill, so per-item latency rises. A GPU running inference at batch size 1 might serve 125 requests per second at 8 ms each while sitting 18% utilized; at batch size 32 it serves ten times the throughput at three times the latency, finally saturating the hardware. Neither is “better” — one is tuned for a latency budget, the other for a throughput budget, and the benchmark has to say which it’s measuring.
The two axes are linked, and Little’s Law is the link: the average number of requests in flight in a stable system equals the arrival rate times the average time each spends in the system (L = λ × W). The practical reading is that throughput, concurrency, and latency are three views of one system, not three independent dials. Push offered load up and latency stays flat — until some resource (a thread pool, a connection pool, CPU, a database) saturates, at which point requests start queueing and latency knees sharply upward while throughput plateaus. That knee is the most important point on the curve: it’s the real capacity of the system, the number that should drive autoscaling and provisioning. A benchmark that reports a single throughput figure without finding the knee has measured a point on a curve and called it the curve.
Variance: one run proves nothing
A benchmark is a sample, and samples have noise — background processes, cache state, frequency scaling, the scheduler. The first job of a benchmark is to make a difference visible above that noise, which means you cannot conclude anything from a single run of each variant. You need repeated trials and a sense of the spread.
The cheap, robust habit is to run each variant many times and compare distributions, not point estimates. If variant A is 3% faster on the mean but the run-to-run standard deviation is 8%, the difference is noise and you have learned nothing; if A is 3% faster and the variance is 0.5%, you have a real effect. Tools that report min/mean/max and standard deviation (or run a proper significance test) make this judgment for you; eyeballing two single numbers does not.
# Illustrative: a difference is only real if it clears the noise.
import statistics as st
def is_real(a: list[float], b: list[float]) -> bool:
"""Crude guard: the gap in means must exceed the combined run-to-run spread."""
gap = abs(st.mean(a) - st.mean(b))
noise = st.pstdev(a) + st.pstdev(b)
return gap > noise # in production, use a real t-test / confidence intervalThat guard is deliberately blunt — real practice uses confidence intervals or a t-test — but it encodes the right instinct: a result you can’t distinguish from the machine’s mood is not a result. The cause of high variance is usually environmental (thermal throttling, a noisy neighbor, an unpinned CPU), so when the spread is large, fix the environment before you trust the comparison.
Four lenses, four ways to be fooled
The same methodology applies everywhere, but each domain has a signature trap. Knowing the trap lets you read other people’s benchmarks — and design your own — with the right suspicion.
Languages. Cross-language comparisons are a minefield of unfairness. The single most common error is comparing a debug build of a compiled language against an optimized one: a Rust binary built without --release can run 20× slower than the same code optimized, and that one row poisons the whole table. The second is comparing non-idiomatic code — a naive Python loop against a SIMD-and-preallocation Rust implementation measures the author’s effort per language, not the languages. The third is forgetting warmup on a JIT runtime. And the fourth is forgetting what the library does: naive Python is 100× slower than NumPy on matrix math because NumPy drops into compiled BLAS — the gap measures the library, not the language. Read language benchmarks by workload class (CPU-bound vs. I/O-bound vs. memory-bound), never as a global ranking, because a language that wins on raw compute often loses on high-concurrency I/O.
Build it → The cross-language performance references these comparisons target are the Rust projects: Project 03: High-Performance Cache, Project 06: Async Runtime, and Project 20: SIMD Analytics Engine — each benchmarked with Criterion against scalar and managed-language baselines.
Web frameworks. The signature trap is benchmarking an empty “hello world” handler. Framework overhead is usually a small slice of a real request — a few milliseconds against a 30 ms database call — so a leaderboard showing one framework at 5× the requests-per-second of another is true and almost irrelevant for an app whose bottleneck is downstream. Do the budget math: at the load you actually expect, how many milliseconds does the framework cost per request versus your handler? Benchmark a handler that does representative work, keep HTTP keep-alive on (or you measure connection setup), and verify the load generator isn’t itself the bottleneck — if every framework posts identical numbers, your client saturated first.
Build it → The web stacks these load tests exercise end-to-end are Project 02: Microservice Platform (Go + gRPC behind a Kong gateway) and Project 05: SaaS Web Platform (a full FastAPI/ASGI stack under production-shaped load).
Databases. There is no fastest database, only a fit for a workload, and the trap is a dataset that fits in RAM. At 1 GB everything lives in the buffer cache, so a point-read benchmark measures memory and the network round trip rather than the engine — and the in-memory store “wins” in a way that says nothing about a disk-bound production load. Size the working set past RAM the way production does, and the gap between engines narrows sharply because both are now bound by disk I/O and buffer-pool efficiency. Always report working-set-versus-RAM, match the read/write mix to your real one, and pool connections exactly as the application would — without pooling, a short benchmark measures TCP and auth handshakes, not queries.
Build it → Storage engines built and benchmarked the same way live in Project 17: Columnar Query Engine (OLAP scan/aggregate throughput) and Project 52: Time-Series Database (write throughput and range-query p99 under load).
ML. GPU work is asynchronous, so the signature trap is timing a kernel launch instead of its execution: call the model, stop the clock, and you’ll report 2 ms for an 80 ms computation because the GPU is still working when you read the time. You must synchronize before stopping the timer, and warm up first to absorb cuDNN autotuning and allocation. Beyond that, wall-clock time alone is uninterpretable — capture GPU utilization alongside it. Low utilization with high wall-clock time means you’re data-loading- or transfer-bound, and no kernel optimization will help; the fix is batching and prefetching, not a faster matmul. Sweep batch size and report the curve, because a single number hides the entire latency/throughput tradeoff.
Build it → The ML-side analog of this whole chapter — a full benchmarking harness with statistical analysis and regression detection — is Project 49: AI Benchmark Suite, with the serving hot paths it measures in Project 44: Autoregressive Inference and Project 19: GPU Kernel Optimization.
Benchmark integrity
The final discipline is honesty, because benchmarks are unusually easy to weaponize. Report the full environment — hardware, versions, build flags, dataset size — so the result can be reproduced or falsified. Don’t cherry-pick the run that looks best; report the distribution across runs, including the variance. Be especially skeptical of vendor benchmarks, where the framing is chosen to favor the product: the workload, the competitor’s configuration, the metric, and the percentile are all degrees of freedom that a motivated author can tune. The tell is usually a single headline number with no distribution, no environment spec, and a workload that happens to be exactly what the product is best at. A benchmark that doesn’t tell you how to reproduce it isn’t evidence; it’s marketing with a chart.
Practical exercise
Difficulty: Level I · Level II · Level III
- Level I — Benchmark a function correctly. Take a CPU-bound function and time it naively with a single
time.time()call around one invocation. Then rewrite it the right way: add a warmup phase, run N ≥ 100 trials in process, consume the result so it can’t be optimized away, and report p50/p95/p99 — not just the mean. Show how the naive number differs from the steady-state distribution and explain each gap. - Level II — Find the knee without coordinated omission. Build an open-loop load test (issue requests on a fixed schedule, not after each response returns) against a service. Sweep offered load upward and plot the full latency distribution at each level. Identify the knee where p99 latency turns sharply upward, name the resource that saturates there, and contrast your p99 against what a closed-loop client would have reported at the same load.
- Level III — Design a CI performance-regression gate. Specify the metric (which percentile, on which workload), the threshold (expressed as a multiple of measured run-to-run variance, not a fixed percentage), the number of trials per run, and the rollback policy when the gate trips. Then argue why it won’t flake: how does it distinguish a real regression from noise, and what does it do when the CI runner itself is a noisy, shared machine?
Summary
A benchmark is a controlled experiment, and most software benchmarks are wrong in a short list of predictable ways: no warmup, measuring the wrong thing, averaging over a tail, ignoring variance, an unrepresentative workload, and the twin distortions of dead-code elimination and coordinated omission. The number you almost always want is not the mean but the distribution, led by its tail — because at scale the tail is somebody’s whole experience. Discipline means isolating one variable, warming to steady state, running many trials on a controlled machine, making the work impossible to skip or inflate, finding the knee where throughput trades against latency, and proving a difference clears the noise before you believe it.
Key takeaways
- Report the distribution and lead with the tail (p50/p95/p99/p99.9); a good mean routinely hides a brutal p99.
- Warm up to steady state and control the environment, or you measure startup and background noise instead of the code.
- Defeat dead-code elimination (consume the result) and coordinated omission (use an open-loop load generator) — each can be off by orders of magnitude.
- One run proves nothing; a difference is real only when it clears the run-to-run variance.
- Throughput and latency trade off; find the knee, because that’s the system’s real capacity.
- Each domain has a signature trap — debug builds (languages), empty handlers (web), in-RAM datasets (databases), and unsynchronized async timing (ML).
Connections to other chapters
- Containerization with Docker (sibling): image size and cold-start time are things you benchmark, and they follow the same rules — warm up, run repeated trials, report the distribution rather than a single lucky pull.
- Orchestration with Kubernetes (sibling): autoscaling and capacity decisions are only as good as the load numbers behind them. The knee of the latency curve — found with an open-loop load test — is the input that should set your replica counts and HPA thresholds.
- Observability and Monitoring (extension, cross-cutting): production metrics are continuous benchmarking. Percentile dashboards, SLOs on p99, and latency histograms are the same statistics from this chapter, computed live on real traffic instead of in a harness.
- Software-Engineering Fundamentals / Performance (prerequisite and extension): profiling locates the hot path before you benchmark it. Benchmark without profiling first and you’ll measure the wrong code precisely; the two skills are a loop, not a sequence.
Further reading
Essential
- Gil Tene, How NOT to Measure Latency (talk, 2015) — the definitive treatment of coordinated omission and why your latency numbers are probably wrong.
- Brendan Gregg, Systems Performance: Enterprise and the Cloud (2nd ed., 2020) — the practitioner’s reference for profiling and benchmarking real systems.
Deep dives
- Raj Jain, The Art of Computer Systems Performance Analysis (1991) — the rigorous foundation: experimental design, statistics, and the catalogue of common mistakes.
- Neil Gunther, Guerrilla Capacity Planning — the Universal Scalability Law (USL), which models exactly where and why the throughput curve knees over under contention.
Historical context
- John D. C. Little, A Proof for the Queuing Formula L = λW (1961) — the law that ties throughput, concurrency, and latency into one relationship.
- Jim Gray (ed.), The Benchmark Handbook (1993) — the origin of disciplined, domain-specific system benchmarks (TPC and kin) and the case for representative workloads.