Production ML

Keywords

ml monitoring, data drift, concept drift, ab testing, shadow deployment, model optimization, quantization, pruning, distillation, online evaluation

Introduction

The model launched at 92% accuracy. The team had done everything right: a clean holdout set, cross-validation, a careful offline evaluation that everyone signed off on. It went to production, the dashboards were green, and the project moved on to the next thing. Six months later a customer complaint triggered an audit, and someone finally pulled the numbers. The model was running at 70%. It had been sliding for months — a percentage point here, two there — and nobody had noticed, because nothing had broken. Latency was fine. Error rates were zero. The service had not thrown a single exception. By every signal the operations team watched, the system was perfectly healthy. It was also, increasingly, wrong.

There was no bug to find. The code that scored 92% in March was byte-for-byte the code scoring 70% in September. What changed was the world: users shifted their behavior, an upstream source quietly altered the shape of a feature, and the population the model served in autumn no longer resembled the one it had learned from in spring. The model had not failed. Its assumptions had expired, and nothing in production was watching for that.

This is the failure mode that makes production ML different from production software, and it inverts an intuition every engineer carries. We are taught that software is stable if you leave it alone — a deployed service nobody touches keeps behaving as it did the day it shipped. Code doesn’t rot if you don’t touch it. Machine learning models rot precisely because you don’t. A model is a frozen snapshot of a relationship that held in its training data, deployed into a world that refuses to stay frozen; leave it alone and the world drifts out from under it. Maintaining a model in production is not a launch you finish — it is a feedback loop you keep running.

The Core Insight

Ordinary software has a clean definition of “correct”: given an input, it produces the output the specification demands, and you can test that the day you ship and trust it forever. A model has no such guarantee. Its correctness is conditional — it is right to the extent that the world it sees matches the world it learned from — and that condition silently weakens every day the world moves on. This produces failure modes ordinary software simply does not have, and three consequences follow directly.

  1. You must monitor the model’s behavior, not just the system’s. Latency, throughput, and error rate tell you the service is healthy; they say nothing about whether the predictions are any good. A model can answer in two milliseconds, never crash, and be confidently, expensively wrong. The thing that decays is invisible to every metric a traditional observability stack collects, so production ML adds a second layer: the input distributions, the prediction distributions, and — when it eventually arrives — the actual accuracy.
  2. You must evaluate new models on live traffic, because offline metrics don’t guarantee online wins. A candidate that beats the incumbent on your test set can still lose in production: the test set is yesterday’s data, it can’t capture how users react to the new model, and a small offline gain can hide a real-world regression. The only honest judge of a model in production is production — which is what A/B tests and shadow deployments are for.
  3. You must often optimize the model to fit production at all. The network that wins on the leaderboard may be too slow or too expensive to serve under a latency budget, so you trade a little accuracy for a lot of speed and cost — quantization, pruning, distillation — and then verify the trade was worth it.

Put together, production ML is a loop: serve, watch, detect decay, retrain, validate on live traffic, redeploy — and around again. The rest of this chapter is that loop, in order.

A mental model

Hold three images and you have the whole chapter.

The first is drift as the ground shifting under a building. A model is a structure engineered for the soil it was poured on — the distribution of its training data. The building doesn’t crack because someone hit it; it cracks because the ground beneath it slowly moves, and a structure that was sound on the original soil is now subtly, dangerously out of true. Monitoring is the network of strain gauges that tells you the ground is moving before the walls come down.

The second is live traffic as the only honest judge. Offline evaluation is a rehearsal in an empty theater: it tells you the actors know their lines, not whether the audience laughs. A/B testing and shadow deployment put the new model in front of the real audience under controlled conditions, so the verdict comes from reality, not from your own held-out copy of the past.

The third is optimization as a deliberate trade. There is a triangle — accuracy, latency, size — and you cannot maximize all three. Optimization is choosing, on purpose, to give up a sliver of accuracy for a large reduction in latency or cost, with the loss measured and bounded rather than discovered by a customer.

What to watch and how to ship safely

The loop gives you a decision framework. Monitor at three levels, ordered by how fast the signal arrives: the inputs first (has the feature distribution drifted? — earliest warning, no labels needed), the outputs next (have predictions shifted, are confidences collapsing? — still label-free, still early), and the outcomes last (is accuracy actually holding? — the truth, but lagged by weeks, so it confirms rather than first-alarms). When the monitors fire and you have a fix to ship, roll it out in increasing order of exposureshadow (scores live traffic but reaches no user), then canary (a small slice of real users), then a full A/B test (a controlled split that measures the causal effect) — so each stage can catch what the cheaper checks missed at a contained blast radius. Figure 39.1 shows the whole circuit: traffic, monitors, alert, retrain, staged validation, promote or roll back.

What you’ll learn

  • Why ML models decay in production even when the code never changes, and how to distinguish the kinds of decay — data drift, concept drift, and label drift
  • How to monitor a model’s behavior (input, prediction, and outcome distributions) rather than only its system health, and which statistical test fits which signal
  • Why ground-truth labels arrive late, and how to get early warning without them
  • Why offline metrics mislead, and how shadow deployments, canaries, and A/B tests validate a model on the only data that counts — live traffic
  • How to design a guarded online experiment: randomization unit, guardrail metrics, and why peeking inflates false positives
  • How quantization, pruning, and distillation shrink a model for latency and cost with a bounded, measured accuracy loss
  • How the pieces close into a retraining loop: detect drift, retrain, evaluate, ship safely, and repeat

Prerequisites

  • ML Systems: how a model is served and how training, the feature pipeline, and experiment tracking fit together — the infrastructure this chapter’s monitoring and retraining ride on top of.
  • ML Engineering Overview: offline model evaluation, train/test splits, the metrics (accuracy, precision/recall, AUC, RMSE) whose online behavior we now track.
  • Comfort with basic statistics: distributions, hypothesis testing, p-values, confidence intervals.

Why models decay: drift, and its kinds

Start with the centerpiece, because it motivates everything else: why does a model that was correct become wrong without anyone changing it? The answer is drift, and it comes in distinct kinds that demand different responses.

Data drift (also called covariate or feature drift) is a change in the distribution of the inputs. The features the model sees in production stop looking like the features it trained on — a sensor recalibrates, a new user segment arrives, a marketing campaign floods the funnel with a different demographic. The relationship the model learned may still hold, but it is now being asked about regions of the input space it saw little of in training, so its answers grow less reliable. Crucially, data drift is detectable without any labels: you have the training distribution and you have the live inputs, and you can compare them directly. That makes it your earliest, cheapest warning.

Concept drift is more insidious: the inputs may look identical, but the relationship between inputs and the target has changed. The same feature vector that predicted “low risk” last year predicts “high risk” now, because the world’s underlying rules moved — fraud tactics evolved, consumer preferences shifted, a competitor changed the game. This is the kind of decay that silently took the 92% model to 70%: nothing about the inputs need look wrong, yet the model’s mapping from input to output is stale. Concept drift is harder to catch because, by definition, you cannot see it in the inputs alone — you need outcomes, the actual labels, to know the mapping has broken.

Label drift is a shift in the distribution of the target itself — the base rate of the positive class changes. A fraud model trained when 2% of transactions were fraudulent behaves differently in a month when 8% are, even if every other relationship holds, because its calibrated thresholds and priors assumed the old balance.

The reason these distinctions earn their keep is that they tell you what kind of signal will detect each. Data and label drift live in distributions you can observe immediately; concept drift hides until ground truth arrives. Which brings us to the hardest constraint in the whole discipline.

The ground-truth lag problem

The metric you actually care about — is the model right? — is usually the one you get last. A fraud model decides in milliseconds, but you do not learn whether a transaction was fraudulent until it is disputed, charged back, or quietly ages out as legitimate — days or weeks later. A loan-default model waits months. A medical-outcome model can wait years. The label is the truth, and the truth is lagged.

This lag is the reason monitoring is layered rather than a single accuracy gauge. If the only thing you watched were accuracy, you would be flying blind for the entire window in which a bad model is already serving millions of predictions, and you would learn of the disaster only after it had fully happened. So you build a hierarchy of leading indicators: watch the inputs (drift detectable now), watch the predictions (their distribution and confidence, detectable now), and treat lagged accuracy as the final, authoritative confirmation rather than the first alarm. A confidently shifting input distribution today is your warning that the lagged accuracy six weeks from now may disappoint — and your cue to investigate before it does.

ML monitoring in practice

General observability — metrics, logs, traces, the telemetry that tells you a service is up and fast — is covered in Part IV, and an ML service needs all of it. What follows is the ML-specific layer that sits on top: monitoring the model’s behavior, not the server’s. It maps directly onto the three drift levels.

Input monitoring compares the live feature distribution against a reference, usually the training set. The right comparison depends on the feature type, and getting this wrong is a classic mistake — a test built for continuous data will mislead you on categorical features. For continuous features, the Kolmogorov–Smirnov test measures the largest gap between two cumulative distributions; for categorical features, a chi-squared test compares category frequencies. A widely used third option, from credit-risk modeling, is the Population Stability Index (PSI), whose practical virtue is returning a magnitude rather than a yes/no, with rules of thumb that travel well across teams.

Rule of thumb. PSI below 0.1 means the distribution is stable; 0.1 to 0.2 means moderate change worth investigating; above 0.2 means a significant shift that often justifies retraining. Unlike a raw p-value — which, on enough data, will declare everything “significantly” drifted — PSI gives you an effect size you can threshold.

A monitor picks the appropriate test per feature and reports a magnitude you can alert on. The illustrative shape, with the test selected by column type:

def feature_drift(reference: pd.Series, live: pd.Series) -> float:
    """Return a drift magnitude for one feature, test chosen by dtype.

    Continuous -> KS statistic; categorical -> a PSI-style divergence.
    A larger value means the live distribution has moved further from
    the training reference. Alert on magnitude, not just significance.
    """
    if pd.api.types.is_numeric_dtype(reference):
        return ks_2samp(reference, live).statistic     # 0..1, higher = more drift
    return population_stability_index(reference, live)  # PSI thresholds: 0.1 / 0.2

A subtlety the per-feature view misses: drift can be multivariate — each feature looks individually fine while their joint distribution has moved, because correlations between features changed. The standard trick is a domain classifier: label training rows 0 and live rows 1, train a classifier to tell them apart, and look at its AUC. If a model can reliably distinguish “old” data from “new,” the distributions differ; an AUC near 0.5 means it cannot, so they are effectively the same.

Prediction monitoring watches the model’s outputs without needing any labels. If a fraud model that historically flags 5% of transactions suddenly flags 15%, something has changed — the model, the input pipeline, or the world — and you want to know now, not when the chargebacks land. Two signals matter: the distribution of predicted classes (is the mix shifting?) and the confidence scores (are they collapsing toward uncertainty?). A sustained drop in average confidence frequently precedes an accuracy drop, which makes it one of the most useful early warnings you have.

Outcome monitoring is the lagged truth: as labels arrive, compute real performance — accuracy, F1, AUC, RMSE — over a sliding window and compare it against the baseline the model launched with. The window matters because cumulative metrics average away recent decay; a thousand-prediction rolling window surfaces a problem the all-time average happily hides. When windowed performance falls below baseline by more than a threshold, the loop has its trigger.

The discipline that ties these together is alerting someone will actually act on, and the hard part is signal-to-noise. Alert on a bare p-value and, with enough traffic, you will page someone hourly for significant-but-meaningless wiggles until they mute the channel — and miss the real fire. The fix: require both an effect size and significance (PSI over 0.2 and a low p-value, say), compare against a baseline rather than a magic absolute number, and scale monitoring intensity to the model’s stakes — a revenue or safety model earns real-time drift checks and auto-rollback; an internal tool earns a weekly digest.

War story: the model that decayed in silence

A fraud-detection model went live at 95% precision and was a quiet success — until, six months later, fraud losses spiked and an investigation found the model had been degrading the whole time. Fraud patterns had evolved (textbook concept drift), the model’s mapping had gone stale, and nothing was watching for it. The team monitored latency and error rate, both perfectly green, and had no input-drift detection, no prediction-distribution monitoring, and no windowed accuracy tracking against baseline. The decay was invisible to everything they looked at. The fix was not a better model but a better loop: drift detection on the inputs, distribution and confidence monitoring on the outputs, windowed accuracy on the lagged labels, and alerts gated on effect size so they meant something. A model nobody is watching is a model that is quietly failing; the only question is when you find out.

Online evaluation: A/B testing, shadow, and canary

You have detected decay and trained a replacement that scores better offline. Ship it? Not yet — because offline-better does not imply online-better, and the gap between them has burned many teams. Your test set is a frozen slice of past data, so it cannot show how users respond to the new model’s predictions, cannot capture feedback loops the new model creates, and cannot reflect the world as it is now rather than when the set was collected. The only way to know a model is better in production is to test it in production, under controls that isolate its effect.

The cleanest control is an A/B test: split comparable users at random, send group A to the current model (control) and group B to the new one (treatment), and compare a pre-chosen metric. Randomization is what makes the groups comparable, so any difference in outcomes is caused by the model and not by some confound. Three design decisions carry most of the weight.

First, the randomization unit must be the user, not the request — assign a user once and keep them on that variant, or the same person flips between models mid-session and you have measured noise. Deterministic hashing of the user ID gives you sticky, restart- surviving assignment for free.

Second, you must fix the sample size in advance from a power analysis: the smaller the effect you want to detect, the more users you need, and detecting a 5% lift takes far more traffic than a 20% one. This pins down how long to run — and forbids the most common sin in experimentation.

Third, declare one primary metric and a set of guardrails before you start. The primary metric decides ship/no-ship; guardrails are the things that must not get worse even if the primary improves — latency, error rate, revenue, a fairness metric. A model that lifts click-through but tanks latency or quietly harms a user segment is not a win, and guardrails are how you catch that automatically.

For ML specifically, two refinements matter. When both models can score the same users, a paired test (McNemar’s, for classification) is far more powerful than comparing two independent groups, because it controls for per-user difficulty. And the ground-truth lag strikes again: if your conversion signal takes a week to mature, analyze only mature cohorts or you will compare a fully-converted control against a half-converted treatment and conclude nonsense.

A/B testing is the gold standard, but it is not the first thing you reach for, because it does expose real users to the new model. The safe rollout climbs an exposure ladder. Shadow deployment runs the candidate alongside the champion on mirrored live traffic — it scores every real request, but its answers are logged and discarded, never shown to a user. This validates the model on production data with zero user risk: you confirm it doesn’t crash, that latency is acceptable, and that its prediction distribution looks sane, all before a single customer is affected. Canary is the next rung: route a small slice of real users — 1%, then 5% — to the new model, watch the guardrails, and widen only if they hold. A full A/B test is the top rung, committing enough traffic to measure the causal effect with statistical power. Shadow proves it works; canary proves it’s safe at small scale; A/B proves it’s better.

War story: the winner that wasn’t

A team watched their A/B dashboard every morning and shipped the new model the instant it crossed p < 0.05 — on day four of a planned two-week run. The lift evaporated the next quarter. They had peeked: checking a running experiment repeatedly and stopping at the first significant reading inflates the false-positive rate far above the nominal 5%, because with enough looks the metric will wander across the threshold by chance alone. The early crossing was noise dressed as signal. Had they instead fixed the sample size up front and analyzed once at the end — or used a sequential test designed for continuous monitoring, which spends its error budget across looks — they would have seen there was no real effect and kept the incumbent. The discipline is boring and non-negotiable: decide the stopping rule before you look, then honor it.

Model optimization: trading accuracy for latency and cost

A model can be accurate and still be unshippable. If it cannot answer inside the latency budget, or if serving it at scale costs more than the feature earns, the best test-set number in the world is academic. Optimization is the deliberate move to relax the binding constraint — latency, memory, or cost — by spending a small, bounded amount of accuracy. Recall the triangle: you trade down one corner to win on another, on purpose, with the loss measured rather than discovered in the wild.

Quantization is usually the highest-leverage first move. Models train in 32-bit floating point, but inference rarely needs that precision; storing and computing weights in 16-bit (FP16) or 8-bit integers (INT8) shrinks the model roughly 2–4x and speeds up inference correspondingly, often for under 1–2% accuracy loss. The cheapest win, FP16 on a modern GPU, frequently costs essentially nothing. INT8 squeezes harder and needs calibration — running representative data through the model to learn the right value ranges before mapping the float range onto 256 integer levels; skip the calibration data and the loss balloons.

Pruning removes weights or whole structures the model barely uses — most networks are over-parameterized enough to lose a large fraction of their weights. The non-negotiable rule is prune, then fine-tune: cutting weights and deploying immediately tanks accuracy, but pruning gradually and retraining to recover between rounds reclaims most of it. The speedup depends on the kind: structured pruning (removing whole channels or layers) helps any hardware; unstructured pruning (zeroing individual weights) only helps hardware that exploits sparsity.

Distillation is the most dramatic and the most work. Train a small “student” to mimic a large “teacher” — not just its final labels but its full output distribution, which carries far richer signal than a hard label. A well-distilled student can be many times smaller and faster while keeping most of the teacher’s accuracy, which is why it is the workhorse for edge and mobile deployment where the teacher will not fit at all.

The one rule that governs all three: always validate after optimizing, and validate per segment, not just in aggregate. Headline accuracy can hold while a subgroup quietly collapses.

Note

The quantization and distillation of large language models — GPTQ, AWQ, KV-cache tricks, and the rest of the LLM serving and evaluation stack — is the domain of the companion AI Engineering book. This chapter covers the general model-optimization techniques; the LLM-specific machinery lives there.

Build it → Model optimization made concrete: the AI Benchmark Suite (Project 49) measures the accuracy-versus-latency trade-offs that decide whether an optimization was worth it — the per-segment, percentile-aware validation this section insists on.

The retraining loop

Now close the circuit. Monitoring detects decay; online evaluation validates a fix; the two join into a loop that keeps a model healthy as the world moves.

A drift or quality alert — input drift past threshold, predictions shifting, or lagged accuracy falling below baseline — is the trigger. It kicks off retraining on fresh data that includes the new reality, producing a candidate. The candidate passes offline evaluation first as a cheap gate (if it can’t beat the incumbent on held-out data, stop here), and this is where optimization belongs: quantize or distill to fit the latency budget, then re-validate, before spending live traffic on it. The surviving candidate climbs the exposure ladder — shadow to prove it runs cleanly on production data, canary to prove it’s safe on a small slice, A/B to prove it’s genuinely better — and only then is it promoted to champion. If any stage fails its guardrails, you roll back and keep the incumbent. Then the loop runs again, because the world will drift again.

This is the same shape as CI/CD for software — build, test, stage, promote, roll back — but with the ML-specific twist that the trigger is drift in the world rather than a code commit, and the test of record runs on live traffic rather than in a sandbox.

Build it → The orchestration that runs this loop: ML Training Orchestrator (Project 04) implements the retrain → evaluate → promote pipeline, and the Data Observability platform (Project 09) implements the drift and distribution monitoring that fires the trigger.


Practical exercise

Difficulty: Level I · Level II · Level III

  1. Level I — Watch a deployed model. For a model you have access to (or a simulated one), set up two monitors with no dependence on labels: an input monitor that computes a drift magnitude per feature against the training reference (KS for continuous, PSI for categorical), and a prediction monitor that tracks the predicted- class mix and average confidence over a sliding window. Define one drift alert with an explicit, justified threshold — and explain why you chose effect size plus significance rather than a bare p-value.
  2. Level II — Validate a new model on live traffic. Design the rollout for a candidate that beat the incumbent offline. Specify the shadow deployment (what you log and what you check before any user is exposed), then the A/B test: randomization unit, one primary metric, at least two guardrail metrics, the sample size from a power analysis, and a fixed stopping rule. Then write the paragraph you would give a skeptical stakeholder explaining why the offline win was not enough and what live traffic told you that the test set could not.
  3. Level III — Design the closed loop. Specify the full circuit end to end: the drift signals that trigger retraining and their thresholds; the retraining step; the offline evaluation gate; where model optimization (quantization or distillation) fits and how you keep its accuracy loss bounded and within the latency budget; the staged validation (shadow → canary → A/B) with the guardrails that gate each promotion; and the rollback path. State, for each stage, what failure looks like and what the loop does about it.

Summary

Production ML differs from production software in one decisive way: a model can be perfectly healthy by every system metric — fast, error-free, never crashing — and still be wrong, because its accuracy is conditional on the world matching its training data, and the world drifts. Code doesn’t rot if you leave it alone; models rot precisely because you do. So you monitor the model’s behavior, not just the server’s: input distributions (data drift, detectable now without labels), prediction distributions (early warning, also label-free), and lagged ground-truth accuracy (the truth, arriving last). When decay is detected, you ship the fix up an exposure ladder — shadow, then canary, then A/B test — because offline metrics don’t guarantee online wins, and live traffic is the only honest judge. Along the way you optimize the model — quantization, pruning, distillation — to meet latency and cost, trading a small, measured slice of accuracy for a large win. The pieces close into a retraining loop: detect, retrain, evaluate, validate safely, promote or roll back, repeat. Production ML is a feedback loop, not a launch.

Key takeaways

  • A model can be healthy by system metrics and wrong by world drift; monitor the model’s behavior, not only the service’s health.
  • Drift has kinds, and they dictate detection: data and label drift live in distributions you can watch now; concept drift hides until lagged labels arrive.
  • Layer monitoring by how fast the signal comes: inputs (earliest, label-free), predictions (early, label-free), outcomes (the truth, but lagged) — and gate alerts on effect size, not bare p-values.
  • Offline-better is not online-better; validate on live traffic up the exposure ladder (shadow → canary → A/B), with a user-level randomization unit, a pre-fixed sample size, one primary metric, and guardrails. Never peek.
  • Optimization trades a bounded, validated accuracy loss for latency and cost (quantize, prune + fine-tune, distill) — and you check per segment, not just the headline number.

Connections to other chapters

  • ML Systems (prerequisite): monitoring and retraining ride directly on the serving infrastructure, feature pipeline, and experiment tracking established there — this chapter is what keeps that system correct once it is live, rather than merely up.
  • Observability (Part IV): ML monitoring is a specialization of general observability. The metrics, logs, traces, and alerting machinery come from there; this chapter extends them from “is the service healthy?” to “is the model right?” — a question the standard telemetry cannot answer.
  • CI/CD (Part IV): the retraining loop is continuous delivery for models. The build–test–stage–promote–rollback shape is identical; what changes is that the trigger is drift in the world rather than a code commit, and the test of record runs on live traffic.
  • Data Quality and Testing (Part II): drift is fundamentally a data-quality problem surfacing at the model boundary. The distribution checks, schema validation, and data-contract thinking from there are the same discipline applied to a model’s inputs; input-drift monitoring is data-quality monitoring with the model as the consumer.
  • A note on scope: the serving, optimization, and evaluation of large language models — quantization schemes like GPTQ/AWQ, LLM-as-judge evaluation, prompt and RAG testing — belong to the companion AI Engineering book, which treats them at the depth they deserve. This chapter deliberately stops at the general case.

Further reading

Essential

  • Chip Huyen, Designing Machine Learning Systems — the chapters on monitoring, data and concept drift, and experimentation are the closest single reference to this chapter’s whole arc.
  • Kohavi, Tang, and Xu, Trustworthy Online Controlled Experiments — the definitive practical guide to A/B testing at scale: power, guardrails, SRM, peeking, and the organizational discipline experiments require.

Deep dives

  • Han, Mao, and Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding” — the foundational paper tying pruning and quantization together.
  • Hinton, Vinyals, and Dean, “Distilling the Knowledge in a Neural Network” — the original knowledge-distillation paper, and the source of the “learn the teacher’s soft outputs” idea.

Historical context

  • Gama et al., “A Survey on Concept Drift Adaptation” — a thorough map of the drift-detection literature and the vocabulary (data vs concept vs label drift) this chapter uses.
  • Sculley et al., “Hidden Technical Debt in Machine Learning Systems” — the paper that named the maintenance burden of production ML, and the reason a chapter like this one has to exist.