ML Systems: Tracking, Features & Serving

Keywords

mlops, experiment tracking, feature store, model serving, model registry, training serving skew, reproducibility, online offline features, inference

Introduction

The model worked. In the notebook it cleared the accuracy bar everyone had been chasing for a month — a clean validation curve, a confusion matrix the product team loved, a winning run the author screenshotted and pasted into Slack. Then the trouble started, and none of it was about the math.

First, nobody could reproduce the winning run. The author had tweaked the learning rate, swapped a feature, and re-pulled the dataset somewhere in a forty-cell notebook, and the exact combination that produced the magic number was gone. The metrics had been logged; the inputs that produced them had not — not the data snapshot, not the git SHA, not the seed. The best result the team had ever seen was, in the end, an anecdote.

Second, the feature that looked so good in training behaved differently in production. Offline, a “30-day spend” feature was computed in pandas over a historical table. Online, the serving service recomputed it in hand-written SQL against the live database — and the two drifted by a rounding convention and a timezone nobody noticed. The model was trained on one number and served another. Accuracy didn’t crash; it sagged, quietly, for weeks, until someone finally diffed the two code paths. This is training-serving skew, and it is the most expensive bug in machine learning precisely because it is silent.

Third, shipping the model was a white-knuckle ritual. There was no registry, no staged promotion, no rollback. Deploying meant someone SCP’d a pickle file onto a box, restarted the service, and watched the dashboards with their stomach in a knot. When the new model misbehaved, “rolling back” meant finding the previous pickle — if it still existed.

A trained model is not a product. The notebook artifact that scores well on a held-out split is the beginning, not the end. The gap between that artifact and a served, reproducible, maintainable model — one you can trace to its inputs, serve consistently to every caller, and replace without fear — is ML systems engineering. This chapter is about the three systems that close that gap.

The Core Insight

Three pieces of infrastructure turn a model into a reliable product, and each one answers a question the notebook left unanswered.

  1. Experiment tracking and a model registry answer “how did we get this model, and which one is in production?” Every model becomes traceable back to the data, code, parameters, and metrics that produced it. The winning run stops being an anecdote and becomes a record you can re-run.
  2. A feature store answers “are training and serving looking at the same numbers?” One definition of each feature is computed once and served consistently to both training and inference — which is what kills training-serving skew at the source rather than chasing it after the fact.
  3. Model serving answers “how do callers get predictions?” The model goes behind an API and inherits the full discipline of any production service: latency budgets, throughput targets, autoscaling, health checks, versioning.

The unifying theme is simple and demanding: ML in production needs the same rigor as software, plus the data-and-model-specific machinery that ordinary software never needed. A web service has code and config to version; an ML system also has data and models to version, features to keep consistent across two very different access patterns, and a non-deterministic artifact whose quality can drift even when the code is frozen. The three systems above are the machinery for exactly those extra problems.

Scope: systems, not the LLM evaluation lifecycle

This chapter is the platform view of MLOps — experiment tracking, feature stores, and serving infrastructure. The broader MLOps lifecycle and evaluation discipline — offline/online eval design, LLM judges, drift-driven retraining policy, prompt and model evaluation harnesses — overlaps a separate body of work and is treated as its own subject. Here we stay on the systems that move a model from a trained artifact to a served, reproducible one.

A mental model

Three analogies, one per system, make the whole chapter portable.

Experiment tracking is version control for experiments. Git versions code; it has nothing to say about which learning rate produced which accuracy on which dataset snapshot. Experiment tracking versions the whole experiment — params, code SHA, data version, metrics, and artifacts together — so “which change helped?” becomes a query against an immutable record rather than a test of someone’s memory. A run is a row you can sort, filter, and re-run.

A feature store is a shared, consistent feature API that spans two worlds. It is one catalog of feature definitions with two faces: an offline face for training (bulk, historical, point-in-time correct) and an online face for serving (single-entity, low-latency). The point is not the storage; it is that both faces are computed from the same definition, so the feature a model learns from and the feature it predicts on are the same feature.

Serving means your model is now a latency-bound microservice. The moment a model is behind an API, it stops being math and becomes a service with a p99, a queue, a warm pool, and a deploy pipeline. The model is rarely the bottleneck; the request path around it — preprocessing, batching, model loading, postprocessing — is where the incidents live.

The decisions

These systems come with real choices, and getting them wrong is expensive. A few decision frames to carry through the chapter:

  • What to track. At minimum: parameters, metrics (over steps, not just final values), artifacts, and the data version and code SHA. The first three are easy and the last two are the ones teams skip — and the ones that make a run reproducible.
  • Build vs. buy a feature store. A single model with a handful of features computed inside the serving app does not need a feature store; you would be paying real operational overhead to solve a problem you don’t have. Reach for one when you have training-serving skew bugs, features shared across multiple models, a need for point-in-time correctness, or a low-latency online requirement. Below that bar, direct queries are honest.
  • Online vs. offline features. Offline features feed training and batch scoring from a warehouse; online features feed real-time inference from a low-latency store. Most features need both faces — and the materialization that keeps the online face fresh is a first-class concern, not an afterthought.
  • Serving pattern. Online/real-time serving optimizes for latency (one request, answered now); batch serving optimizes for throughput (millions of rows, scored offline). The latency/throughput tradeoff, and where batching sits on it, is the central serving decision — see Figure 38.1 for how training and serving read from the two feature faces.

What you’ll learn

  • How experiment tracking turns a training run into an immutable, reproducible record — and why logging the data version and code SHA matters as much as the metrics
  • How a model registry promotes models through stages (champion/challenger) so the model in production is always a known, traceable version
  • What training-serving skew is, why it degrades production silently, and how one feature definition served to both training and inference eliminates it
  • What point-in-time correctness means and why getting it wrong inflates your offline metrics with future data
  • How online and offline feature stores differ in access pattern, and what materialization does between them
  • Why a served model is a latency-bound microservice, and where the latency actually goes (loading, batching, pre/post-processing — rarely the model math)
  • The difference between online and batch inference, and how dynamic batching trades a little latency for a lot of throughput
  • How MLOps glues these together into automated retrain/evaluate/deploy pipelines with a safe rollout and rollback path

Prerequisites

  • Deep Learning — you should be comfortable with the training loop, what a model artifact is, and metrics like loss and accuracy over steps.
  • The Data Engineering Landscape — feature stores sit on top of the data platform; warehouses, batch pipelines, and streaming all reappear here as the substrate features are computed from.
  • Comfort with HTTP services and the basic shape of a request/response API, since serving is web engineering with model-shaped constraints.

Experiment tracking and the model registry

The cheapest insurance in machine learning is also the one teams most often skip: recording how each model came to be. The core abstraction is the run — one training execution, captured as an immutable, comparable record. A run is not just its metrics; it is the bundle of inputs and outputs that, taken together, let you answer “what produced this number?” and, crucially, re-run it.

A useful run captures five things: the parameters (learning rate, batch size, architecture), the code (a git SHA and a clean/dirty flag), the data (a dataset version or content hash), the metrics (loss and validation accuracy logged over steps, so you have a curve and not a single endpoint), and the artifacts (the checkpoint, the plots, the preprocessing pipeline). The first time you need to compare run A against run B to find what moved a metric, the value of having logged all five becomes obvious — and the first time you’ve logged only metrics, the cost of skipping the other four becomes a war story.

In practice this is a thin wrapper around your training loop. A tracking library opens a run, you log the inputs up front and the metrics as they stream, and the library writes an immutable record you can query later.

import mlflow

# One run = one immutable record: inputs logged up front, metrics over steps.
with mlflow.start_run(run_name="bert-base"):
    mlflow.log_params(config)                       # lr, batch_size, ...
    mlflow.set_tag("git_commit", git_sha())         # code version
    mlflow.log_param("data_version", dataset.hash)  # the input nobody logs
    for step, batch in enumerate(loader):
        loss = train_step(batch)
        mlflow.log_metric("train_loss", loss, step=step)  # a curve, not a point
    mlflow.log_artifact("model.pt")                 # the checkpoint itself

The two lines that separate a reproducible run from an anecdote are the code version and the data version. Metrics tell you what happened; the params, SHA, and data hash tell you why, and let you make it happen again. Logging metrics with a step is the other quiet discipline: a single final number cannot show you that a run was still improving when it stopped, or that it diverged and recovered — the curve can.

Tracking answers “how did we get this model.” The model registry answers the next question: “which model is in production, and how do we move a new one there safely?” A registry is a versioned catalog of trained models with a promotion workflow layered on top. Rather than juggling files named model_final_v2_real.pkl, you register a model, get a version number, and attach aliases that name roles rather than versions — a challenger under evaluation and a champion in production. Promotion is reassigning an alias; rollback is reassigning it back.

from mlflow import MlflowClient
client = MlflowClient()

# Promotion is an alias move — and so is rollback.
client.set_registered_model_alias("sentiment", "challenger", version=7)
# ...after the challenger passes evaluation:
client.set_registered_model_alias("sentiment", "champion", version=7)

This indirection is what makes deploys boring in the good way. The serving layer loads models:/sentiment@champion and never hard-codes a version; shipping a new model is a one-line alias move that the serving layer picks up, and rolling back is the same move in reverse. The scary SCP-a-pickle ritual from the introduction is replaced by a tracked, reversible state change. (The older stage model — StagingProductionArchived — is deprecated in current MLflow in favor of these aliases; if you see stage-based code, treat it as legacy.)

Build it → A working registry-backed promotion and routing layer: Project 29: Model Routing Layer routes traffic between champion and challenger models behind one serving endpoint.

The feature store

The feature store is the centerpiece of this chapter because it solves the most insidious failure in the introduction: training-serving skew. Recall the shape of that bug. Offline, the “30-day spend” feature was computed one way (pandas over a historical table); online, it was recomputed another way (hand-written SQL against the live DB). The two code paths drifted, the model was trained on one number and served another, and accuracy sagged for weeks before anyone noticed. The defining property of this bug is that nothing throws — the system runs, the model returns predictions, and the only symptom is a metric that’s quietly worse than it should be.

The root cause is having two definitions of one feature. The fix is to have exactly one. A feature store is, at its core, a catalog of feature definitions — one source of truth — that is computed and then served two ways: an offline store (a warehouse or columnar files, optimized for bulk historical retrieval) feeds training and batch scoring, and an online store (a low-latency key-value store like Redis, optimized for single-entity lookups under 10ms) feeds real-time inference. Both faces are populated from the same definition, so the feature a model learns from and the feature it predicts on are guaranteed to be the same feature. Figure 38.1 shows the whole arrangement: one definition fanning out into two stores, training reading the offline face, the serving API reading the online face.

In code, the consequence is that training and serving make different calls — bulk historical retrieval versus single-entity lookup — but against the same feature definitions. That shared definition is the entire point.

# Training: bulk, historical, joined to labels at their event times.
training_df = store.get_historical_features(
    entity_df=labels_df,                              # entities + event_timestamp
    features=["customer_stats:spend_30d", "customer_stats:order_count_30d"],
).to_df()

# Serving: one entity, low latency — SAME feature definitions.
features = store.get_online_features(
    features=["customer_stats:spend_30d", "customer_stats:order_count_30d"],
    entity_rows=[{"customer_id": 1001}],
).to_dict()

The offline path carries a second subtlety that is just as easy to get wrong: point-in-time correctness. Training data pairs a label with the features as they existed at the moment the label was assigned — not as they exist today. If you join today’s “30-day spend” onto a label from three weeks ago, you’ve leaked the future into the past: the model trains on information it could not have had at prediction time, your offline metrics look fantastic, and production is a disappointment. A feature store does the point-in-time join for you: given an entity and an event timestamp, it returns the feature values as of that timestamp, never after it. This is why historical retrieval takes an entity dataframe with timestamps rather than a plain list of IDs — the timestamp is what makes the join honest.

The bridge between the two stores is materialization: the job that computes features and pushes the latest values into the online store so real-time lookups find fresh data. It is not a detail — a model serving stale features because a materialization job silently failed is a production incident, so feature freshness (and its TTL, the window a value stays valid) deserves the same monitoring as any other pipeline. Where sub-second freshness matters, the same definition can be fed from a stream instead of a batch job — but the principle is unchanged: one definition, kept consistent across both faces.

War story: the rounding convention that cost three points of accuracy

A team shipped a fraud model whose offline features were computed in a nightly pandas pipeline and whose online features were recomputed in the serving service’s own SQL. The two looked identical in code review. They were not: the offline path applied log(amount + 1) and the online path applied log(amount) — a one-character difference a reviewer’s eye slid right over. Online fraud-catch rate trailed the offline evaluation by about three points, steadily, for two months. Nothing errored; dashboards were green; the model “worked.” It surfaced only when an engineer, chasing an unrelated bug, dumped the same transaction through both paths and saw two different feature vectors. The fix was not a smarter model — it was deleting the second definition. One feature definition, served to both training and inference, is the whole reason the feature store exists; the moment a feature has two implementations, you are one rounding convention away from this incident.

Build it → The feature store as a production system: Project 50: Feature Engineering Platform implements offline/online stores, point-in-time joins, and materialization end to end.

Model serving

Once a model is registered and its features are consistent, the last system stands the model up behind an API — and that act changes what the model is. It stops being a function you call in a notebook and becomes a latency-bound microservice with all the obligations that implies: a latency budget, a throughput target, health checks, graceful degradation, and a deploy pipeline. The most important thing to internalize about serving is that a served model is never just model(x). It sits inside a stack — the request arrives at an API, gets preprocessed (tokenized, normalized), runs through the model runtime (possibly on a GPU, possibly batched), gets postprocessed (decoded, thresholded), and returns. Most production incidents live in that surrounding stack, not in the model math.

The first lesson every serving team learns the hard way is cold start. A model loaded from disk on the request path adds seconds to that request; loading it once at startup and keeping it warm in memory is the difference between a p99 of milliseconds and a p99 of seconds. The shape of the fix is a startup hook that loads the model before any traffic arrives.

from contextlib import asynccontextmanager
from fastapi import FastAPI

@asynccontextmanager
async def lifespan(app):
    app.state.model = load_champion()   # load ONCE, at startup — never per request
    warm_up(app.state.model)            # dummy inference: trigger lazy JIT/kernel init
    yield

app = FastAPI(lifespan=lifespan)

That warm_up call is the second lesson hiding inside the first. On many runtimes — a torch.compile’d model, ONNX Runtime, TensorRT — the first inference triggers lazy graph optimization and kernel compilation, so the first real request after a deploy can be five to ten times slower than the rest. A few dummy inferences at startup pay that cost before users do, which is why you measure latency in percentiles (p50, p95, p99) and not averages: the average hides exactly the tail-latency spikes that cold starts and missing warmups produce.

The central serving decision is online versus batch, and it is really a choice on the latency/throughput tradeoff. Online (real-time) serving answers one request as fast as possible — a user is waiting — and optimizes for latency. Batch serving scores millions of rows offline, where no one is waiting, and optimizes for throughput; it is the right tool for nightly recommendation refreshes or bulk scoring, and it is dramatically cheaper per prediction. Between them sits dynamic batching, the technique that makes online GPU serving efficient: the server holds incoming requests for a few milliseconds, groups them into a batch, and runs them together, because a GPU scoring 32 requests at once is far more efficient than scoring them one by one. It trades a little latency for a lot of throughput — a knob you tune against your latency budget, not a free lunch.

Because serving is web engineering, it inherits web engineering’s defenses: validate inputs at the boundary (a malformed feature vector returns a 422, not a crashed worker), set timeouts (a hanging prediction must not hang the request), bound in-memory queues so a spike doesn’t OOM the process, and stamp every response with the model version that produced it so a bad prediction is traceable. None of this is ML-specific; all of it separates a demo endpoint from a service. The serving frameworks — TorchServe, TensorFlow Serving, Triton — provide batching, versioning, and GPU scheduling out of the box, but the FastAPI service above is the right starting point and makes the constraints legible.

Build it → Real-time inference at the level where latency dominates: Project 44: Autoregressive Inference implements warm-pool loading, batching, and the request stack for token-by-token generation where every millisecond is on the critical path.

MLOps as the glue

Tracking, the feature store, and serving are three systems; MLOps is the pipeline that connects them into a loop. Where ordinary CI/CD builds, tests, and deploys code, MLOps does the same for models — and adds the steps software never needed: retrain on fresh data, evaluate the new model against the current champion, and only then promote it. The deployment itself reuses web-engineering patterns. A canary rollout sends a slice of traffic to the new model and watches its metrics (accuracy, latency, error rate) before widening; a blue-green deploy keeps the old version warm for an instant-switch rollback; a rolling update replaces instances gradually for zero downtime. The registry’s champion/challenger aliases make all three safe: the pipeline promotes a challenger only after it earns it, and rolls back by moving the alias.

The honest framing is maturity, not a binary. Most teams start at “automated pipelines” — scripted training and tracked experiments with a manual deploy — and climb toward continuous training, where data drift triggers retraining automatically. You build the bottom rung well, not the top on day one. The lifecycle and evaluation half of this — designing the evaluations a challenger must pass, detecting the drift that triggers retraining — is a deep subject treated as its own body of work; this chapter stays on the systems the pipeline orchestrates.

Versioning everything

The thread through all of this is one discipline ordinary software never had to take so far: version everything that affects a prediction. Code is the obvious input, and git handles it. But an ML system’s behavior is determined by three inputs — code, data, and the model artifact — and reproducibility means versioning all three together. A model is reproducible only if you can name the exact code SHA, dataset snapshot, and parameters that produced it; the registry versions the resulting artifact, and tracking ties it back to those inputs. Data-versioning tools (DVC and friends) give datasets the same git-like history code enjoys, so a run that says “trained on data v2.1” still means something months later. The payoff: for any model that ever served a prediction, you can say exactly how it was built and how to rebuild it — the difference between an auditable system and an anecdote with a confusion matrix.


Practical exercise

Difficulty: Level I · Level II · Level III

  1. Level I — Make a run reproducible. Take a small training script and instrument it with experiment tracking: log the parameters, the metrics over steps (not just the final value), the model artifact, and — the two everyone forgets — the git SHA and a data version hash. Then prove it works: from the logged run alone, reconstruct the exact configuration and re-run it. Write down what you would have been unable to recover if you had logged only the final accuracy.

  2. Level II — Define a feature once, serve it twice. Define a single feature (e.g. “30-day spend per customer”) with one definition. Serve it to an offline training job via a historical, point-in-time-correct retrieval, and to an online inference path via a low-latency single-entity lookup. Then explain, in writing: (a) how routing both paths through one definition prevents training-serving skew, and (b) what point-in-time correctness is — construct a concrete example where joining “today’s” feature value onto a past label leaks the future and inflates your offline metric.

  3. Level III — Design serving under a tight SLA. Design the serving architecture for a model with a p99 latency budget of, say, 50ms. Decide online versus batch (and where, if anywhere, batch scoring offloads work); specify dynamic batching and how you’d tune its wait window against the budget; describe warm-pool loading and warmup; and lay out a model rollout/rollback path using champion/challenger aliases (canary or blue-green). Then do the adversarial part: reason about where skew or staleness could still creep in despite the feature store — a materialization job that silently fails, a TTL shorter than the update cadence, an on-the-fly transformation in the serving code that bypasses the shared definition — and how you’d detect each.

Summary

A trained model is not a product; the gap between the notebook artifact and a served, reproducible, maintainable model is ML systems engineering, and it is closed by three systems. Experiment tracking and a model registry make every model traceable back to the data, code, parameters, and metrics that produced it, and turn deployment into a reversible alias move instead of a white-knuckle file copy. A feature store holds one definition of each feature and serves it consistently to both training (offline, point-in-time correct) and inference (online, low-latency), which eliminates training-serving skew at the source rather than chasing it after it has already cost you accuracy. Model serving stands the model up as a latency-bound microservice, where the incidents live in the request stack — cold starts, missing warmups, unbatched GPU calls — far more often than in the model math. MLOps is the pipeline that loops these together into automated retrain/evaluate/deploy with a safe rollout and rollback, and the discipline underneath all of it is to version everything that affects a prediction: code, data, and model alike.

Key takeaways

  • A run is reproducible only if you log the data version and code SHA alongside the metrics; the metrics tell you what happened, the inputs tell you why — and let you do it again.
  • A model registry’s champion/challenger aliases make promotion and rollback one reversible state change; the serving layer loads a role, never a hard-coded version.
  • Training-serving skew is silent — nothing errors, accuracy just sags — and its root cause is having two definitions of one feature; the cure is having exactly one, served to both faces.
  • Point-in-time correctness keeps the future out of your training data; a feature store’s historical retrieval joins features as of the label’s timestamp, which is why it needs timestamps, not just IDs.
  • A served model is a microservice: load once and warm up, batch on the GPU, measure p99 not the average, and validate/timeout/version at the boundary.
  • Version code and data and the model artifact — reproducibility needs all three, and software’s git-only habit is not enough.

Connections to other chapters

  • The Machine Learning Engineering Landscape (prerequisite): that chapter frames the model as a small fraction of a real ML system; this one builds out the systems that make up the rest — the tracking, features, and serving infrastructure that surrounds the weights.
  • The Data Engineering Landscape (prerequisite): a feature store sits directly on the data platform. The warehouses, batch pipelines, and streaming systems taught there are exactly the substrate a feature’s offline and online faces are computed from, and the feature-engineering platform project lives at that seam.
  • Python: Web Development and Go: Web Services & gRPC (siblings): model serving is a web service, with extra constraints. The request lifecycle, validation, timeouts, and API design from those chapters carry over directly; serving adds model loading, warmup, and batching on top of that same foundation.
  • CI/CD and Deployment Automation and Observability (extensions): MLOps is CI/CD plus monitoring applied to models — the canary/blue-green/rolling patterns and the metrics-and-alerting discipline come straight from those chapters, with model accuracy, feature freshness, and prediction drift added as new things to watch. (The deeper evaluation lifecycle — eval design, drift policy, LLM judges — is treated as its own subject and is intentionally out of scope here.)

Further reading

Essential

  • Chip Huyen, Designing Machine Learning Systems (O’Reilly) — the definitive book-length treatment of exactly this systems view: tracking, features, serving, and the loop around them.
  • MLflow documentation — the canonical reference for experiment tracking, the model registry, and alias-based promotion.
  • Feast documentation — the open-source feature store; the offline/online store split, point-in-time joins, and materialization in concrete form.

Deep dives

  • Google Cloud, “MLOps: Continuous delivery and automation pipelines in machine learning” — the maturity-level framing (manual → automated pipelines → CI/CD → continuous training) used in this chapter.
  • Uber Engineering, “Meet Michelangelo: Uber’s Machine Learning Platform” — the influential early articulation of an end-to-end ML platform, and the origin of much feature-store thinking.

Historical context

  • “Feature Stores for ML” writing (the Hopsworks/Tecton/Feast lineage) — how the feature store emerged as a named pattern from the training-serving skew problem.
  • D. Sculley et al., “Hidden Technical Debt in Machine Learning Systems” (NeurIPS, 2015) — the paper that named the “ML code is a tiny box in the middle” reality this entire chapter is a response to.