The Machine Learning Engineering Landscape

Keywords

ml engineering, ml systems, ml lifecycle, technical debt, training serving skew, mlops, model serving, reproducibility

Introduction

The notebook said 95%. After a month of careful work — clean features, honest cross-validation, a held-out test set the data scientist had not peeked at — the model predicted churn with an accuracy the business genuinely wanted. The demo went perfectly. Everyone agreed it should ship. That is where the easy part ended and the engineering began, and the engineering is where it nearly died.

The first problem surfaced when someone asked for the model to be rebuilt. Which version of the data had it trained on? Nobody could say — the CSV had been overwritten twice since. Which hyperparameters? They lived in a cell that had been edited in place, its history gone. Re-running the notebook top to bottom produced a different model, because the train/test split reshuffled and a dependency had bumped a minor version. The 95% was real, once, and irreproducible. The second problem was worse and quieter. The features the notebook computed — a customer’s average order value over the trailing ninety days, normalized against a mean baked into the notebook — were computed differently at serving time, by a separate service, against live data, with a different baseline. The model in production was being fed features that did not match the ones it had learned on. It was confidently wrong, and no error fired. Then, over the following weeks, even that drifted: customer behavior shifted, and the model’s accuracy rotted from 95% toward a coin flip — silently, because nothing was watching the predictions against reality. And when the team finally decided to retrain, a single retrain turned out to be a week of manual work: copy the notebook, hunt down current data, re-tune, re-export, hand the artifact to the serving team, hope.

None of these failures were about the model. The model was the easy part — a month of work that produced a good artifact. What buckled was everything around the model: the inability to reproduce it, the gap between how features were built for training versus serving, the absence of any monitoring to catch decay, and the lack of a repeatable path from new data to deployed model. This is the central truth of Part III, and it is worth saying plainly before anything else: in production machine learning, the model is the small part; the system around it is the engineering.

The core insight: ML code is a small part of an ML system

The most cited observation in this field is also the most clarifying. In Hidden Technical Debt in Machine Learning Systems (Sculley et al., NeurIPS 2015), a team of Google engineers drew a now-famous diagram: a sprawl of boxes — data collection, feature extraction, data verification, serving infrastructure, monitoring, configuration, process-management tools — surrounding one tiny box in the middle labeled ML Code. The point of the picture is the proportion. The mathematics everyone studies — the model, the loss function, the training loop — is a fraction of a real ML system. The overwhelming majority of the code, the effort, and the failures live in the infrastructure that feeds the model, ships it, and keeps it honest.

This reframes what the job actually is. ML engineering is not primarily about inventing architectures or squeezing another point of accuracy out of a benchmark; that is research and data science, and it is genuinely the smaller box. ML engineering is the discipline of building and operating the system around the model reliably — the data pipelines that deliver consistent inputs, the feature infrastructure that computes features the same way in training and serving, the experiment tracking that makes a result reproducible, the serving layer that turns a saved artifact into a low-latency API, the monitoring that notices when reality drifts away from the training distribution, and the retraining loop that closes back on itself. Figure 35.1 shows the whole system, with the training code as the one small box it really is.

Read the figure as a loop, not a pipeline. Data is ingested and versioned; features are engineered and stored; the training code consumes them and produces a candidate model, logging its parameters and metrics to experiment tracking as it goes; the candidate is evaluated against gates and, if it passes, promoted into a model registry; the registered model is served for online inference or batch scoring; and the serving layer is watched by monitoring, which compares live inputs and predictions against what the model was trained on. When monitoring detects drift — the world has moved, accuracy is sliding — it triggers a retrain, and the loop runs again. The small red box in the middle is the model code. The blue, orange, and green bands around it — the data platform, the feature store and tracking, the registry, serving, and monitoring — are most of an ML system, and most of this part of the book.

What makes ML systems different

If ML systems were just ordinary software with a matrix multiply in the middle, none of this would need its own part of the book. They are different in four specific ways, and each one is a class of problem that conventional software engineering does not have.

Behavior depends on data, not just code. In ordinary software, behavior is fully determined by the source: read the code and you know what it does. An ML model’s behavior is determined by the data it was trained on — the same code, trained on different data, is a different system. Correctness can no longer be reasoned about from the code alone, which is why versioning data is as essential as versioning code, and why “what did this train on?” must always have an answer.

Models decay because the world drifts. A correct program stays correct until you change it. A correct model degrades on its own, because the world it was trained to predict keeps moving — customer behavior shifts, fraud patterns evolve, last year’s distribution stops describing this year’s traffic. This is drift, and it is why a deployed model is never finished. Software rots only when requirements change; a model rots even when nothing in your control changes at all — which makes monitoring and a retraining loop part of the system’s definition, not optional extras.

Reproducibility requires more than code. To reproduce a software build you pin the code and its dependencies. To reproduce a model you must pin four things at once: the data (down to a version or hash), the parameters (every hyperparameter and seed), the code, and the environment (library versions, even CUDA and cuDNN, because numerical non-determinism is real). Miss any one and “the exact same run” produces a different model — exactly the failure in the opening story.

Training-serving skew is a bug class software does not have. This is the subtle, expensive one. A model learns the relationship between features as they were computed during training and the target. If features are computed even slightly differently at serving time — a different normalization baseline, a different default for a missing value, a feature joined from a stale table — the model receives inputs from a distribution it never saw and produces quietly wrong predictions, with no error and no crash. Ordinary software has nothing analogous: there is no way for a sort() to be “skewed” between test and production. Eliminating skew by computing features once, the same way for both training and serving, is one of the central reasons feature stores exist.

How Part III is organized

Part III walks the ML system from the inside out — from the algorithms and frameworks at the core, through the systems that operate them, out to the hardware they run on. Six chapters, in this order:

  • ML Foundations — the smaller box, treated honestly: how learning algorithms work, how to engineer features that carry signal, and how to select and evaluate a model without fooling yourself. This is the conceptual ground everything else stands on.
  • Deep Learning Frameworks — PyTorch and TensorFlow, and the one idea that powers both: automatic differentiation. How tensors, autograd, and training loops turn the math into code you can actually run and scale.
  • ML Systems — the surrounding infrastructure made concrete: experiment tracking, feature stores, model registries, and model serving. This is where most of Figure 35.1 gets built.
  • Production ML — operating a deployed model: monitoring and drift detection, A/B testing for safe rollout, and optimization (quantization, pruning, distillation) to make inference fast and cheap. The part of the loop that runs after deployment.
  • Distributed Training — when one machine is not enough: data and model parallelism, and the frameworks that scale training across many GPUs and nodes.
  • GPU Programming & CUDA — the hardware floor: how GPUs are built, how to write kernels, and where the performance every layer above quietly depends on actually comes from.

One boundary is worth stating up front. This part stops at the level of ML engineering systems — training, serving, features, monitoring, scaling, and the hardware. It does not cover large language models, retrieval-augmented generation, agents, or LLM inference; that material — the AI-engineering layer built on top of these foundations — lives in the companion book AI Engineering. Where the underlying theory overlaps (a transformer is a neural network; an LLM is served like any other model), the foundations here carry over directly; the LLM-specific practices do not, and this part deliberately leaves them to that book.

What you’ll learn across Part III

  • Why an ML system is mostly not the model, and how the data, feature, serving, and monitoring infrastructure around the model determines whether it succeeds in production
  • How the core algorithms and deep-learning frameworks actually work — features, model selection, tensors, autograd, and training loops — well enough to reason about them, not just call them
  • How to make an ML result reproducible, by versioning data, parameters, code, and environment together rather than hoping a notebook re-runs the same way
  • How to recognize and eliminate training-serving skew, the bug class unique to ML, by computing features once for both training and serving
  • How to operate a deployed model — monitor it for drift, roll out new versions safely with A/B tests, and optimize it for latency and cost
  • How to scale training beyond a single machine and reach down to the GPU when the performance the whole stack depends on demands it

A quick orientation

Before the first chapter, spend a few minutes mapping these ideas onto a system you actually know. As in the rest of this book, the goal is a defensible answer, not a correct one — the habit of seeing the whole system, not just the model, is the point.

Difficulty: Level I · Level II · Level III

  1. Level I — Map the lifecycle. Pick an ML project or product you have seen, even at a distance. Walk the stages in Figure 35.1 — data pipeline, feature engineering, training, evaluation, registry, serving, monitoring, retraining — and mark which ones the project actually handles and which it neglects or does by hand. Most real projects have several blank boxes; naming them is the exercise.
  2. Level II — Find the skew. For that same project, trace where a feature is computed for training and where the same feature is computed for serving. Are they the same code path, or two? Name one concrete place where training-serving skew could creep in — a normalization baseline, a default for missing values, a join against a table that is fresh in one path and stale in the other.
  3. Level III — Argue the bottleneck. Make the case for which surrounding system — not the model — most often decides whether an ML project succeeds: the data pipeline, the feature infrastructure, the reproducibility tooling, the serving layer, or the monitoring loop. Defend your choice with a failure you have seen or can imagine, and state what evidence would change your mind. There is no single right answer; the reasoning is the deliverable.

Connections to other chapters

The six chapters of Part III pay off this opener in sequence. ML Foundations and Deep Learning Frameworks open the small box — the algorithms, features, and autograd that the rest of the system serves — so that when the surrounding infrastructure is built in ML Systems, you understand what it is wrapping. ML Systems and Production ML are where Figure 35.1 becomes real: experiment tracking and feature stores attack the reproducibility and training-serving-skew problems from the introduction directly, and monitoring and A/B testing close the loop that catches drift. Distributed Training and GPU Programming & CUDA are the answer to scale — what to do when the small box itself is too big or too slow for one machine.

Part III does not stand alone. It sits directly on top of Part II (Data Engineering): an ML system runs on the data platform, and the features that feed every model come from the pipelines, storage, and processing taught there — the opening story’s silent feature failure is, at root, a data-engineering failure that surfaced as an ML one. It also leans on Part IV (Cross-Cutting Concerns): ML systems need the same Observability and CI/CD as any production software, plus an ML-specific layer on top — monitoring that watches data distributions and prediction quality, and pipelines that validate data, gate on evaluation metrics, and promote models through a registry. Wherever this part says “monitor the model” or “ship it through CI,” the general machinery comes from Part IV and the ML-specific extension is built here.

Finally, the boundary again, as a connection rather than a wall: the LLM and AI-engineering layer — RAG, agents, LLM serving and inference — is built on these foundations but belongs to the companion book AI Engineering. A reader who finishes Part III has exactly the systems grounding that book assumes.

Further reading

Essential

  • Designing Machine Learning Systems (Chip Huyen) — the best single book on the systems-level view this part takes: data, features, training, deployment, and monitoring as one engineered whole. If you read one thing alongside Part III, read this.
  • Machine Learning Engineering (Andriy Burkov) — a practical, end-to-end treatment of the ML lifecycle from problem framing through deployment and maintenance; pairs well with the reproducibility and monitoring themes here.

Deep dives

  • Sculley et al., “Hidden Technical Debt in Machine Learning Systems” (NeurIPS 2015) — the source of this chapter’s core insight and its famous small-box diagram; the foundational argument that the model is the easy part.
  • Zinkevich, “Rules of Machine Learning: Best Practices for ML Engineering” (Google) — a field-tested checklist of hard-won rules, including several on training-serving skew and on not over-engineering the model before the system around it works.

Historical context

  • Breck et al., “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction” (Google) — turns the technical-debt argument into a concrete, scorable rubric for whether an ML system is actually ready for production.
  • Polyzotis et al., “Data Management Challenges in Production Machine Learning” (SIGMOD
    1. — an early, thorough articulation of why the data around the model, not the model, is where production ML systems are won or lost.