Deep Learning Frameworks

Keywords

deep learning, pytorch, tensorflow, autograd, automatic differentiation, training loop, tensors, gpu, computational graph, backpropagation

Introduction

The researcher had a three-layer network on the whiteboard and a deadline. To train it she needed the gradient of the loss with respect to every weight — and in 2012, before frameworks were a given, that meant deriving it by hand. She worked backward through the chain rule layer by layer, filling pages with partial derivatives, then transcribed each one into a loop of array updates. The forward pass took twenty lines. The backward pass took two hundred, and it was where every bug lived: a transposed matrix here, a missing factor of two there, a sign she flipped while copying from the second page to the third. None of these errors crashed anything. The network ran. It just didn’t learn — the loss drifted sideways for a thousand iterations, because a gradient that is almost right points in almost the right direction, which is to say the wrong one. She found the bug eventually by recomputing each derivative numerically and comparing. The whole exercise had taken a day and produced one working gradient for one specific network. Change the architecture and she would do it all again.

That is the problem deep learning frameworks were built to abolish. The forward pass — the actual model, the part with the ideas in it — is short and easy to get right. The backward pass is long, mechanical, and unforgiving, and it has to be rederived every time the architecture changes. A second failure mode is just as common and just as quiet: the loop that runs cleanly but never improves because the gradients from the last batch were never cleared and silently accumulated, or because the model sat on the GPU while a batch of data stayed on the CPU and every step paid for a needless transfer. In both cases nothing throws. The program is wrong in a way the type system cannot see. A deep learning framework exists to make one thing automatic — the gradient — and to move tensors fast on accelerators, so the part you get wrong by hand is the part you never write by hand.

The Core Insight

A deep learning framework is, at its core, two things bolted together: a tensor library that runs on accelerators and an automatic differentiation engine. Everything else — layers, optimizers, data loaders, the model zoo — is convenience built on those two.

The tensor library is the easy half to describe and the hard half to build. A tensor is an n-dimensional array, like a NumPy array, but with two extra facts attached: which device it lives on (CPU, GPU, TPU) and whether the framework should track gradients through it. Operations on tensors — matrix multiplies, convolutions, activations — dispatch to hand-optimized native kernels (CUDA on NVIDIA hardware) that run thousands of multiply-accumulates in parallel. That is why a framework is “NumPy that runs on the GPU”: the API feels like array math, but the arithmetic happens on silicon built for dense linear algebra at terabytes per second.

The autodiff engine is the half that earns the name framework. As your forward computation runs, the engine records every operation into a graph — input feeds a matrix multiply, whose output feeds an activation, and so on to a scalar loss. You wrote only that forward path. To get gradients, the engine walks the recorded graph backward, applying the chain rule at each node to accumulate the derivative of the loss with respect to every parameter. This is reverse-mode automatic differentiation, and it is exactly backpropagation — the researcher’s two hundred lines become one method call. It is neither numerical approximation nor symbolic algebra on your source; it is the chain rule applied mechanically to a trace of the operations that actually ran, so it is exact and works for any forward pass you can express, loops and branches included.

The payoff is that the training loop is the same regardless of model. Forward to a loss, backward to gradients, optimizer step to update the weights, repeat — that skeleton does not change whether the model is a two-layer perceptron or a billion-parameter network. Everything that differs between models lives inside the forward pass, as a composition of differentiable building blocks. Learn the loop once and you have learned how to train anything.

A mental model

Three pictures, and the rest of this chapter is detail. A tensor is an n-dimensional array that knows which device it lives on — not just data, but data that remembers where it is and whether it matters to the gradient. The computational graph is a recording: as the forward pass runs, the framework writes down each operation and what fed it, then on .backward() plays the tape in reverse, so backpropagation is assigning blame — walking from the loss back toward the inputs, distributing responsibility for the error across every parameter, after which the optimizer nudges each in proportion to its blame. And the framework as a whole is one line: NumPy that runs on the GPU and remembers how to differentiate itself.

When to use deep learning, and which framework

Two decisions precede any code. The first is whether to use deep learning at all. The honest answer: when your data is unstructured — images, audio, text, raw signals — and there is a lot of it. Classical machine learning wins on small, tabular, well-feature-engineered datasets, where a gradient-boosted tree beats a neural network on accuracy, training time, and interpretability at once. Deep learning earns its considerable cost precisely where hand-engineering features is hopeless: you cannot write down the features that distinguish a cat from a dog in pixels, so you let the network learn the representation from raw input. That is the trade — you give up interpretability and a small-data comfort zone to gain features no human would engineer. Figure 37.1 shows the loop that makes that learning happen.

The second decision is the framework, and it turns on one design axis: eager / define-by-run versus graph / define-then-run. In eager execution — PyTorch’s default, and the research community’s — operations run the instant the interpreter reaches them and the graph is built on the fly; your code is the computation, so you can print inside the forward pass, set breakpoints, and branch on a tensor’s value with a plain Python if. In graph execution — TensorFlow 1.x’s original model, alive today in @tf.function and tf.GradientTape — you describe the whole computation as a static graph first, then hand it to a runtime to optimize and execute; harder to debug, easier to serialize and ship to a phone or TPU. The practical default for new work, especially research and anything dynamic, is eager-mode PyTorch — it is what most papers ship and what the largest talent pool knows. Reach for the graph-first tools when deployment to constrained targets (mobile via TensorFlow Lite, the browser via TensorFlow.js) or TPU training at scale is the priority, where TensorFlow’s lineage still leads. The distinction has softened either way: PyTorch compiles eager code with torch.compile, and the later Eager versus graph section returns to that.

What you’ll learn

How a tensor differs from a plain array — device placement, dtype, and gradient tracking — and the cardinal rule that keeps a training run from silently stalling
How reverse-mode automatic differentiation records a computational graph in the forward pass and replays it backward to compute every gradient
Why the training loop — forward, loss, zero-grad, backward, step — is universal, and what each step is actually doing
How models are assembled from differentiable building blocks, and how a custom layer hooks into the autograd engine
How the major architecture families (CNNs, RNNs) are just patterns of those blocks — and where the theory of transformers and LLMs lives instead
How eager and graph execution differ, and how compilation (torch.compile, XLA) buys back performance without giving up the define-by-run feel

Prerequisites

Machine Learning Foundations: what a loss function is, what gradient descent does, the train/validate split, and why overfitting is the enemy
Performance and Profiling: the cross-language chapter on why code is slow and how to speed it up — in particular why pure-Python loops are slow, what vectorization buys you, and the idea that a fast Python library is usually a thin front-end over native code
Comfort with linear algebra at the level of matrix multiplication and the chain rule, and with NumPy-style array operations

Tensors and devices

Start with the tensor, because everything else operates on it. A tensor is an n-dimensional array — the same idea as a NumPy array, with most of the same API: create from lists, fill with zeros or random values, index, slice, and run element-wise and matrix operations. The shape conventions recur everywhere and are worth memorizing: [batch, channels, height, width] for images, [batch, sequence_length, features] for sequences, [batch, features] for tabular data. The batch dimension comes first almost universally, because the framework processes many examples at once to keep the accelerator busy.

What makes a tensor more than an array is its metadata, and two pieces are load-bearing. The first is dtype — float32 is the default; float16 and bfloat16 halve the memory and roughly double throughput on modern GPUs, a real performance lever for large models. The second, the one that bites beginners, is device: every tensor lives somewhere specific, on the CPU or a particular GPU, with its data physically in that device’s memory.

This yields the single most important operational rule in the whole framework: keep everything on one device. A model’s parameters live on a device; the input batch must live on the same one. An operation between a CPU tensor and a GPU tensor does not quietly do the right thing — at best it errors (expected all tensors to be on the same device, a message every practitioner learns to read on sight). The fix is to move data to the model’s device at the top of every step; the efficient habit is to create tensors on the target device rather than building them on the CPU and copying, because the CPU-to-GPU transfer is slow and a loop that crosses that boundary needlessly spends more time shuffling bytes than computing. Below is the canonical device dance — pick the device once, move the model there, move each batch there as it arrives.

import torch

# Pick the accelerator if one exists; fall back to CPU otherwise.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)                       # parameters now live on `device`
for inputs, labels in loader:
    inputs = inputs.to(device)                 # batch must join the model
    labels = labels.to(device)                 # ...so must the labels

The same idea appears in TensorFlow, where placement is more automatic but the principle is identical. Device discipline is not skippable: it is the difference between a loop that runs and one that runs fast, and occasionally the difference between a loop that runs at all.

Automatic differentiation: the centerpiece

This is the heart of the framework and the reason it exists, so it is worth slowing down. Autodiff replaces the researcher’s hand-derived gradients with bookkeeping the framework does for you, and the mechanism is simpler than it sounds.

When you operate on a tensor marked to track gradients — in PyTorch, one with requires_grad=True, which model parameters have automatically — the framework records that operation into a directed graph: nodes are tensors, edges are the operations that produced them. The graph grows one edge per operation as the forward pass runs, until you reach a scalar loss. Nothing has been differentiated yet; the framework has merely written down what happened, in order. This is the tape, or the computational graph, and building it is the only cost the forward pass pays to be differentiable.

The magic is in .backward(). Call it on the loss and the framework walks the recorded graph in reverse, from the loss back toward the inputs. At each node it knows the local derivative of that operation — of a matrix multiply, a ReLU, an addition, all in closed form — and multiplies these together along the way, which is precisely the chain rule. The result, accumulated into each parameter’s .grad, is the gradient of the loss with respect to that parameter. This is reverse-mode automatic differentiation; in deep learning it goes by the more familiar name backpropagation. A tiny example makes it concrete: square a tensor, ask for the gradient, and get back the analytic derivative without writing it.

x = torch.tensor([2.0], requires_grad=True)   # track gradients for x
y = x ** 2 + 3 * x + 1                         # forward pass; graph is recorded
y.backward()                                   # walk the graph backward
print(x.grad)                                  # tensor([7.]) — exactly dy/dx = 2x+3 at x=2

No derivative was derived. The framework knew the local rule for squaring, scaling, and addition, and composed them. Scale this from one scalar to a network’s millions of parameters and the loop is identical: the forward pass records the graph, .backward() replays it, every .grad is filled in exactly. Three properties make this powerful enough to have remade a field. It is exact, not a finite-difference approximation, so it adds no numerical error to the gradient. It is general, working for any forward pass — including ones with if statements and for loops — because it differentiates the operations that actually ran rather than analyzing your source. And it is automatic: you write the forward computation and the gradient comes free.

Figure 37.1 puts the pieces in one frame — the forward pass building the graph, the backward pass walking it in reverse to compute gradients, the optimizer closing the loop — under one annotation worth tattooing on it: you write only the forward pass; the framework records the graph and differentiates it automatically.

TensorFlow expresses the same idea with different ergonomics: you record operations inside a tf.GradientTape context, then call tape.gradient(loss, vars). The tape is explicit rather than implicit and watches tf.Variable objects by default — a common trap is asking for the gradient of a tf.constant and getting None because the tape was never told to watch it. The surface differs; the algorithm underneath is the same reverse-mode autodiff.

The training loop

With autodiff understood, the training loop is almost anticlimactic — which is the point. It is the same five steps for every model. Move the batch to the device. Zero the gradients from the previous step. Run the forward pass to get predictions. Compute the loss. Call .backward() to fill in the gradients, then .step() to let the optimizer update the parameters. Loop. The order is not negotiable, and one step in particular is the source of the most infamous silent bug in deep learning, which we will come to.

model.train()                                  # enable dropout, batchnorm in train mode
for inputs, labels in loader:
    inputs, labels = inputs.to(device), labels.to(device)
    optimizer.zero_grad()                      # clear last step's gradients (see below)
    outputs = model(inputs)                    # forward: build the graph
    loss = criterion(outputs, labels)          # how wrong are we?
    loss.backward()                            # backward: fill every .grad
    optimizer.step()                           # update params: param -= lr * param.grad

Two lines deserve a closer look. loss.backward() is the autodiff engine invoked: it walks the graph built by model(inputs) and the loss and deposits a gradient into every parameter’s .grad. optimizer.step() is the learning itself: it reads each .grad and nudges the parameter in the loss-reducing direction, scaled by the learning rate. Adam and SGD differ only in how they turn gradients into updates; both consume the .grad fields backward() produced.

Now the bug. Gradients accumulate by default — each .backward() adds into .grad rather than replacing it. That is deliberate (it lets you accumulate across micro-batches to simulate a larger batch than fits in memory), but it means that if you forget optimizer.zero_grad(), the gradient from step one is still in .grad when step two adds to it, and your updates are computed from a polluted sum of every batch you have ever seen. Nothing crashes; the loss simply fails to descend, and you stare at it exactly as the researcher stared at her hand-coded gradient. A training loop can be syntactically perfect and semantically wrong, and the framework will not tell you. Zeroing gradients at the top of every step is the one habit that prevents the most common version of that failure.

War story: the loop that ran for a week and learned nothing

A team kicked off a multi-day run on a fresh model and a slightly refactored loop. It ran — GPUs pegged, loss curves drew themselves, checkpoints landed on schedule — but validation accuracy hovered at chance. Forward pass correct, data correct, architecture matching a published baseline. The defect was one moved line: optimizer.zero_grad() had drifted outside the batch loop, so gradients accumulated across the entire epoch before each update, and every step nudged the weights with a gradient summed over thousands of batches — enormous, stale, meaningless. No exception, no warning, no NaN; the only symptom was a flat accuracy curve, which looks exactly like “the task is just hard.” In deep learning the compiler cannot save you, because the program is correct — it is the math that is wrong. The defenses are discipline (zero the gradients, pin the device) and verification (confirm the loss drops on a single batch before launching the week-long run).

Build it → To see autograd and the loop with no framework underneath them, build the engine yourself in Project 35: Differentiable Programming, a from-scratch reverse-mode autodiff library — tensors, a recorded graph, backward(), and SGD/Adam, in pure Python with no PyTorch or TensorFlow underneath. It is the fastest way to make the mechanism in this section stop being magic.

Building models from differentiable blocks

A model is a composition of differentiable blocks, and the framework gives you a container for them. In PyTorch that container is nn.Module: you subclass it, declare layers as attributes in __init__, and define the forward pass in a forward method. The base class does the tedious bookkeeping — it registers every layer’s parameters so the optimizer finds them with model.parameters(), moves them all to a device with one model.to(device) call, switches the model between training and evaluation behavior, and serializes the weights. You never write a backward method; autograd derives it from your forward.

import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, in_features: int, hidden: int, n_classes: int) -> None:
        super().__init__()                     # registers parameters — never skip this
        self.fc1 = nn.Linear(in_features, hidden)
        self.fc2 = nn.Linear(hidden, n_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = torch.relu(self.fc1(x))            # block 1: linear then nonlinearity
        return self.fc2(x)                     # block 2: linear to logits

The forward method reads like ordinary math because that is what it is — a pipeline of differentiable functions. Each nn.Linear holds a weight matrix and a bias, both nn.Parameter objects (the framework’s name for a learnable tensor: it tracks gradients and is collected by model.parameters()). The super().__init__() call is not ceremony; it installs the registration machinery, and omitting it leaves your layers existing but their parameters invisible to the optimizer — a model that looks complete and trains nothing.

When the built-in blocks do not cover what you need — a layer from a paper not yet in PyTorch, an operation with custom gradient behavior — you write a custom layer. Usually a custom nn.Module suffices: define new nn.Parameters, write the forward, let autograd handle the backward. Occasionally you must define the backward yourself, by subclassing torch.autograd.Function and implementing both forward and backward — a straight-through estimator, say, that passes a non-differentiable operation forward but supplies a surrogate gradient backward. This is the one place you re-enter the researcher’s world of hand-written gradients, and the one place her debugging trick is mandatory: gradient-check against a finite-difference estimate (torch.autograd.gradcheck) before you trust it, because a custom backward with the wrong number of gradients, or a subtly wrong one, trains silently and badly in exactly the way that cost her a day.

Build it → Eager execution and dynamic, build-by-run graphs — the define-by-run model from the inside — are the subject of Project 38: Dynamic Graph Execution, which implements a PyTorch-style framework where the graph is constructed on the fly as operations run, with autograd and graph optimization layered on top.

Architectures as composition

Once you see a model as a composition of differentiable blocks, the famous architectures stop being separate subjects and become patterns of composition, each carrying an inductive bias — an assumption about the data baked into the connectivity. A fully-connected network connects everything to everything and assumes no structure. A convolutional network (CNN) slides a small shared filter across a grid, assuming features are local and translation-invariant — the same edge detector works in any corner of an image — which is why CNNs dominate vision. A recurrent network (RNN), and its gated descendants LSTM and GRU, carries a hidden state forward through a sequence one step at a time, assuming the data is ordered and recent context matters. Each is the same machinery — differentiable blocks, autograd, the universal loop — wired into a different topology to match a different shape of data.

That is as far as this chapter goes into specific architectures, and the boundary is deliberate. The theory of transformers, attention, and large language models — what query-key-value attention computes, why it scales, how LLMs are trained and aligned — lives in the companion AI Engineering book, not here. This chapter is the framework-and-engineering view: how the machine that runs those architectures works. When you reach for a transformer, you will assemble it from the same differentiable blocks and train it with the same five-step loop you learned here; the architecture-specific theory is what the other book supplies.

Eager versus graph, and compilation

The Introduction sketched the eager/graph distinction; here is what it costs and buys. Eager execution is the better way to write and debug, because code runs line by line and you can inspect anything. But running operation-by-operation through the Python interpreter leaves performance on the table: each op dispatches separately, the interpreter sits in the hot path, and the framework cannot see far enough ahead to fuse adjacent operations.

Compilation reclaims that performance without forcing you back into define-then-run. PyTorch’s torch.compile traces your eager model into a graph, optimizes it — fusing operations, specializing for the shapes it sees — and runs the compiled version, often a substantial speedup, from a single added line. TensorFlow’s @tf.function traces a Python function the same way, and XLA (the compiler shared across TensorFlow and JAX, reachable from PyTorch) goes further with fusion especially potent on TPUs. The mental model is clean: develop eager where debugging is easy, compile for production where speed matters. The one trap is retracing — a compiled function rebuilds its graph on each new input signature, so calling it in a loop with varying Python (rather than tensor) arguments can silently recompile every iteration and run slower than eager. Feed it stable, tensor-typed inputs so it traces once and reuses the graph.

Build it → Scaling autograd past one device — synchronizing gradients across many GPUs — is what Project 40: Distributed Autograd implements: data-parallel (DDP) and sharded (FSDP) training and RPC-based autograd across network boundaries, the engine behind training a single model on a cluster.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — Write the loop and explain it. Implement the canonical training loop for a small network on a toy dataset and get the loss to drop. Write one sentence per line explaining each of the five steps — zero_grad, forward, loss, backward, step — with attention to why zero_grad() comes first. Then delete the zero_grad() line, rerun, and describe concretely what goes wrong and why the loss curve changes as it does.
Level II — Verify autograd and move it to the GPU. Pick a small scalar function and compute its gradient two ways: with the framework’s autodiff (.backward()) and with a finite-difference estimate (f(x + h) - f(x - h)) / (2h). Show the two agree to several digits — the check the researcher used to find her bug. Then move your Level I loop onto a GPU correctly: model and every batch on the same device, no stray CPU/GPU transfers in the hot loop. Confirm by timing that step time improved and no device-mismatch error appears.
Level III — Write a custom differentiable layer, and reason about the graph. Implement a custom layer by subclassing torch.autograd.Function with your own forward and backward — a straight-through estimator or gradient-reversal layer is a good target — and gradient-check it before trusting it. Then analyze: which tensors does autograd record in the forward pass, what does it replay on the backward pass, and where, for this layer, does torch.compile (or graph mode) change performance versus eager execution, and why?

Summary

A deep learning framework is two things: a tensor library that runs on accelerators and an automatic differentiation engine. Tensors are device-aware n-dimensional arrays, and the cardinal rule is to keep them all on one device. Autodiff is the reason the framework exists — it records the forward pass as a computational graph and replays it backward to compute every gradient exactly and automatically, so the long, error-prone backward pass once derived by hand is now a single .backward() call. That makes the training loop universal: forward, loss, zero-grad, backward, step is the same skeleton for every model, with the model’s individuality living inside a forward pass composed of differentiable blocks. Architectures (CNNs, RNNs, and beyond) are patterns of those blocks chosen to match the shape of the data, and the eager-versus-graph choice trades debuggability for speed — a trade compilation now largely lets you have both ways.

Key takeaways

A framework automates exactly one hard thing — the gradient — via reverse-mode autodiff (backpropagation): you write the forward pass, the framework records it and differentiates it for you.
A tensor is a device-aware array; the most common silent failure is a CPU/GPU device mismatch, so keep model and data on the same device and create tensors there.
The training loop is universal — forward, loss, zero_grad, backward, step — and the order matters; gradients accumulate by default, so forgetting zero_grad() breaks training without raising an error.
Models compose from differentiable blocks (nn.Module, nn.Parameter); custom layers with hand-written backward passes must be gradient-checked against finite differences before you trust them.
Eager execution is for writing and debugging; graph compilation (torch.compile, XLA) is for speed — beware silent retracing from unstable, non-tensor inputs.

Connections to other chapters

Machine Learning Foundations (prerequisite): where loss functions and gradient descent come from, and the framing this chapter rests on — in classical ML you engineer features and the model is shallow; in deep learning the network learns the features, which is exactly what the autograd-driven loop makes possible.
Distributed Training (extension): everything here trains one model on one device. The next step scales the same loop across many GPUs and machines — data and model parallelism, gradient synchronization — the autograd engine of this chapter stretched across a network, the territory Project 40 explores in miniature.
GPU Programming & CUDA (foundation underneath): the matrix multiplies and convolutions that feel free here are hand-optimized CUDA kernels. This chapter consumes them; that chapter writes them, and explains why keeping tensors on the device matters so much.
Performance and Profiling (prerequisite): a framework is a Python front-end over native and CUDA code, which is why eager Python feels ergonomic while the arithmetic stays fast. The cross-language optimization instincts from that chapter — vectorize, keep the interpreter out of hot loops, push work into native kernels — are what device discipline and torch.compile operationalize.

A boundary worth restating: the theory of specific modern architectures — transformer attention, and the training and alignment of large language models — is deliberately out of scope here and belongs to the companion AI Engineering book. This chapter teaches the framework that runs them; that book teaches what they compute.

Goodfellow, Bengio & Courville, Deep Learning (MIT Press, 2016) — the standard reference; its backpropagation chapter is the conceptual backbone of everything here.
Stevens, Antiga & Viehmann, Deep Learning with PyTorch (Manning, 2020), plus the official PyTorch docs — the most direct path from these concepts to fluent framework code.

Deep dives

Baydin et al., “Automatic Differentiation in Machine Learning: A Survey” (JMLR, 2018) — the definitive map of the autodiff landscape, and why reverse mode is the right choice with many parameters and one scalar loss.
The TensorFlow and JAX documentation on tf.function, GradientTape, and XLA — the graph-execution and compilation side, in the frameworks that pioneered it.

Historical context

Rumelhart, Hinton & Williams, “Learning representations by back-propagating errors” (Nature, 1986) — the paper that put backpropagation on the map; the autodiff engine is this algorithm, automated.
Linnainmaa (1970) and Werbos (1974) — the earlier independent derivations of reverse-mode differentiation, for tracing backprop to its origins before it had its deep-learning name.

--- title: "Deep Learning Frameworks" keywords: [deep learning, pytorch, tensorflow, autograd, automatic differentiation, training loop, tensors, gpu, computational graph, backpropagation] difficulty: advanced prerequisites: [ml-foundations, performance-and-profiling] estimated_time: "4-5 hours" --- ## Introduction The researcher had a three-layer network on the whiteboard and a deadline. To train it she needed the gradient of the loss with respect to every weight — and in 2012, before frameworks were a given, that meant deriving it by hand. She worked backward through the chain rule layer by layer, filling pages with partial derivatives, then transcribed each one into a loop of array updates. The forward pass took twenty lines. The backward pass took two hundred, and it was where every bug lived: a transposed matrix here, a missing factor of two there, a sign she flipped while copying from the second page to the third. None of these errors crashed anything. The network ran. It just didn't learn — the loss drifted sideways for a thousand iterations, because a gradient that is *almost* right points in almost the right direction, which is to say the wrong one. She found the bug eventually by recomputing each derivative numerically and comparing. The whole exercise had taken a day and produced one working gradient for one specific network. Change the architecture and she would do it all again. That is the problem deep learning frameworks were built to abolish. The forward pass — the actual model, the part with the ideas in it — is short and easy to get right. The backward pass is long, mechanical, and unforgiving, and it has to be rederived every time the architecture changes. A second failure mode is just as common and just as quiet: the loop that runs cleanly but never improves because the gradients from the last batch were never cleared and silently accumulated, or because the model sat on the GPU while a batch of data stayed on the CPU and every step paid for a needless transfer. In both cases nothing throws. The program is *wrong* in a way the type system cannot see. A deep learning framework exists to make one thing automatic — the gradient — and to move tensors fast on accelerators, so the part you get wrong by hand is the part you never write by hand. ### The Core Insight A deep learning framework is, at its core, two things bolted together: a **tensor library that runs on accelerators** and an **automatic differentiation engine**. Everything else — layers, optimizers, data loaders, the model zoo — is convenience built on those two. The tensor library is the easy half to describe and the hard half to build. A tensor is an n-dimensional array, like a NumPy array, but with two extra facts attached: which *device* it lives on (CPU, GPU, TPU) and whether the framework should *track gradients* through it. Operations on tensors — matrix multiplies, convolutions, activations — dispatch to hand-optimized native kernels (CUDA on NVIDIA hardware) that run thousands of multiply-accumulates in parallel. That is why a framework is "NumPy that runs on the GPU": the API feels like array math, but the arithmetic happens on silicon built for dense linear algebra at terabytes per second. The autodiff engine is the half that earns the name *framework*. As your forward computation runs, the engine **records every operation into a graph** — input feeds a matrix multiply, whose output feeds an activation, and so on to a scalar loss. You wrote only that forward path. To get gradients, the engine walks the recorded graph *backward*, applying the chain rule at each node to accumulate the derivative of the loss with respect to every parameter. This is **reverse-mode automatic differentiation**, and it is exactly backpropagation — the researcher's two hundred lines become one method call. It is neither numerical approximation nor symbolic algebra on your source; it is the chain rule applied mechanically to a trace of the operations that actually ran, so it is exact and works for *any* forward pass you can express, loops and branches included. The payoff is that the **training loop is the same regardless of model**. Forward to a loss, backward to gradients, optimizer step to update the weights, repeat — that skeleton does not change whether the model is a two-layer perceptron or a billion-parameter network. Everything that differs between models lives inside the forward pass, as a composition of differentiable building blocks. Learn the loop once and you have learned how to train anything. ### A mental model Three pictures, and the rest of this chapter is detail. A **tensor** is an n-dimensional array that knows which device it lives on — not just data, but data that remembers where it is and whether it matters to the gradient. The **computational graph** is a recording: as the forward pass runs, the framework writes down each operation and what fed it, then on `.backward()` plays the tape in reverse, so backpropagation is *assigning blame* — walking from the loss back toward the inputs, distributing responsibility for the error across every parameter, after which the optimizer nudges each in proportion to its blame. And the **framework as a whole** is one line: NumPy that runs on the GPU and remembers how to differentiate itself. ### When to use deep learning, and which framework Two decisions precede any code. The first is whether to use deep learning *at all*. The honest answer: when your data is **unstructured** — images, audio, text, raw signals — and there is a **lot** of it. Classical machine learning wins on small, tabular, well-feature-engineered datasets, where a gradient-boosted tree beats a neural network on accuracy, training time, and interpretability at once. Deep learning earns its considerable cost precisely where hand-engineering features is hopeless: you cannot write down the features that distinguish a cat from a dog in pixels, so you let the network *learn* the representation from raw input. That is the trade — you give up interpretability and a small-data comfort zone to gain features no human would engineer. @fig-dl-training shows the loop that makes that learning happen. The second decision is the framework, and it turns on one design axis: **eager / define-by-run** versus **graph / define-then-run**. In eager execution — PyTorch's default, and the research community's — operations run the instant the interpreter reaches them and the graph is built on the fly; your code *is* the computation, so you can `print` inside the forward pass, set breakpoints, and branch on a tensor's value with a plain Python `if`. In graph execution — TensorFlow 1.x's original model, alive today in `@tf.function` and `tf.GradientTape` — you describe the whole computation as a static graph first, then hand it to a runtime to optimize and execute; harder to debug, easier to serialize and ship to a phone or TPU. The practical default for new work, especially research and anything dynamic, is eager-mode PyTorch — it is what most papers ship and what the largest talent pool knows. Reach for the graph-first tools when deployment to constrained targets (mobile via TensorFlow Lite, the browser via TensorFlow.js) or TPU training at scale is the priority, where TensorFlow's lineage still leads. The distinction has softened either way: PyTorch compiles eager code with `torch.compile`, and the later *Eager versus graph* section returns to that. ### What you'll learn - How a tensor differs from a plain array — device placement, dtype, and gradient tracking — and the cardinal rule that keeps a training run from silently stalling - How reverse-mode automatic differentiation records a computational graph in the forward pass and replays it backward to compute every gradient - Why the training loop — forward, loss, zero-grad, backward, step — is universal, and what each step is actually doing - How models are assembled from differentiable building blocks, and how a custom layer hooks into the autograd engine - How the major architecture families (CNNs, RNNs) are just patterns of those blocks — and where the theory of transformers and LLMs lives instead - How eager and graph execution differ, and how compilation (`torch.compile`, XLA) buys back performance without giving up the define-by-run feel ### Prerequisites - **Machine Learning Foundations**: what a loss function is, what gradient descent does, the train/validate split, and why overfitting is the enemy - **Performance and Profiling**: the cross-language chapter on why code is slow and how to speed it up — in particular why pure-Python loops are slow, what vectorization buys you, and the idea that a fast Python library is usually a thin front-end over native code - Comfort with linear algebra at the level of matrix multiplication and the chain rule, and with NumPy-style array operations --- ## Tensors and devices Start with the tensor, because everything else operates on it. A tensor is an n-dimensional array — the same idea as a NumPy array, with most of the same API: create from lists, fill with zeros or random values, index, slice, and run element-wise and matrix operations. The shape conventions recur everywhere and are worth memorizing: `[batch, channels, height, width]` for images, `[batch, sequence_length, features]` for sequences, `[batch, features]` for tabular data. The batch dimension comes first almost universally, because the framework processes many examples at once to keep the accelerator busy. What makes a tensor more than an array is its metadata, and two pieces are load-bearing. The first is **dtype** — `float32` is the default; `float16` and `bfloat16` halve the memory and roughly double throughput on modern GPUs, a real performance lever for large models. The second, the one that bites beginners, is **device**: every tensor lives somewhere specific, on the CPU or a particular GPU, with its data physically in that device's memory. This yields the single most important operational rule in the whole framework: **keep everything on one device.** A model's parameters live on a device; the input batch must live on the same one. An operation between a CPU tensor and a GPU tensor does not quietly do the right thing — at best it errors (`expected all tensors to be on the same device`, a message every practitioner learns to read on sight). The fix is to move data to the model's device at the top of every step; the efficient habit is to *create* tensors on the target device rather than building them on the CPU and copying, because the CPU-to-GPU transfer is slow and a loop that crosses that boundary needlessly spends more time shuffling bytes than computing. Below is the canonical device dance — pick the device once, move the model there, move each batch there as it arrives. ```python import torch # Pick the accelerator if one exists; fall back to CPU otherwise. device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) # parameters now live on `device` for inputs, labels in loader: inputs = inputs.to(device) # batch must join the model labels = labels.to(device) # ...so must the labels ``` The same idea appears in TensorFlow, where placement is more automatic but the principle is identical. Device discipline is not skippable: it is the difference between a loop that runs and one that runs *fast*, and occasionally the difference between a loop that runs at all. ## Automatic differentiation: the centerpiece This is the heart of the framework and the reason it exists, so it is worth slowing down. Autodiff replaces the researcher's hand-derived gradients with bookkeeping the framework does for you, and the mechanism is simpler than it sounds. When you operate on a tensor marked to track gradients — in PyTorch, one with `requires_grad=True`, which model parameters have automatically — the framework records that operation into a directed graph: nodes are tensors, edges are the operations that produced them. The graph grows one edge per operation as the forward pass runs, until you reach a scalar loss. Nothing has been differentiated yet; the framework has merely written down what happened, in order. This is the *tape*, or the *computational graph*, and building it is the only cost the forward pass pays to be differentiable. The magic is in `.backward()`. Call it on the loss and the framework walks the recorded graph in reverse, from the loss back toward the inputs. At each node it knows the local derivative of that operation — of a matrix multiply, a ReLU, an addition, all in closed form — and multiplies these together along the way, which is precisely the chain rule. The result, accumulated into each parameter's `.grad`, is the gradient of the loss with respect to that parameter. This is reverse-mode automatic differentiation; in deep learning it goes by the more familiar name *backpropagation*. A tiny example makes it concrete: square a tensor, ask for the gradient, and get back the analytic derivative without writing it. ```python x = torch.tensor([2.0], requires_grad=True) # track gradients for x y = x ** 2 + 3 * x + 1 # forward pass; graph is recorded y.backward() # walk the graph backward print(x.grad) # tensor([7.]) — exactly dy/dx = 2x+3 at x=2 ``` No derivative was derived. The framework knew the local rule for squaring, scaling, and addition, and composed them. Scale this from one scalar to a network's millions of parameters and the loop is identical: the forward pass records the graph, `.backward()` replays it, every `.grad` is filled in exactly. Three properties make this powerful enough to have remade a field. It is **exact**, not a finite-difference approximation, so it adds no numerical error to the gradient. It is **general**, working for any forward pass — including ones with `if` statements and `for` loops — because it differentiates the operations that *actually ran* rather than analyzing your source. And it is **automatic**: you write the forward computation and the gradient comes free. @fig-dl-training puts the pieces in one frame — the forward pass building the graph, the backward pass walking it in reverse to compute gradients, the optimizer closing the loop — under one annotation worth tattooing on it: *you write only the forward pass; the framework records the graph and differentiates it automatically.* ![The deep-learning training loop: the forward pass runs input through the layers to a loss while the framework records every operation as a computational graph; reverse-mode automatic differentiation then walks that graph backward to compute each parameter's gradient (backpropagation); and the optimizer updates the parameters. You write only the forward pass — the gradients are automatic — and the tensors run on an accelerator.](../assets/diagrams/rendered/dl_training_loop.svg){#fig-dl-training .lightbox} TensorFlow expresses the same idea with different ergonomics: you record operations inside a `tf.GradientTape` context, then call `tape.gradient(loss, vars)`. The tape is explicit rather than implicit and watches `tf.Variable` objects by default — a common trap is asking for the gradient of a `tf.constant` and getting `None` because the tape was never told to watch it. The surface differs; the algorithm underneath is the same reverse-mode autodiff. ## The training loop With autodiff understood, the training loop is almost anticlimactic — which is the point. It is the same five steps for every model. Move the batch to the device. Zero the gradients from the previous step. Run the forward pass to get predictions. Compute the loss. Call `.backward()` to fill in the gradients, then `.step()` to let the optimizer update the parameters. Loop. The order is not negotiable, and one step in particular is the source of the most infamous silent bug in deep learning, which we will come to. ```python model.train() # enable dropout, batchnorm in train mode for inputs, labels in loader: inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() # clear last step's gradients (see below) outputs = model(inputs) # forward: build the graph loss = criterion(outputs, labels) # how wrong are we? loss.backward() # backward: fill every .grad optimizer.step() # update params: param -= lr * param.grad ``` Two lines deserve a closer look. `loss.backward()` is the autodiff engine invoked: it walks the graph built by `model(inputs)` and the loss and deposits a gradient into every parameter's `.grad`. `optimizer.step()` is the learning itself: it reads each `.grad` and nudges the parameter in the loss-reducing direction, scaled by the learning rate. Adam and SGD differ only in *how* they turn gradients into updates; both consume the `.grad` fields `backward()` produced. Now the bug. Gradients **accumulate** by default — each `.backward()` *adds* into `.grad` rather than replacing it. That is deliberate (it lets you accumulate across micro-batches to simulate a larger batch than fits in memory), but it means that if you forget `optimizer.zero_grad()`, the gradient from step one is still in `.grad` when step two adds to it, and your updates are computed from a polluted sum of every batch you have ever seen. Nothing crashes; the loss simply fails to descend, and you stare at it exactly as the researcher stared at her hand-coded gradient. A training loop can be syntactically perfect and semantically wrong, and the framework will not tell you. Zeroing gradients at the top of every step is the one habit that prevents the most common version of that failure. ::: {.callout-warning} ## War story: the loop that ran for a week and learned nothing A team kicked off a multi-day run on a fresh model and a slightly refactored loop. It ran — GPUs pegged, loss curves drew themselves, checkpoints landed on schedule — but validation accuracy hovered at chance. Forward pass correct, data correct, architecture matching a published baseline. The defect was one moved line: `optimizer.zero_grad()` had drifted *outside* the batch loop, so gradients accumulated across the entire epoch before each update, and every step nudged the weights with a gradient summed over thousands of batches — enormous, stale, meaningless. No exception, no warning, no NaN; the only symptom was a flat accuracy curve, which looks exactly like "the task is just hard." In deep learning the compiler cannot save you, because the program is *correct* — it is the *math* that is wrong. The defenses are discipline (zero the gradients, pin the device) and verification (confirm the loss drops on a single batch before launching the week-long run). ::: > **Build it →** To see autograd and the loop with no framework underneath them, build > the engine yourself in > [Project 35: Differentiable Programming](https://github.com/jchu0/applied-cs-projects/tree/main/35-differentiable-programming), > a from-scratch reverse-mode autodiff library — tensors, a recorded graph, `backward()`, > and SGD/Adam, in pure Python with no PyTorch or TensorFlow underneath. It is the fastest > way to make the mechanism in this section stop being magic. ## Building models from differentiable blocks A model is a composition of differentiable blocks, and the framework gives you a container for them. In PyTorch that container is `nn.Module`: you subclass it, declare layers as attributes in `__init__`, and define the forward pass in a `forward` method. The base class does the tedious bookkeeping — it *registers every layer's parameters* so the optimizer finds them with `model.parameters()`, moves them all to a device with one `model.to(device)` call, switches the model between training and evaluation behavior, and serializes the weights. You never write a `backward` method; autograd derives it from your `forward`. ```python import torch.nn as nn class SimpleNet(nn.Module): def __init__(self, in_features: int, hidden: int, n_classes: int) -> None: super().__init__() # registers parameters — never skip this self.fc1 = nn.Linear(in_features, hidden) self.fc2 = nn.Linear(hidden, n_classes) def forward(self, x: torch.Tensor) -> torch.Tensor: x = torch.relu(self.fc1(x)) # block 1: linear then nonlinearity return self.fc2(x) # block 2: linear to logits ``` The forward method reads like ordinary math because that is what it is — a pipeline of differentiable functions. Each `nn.Linear` holds a weight matrix and a bias, both `nn.Parameter` objects (the framework's name for a learnable tensor: it tracks gradients and is collected by `model.parameters()`). The `super().__init__()` call is not ceremony; it installs the registration machinery, and omitting it leaves your layers existing but their parameters invisible to the optimizer — a model that looks complete and trains nothing. When the built-in blocks do not cover what you need — a layer from a paper not yet in PyTorch, an operation with custom gradient behavior — you write a **custom layer**. Usually a custom `nn.Module` suffices: define new `nn.Parameter`s, write the `forward`, let autograd handle the backward. Occasionally you must define the backward *yourself*, by subclassing `torch.autograd.Function` and implementing both `forward` and `backward` — a straight-through estimator, say, that passes a non-differentiable operation forward but supplies a surrogate gradient backward. This is the one place you re-enter the researcher's world of hand-written gradients, and the one place her debugging trick is mandatory: gradient-check against a finite-difference estimate (`torch.autograd.gradcheck`) before you trust it, because a custom backward with the wrong number of gradients, or a subtly wrong one, trains silently and badly in exactly the way that cost her a day. > **Build it →** Eager execution and dynamic, build-by-run graphs — the define-by-run > model from the inside — are the subject of > [Project 38: Dynamic Graph Execution](https://github.com/jchu0/applied-cs-projects/tree/main/38-dynamic-graph-execution), > which implements a PyTorch-style framework where the graph is constructed on the fly as > operations run, with autograd and graph optimization layered on top. ## Architectures as composition Once you see a model as a composition of differentiable blocks, the famous architectures stop being separate subjects and become *patterns of composition*, each carrying an **inductive bias** — an assumption about the data baked into the connectivity. A fully-connected network connects everything to everything and assumes no structure. A **convolutional network (CNN)** slides a small shared filter across a grid, assuming features are *local* and *translation-invariant* — the same edge detector works in any corner of an image — which is why CNNs dominate vision. A **recurrent network (RNN)**, and its gated descendants LSTM and GRU, carries a hidden state forward through a sequence one step at a time, assuming the data is ordered and recent context matters. Each is the same machinery — differentiable blocks, autograd, the universal loop — wired into a different topology to match a different shape of data. That is as far as this chapter goes into specific architectures, and the boundary is deliberate. The theory of **transformers, attention, and large language models** — what query-key-value attention computes, why it scales, how LLMs are trained and aligned — lives in the companion *AI Engineering* book, not here. This chapter is the framework-and-engineering view: how the machine that *runs* those architectures works. When you reach for a transformer, you will assemble it from the same differentiable blocks and train it with the same five-step loop you learned here; the architecture-specific theory is what the other book supplies. ## Eager versus graph, and compilation The Introduction sketched the eager/graph distinction; here is what it costs and buys. Eager execution is the better way to *write and debug*, because code runs line by line and you can inspect anything. But running operation-by-operation through the Python interpreter leaves performance on the table: each op dispatches separately, the interpreter sits in the hot path, and the framework cannot see far enough ahead to *fuse* adjacent operations. Compilation reclaims that performance without forcing you back into define-then-run. PyTorch's `torch.compile` traces your eager model into a graph, optimizes it — fusing operations, specializing for the shapes it sees — and runs the compiled version, often a substantial speedup, from a *single added line*. TensorFlow's `@tf.function` traces a Python function the same way, and **XLA** (the compiler shared across TensorFlow and JAX, reachable from PyTorch) goes further with fusion especially potent on TPUs. The mental model is clean: develop eager where debugging is easy, compile for production where speed matters. The one trap is **retracing** — a compiled function rebuilds its graph on each new input signature, so calling it in a loop with varying Python (rather than tensor) arguments can silently recompile every iteration and run *slower* than eager. Feed it stable, tensor-typed inputs so it traces once and reuses the graph. > **Build it →** Scaling autograd past one device — synchronizing gradients across many > GPUs — is what > [Project 40: Distributed Autograd](https://github.com/jchu0/applied-cs-projects/tree/main/40-distributed-autograd) > implements: data-parallel (DDP) and sharded (FSDP) training and RPC-based autograd > across network boundaries, the engine behind training a single model on a cluster. --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — Write the loop and explain it.** Implement the canonical training loop for a small network on a toy dataset and get the loss to drop. Write one sentence per line explaining each of the five steps — `zero_grad`, forward, loss, `backward`, `step` — with attention to *why* `zero_grad()` comes first. Then delete the `zero_grad()` line, rerun, and describe concretely what goes wrong and why the loss curve changes as it does. 2. **Level II — Verify autograd and move it to the GPU.** Pick a small scalar function and compute its gradient two ways: with the framework's autodiff (`.backward()`) and with a finite-difference estimate `(f(x + h) - f(x - h)) / (2h)`. Show the two agree to several digits — the check the researcher used to find her bug. Then move your Level I loop onto a GPU correctly: model and every batch on the same device, no stray CPU/GPU transfers in the hot loop. Confirm by timing that step time improved and no device-mismatch error appears. 3. **Level III — Write a custom differentiable layer, and reason about the graph.** Implement a custom layer by subclassing `torch.autograd.Function` with your own `forward` and `backward` — a straight-through estimator or gradient-reversal layer is a good target — and gradient-check it before trusting it. Then analyze: which tensors does autograd record in the forward pass, what does it replay on the backward pass, and *where*, for this layer, does `torch.compile` (or graph mode) change performance versus eager execution, and why? ## Summary A deep learning framework is two things: a tensor library that runs on accelerators and an automatic differentiation engine. Tensors are device-aware n-dimensional arrays, and the cardinal rule is to keep them all on one device. Autodiff is the reason the framework exists — it records the forward pass as a computational graph and replays it backward to compute every gradient exactly and automatically, so the long, error-prone backward pass once derived by hand is now a single `.backward()` call. That makes the training loop universal: forward, loss, zero-grad, backward, step is the same skeleton for every model, with the model's individuality living inside a `forward` pass composed of differentiable blocks. Architectures (CNNs, RNNs, and beyond) are patterns of those blocks chosen to match the shape of the data, and the eager-versus-graph choice trades debuggability for speed — a trade compilation now largely lets you have both ways. ### Key takeaways - A framework automates exactly one hard thing — the gradient — via reverse-mode autodiff (backpropagation): you write the forward pass, the framework records it and differentiates it for you. - A tensor is a device-aware array; the most common silent failure is a CPU/GPU device mismatch, so keep model and data on the same device and create tensors there. - The training loop is universal — forward, loss, `zero_grad`, backward, `step` — and the order matters; gradients accumulate by default, so forgetting `zero_grad()` breaks training without raising an error. - Models compose from differentiable blocks (`nn.Module`, `nn.Parameter`); custom layers with hand-written backward passes must be gradient-checked against finite differences before you trust them. - Eager execution is for writing and debugging; graph compilation (`torch.compile`, XLA) is for speed — beware silent retracing from unstable, non-tensor inputs. ### Connections to other chapters - **Machine Learning Foundations** (prerequisite): where loss functions and gradient descent come from, and the framing this chapter rests on — in classical ML you *engineer* features and the model is shallow; in deep learning the network *learns* the features, which is exactly what the autograd-driven loop makes possible. - **Distributed Training** (extension): everything here trains one model on one device. The next step scales the same loop across many GPUs and machines — data and model parallelism, gradient synchronization — the autograd engine of this chapter stretched across a network, the territory Project 40 explores in miniature. - **GPU Programming & CUDA** (foundation underneath): the matrix multiplies and convolutions that feel free here are hand-optimized CUDA kernels. This chapter consumes them; that chapter writes them, and explains *why* keeping tensors on the device matters so much. - **Performance and Profiling** (prerequisite): a framework is a Python front-end over native and CUDA code, which is why eager Python feels ergonomic while the arithmetic stays fast. The cross-language optimization instincts from that chapter — vectorize, keep the interpreter out of hot loops, push work into native kernels — are what `device` discipline and `torch.compile` operationalize. A boundary worth restating: the *theory* of specific modern architectures — transformer attention, and the training and alignment of large language models — is deliberately out of scope here and belongs to the companion *AI Engineering* book. This chapter teaches the framework that runs them; that book teaches what they compute. ## Further reading ### Essential - Goodfellow, Bengio & Courville, *Deep Learning* (MIT Press, 2016) — the standard reference; its backpropagation chapter is the conceptual backbone of everything here. - Stevens, Antiga & Viehmann, *Deep Learning with PyTorch* (Manning, 2020), plus the official PyTorch docs — the most direct path from these concepts to fluent framework code. ### Deep dives - Baydin et al., *"Automatic Differentiation in Machine Learning: A Survey"* (JMLR, 2018) — the definitive map of the autodiff landscape, and why reverse mode is the right choice with many parameters and one scalar loss. - The TensorFlow and JAX documentation on `tf.function`, `GradientTape`, and XLA — the graph-execution and compilation side, in the frameworks that pioneered it. ### Historical context - Rumelhart, Hinton & Williams, *"Learning representations by back-propagating errors"* (Nature, 1986) — the paper that put backpropagation on the map; the autodiff engine is this algorithm, automated. - Linnainmaa (1970) and Werbos (1974) — the earlier independent derivations of reverse-mode differentiation, for tracing backprop to its origins before it had its deep-learning name.