Containerization with Docker

Keywords

docker, containers, images, dockerfile, multi-stage builds, layers, oci, registries, docker-compose

Introduction

The service passed every test on the engineer’s laptop, sailed through CI, and went out on a Friday. By Saturday it was crash-looping in production. The logs showed an undefined symbol deep inside a native dependency — a library that loaded fine in development and refused to load in prod. The two machines were, on paper, identical: same code, same version pin, same config. They differed in exactly one thing nobody had written down. The laptop ran a newer glibc than the production base image, and the dependency had been compiled against the newer one. The application had never been wrong; its environment had drifted, silently, and the test suite had no way to see it.

This is the oldest failure in deployment, and “works on my machine” is its catchphrase. The thing you tested and the thing you shipped were never the same thing. They shared source code but not a runtime — a different libc, a different Python patch release, a system package present here and absent there, an environment variable set in one shell and not the other. Version pins capture the names of your dependencies; they do not capture the operating system underneath them. The gap between “the code I wrote” and “the world that code runs in” is where deploys go to die. Containerization closes that gap by shipping the world along with the code.

The Core Insight

A language runtime, a requirements.txt, a list of pinned versions — these describe the application. They say nothing about the userland the application sits on: the C library, the system shared objects, the locale data, the CA certificates, the exact shape of /usr. That userland is the part that drifts, because it belongs to the host, and the host is whatever someone provisioned months ago. The insight behind containers is to stop treating the environment as a given and start treating it as an artifact — to package the application and its entire userland into one immutable image that runs identically wherever it lands.

A container does this without paying the price of a virtual machine. A VM gives every workload its own emulated hardware and a full guest kernel, which is heavy in three specific ways:

Memory and disk. Each VM carries a complete OS image — gigabytes of kernel, init system, and base packages — duplicated per instance.
Boot time. A guest kernel has to come up before your process can. Seconds to tens of seconds, every time you scale out.
Density. Hypervisor overhead and per-VM RAM floors cap how many workloads fit on a box, which is money on every node.

A container skips all of it by sharing the host kernel and isolating only what needs isolating. Linux namespaces give the process its own private view of process IDs, the filesystem mount tree, the network stack, and user IDs — so it sees itself as PID 1 in an empty world. Linux cgroups cap and account for the CPU, memory, and I/O it can consume. There is no second kernel and no emulated hardware. The container is just your process, fenced off and resource-limited, running on the same kernel as everything else — which is why it starts in milliseconds and weighs tens of megabytes instead of gigabytes.

A mental model

Think of a container image as a shipping container, and the analogy pays off in every direction. Before standardized containers, freight was loaded as loose cargo — barrels, crates, sacks — and every ship, crane, and dock had to handle every odd shape by hand. The steel box changed the world not by being clever but by being uniform: opaque, self-contained, and the same on the outside no matter what’s inside. A crane that can lift one can lift any of them; a ship that carries one carries thousands. The contents are the shipper’s business; the interface is standard.

A Docker image is that box for software. The runtime — Docker, containerd, a Kubernetes node — is the crane and the ship: it doesn’t know or care whether the box holds Python, a Go binary, or a database, only that it presents the standard interface. And the image itself is best understood as a photograph of a filesystem: a frozen, read-only snapshot of a complete userland, captured at build time, that you can copy endlessly and run anywhere. The snapshot never changes. When a container runs, it gets a thin writable layer on top of the photo; throw the container away and the photo is untouched, ready to stamp out the next identical instance.

When to use containers (and when not)

Containers are close to a default for shipping server software, but “default” is not “always.” Figure 47.1 shows the build-and-distribute flow they enable; the decision below is about whether you want that flow at all.

Reach for a container when reproducibility and environment parity are worth a small runtime cost: web services and APIs, microservices that each carry their own dependency set, anything that has to run identically across a developer’s laptop, CI, and several production regions. If your pain is “it behaves differently over there,” containers are the cure.

Reach for a VM or bare metal instead when you need a different kernel or hard isolation across a trust boundary — containers share the host kernel, so a kernel-level exploit or a workload that needs its own kernel modules wants a real VM. Latency-critical or hardware-bound workloads (high-frequency trading, specialized drivers) sometimes can’t afford even the thin namespace overhead.

The stateful-workload caveat deserves its own line. Containers are ephemeral by design — the photo is immutable and the writable layer dies with the container. That is a feature for stateless services and a trap for databases. You can run Postgres in a container, but the data must live on a persistent volume mounted from outside, never in the container’s own layer. The container is the running engine; the state lives elsewhere. Confuse the two and your first docker rm is a data-loss incident.

What you’ll learn

How images are built from stacked, content-addressed layers, and how the build cache turns layer ordering into a performance decision
How to write a Dockerfile that builds fast and rebuilds faster, by separating what rarely changes from what changes every commit
Why multi-stage builds are the single highest-leverage technique for shrinking images, and how the builder/runtime split works
How to make images small and safe: minimal bases, non-root users, distroless and scratch
Why Python, Go, and Rust produce wildly different image profiles, and what that means for your base-image choices
How to wire several containers together for local development with Docker Compose, and where its conveniences stop being appropriate for production

Prerequisites

Linux processes and filesystems: what a process is, what PID 1 means, how the mount tree and environment variables work (the Linux Processes material)
Software-engineering fundamentals: dependency management, build vs. runtime, and why reproducibility matters (the Software Engineering Fundamentals material)
Comfort at a shell: running commands, reading exit codes, following logs

Images, layers, and the build cache

An image is not a single blob; it is a stack of read-only layers, each one the filesystem delta produced by a single build step. FROM lays down a base layer; installing dependencies adds another; copying your code adds another. Stack them and you have the complete userland — the photograph of the filesystem. When a container runs, the runtime unions these read-only layers and adds one thin writable layer on top, copy-on-write, so a thousand containers from the same image share the underlying layers and pay only for what each one changes.

The reason this matters for building is the cache. Each layer is keyed by its inputs, and Docker reuses a cached layer whenever those inputs are unchanged — but the moment one layer’s input changes, that layer and every layer after it are rebuilt. Cache invalidation cascades downward. This single fact dictates how you order a Dockerfile: put the steps that rarely change first and the steps that change constantly last. Dependencies before source code, always. Copy the lockfile and install dependencies as their own early step, then copy your application code as a later step. A typical commit touches only your code, so the expensive dependency layer stays cached and the rebuild is near-instant.

# Illustrative: dependencies first (cached), code last (changes every commit).
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt   # cached unless deps change
COPY . .                                             # the only layer a code edit busts

The inverted version — COPY . . before the install — reinstalls every dependency on every one-line change, because copying the code first invalidates the cache for everything below it. That is the most common reason a build that should take five seconds takes five minutes.

One more habit goes with this: a .dockerignore. The build context is everything in the directory you hand to docker build, and it all gets shipped to the daemon and is fair game for COPY . .. Without a .dockerignore, you sweep .git, local virtual environments, node_modules, logs, and test caches into the image — bloating it and, worse, invalidating cache on files that have nothing to do with the build.

Writing a Dockerfile well

A good Dockerfile is mostly the discipline above plus a few defaults worth treating as non-negotiable. Pin your base image to a specific version, never latest — a floating tag means two builds a week apart can pull different base images, and your “reproducible” artifact quietly stops being reproducible (more on this in the war story below). Set an explicit USER so the process does not run as root. Prefer COPY over ADD unless you specifically need ADD’s URL-fetch or auto-extract behavior, because COPY does exactly one obvious thing. And remember that each RUN is its own layer: a package install whose cleanup happens in a separate later RUN doesn’t shrink anything, because the deleted files still exist in the earlier layer. Install and clean up in the same RUN so the deletion lands in the same layer as the creation.

The subtle trap is that a layer is permanent once it exists, even if a later layer deletes its contents. Layers are additive deltas; deleting a file in layer five doesn’t remove it from layer three — it just hides it. Anything ever written into a layer — a secret, a credentials file, a private key you COPYd in and rmd two lines later — remains fully recoverable from the image history by anyone who pulls it. Build secrets and runtime environment variables exist precisely so secrets never touch a layer.

Multi-stage builds: the highest-leverage technique

Here is the central tension of image building. To build most software you need a heavy toolchain — a compiler, a language SDK, development headers, build-time dependencies. To run the result you need almost none of it. A single-stage Dockerfile fuses the two: whatever you installed to build the artifact ships in the final image right alongside the artifact, so a 30 MB Go binary arrives wrapped in a 900 MB Go SDK it will never use again. The build environment becomes permanent runtime baggage.

Multi-stage builds sever build-time from runtime. You write two (or more) FROM stages in one Dockerfile: a fat builder stage that has the full toolchain and produces the artifact, and a slim runtime stage that starts from a minimal base and copies only the finished artifact across with COPY --from=builder. Everything in the builder — the compiler, the headers, the intermediate object files, the source tree — is discarded. Nothing crosses the boundary except what you explicitly copy. The final image is the runtime base plus your artifact, and nothing else. This is the shape in Figure 47.1: source flows into the builder, only the artifact is copied into the runtime layers, and that slim image is what reaches the registry and the running containers.

The Go case is the dramatic one because the artifact is so self-sufficient. Compile a static binary in the builder, then copy it into the emptiest base there is:

# Stage 1 — builder: full Go SDK (~900 MB), discarded after this stage.
FROM golang:1.23 AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /app ./cmd/server

# Stage 2 — runtime: nothing but the binary and CA certs.
FROM gcr.io/distroless/static-debian12
COPY --from=builder /app /app
USER 65532:65532
ENTRYPOINT ["/app"]

The final image is a few tens of megabytes — the binary, a handful of certificates, and no shell, package manager, or libc to attack. The 900 MB SDK never leaves stage one. Multi-stage builds are to image size what algorithmic complexity is to runtime performance: a single upstream decision that sets the ceiling for everything downstream. Get it right and the rest is tuning; get it wrong and no amount of base-image fiddling will save you.

Smaller and safer: bases, non-root, distroless

Image size and image security pull in the same direction, which is convenient: the smaller you make an image, the less there is to exploit. The lever is the base. A full python:3.12 image is roughly a gigabyte; python:3.12-slim is around 130 MB; an Alpine variant is smaller still but swaps glibc for musl, which can break binary wheels with native extensions in ways that cost you an afternoon. Distroless images go further — Google’s distroless bases contain your language runtime and its dependencies and nothing else: no shell, no package manager, no cat, no ls. There is no shell to drop into during an exploit and almost no CVE surface, because almost no software is present to have CVEs. For statically linked binaries, scratch — the empty base — is the logical endpoint: zero bytes of OS.

Security is the same minimalism applied to the running process. Run as non-root: the default container user is root, and a container escape from a root process inherits root on the host, so set an explicit unprivileged USER. For a scratch or distroless image with no user database, use a numeric UID like USER 65532. Then make scanning routine — tools like Trivy read an image’s layers and flag known vulnerabilities in its packages, and wiring that into CI catches a vulnerable base before it ships rather than after. The smaller the image, the shorter the scan report, which is the whole point.

War story: the 2 GB image and the rollback that wasn’t

A platform team’s flagship service shipped on a single-stage Dockerfile that grew, one “just add this tool” commit at a time, into a 2 GB image carrying the entire build toolchain, the test suite, and a stray .git directory. Deploys slowed as every node pulled two gigabytes; eventually a pull timed out mid-rollout and wedged the cluster. The image had two diseases at once. It was huge — fixed by a multi-stage build that copied only the compiled artifact into a slim base, cutting it to 80 MB. And it was tagged :latest. When the team tried to roll back to “the last good version,” there was no such thing: :latest had been overwritten on every build, so the tag that once meant the working image now meant the broken one. The lesson is two rules that always travel together — multi-stage to keep images small, and immutable version tags (or better, digests) so every build is a distinct, rollback-able artifact. A tag you can overwrite is not a version; it’s a moving target.

Build it → Tiny static-binary images in practice: the Rust services in Project 03: High-Performance Cache and Project 13: Service Mesh compile to slim runtime images, and the Go gRPC services in Project 02: Microservice Platform show the builder→scratch/distroless pattern across a multi-service stack.

The polyglot reality: not all images are equal

The same multi-stage discipline yields very different images depending on the language, because languages differ in how much userland the runtime truly needs. This is worth internalizing, because it’s why “just use distroless” is good advice for one stack and impossible for another.

Go and Rust compile to self-contained native binaries. With CGO_ENABLED=0 a Go program, or a Rust program built against musl, links statically — the binary carries everything, so the runtime stage can be scratch or distroless-static and the whole image is the binary plus CA certificates. Single-digit to low-tens of megabytes is normal. There is no interpreter to ship because the language is the binary.

Python cannot do this. A Python application is source code that needs an interpreter, a standard library, and a tree of installed wheels — some of them compiled C extensions linked against system libraries — present at runtime. You cannot copy “just the binary” because there is no binary; you copy the interpreter plus the whole dependency tree. A well-built multi-stage Python image installs wheels in the builder and copies the resulting site-packages into a slim runtime, which trims the build toolchain but still drags an interpreter and its libraries. Expect 150–400 MB where a Go service would be 20. This is not a failure of your Dockerfile; it is the shape of an interpreted, dynamically-linked runtime. Knowing it stops you from chasing a 30 MB Python image that cannot exist, and tells you where the real wins are: a slim (not full) base, a clean .dockerignore, and never shipping build tools.

Build it → Compare the profiles directly. The FastAPI services in Project 05: SaaS Web Platform and Project 50: Feature Engineering Platform show the Python interpreter-plus-wheels profile, while the Rust Project 51: Message Queue and Project 52: Time-Series Database show the static-binary end of the spectrum.

Docker Compose for local development

One container is a process; a real system is several talking to each other — an API, a database, a cache, maybe a message broker. Wiring them up by hand with a dozen docker run flags for ports, networks, volumes, and dependency order is tedious and unrepeatable. Docker Compose declares the whole topology in one YAML file and brings it up with a single command. Compose puts the services on a shared network where each is reachable by its service name as a hostname — the API connects to the database at db:5432, not some IP — and that name-based wiring is most of what makes a multi-service stack manageable.

# Illustrative: an API that depends on a database, on a shared Compose network.
services:
  api:
    build: .
    ports: ["8000:8000"]
    environment:
      DATABASE_URL: postgresql://user:pass@db:5432/app   # 'db' = the service name
    depends_on:
      db: { condition: service_healthy }
  db:
    image: postgres:16-slim
    volumes: ["pgdata:/var/lib/postgresql/data"]          # state lives outside the container
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user"]
volumes: { pgdata: }

Two details earn their keep. depends_on with a service_healthy condition makes the API wait until the database is actually accepting connections, not merely started — without it, the API races the database on startup and crashes on the first query. And the named volume is where the Postgres data lives, outside the container’s ephemeral layer, so docker compose down and back up doesn’t wipe the database. That is the stateful-workload caveat from earlier, made concrete.

Compose is superb for local development and CI integration tests, where you want the whole stack on one machine, reproducibly, in one command. It is not a production orchestrator: it has no multi-host scheduling, no self-healing rescheduling, no rolling deploys or autoscaling. When you outgrow one machine, the same images move to Kubernetes — which is the subject of its own chapter, and the reason getting the image right here matters there.

Build it → Full Compose stacks that stand up databases, caches, and brokers for local dev: Project 07: Data Lakehouse and Project 08: Streaming Platform orchestrate full multi-service infrastructure with docker compose.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — Containerize and measure. Take a small app (a FastAPI endpoint or a Go HTTP server) and write a straightforward single-stage Dockerfile. Build it, run it, hit the endpoint, then record the image size from docker images. This is your baseline — write the number down; you’ll beat it in the next task.
Level II — Multi-stage and report the delta. Convert the same app to a multi-stage build: a builder stage with the toolchain, a slim or distroless runtime stage that copies only the artifact. Report the before/after image size and the percentage cut. Then change one line of application code, rebuild, and explain from the build output exactly which layers were rebuilt and which were served from cache — and why, in terms of layer ordering.
Level III — Harden for a regulated deploy. Take the multi-stage image to production grade: non-root USER, a distroless or scratch runtime, and a pinned base by digest so the build is byte-reproducible rather than tag-floating. Run a vulnerability scan (e.g. Trivy) and generate an SBOM. Write a short paragraph on the supply-chain story you’d present at a compliance review: how a reviewer can verify exactly what is in the image, prove it hasn’t drifted since it was signed, and reproduce the build from source.

Summary

Containerization solves the deployment problem that version pins cannot: it packages the application and its entire userland into one immutable image, so the thing you tested is byte-for-byte the thing you ship. It does this cheaply by sharing the host kernel and isolating with namespaces and cgroups rather than virtualizing hardware — milliseconds to start, tens of megabytes on disk, where a VM is seconds and gigabytes. The image is a photograph of a filesystem, built from stacked read-only layers whose ordering governs the build cache. The highest-leverage technique is the multi-stage build, which discards the heavy toolchain and ships only the artifact into a slim, non-root, minimal base — and how slim depends on your language, with Go and Rust reaching scratch and Python necessarily dragging its interpreter.

Key takeaways

A container ships the environment, not just the code — that is the whole point, and the cure for “works on my machine.”
Containers isolate with kernel namespaces + cgroups, not a guest kernel; that is why they’re cheap, and also why hard isolation across a trust boundary still wants a VM.
Layer ordering is build performance: dependencies before code, clean up in the same RUN, and never assume a deleted file leaves the layer it was written into.
Multi-stage builds are the single biggest win for size and security; pair them with a minimal base, a non-root user, and immutable version tags or digests.
State belongs on a mounted volume, never in the container’s ephemeral writable layer.

Connections to other chapters

Software Engineering Fundamentals (prerequisite): containers are the concrete answer to the reproducibility and dependency-management problems framed there — an image is a build artifact taken to its logical, environment-inclusive conclusion.
Orchestration with Kubernetes (extension): Kubernetes is what runs the images you build here at scale — it schedules containers across many hosts, restarts the ones that die, and rolls out new versions. A small, well-formed image is the unit it schedules; the image work in this chapter is the input to that one.
Benchmarking Systems (sibling): image size and container cold-start time are measurable infrastructure costs, not vibes. The discipline for measuring them — and for trusting the deltas you report in the exercise above — comes from the benchmarking chapter.
Cross-Cutting CI/CD and Security (extension): images are built, scanned, and signed in CI. The supply-chain controls — vulnerability scanning, SBOMs, image signing, pinned digests — are where containerization meets the security and pipeline practices covered there.

The Open Container Initiative (OCI) Image and Runtime Specifications — the open standards, decoupled from Docker, that define what an image and a running container actually are.
Merkel, “Docker: Lightweight Linux Containers for Consistent Development and Deployment” (Linux Journal, 2014) — the early articulation of the container value proposition.

Historical context

Linux namespaces and cgroups kernel documentation — the two primitives that make containers possible; everything above is orchestration over these.
Verma et al., “Large-scale cluster management at Google with Borg” (EuroSys, 2015) — the internal system whose lineage runs straight through containers to Kubernetes, and the reason the next chapter exists.

--- title: "Containerization with Docker" keywords: [docker, containers, images, dockerfile, multi-stage builds, layers, oci, registries, docker-compose] difficulty: intermediate prerequisites: [linux-processes, software-engineering-fundamentals] estimated_time: "3-4 hours" --- ## Introduction The service passed every test on the engineer's laptop, sailed through CI, and went out on a Friday. By Saturday it was crash-looping in production. The logs showed an `undefined symbol` deep inside a native dependency — a library that loaded fine in development and refused to load in prod. The two machines were, on paper, identical: same code, same version pin, same config. They differed in exactly one thing nobody had written down. The laptop ran a newer glibc than the production base image, and the dependency had been compiled against the newer one. The application had never been wrong; its *environment* had drifted, silently, and the test suite had no way to see it. This is the oldest failure in deployment, and "works on my machine" is its catchphrase. The thing you tested and the thing you shipped were never the same thing. They shared source code but not a runtime — a different libc, a different Python patch release, a system package present here and absent there, an environment variable set in one shell and not the other. Version pins capture the names of your dependencies; they do not capture the operating system underneath them. The gap between "the code I wrote" and "the world that code runs in" is where deploys go to die. Containerization closes that gap by shipping the world along with the code. ### The Core Insight A language runtime, a `requirements.txt`, a list of pinned versions — these describe the *application*. They say nothing about the userland the application sits on: the C library, the system shared objects, the locale data, the CA certificates, the exact shape of `/usr`. That userland is the part that drifts, because it belongs to the host, and the host is whatever someone provisioned months ago. The insight behind containers is to stop treating the environment as a given and start treating it as an *artifact* — to package the application **and its entire userland** into one immutable image that runs identically wherever it lands. A container does this without paying the price of a virtual machine. A VM gives every workload its own emulated hardware and a full guest kernel, which is heavy in three specific ways: 1. **Memory and disk.** Each VM carries a complete OS image — gigabytes of kernel, init system, and base packages — duplicated per instance. 2. **Boot time.** A guest kernel has to come up before your process can. Seconds to tens of seconds, every time you scale out. 3. **Density.** Hypervisor overhead and per-VM RAM floors cap how many workloads fit on a box, which is money on every node. A container skips all of it by **sharing the host kernel** and isolating only what needs isolating. Linux *namespaces* give the process its own private view of process IDs, the filesystem mount tree, the network stack, and user IDs — so it sees itself as PID 1 in an empty world. Linux *cgroups* cap and account for the CPU, memory, and I/O it can consume. There is no second kernel and no emulated hardware. The container is just your process, fenced off and resource-limited, running on the same kernel as everything else — which is why it starts in milliseconds and weighs tens of megabytes instead of gigabytes. ### A mental model Think of a container image as a **shipping container**, and the analogy pays off in every direction. Before standardized containers, freight was loaded as loose cargo — barrels, crates, sacks — and every ship, crane, and dock had to handle every odd shape by hand. The steel box changed the world not by being clever but by being *uniform*: opaque, self-contained, and the same on the outside no matter what's inside. A crane that can lift one can lift any of them; a ship that carries one carries thousands. The contents are the shipper's business; the *interface* is standard. A Docker image is that box for software. The runtime — Docker, containerd, a Kubernetes node — is the crane and the ship: it doesn't know or care whether the box holds Python, a Go binary, or a database, only that it presents the standard interface. And the image itself is best understood as a **photograph of a filesystem**: a frozen, read-only snapshot of a complete userland, captured at build time, that you can copy endlessly and run anywhere. The snapshot never changes. When a container runs, it gets a thin writable layer on top of the photo; throw the container away and the photo is untouched, ready to stamp out the next identical instance. ### When to use containers (and when not) Containers are close to a default for shipping server software, but "default" is not "always." @fig-docker-build shows the build-and-distribute flow they enable; the decision below is about whether you want that flow at all. **Reach for a container** when reproducibility and environment parity are worth a small runtime cost: web services and APIs, microservices that each carry their own dependency set, anything that has to run identically across a developer's laptop, CI, and several production regions. If your pain is "it behaves differently over there," containers are the cure. **Reach for a VM or bare metal** instead when you need a *different kernel* or hard isolation across a trust boundary — containers share the host kernel, so a kernel-level exploit or a workload that needs its own kernel modules wants a real VM. Latency-critical or hardware-bound workloads (high-frequency trading, specialized drivers) sometimes can't afford even the thin namespace overhead. **The stateful-workload caveat** deserves its own line. Containers are ephemeral by design — the photo is immutable and the writable layer dies with the container. That is a feature for stateless services and a trap for databases. You *can* run Postgres in a container, but the data must live on a persistent volume mounted from outside, never in the container's own layer. The container is the running engine; the state lives elsewhere. Confuse the two and your first `docker rm` is a data-loss incident. ### What you'll learn - How images are built from stacked, content-addressed layers, and how the build cache turns layer ordering into a performance decision - How to write a Dockerfile that builds fast and rebuilds faster, by separating what rarely changes from what changes every commit - Why **multi-stage builds** are the single highest-leverage technique for shrinking images, and how the builder/runtime split works - How to make images small *and* safe: minimal bases, non-root users, distroless and scratch - Why Python, Go, and Rust produce wildly different image profiles, and what that means for your base-image choices - How to wire several containers together for local development with Docker Compose, and where its conveniences stop being appropriate for production ### Prerequisites - Linux processes and filesystems: what a process is, what PID 1 means, how the mount tree and environment variables work (the *Linux Processes* material) - Software-engineering fundamentals: dependency management, build vs. runtime, and why reproducibility matters (the *Software Engineering Fundamentals* material) - Comfort at a shell: running commands, reading exit codes, following logs --- ## Images, layers, and the build cache An image is not a single blob; it is a stack of read-only **layers**, each one the filesystem delta produced by a single build step. `FROM` lays down a base layer; installing dependencies adds another; copying your code adds another. Stack them and you have the complete userland — the photograph of the filesystem. When a container runs, the runtime unions these read-only layers and adds one thin **writable layer** on top, copy-on-write, so a thousand containers from the same image share the underlying layers and pay only for what each one changes. The reason this matters for *building* is the cache. Each layer is keyed by its inputs, and Docker reuses a cached layer whenever those inputs are unchanged — but the moment one layer's input changes, that layer and **every layer after it** are rebuilt. Cache invalidation cascades downward. This single fact dictates how you order a Dockerfile: put the steps that rarely change *first* and the steps that change constantly *last*. Dependencies before source code, always. Copy the lockfile and install dependencies as their own early step, then copy your application code as a later step. A typical commit touches only your code, so the expensive dependency layer stays cached and the rebuild is near-instant. ```dockerfile # Illustrative: dependencies first (cached), code last (changes every commit). COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # cached unless deps change COPY . . # the only layer a code edit busts ``` The inverted version — `COPY . .` before the install — reinstalls every dependency on every one-line change, because copying the code first invalidates the cache for everything below it. That is the most common reason a build that should take five seconds takes five minutes. One more habit goes with this: a `.dockerignore`. The build context is everything in the directory you hand to `docker build`, and it all gets shipped to the daemon and is fair game for `COPY . .`. Without a `.dockerignore`, you sweep `.git`, local virtual environments, `node_modules`, logs, and test caches into the image — bloating it and, worse, invalidating cache on files that have nothing to do with the build. ## Writing a Dockerfile well A good Dockerfile is mostly the discipline above plus a few defaults worth treating as non-negotiable. **Pin your base image** to a specific version, never `latest` — a floating tag means two builds a week apart can pull different base images, and your "reproducible" artifact quietly stops being reproducible (more on this in the war story below). **Set an explicit `USER`** so the process does not run as root. **Prefer `COPY` over `ADD`** unless you specifically need `ADD`'s URL-fetch or auto-extract behavior, because `COPY` does exactly one obvious thing. And remember that each `RUN` is its own layer: a package install whose cleanup happens in a *separate* later `RUN` doesn't shrink anything, because the deleted files still exist in the earlier layer. Install and clean up in the same `RUN` so the deletion lands in the same layer as the creation. The subtle trap is that **a layer is permanent once it exists, even if a later layer deletes its contents.** Layers are additive deltas; deleting a file in layer five doesn't remove it from layer three — it just hides it. Anything ever written into a layer — a secret, a credentials file, a private key you `COPY`d in and `rm`d two lines later — remains fully recoverable from the image history by anyone who pulls it. Build secrets and runtime environment variables exist precisely so secrets never touch a layer. ## Multi-stage builds: the highest-leverage technique Here is the central tension of image building. To *build* most software you need a heavy toolchain — a compiler, a language SDK, development headers, build-time dependencies. To *run* the result you need almost none of it. A single-stage Dockerfile fuses the two: whatever you installed to build the artifact ships in the final image right alongside the artifact, so a 30 MB Go binary arrives wrapped in a 900 MB Go SDK it will never use again. The build environment becomes permanent runtime baggage. **Multi-stage builds** sever build-time from runtime. You write two (or more) `FROM` stages in one Dockerfile: a fat **builder** stage that has the full toolchain and produces the artifact, and a slim **runtime** stage that starts from a minimal base and copies *only the finished artifact* across with `COPY --from=builder`. Everything in the builder — the compiler, the headers, the intermediate object files, the source tree — is discarded. Nothing crosses the boundary except what you explicitly copy. The final image is the runtime base plus your artifact, and nothing else. This is the shape in @fig-docker-build: source flows into the builder, only the artifact is copied into the runtime layers, and that slim image is what reaches the registry and the running containers. ![A multi-stage Docker build: a heavy builder stage compiles the app, then only the artifacts are copied into a slim runtime image, which is pushed to a registry and pulled as immutable containers.](../assets/diagrams/rendered/infra_containerization.svg){#fig-docker-build .lightbox} The Go case is the dramatic one because the artifact is so self-sufficient. Compile a static binary in the builder, then copy it into the emptiest base there is: ```dockerfile # Stage 1 — builder: full Go SDK (~900 MB), discarded after this stage. FROM golang:1.23 AS builder WORKDIR /src COPY go.mod go.sum ./ RUN go mod download COPY . . RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /app ./cmd/server # Stage 2 — runtime: nothing but the binary and CA certs. FROM gcr.io/distroless/static-debian12 COPY --from=builder /app /app USER 65532:65532 ENTRYPOINT ["/app"] ``` The final image is a few tens of megabytes — the binary, a handful of certificates, and no shell, package manager, or libc to attack. The 900 MB SDK never leaves stage one. Multi-stage builds are to image size what algorithmic complexity is to runtime performance: a single upstream decision that sets the ceiling for everything downstream. Get it right and the rest is tuning; get it wrong and no amount of base-image fiddling will save you. ## Smaller and safer: bases, non-root, distroless Image size and image security pull in the same direction, which is convenient: the smaller you make an image, the less there is to exploit. The lever is the base. A full `python:3.12` image is roughly a gigabyte; `python:3.12-slim` is around 130 MB; an Alpine variant is smaller still but swaps glibc for musl, which can break binary wheels with native extensions in ways that cost you an afternoon. **Distroless** images go further — Google's distroless bases contain your language runtime and its dependencies and *nothing else*: no shell, no package manager, no `cat`, no `ls`. There is no shell to drop into during an exploit and almost no CVE surface, because almost no software is present to have CVEs. For statically linked binaries, `scratch` — the empty base — is the logical endpoint: zero bytes of OS. Security is the same minimalism applied to the running process. Run as **non-root**: the default container user is root, and a container escape from a root process inherits root on the *host*, so set an explicit unprivileged `USER`. For a `scratch` or distroless image with no user database, use a numeric UID like `USER 65532`. Then make scanning routine — tools like Trivy read an image's layers and flag known vulnerabilities in its packages, and wiring that into CI catches a vulnerable base before it ships rather than after. The smaller the image, the shorter the scan report, which is the whole point. ::: {.callout-warning} ## War story: the 2 GB image and the rollback that wasn't A platform team's flagship service shipped on a single-stage Dockerfile that grew, one "just add this tool" commit at a time, into a 2 GB image carrying the entire build toolchain, the test suite, and a stray `.git` directory. Deploys slowed as every node pulled two gigabytes; eventually a pull timed out mid-rollout and wedged the cluster. The image had two diseases at once. It was huge — fixed by a multi-stage build that copied only the compiled artifact into a slim base, cutting it to 80 MB. And it was tagged `:latest`. When the team tried to roll back to "the last good version," there was no such thing: `:latest` had been overwritten on every build, so the tag that once meant the working image now meant the broken one. The lesson is two rules that always travel together — **multi-stage to keep images small, and immutable version tags (or better, digests) so every build is a distinct, rollback-able artifact.** A tag you can overwrite is not a version; it's a moving target. ::: > **Build it →** Tiny static-binary images in practice: the Rust services in > [Project 03: High-Performance Cache](https://github.com/jchu0/applied-cs-projects/tree/main/03-high-performance-cache) > and [Project 13: Service Mesh](https://github.com/jchu0/applied-cs-projects/tree/main/13-service-mesh) > compile to slim runtime images, and the Go gRPC services in > [Project 02: Microservice Platform](https://github.com/jchu0/applied-cs-projects/tree/main/02-microservice-platform) > show the builder→scratch/distroless pattern across a multi-service stack. ## The polyglot reality: not all images are equal The same multi-stage discipline yields very different images depending on the language, because languages differ in how much userland the *runtime* truly needs. This is worth internalizing, because it's why "just use distroless" is good advice for one stack and impossible for another. **Go and Rust** compile to self-contained native binaries. With `CGO_ENABLED=0` a Go program, or a Rust program built against musl, links statically — the binary carries everything, so the runtime stage can be `scratch` or distroless-static and the whole image is the binary plus CA certificates. Single-digit to low-tens of megabytes is normal. There is no interpreter to ship because the language *is* the binary. **Python** cannot do this. A Python application is source code that needs an interpreter, a standard library, and a tree of installed wheels — some of them compiled C extensions linked against system libraries — present at runtime. You cannot copy "just the binary" because there is no binary; you copy the interpreter plus the whole dependency tree. A well-built multi-stage Python image installs wheels in the builder and copies the resulting site-packages into a `slim` runtime, which trims the build toolchain but still drags an interpreter and its libraries. Expect 150–400 MB where a Go service would be 20. This is not a failure of your Dockerfile; it is the shape of an interpreted, dynamically-linked runtime. Knowing it stops you from chasing a 30 MB Python image that cannot exist, and tells you where the real wins are: a `slim` (not full) base, a clean `.dockerignore`, and never shipping build tools. > **Build it →** Compare the profiles directly. The FastAPI services in > [Project 05: SaaS Web Platform](https://github.com/jchu0/applied-cs-projects/tree/main/05-saas-web-platform) > and [Project 50: Feature Engineering Platform](https://github.com/jchu0/applied-cs-projects/tree/main/50-feature-engineering-platform) > show the Python interpreter-plus-wheels profile, while the Rust > [Project 51: Message Queue](https://github.com/jchu0/applied-cs-projects/tree/main/51-message-queue) > and [Project 52: Time-Series Database](https://github.com/jchu0/applied-cs-projects/tree/main/52-time-series-database) > show the static-binary end of the spectrum. ## Docker Compose for local development One container is a process; a real system is several talking to each other — an API, a database, a cache, maybe a message broker. Wiring them up by hand with a dozen `docker run` flags for ports, networks, volumes, and dependency order is tedious and unrepeatable. **Docker Compose** declares the whole topology in one YAML file and brings it up with a single command. Compose puts the services on a shared network where each is reachable by its service name as a hostname — the API connects to the database at `db:5432`, not some IP — and that name-based wiring is most of what makes a multi-service stack manageable. ```yaml # Illustrative: an API that depends on a database, on a shared Compose network. services: api: build: . ports: ["8000:8000"] environment: DATABASE_URL: postgresql://user:pass@db:5432/app # 'db' = the service name depends_on: db: { condition: service_healthy } db: image: postgres:16-slim volumes: ["pgdata:/var/lib/postgresql/data"] # state lives outside the container healthcheck: test: ["CMD-SHELL", "pg_isready -U user"] volumes: { pgdata: } ``` Two details earn their keep. `depends_on` with a `service_healthy` condition makes the API wait until the database is actually accepting connections, not merely started — without it, the API races the database on startup and crashes on the first query. And the named volume is where the Postgres data lives, *outside* the container's ephemeral layer, so `docker compose down` and back up doesn't wipe the database. That is the stateful-workload caveat from earlier, made concrete. Compose is superb for local development and CI integration tests, where you want the whole stack on one machine, reproducibly, in one command. It is *not* a production orchestrator: it has no multi-host scheduling, no self-healing rescheduling, no rolling deploys or autoscaling. When you outgrow one machine, the same images move to Kubernetes — which is the subject of its own chapter, and the reason getting the image right here matters there. > **Build it →** Full Compose stacks that stand up databases, caches, and brokers for > local dev: [Project 07: Data Lakehouse](https://github.com/jchu0/applied-cs-projects/tree/main/07-data-lakehouse) > and [Project 08: Streaming Platform](https://github.com/jchu0/applied-cs-projects/tree/main/08-streaming-platform) > orchestrate full multi-service infrastructure with `docker compose`. --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — Containerize and measure.** Take a small app (a FastAPI endpoint or a Go HTTP server) and write a straightforward single-stage Dockerfile. Build it, run it, hit the endpoint, then record the image size from `docker images`. This is your baseline — write the number down; you'll beat it in the next task. 2. **Level II — Multi-stage and report the delta.** Convert the same app to a multi-stage build: a builder stage with the toolchain, a slim or distroless runtime stage that copies only the artifact. Report the before/after image size and the percentage cut. Then change one line of application code, rebuild, and explain from the build output exactly which layers were rebuilt and which were served from cache — and *why*, in terms of layer ordering. 3. **Level III — Harden for a regulated deploy.** Take the multi-stage image to production grade: non-root `USER`, a distroless or scratch runtime, and a *pinned base by digest* so the build is byte-reproducible rather than tag-floating. Run a vulnerability scan (e.g. Trivy) and generate an SBOM. Write a short paragraph on the supply-chain story you'd present at a compliance review: how a reviewer can verify *exactly* what is in the image, prove it hasn't drifted since it was signed, and reproduce the build from source. ## Summary Containerization solves the deployment problem that version pins cannot: it packages the application *and its entire userland* into one immutable image, so the thing you tested is byte-for-byte the thing you ship. It does this cheaply by sharing the host kernel and isolating with namespaces and cgroups rather than virtualizing hardware — milliseconds to start, tens of megabytes on disk, where a VM is seconds and gigabytes. The image is a photograph of a filesystem, built from stacked read-only layers whose ordering governs the build cache. The highest-leverage technique is the multi-stage build, which discards the heavy toolchain and ships only the artifact into a slim, non-root, minimal base — and how slim depends on your language, with Go and Rust reaching `scratch` and Python necessarily dragging its interpreter. ### Key takeaways - A container ships the environment, not just the code — that is the whole point, and the cure for "works on my machine." - Containers isolate with kernel namespaces + cgroups, not a guest kernel; that is why they're cheap, and also why hard isolation across a trust boundary still wants a VM. - Layer ordering *is* build performance: dependencies before code, clean up in the same `RUN`, and never assume a deleted file leaves the layer it was written into. - Multi-stage builds are the single biggest win for size and security; pair them with a minimal base, a non-root user, and immutable version tags or digests. - State belongs on a mounted volume, never in the container's ephemeral writable layer. ### Connections to other chapters - **Software Engineering Fundamentals** (prerequisite): containers are the concrete answer to the reproducibility and dependency-management problems framed there — an image is a build artifact taken to its logical, environment-inclusive conclusion. - **Orchestration with Kubernetes** (extension): Kubernetes is what *runs* the images you build here at scale — it schedules containers across many hosts, restarts the ones that die, and rolls out new versions. A small, well-formed image is the unit it schedules; the image work in this chapter is the input to that one. - **Benchmarking Systems** (sibling): image size and container cold-start time are measurable infrastructure costs, not vibes. The discipline for measuring them — and for trusting the deltas you report in the exercise above — comes from the benchmarking chapter. - **Cross-Cutting CI/CD and Security** (extension): images are built, scanned, and signed in CI. The supply-chain controls — vulnerability scanning, SBOMs, image signing, pinned digests — are where containerization meets the security and pipeline practices covered there. ## Further reading ### Essential - *Docker docs — Best practices for writing Dockerfiles* — the canonical reference for layer ordering, multi-stage builds, and `.dockerignore`. - *Distroless container images* (Google) — the rationale and catalog for minimal, shell-free runtime bases. ### Deep dives - *The Open Container Initiative (OCI) Image and Runtime Specifications* — the open standards, decoupled from Docker, that define what an image and a running container actually are. - Merkel, *"Docker: Lightweight Linux Containers for Consistent Development and Deployment"* (Linux Journal, 2014) — the early articulation of the container value proposition. ### Historical context - Linux **namespaces** and **cgroups** kernel documentation — the two primitives that make containers possible; everything above is orchestration over these. - Verma et al., *"Large-scale cluster management at Google with Borg"* (EuroSys, 2015) — the internal system whose lineage runs straight through containers to Kubernetes, and the reason the next chapter exists.