CI/CD and Deployment Automation

Keywords

ci/cd, continuous integration, continuous delivery, github actions, gitops, pipeline, deployment, canary, blue-green, rollback

Introduction

The deploy was scheduled for Friday at 5pm, after the traffic died down, the way it always was. One engineer — call her the only person who knew the runbook — SSH’d into the production box, pulled the new build, ran the four migration scripts in the order written on a wiki page she’d half-memorized, restarted the service by hand, and watched the logs scroll. For three years this had worked. At 6:02pm it did not. The third migration assumed a column that an earlier hotfix had already renamed, and it failed halfway — some rows migrated, some not, the schema now in a state no script anticipated. The service came up against the broken schema and started returning 500s on every write. There was no rollback. “Rollback” had always meant “redeploy the old build,” but the old build couldn’t read the new, half-migrated table either. The one person who understood the steps spent the next four hours reconstructing the database by hand while the site stayed down, and at 10pm she was still typing SQL into a production shell from memory, alone, on a Friday night.

Nothing in that story is a code bug. The application was fine. What failed was the process: a deployment that lived in one person’s head, ran by hand, in an order no machine enforced, with no tested way back. The release was a rare, high-stakes, irreversible event — a one-way door someone walked through holding their breath. And because it was rare, it was big: weeks of accumulated changes shipped at once, so when something broke there was no way to tell which of fifty commits did it. The whole catastrophe is what happens when shipping software is a heroic manual act instead of a boring automated one. CI/CD turns deployment from a risky event into a repeatable, reversible pipeline — one that builds and tests every change as it lands, ships it the same way every time, and can take it back in seconds when it’s wrong.

The Core Insight

The goal of CI/CD is to make integration and deployment continuous, automated, and reversible, so that shipping software is low-risk and frequent rather than rare and terrifying. Those three properties are the whole thing, and each fixes a specific failure in the Friday-night story.

Continuous means every change is integrated and tested the moment it lands, not batched up for a quarterly release. The opposite — long-lived branches that diverge for weeks and then merge in a painful big bang — is where integration bugs hide until they’re expensive. Continuous Integration (CI) keeps the mainline always-releasable by building and testing every commit, so the question “is main shippable?” always has the answer “yes, we just checked.” Automated means the path from a merged commit to running production code is executed by a machine following the same steps in the same order every time — no wiki page, no memory, no single point of human failure. Continuous Delivery/Deployment (CD) owns that path, with safety rails: staged rollout, health gates that block a bad release, and instant rollback when one slips through.

The deepest part of the insight is counterintuitive: small, frequent changes are safer than big, rare ones. Every instinct says the opposite — surely shipping less often, after more review, is more cautious. But a deploy carrying one change has a tiny blast radius and an obvious cause: if it breaks, you know exactly what broke it, because only one thing changed. A deploy carrying fifty changes after three weeks has fifty suspects and a debugging session. Frequency is not recklessness; it is risk reduction through small batches. The teams that deploy many times a day are not braver than the ones that deploy once a quarter — they have made each deploy so small and so reversible that it stopped being scary.

A mental model

Picture the pipeline as an assembly line with quality gates. A change enters at one end as a commit and moves down the line through stations — build, test, scan, deploy — and at each station there is an inspector who can stop the line. The build station won’t pass a change that doesn’t compile; the test station won’t pass one that fails its tests; the scan station won’t pass one with a critical vulnerability. Nothing reaches the next station until it clears the current one, and nothing reaches production until it clears them all. The point of the line is not speed for its own sake — it’s that every unit comes out the far end having passed exactly the same inspections, so what ships is never a surprise. Figure 43.1 shows this line end to end.

Two refinements make the model precise. First, a deploy is a reversible state change, not a one-way door. The Friday-night failure was an irreversible step — once the half-migration ran, there was no “undo.” A well-built pipeline treats every deploy as a move you can take back: the previous version is still there, still runnable, one command away. You don’t deploy carefully because you can’t undo it; you deploy freely because you can. Second, the most powerful form of CD inverts who’s in charge. In GitOps, the git repository holds the desired state of production — “this is what should be running” — and a controller running in the cluster continuously compares desired state to actual state and reconciles any difference. You don’t push to production; you change the repo, and the system pulls itself toward it. Production becomes a function of git, and the deploy becomes a git commit.

When to invest, and which strategy

CI is not a “when” question — adopt it on day one of any project with more than one contributor, before there’s anything worth deploying. The cost of retrofitting a test gate onto a codebase that grew without one is far higher than the cost of starting with an empty workflow that runs pytest on every push. CI is cheap insurance that gets more valuable, and harder to add, with every commit.

The interesting decision is the deployment strategy — how a new version replaces the old one — and it’s a trade of rollout speed against blast radius, shown in Figure 43.1 as the staged path from staging through canary to full. Recreate (stop the old, start the new) is simplest and incurs downtime; reach for it only when a brief outage is acceptable, as in batch jobs or internal tools. Rolling replaces instances a few at a time so there’s no downtime and no extra infrastructure; it’s the sensible default for most stateless services. Blue-green runs two full environments and switches traffic atomically, giving instant cutover and instant rollback at the cost of double the infrastructure during the switch — worth it when a few seconds of mixed-version traffic is unacceptable. Canary sends a small slice of real traffic to the new version, watches its metrics, and ramps up only if they stay healthy; it has the smallest blast radius of all and is the right choice for high-traffic services where a bad release would hurt many users fast. The rule of thumb: the higher the cost of a bad release reaching everyone, the more gradual the rollout you want.

What you’ll learn

Why the pipeline’s quality gates exist, what each one (build, test, scan) checks, and why failing fast — running the cheapest checks first — is the design principle that keeps a pipeline useful
How CI works in practice with GitHub Actions: the workflow/job/step model, dependency caching, matrix builds, and what separates a pipeline people trust from one they route around
How the four deployment strategies — recreate, rolling, blue-green, canary — trade rollout speed against blast radius, and how feature flags let you deploy code without releasing the feature
How GitOps makes git the source of truth and a controller the deployer, and why pull-based reconciliation gives you a free audit trail and self-healing
Why fast, automated rollback matters more than careful deployment, and how health gates turn rollback from a panic into a reflex
Where the older imperative CI model (Jenkins) fits, and what the modern declarative, cloud-native model changed

Prerequisites

Containerization with Docker (prerequisite): the pipeline’s central artifact is a container image — CI builds it, scans it, and ships it, so you’ll want to be comfortable with images, layers, and registries
Testing fundamentals: the test suite is the gate CI enforces; you should know the difference between unit, integration, and end-to-end tests and why a fast, reliable suite is the foundation everything else stands on
Comfort with git (branches, merges, pull requests) and a shell, plus a working mental model of “staging vs. production” environments

The pipeline and its quality gates

A pipeline is an ordered sequence of stages, each of which must pass before the next begins, that carries a change from commit to production. The canonical CI half is build → test → scan, producing an artifact; the CD half takes that artifact and runs deploy → health gate → progressive rollout. The stages are the assembly-line stations, and the discipline that makes the line worth having is that each is a gate — a hard stop, not a suggestion. A failing test doesn’t produce a warning that ships anyway; it fails the pipeline and the change does not advance.

The single most important design principle here is fail fast: order the gates so the cheapest, fastest checks run first, and the expensive ones run only on changes that have already survived. Linting takes seconds; unit tests take a minute; integration tests take five; building and pushing an image takes more; deploying to staging and running smoke tests takes more still. If a change has a syntax error, you want to know in ten seconds, not after a six-minute image build. So the line is ordered by cost, and a failure anywhere short-circuits everything after it. This is the same economics as the old “a bug caught in code review costs a tenth of one caught in production” — the earlier the gate, the cheaper the failure.

The artifact in the middle deserves its own emphasis, because it’s where the “reproducible” promise lives. CI’s job is to build the deployable thing once — a container image, content-addressed by digest — and then that exact artifact, byte for byte, is what flows through staging and into production. You do not rebuild for production. Rebuilding reintroduces the “works in CI, breaks in prod” gap that the whole exercise exists to close: a rebuild can pull a different base image, a different transitive dependency, a different compiler, and now the thing you tested is not the thing you shipped. Build once, promote the same artifact everywhere. The image you scan in CI is the image that runs in production, and you can prove it by digest.

CI in practice

GitHub Actions is the modern default for CI, and its model is worth knowing precisely because the vocabulary is shared across most modern systems. A workflow is a YAML file in .github/workflows/ triggered by an event — a push, a pull request, a tag. A workflow contains jobs, which run in parallel by default and can declare dependencies on each other with needs. Each job runs on a fresh runner and contains steps, which run in sequence and are either shell commands or reusable actions (uses: actions/checkout@v4). That’s the whole structure: events trigger workflows, workflows contain jobs, jobs contain steps.

A minimal CI workflow is mostly this shape — check out the code, set up the runtime, install dependencies, run the gates:

# .github/workflows/ci.yml — runs on every push and PR to main.
on:
  push: { branches: [main] }
  pull_request: { branches: [main] }

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12", cache: "pip" }   # caches deps between runs
      - run: pip install -r requirements.txt -r requirements-dev.txt
      - run: ruff check .            # fast gate first
      - run: pytest --cov=src        # slower gate second

Two of those lines carry more weight than they look. The cache: "pip" turns dependency installation from a from-scratch download on every run into a near-instant cache restore — and pipeline speed is not a nicety, it’s the difference between a pipeline people wait for and one they learn to ignore. The ordering of ruff before pytest is the fail-fast principle in miniature: lint errors surface in seconds rather than after the test suite has run. When the same checks must run across several runtime versions or operating systems, a matrix expands one job definition into many parallel jobs — one per combination — so a single block tests Python 3.11, 3.12, and 3.13 at once. Matrices are powerful and also the easiest way to burn CI minutes, since a 3-by-3 matrix is nine jobs; constrain them to combinations you actually support.

What separates a good pipeline from a bad one is not its feature list. A good pipeline is fast (feedback in minutes, so developers stay in flow rather than context-switching away and forgetting the result), deterministic (the same commit produces the same result every time — flaky tests that pass and fail at random are corrosive, because they train people to re-run until green and stop reading failures), and required to merge (the gate is enforced by branch protection, not by hoping people check). A pipeline that’s slow, flaky, or optional is worse than no pipeline, because it costs time and builds false confidence. The goal is a green check that means something.

Build it → A full production CI/CD setup — multi-job workflows, caching, image build and push, staged deploys — lives in Project 05: SaaS Web Platform, which carries the DevOps and CI/CD concerns for a real multi-service app. For CI as a correctness gate on performance-sensitive code, the Project 49: AI Benchmark Suite shows benchmarks wired into the pipeline so a regression fails the build.

Deployment strategies

Once CI has produced a trusted artifact, CD has to get it into production without breaking anything — and the how is the deployment strategy. They all answer the same question, “how does the new version replace the old?”, and they differ only in how they shift traffic, which is exactly the trade between rollout speed and blast radius.

Rolling updates replace instances in batches — take down a few old ones, bring up a few new ones, repeat until the fleet is new. There’s no downtime and no extra infrastructure, but for a window both versions serve traffic simultaneously, so the new version must be compatible with the old one’s database schema and API contracts. It’s the right default for most stateless services. Blue-green sidesteps the mixed-version window by running two complete environments — blue (live) and green (the new version, fully deployed but dark) — and flipping a router to send all traffic to green at once. Cutover is instant and, crucially, so is rollback: if green misbehaves, flip back to blue, which is still running untouched. The cost is carrying double the infrastructure during the switch, and you keep blue alive for a while after, as a hot standby, before scaling it down.

Canary is the most cautious and the most powerful. Instead of switching all traffic at once, you route a small slice — 5%, say — to the new version and watch its metrics: error rate, latency, saturation. If they stay healthy through a bake time, you ramp to 25%, then 50%, then 100%, watching at each step; if they degrade, you abort and roll back, and only that 5% of users ever saw the bad version. The blast radius is bounded by the smallest slice you start with. This is what Figure 43.1 shows as the progressive path: canary first, then full, with a rollback arrow waiting on a breached metric. Canary trades rollout speed (it’s deliberately slow, minutes to hours) for the smallest possible blast radius, which is exactly the deal you want for a high-traffic service.

There’s one more move that changes the game entirely: feature flags decouple deploy from release. A flag is a runtime switch that gates whether a code path is active. With flags, you can deploy the code for a new feature to production turned off, then turn it on for 1% of users, then 10%, then everyone — all without another deploy. Deploy becomes a low-stakes, mechanical act (ship the binary); release becomes a separate, controllable decision (flip the flag). And rollback for a flagged feature is instant and surgical: turn the flag off, no redeploy required. The deploy and the release stop being the same event, which is liberating precisely because it shrinks what any single deploy can break.

GitOps

Most of what we’ve described is push-based: CI finishes, then something — a pipeline step, an argocd call, a kubectl apply — reaches into the cluster and changes it. The pipeline holds the credentials to production and pushes changes in. GitOps inverts this. The desired state of production lives declaratively in a git repository — Kubernetes manifests, Helm values, the image tag to run — and a controller running inside the cluster continuously watches that repo, compares the declared desired state to the cluster’s actual state, and reconciles any difference. Nobody pushes to production. You change the repo; the controller pulls the change in. Figure 43.1 shows this as the git-is-desired-state loop the rollout converges on.

This pull-based model (Argo CD and Flux are the two dominant controllers) buys several things almost for free. The repository becomes an exact, auditable record of what is running and who changed it: every production change is a commit, with an author, a timestamp, a diff, and a review. There’s no “who ran what on the box” mystery, because there’s no box anyone runs things on — the controller is the only actor, and its instructions are all in git. The controller also self-heals: because it’s continuously reconciling toward the declared state, if someone makes a manual change to the cluster — a panicked 2am kubectl edit — the controller notices the drift and reverts it back to what git says. Configuration drift, the slow divergence between what you think is running and what actually is, stops being possible by construction. And rollback becomes the most natural operation there is: it’s git revert. Point the repo back at the previous commit and the controller reconciles production back to the last good state. The deploy that took a Friday night and a runbook becomes a pull request.

Rollback and safety

Here is the claim that reorganizes how you think about deployment: fast rollback beats careful deploy. You cannot, with any amount of testing, guarantee a release is good — production carries load, data, and traffic patterns that no staging environment fully reproduces, which is why “it worked in CI” and “it worked in prod” are different sentences. Since some fraction of releases will be bad, the question that actually governs your reliability is not “how do we stop bad releases” but “how fast can we recover from one.” A team that can roll back in thirty seconds can afford to deploy twenty times a day; a team whose rollback is a four-hour manual database rebuild — the Friday-night team — cannot afford to deploy at all. Recovery speed, not deploy caution, is what makes frequent shipping safe.

The mechanism that makes rollback automatic is the health gate: an explicit check that stands between deployment and promotion. After a deploy to staging or a canary slice, the pipeline runs smoke tests and reads health metrics, and promotion proceeds only if they pass. The same metrics power automated rollback — in a canary, you define a success condition (say, HTTP success rate at or above 99% over a five-minute window) and the rollout controller evaluates it at each step; if the canary breaches it, the controller aborts and reverts on its own, without a human in the loop, while the incident is still 5% of traffic instead of all of it. This is the rollback arrow in Figure 43.1: a failed health check or a breached metric routes straight back to the last good artifact. The deploy is, as promised, a reversible state change — and the reversal is something the system does for you, faster than you could decide to do it yourself.

A note on Jenkins and the modern model

It’s worth placing the older world against the new, because you will meet both. Jenkins is the long-dominant CI server, and its model is imperative and server-centric: a long-lived Jenkins controller, configured through plugins, runs pipelines defined in a Jenkinsfile (a Groovy script of stage blocks) on a pool of build agents you operate. It’s enormously flexible — there is a plugin for everything — and that flexibility is also its cost: a Jenkins installation is infrastructure you own, patch, and secure, and its configuration tends to sprawl across the UI, plugins, and scripts in ways that drift from version control. The modern model — GitHub Actions, GitLab CI, and the GitOps controllers — is declarative and cloud-native: the pipeline is YAML that lives in the repo next to the code, runners are ephemeral and provisioned on demand, and there’s no server for you to keep alive. The shift is the same one containers made: from a long-lived, hand-tended machine to a declarative artifact that’s reproducible and disposable. Jenkins is far from dead — it runs an enormous amount of the world’s CI, especially on-premises and in regulated environments — but for a new project, the declarative cloud-native model is the default, and the reason is that the pipeline becomes code you review and version like everything else.

War story: the pipeline nobody trusted

A platform team’s CI pipeline took forty minutes to run and failed at random perhaps one run in five — a flaky integration test that depended on timing, a cache that occasionally corrupted, a third-party API that rate-limited the test suite. The pipeline was technically “required to merge,” but the team had quietly adapted: push, switch to another task, come back later, and when the run came back red, glance at it, decide “that looks like the flaky one,” and click re-run until it went green. Nobody read failures anymore, because most failures were noise. Then a real bug — a genuine regression that broke checkout — failed the pipeline, got the reflexive re-run treatment, happened to pass on the third try when the flaky test flipped green, and shipped. The outage cost a weekend.

The lesson is that a pipeline that isn’t trusted is worse than no pipeline at all. Not because it does nothing, but because it manufactures false confidence: a green check that means “the dice came up green,” not “this change is safe.” A slow pipeline trains people to stop watching; a flaky one trains them to stop believing; and a pipeline people route around — re-running until green, merging past a “known” failure — is providing exactly zero of the safety it appears to provide. Speed and determinism are not polish; they are the load-bearing properties. If your pipeline is slow, make it fast before you add a single feature to it. If it’s flaky, fixing the flake is the highest-priority work on the board, because every flaky failure is a lesson teaching your team to ignore the one failure that matters.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — A CI gate that means something. Write a GitHub Actions workflow that triggers on every push and pull request to main, checks out the code, sets up the runtime with dependency caching, and runs a linter followed by the test suite. Deliberately push a commit that fails a test and confirm the workflow goes red; then enable branch protection so the workflow is required to merge and confirm the failing change cannot be merged. Write one sentence explaining why you ran the linter before the tests.
Level II — Build, scan, and a staged deploy with a gate. Extend the workflow so that after tests pass it builds a container image, runs a vulnerability scan (e.g. Trivy) that fails the build on a critical finding, and pushes the image to a registry tagged by commit SHA (never latest). Add a deploy job that ships the image to a staging environment and runs a smoke test against it as a health gate — a failed smoke test must block promotion to production. Explain, from the build output, why the same image (by digest) must flow to staging and prod rather than being rebuilt.
Level III — Design a canary with automated rollback. For a high-traffic service, design a canary deployment on paper. Define: (a) the specific metric that triggers rollback (name it, give the threshold and the evaluation window — e.g. success rate below 99% over five minutes); (b) the bake time and ramp schedule (5% → 25% → 50% → 100%, and how long the canary bakes at each step before promoting); and (c) how feature flags let you separate the deploy from the release, so the new code can ship dark and be turned on for a fraction of users independently of the rollout. State what happens, step by step, when the metric breaches mid-ramp, and how a human would know it happened.

Summary

CI/CD turns deployment from a rare, manual, irreversible event into a continuous, automated, reversible pipeline. CI keeps the mainline always-releasable by building and testing every change behind quality gates ordered to fail fast — cheap checks first, expensive ones only on survivors — and produces a single immutable artifact, built once and promoted unchanged. CD carries that artifact to production behind safety rails: a deployment strategy chosen to trade rollout speed against blast radius (rolling for the common case, blue-green for instant cutover, canary for the smallest blast radius), health gates that block bad releases, and automated rollback that recovers faster than a human could decide to. GitOps makes git the source of truth and a controller the deployer, buying an audit trail, self-healing, and git revert as rollback. The thread through all of it is that small, frequent, reversible changes are safer than big, rare ones — and that a pipeline is only worth having if it’s fast, deterministic, and trusted enough that a green check actually means the change is safe to ship.

Key takeaways

The goal is continuous, automated, reversible deployment; small frequent changes have a smaller blast radius and an obvious cause, which makes them safer than big rare ones.
Order pipeline gates to fail fast — cheapest checks first — and build the deployable artifact once, then promote that exact artifact everywhere; never rebuild for prod.
Deployment strategies trade rollout speed for blast radius: rolling (default), blue-green (instant cutover and rollback, double infra), canary (smallest blast radius, slowest). Feature flags separate deploy from release.
GitOps inverts deployment: git holds desired state, a controller reconciles production to it — yielding an audit trail, self-healing against drift, and git revert as the rollback.
Fast rollback beats careful deploy: you can’t guarantee a release is good, so recovery speed governs reliability. Health gates make rollback automatic.
A slow or flaky pipeline is worse than none — it manufactures false confidence and trains people to stop reading failures.

Connections to other chapters

Containerization with Docker (prerequisite): the pipeline’s central artifact is the container image. CI builds it, scans it for vulnerabilities, and pushes it by digest; the “build once, promote everywhere” rule here is what makes the image’s reproducibility promise pay off all the way to production.
Testing fundamentals and the per-language Testing chapters (prerequisite): the test suite is the gate CI enforces. Everything in this chapter assumes a fast, reliable suite — the war story is about what happens when it isn’t, and the difference between unit, integration, and end-to-end tests is exactly the fail-fast ordering of the gates.
Observability (sibling): a deploy is an event you watch, and a canary is observability turned into a control loop — the automated-rollback metric (success rate, latency) is read straight from the same telemetry the observability chapter teaches you to emit and query.
Security (sibling): the scanning stage, SBOM generation, secret detection, and least-privilege pipeline credentials all live in the pipeline. CI is where supply-chain security is enforced — a vulnerable dependency or a leaked secret is caught at a gate before it ships, not after.

Humble & Farley, Continuous Delivery — the foundational book; the deployment pipeline, build-once-deploy-many, and the discipline of keeping the mainline always releasable all come from here.
GitHub Actions documentation — the canonical reference for the workflow/job/step model, caching, matrices, environments, and reusable workflows.

Deep dives

Forsgren, Humble & Kim, Accelerate — the research behind the DORA metrics (deployment frequency, lead time, change-failure rate, time-to-restore) and the evidence that frequent, small deploys correlate with higher stability, not lower.
Argo CD and Flux documentation — the two dominant GitOps controllers; read these for the reconciliation loop, sync policies, self-healing, and progressive delivery with Argo Rollouts.

Historical context

John Allspaw & Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr” (Velocity, 2009) — the talk widely credited as the spark for the DevOps and continuous-deployment movement, where the idea that you should deploy many times a day went from heresy to aspiration.

--- title: "CI/CD and Deployment Automation" keywords: [ci/cd, continuous integration, continuous delivery, github actions, gitops, pipeline, deployment, canary, blue-green, rollback] difficulty: intermediate prerequisites: [containerization, testing-fundamentals] estimated_time: "3-4 hours" --- ## Introduction The deploy was scheduled for Friday at 5pm, after the traffic died down, the way it always was. One engineer — call her the only person who knew the runbook — SSH'd into the production box, pulled the new build, ran the four migration scripts in the order written on a wiki page she'd half-memorized, restarted the service by hand, and watched the logs scroll. For three years this had worked. At 6:02pm it did not. The third migration assumed a column that an earlier hotfix had already renamed, and it failed *halfway* — some rows migrated, some not, the schema now in a state no script anticipated. The service came up against the broken schema and started returning 500s on every write. There was no rollback. "Rollback" had always meant "redeploy the old build," but the old build couldn't read the new, half-migrated table either. The one person who understood the steps spent the next four hours reconstructing the database by hand while the site stayed down, and at 10pm she was still typing SQL into a production shell from memory, alone, on a Friday night. Nothing in that story is a code bug. The application was fine. What failed was the *process*: a deployment that lived in one person's head, ran by hand, in an order no machine enforced, with no tested way back. The release was a rare, high-stakes, irreversible event — a one-way door someone walked through holding their breath. And because it was rare, it was big: weeks of accumulated changes shipped at once, so when something broke there was no way to tell which of fifty commits did it. The whole catastrophe is what happens when shipping software is a heroic manual act instead of a boring automated one. CI/CD turns deployment from a risky event into a repeatable, reversible pipeline — one that builds and tests every change as it lands, ships it the same way every time, and can take it back in seconds when it's wrong. ### The Core Insight The goal of CI/CD is to make integration and deployment **continuous, automated, and reversible**, so that shipping software is low-risk and frequent rather than rare and terrifying. Those three properties are the whole thing, and each fixes a specific failure in the Friday-night story. *Continuous* means every change is integrated and tested the moment it lands, not batched up for a quarterly release. The opposite — long-lived branches that diverge for weeks and then merge in a painful big bang — is where integration bugs hide until they're expensive. **Continuous Integration (CI)** keeps the mainline always-releasable by building and testing every commit, so the question "is `main` shippable?" always has the answer "yes, we just checked." *Automated* means the path from a merged commit to running production code is executed by a machine following the same steps in the same order every time — no wiki page, no memory, no single point of human failure. **Continuous Delivery/Deployment (CD)** owns that path, with safety rails: staged rollout, health gates that block a bad release, and instant rollback when one slips through. The deepest part of the insight is counterintuitive: **small, frequent changes are *safer* than big, rare ones.** Every instinct says the opposite — surely shipping less often, after more review, is more cautious. But a deploy carrying one change has a tiny blast radius and an obvious cause: if it breaks, you know exactly what broke it, because only one thing changed. A deploy carrying fifty changes after three weeks has fifty suspects and a debugging session. Frequency is not recklessness; it is risk reduction through *small batches*. The teams that deploy many times a day are not braver than the ones that deploy once a quarter — they have made each deploy so small and so reversible that it stopped being scary. ### A mental model Picture the pipeline as an **assembly line with quality gates**. A change enters at one end as a commit and moves down the line through stations — build, test, scan, deploy — and at each station there is an inspector who can stop the line. The build station won't pass a change that doesn't compile; the test station won't pass one that fails its tests; the scan station won't pass one with a critical vulnerability. Nothing reaches the next station until it clears the current one, and nothing reaches production until it clears them all. The point of the line is not speed for its own sake — it's that every unit comes out the far end having passed *exactly the same inspections*, so what ships is never a surprise. @fig-cicd-pipeline shows this line end to end. ![A CI/CD pipeline: every commit is built, tested, and scanned (CI), producing an artifact that flows through staged deployment with health gates and a progressive (canary then full) rollout (CD) — with an automated rollback when a health check fails. In GitOps, the repository holds the desired state and a controller continuously reconciles production to it.](../assets/diagrams/rendered/cicd_pipeline.svg){#fig-cicd-pipeline .lightbox} Two refinements make the model precise. First, **a deploy is a reversible state change, not a one-way door.** The Friday-night failure was an irreversible step — once the half-migration ran, there was no "undo." A well-built pipeline treats every deploy as a move you can take back: the previous version is still there, still runnable, one command away. You don't deploy carefully because you can't undo it; you deploy freely *because* you can. Second, the most powerful form of CD inverts who's in charge. In **GitOps**, the git repository holds the *desired state* of production — "this is what should be running" — and a controller running in the cluster continuously compares desired state to actual state and reconciles any difference. You don't push to production; you change the repo, and the system pulls itself toward it. Production becomes a function of git, and the deploy becomes a `git commit`. ### When to invest, and which strategy CI is not a "when" question — adopt it on **day one** of any project with more than one contributor, before there's anything worth deploying. The cost of retrofitting a test gate onto a codebase that grew without one is far higher than the cost of starting with an empty workflow that runs `pytest` on every push. CI is cheap insurance that gets more valuable, and harder to add, with every commit. The interesting decision is the *deployment strategy* — how a new version replaces the old one — and it's a trade of rollout speed against blast radius, shown in @fig-cicd-pipeline as the staged path from staging through canary to full. **Recreate** (stop the old, start the new) is simplest and incurs downtime; reach for it only when a brief outage is acceptable, as in batch jobs or internal tools. **Rolling** replaces instances a few at a time so there's no downtime and no extra infrastructure; it's the sensible default for most stateless services. **Blue-green** runs two full environments and switches traffic atomically, giving instant cutover *and* instant rollback at the cost of double the infrastructure during the switch — worth it when a few seconds of mixed-version traffic is unacceptable. **Canary** sends a small slice of real traffic to the new version, watches its metrics, and ramps up only if they stay healthy; it has the smallest blast radius of all and is the right choice for high-traffic services where a bad release would hurt many users fast. The rule of thumb: the higher the cost of a bad release reaching everyone, the more gradual the rollout you want. ### What you'll learn - Why the pipeline's quality gates exist, what each one (build, test, scan) checks, and why *failing fast* — running the cheapest checks first — is the design principle that keeps a pipeline useful - How CI works in practice with GitHub Actions: the workflow/job/step model, dependency caching, matrix builds, and what separates a pipeline people trust from one they route around - How the four deployment strategies — recreate, rolling, blue-green, canary — trade rollout speed against blast radius, and how feature flags let you deploy code without releasing the feature - How GitOps makes git the source of truth and a controller the deployer, and why pull-based reconciliation gives you a free audit trail and self-healing - Why fast, automated rollback matters more than careful deployment, and how health gates turn rollback from a panic into a reflex - Where the older imperative CI model (Jenkins) fits, and what the modern declarative, cloud-native model changed ### Prerequisites - **Containerization with Docker** (prerequisite): the pipeline's central artifact is a container image — CI builds it, scans it, and ships it, so you'll want to be comfortable with images, layers, and registries - **Testing fundamentals**: the test suite is the gate CI enforces; you should know the difference between unit, integration, and end-to-end tests and why a fast, reliable suite is the foundation everything else stands on - Comfort with git (branches, merges, pull requests) and a shell, plus a working mental model of "staging vs. production" environments --- ## The pipeline and its quality gates A pipeline is an ordered sequence of stages, each of which must pass before the next begins, that carries a change from commit to production. The canonical CI half is **build → test → scan**, producing an **artifact**; the CD half takes that artifact and runs **deploy → health gate → progressive rollout**. The stages are the assembly-line stations, and the discipline that makes the line worth having is that *each is a gate* — a hard stop, not a suggestion. A failing test doesn't produce a warning that ships anyway; it fails the pipeline and the change does not advance. The single most important design principle here is **fail fast**: order the gates so the cheapest, fastest checks run first, and the expensive ones run only on changes that have already survived. Linting takes seconds; unit tests take a minute; integration tests take five; building and pushing an image takes more; deploying to staging and running smoke tests takes more still. If a change has a syntax error, you want to know in ten seconds, not after a six-minute image build. So the line is ordered by cost, and a failure anywhere short-circuits everything after it. This is the same economics as the old "a bug caught in code review costs a tenth of one caught in production" — the earlier the gate, the cheaper the failure. The **artifact** in the middle deserves its own emphasis, because it's where the "reproducible" promise lives. CI's job is to build the deployable thing **once** — a container image, content-addressed by digest — and then *that exact artifact*, byte for byte, is what flows through staging and into production. You do not rebuild for production. Rebuilding reintroduces the "works in CI, breaks in prod" gap that the whole exercise exists to close: a rebuild can pull a different base image, a different transitive dependency, a different compiler, and now the thing you tested is not the thing you shipped. Build once, promote the same artifact everywhere. The image you scan in CI is the image that runs in production, and you can prove it by digest. ## CI in practice GitHub Actions is the modern default for CI, and its model is worth knowing precisely because the vocabulary is shared across most modern systems. A **workflow** is a YAML file in `.github/workflows/` triggered by an event — a push, a pull request, a tag. A workflow contains **jobs**, which run in parallel by default and can declare dependencies on each other with `needs`. Each job runs on a fresh runner and contains **steps**, which run in sequence and are either shell commands or reusable **actions** (`uses: actions/checkout@v4`). That's the whole structure: events trigger workflows, workflows contain jobs, jobs contain steps. A minimal CI workflow is mostly this shape — check out the code, set up the runtime, install dependencies, run the gates: ```yaml # .github/workflows/ci.yml — runs on every push and PR to main. on: push: { branches: [main] } pull_request: { branches: [main] } jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: "3.12", cache: "pip" } # caches deps between runs - run: pip install -r requirements.txt -r requirements-dev.txt - run: ruff check . # fast gate first - run: pytest --cov=src # slower gate second ``` Two of those lines carry more weight than they look. The `cache: "pip"` turns dependency installation from a from-scratch download on every run into a near-instant cache restore — and pipeline speed is not a nicety, it's the difference between a pipeline people wait for and one they learn to ignore. The ordering of `ruff` before `pytest` is the fail-fast principle in miniature: lint errors surface in seconds rather than after the test suite has run. When the same checks must run across several runtime versions or operating systems, a **matrix** expands one job definition into many parallel jobs — one per combination — so a single block tests Python 3.11, 3.12, and 3.13 at once. Matrices are powerful and also the easiest way to burn CI minutes, since a 3-by-3 matrix is nine jobs; constrain them to combinations you actually support. What separates a good pipeline from a bad one is not its feature list. A good pipeline is **fast** (feedback in minutes, so developers stay in flow rather than context-switching away and forgetting the result), **deterministic** (the same commit produces the same result every time — flaky tests that pass and fail at random are corrosive, because they train people to re-run until green and stop reading failures), and **required to merge** (the gate is enforced by branch protection, not by hoping people check). A pipeline that's slow, flaky, or optional is worse than no pipeline, because it costs time and builds false confidence. The goal is a green check that *means* something. > **Build it →** A full production CI/CD setup — multi-job workflows, caching, image > build and push, staged deploys — lives in > [Project 05: SaaS Web Platform](https://github.com/jchu0/applied-cs-projects/tree/main/05-saas-web-platform), > which carries the DevOps and CI/CD concerns for a real multi-service app. For CI as a > correctness gate on performance-sensitive code, the > [Project 49: AI Benchmark Suite](https://github.com/jchu0/applied-cs-projects/tree/main/49-ai-benchmark-suite) > shows benchmarks wired into the pipeline so a regression fails the build. ## Deployment strategies Once CI has produced a trusted artifact, CD has to get it into production without breaking anything — and the *how* is the deployment strategy. They all answer the same question, "how does the new version replace the old?", and they differ only in how they shift traffic, which is exactly the trade between rollout speed and blast radius. **Rolling** updates replace instances in batches — take down a few old ones, bring up a few new ones, repeat until the fleet is new. There's no downtime and no extra infrastructure, but for a window both versions serve traffic simultaneously, so the new version must be compatible with the old one's database schema and API contracts. It's the right default for most stateless services. **Blue-green** sidesteps the mixed-version window by running two complete environments — blue (live) and green (the new version, fully deployed but dark) — and flipping a router to send all traffic to green at once. Cutover is instant and, crucially, so is rollback: if green misbehaves, flip back to blue, which is still running untouched. The cost is carrying double the infrastructure during the switch, and you keep blue alive for a while after, as a hot standby, before scaling it down. **Canary** is the most cautious and the most powerful. Instead of switching all traffic at once, you route a small slice — 5%, say — to the new version and watch its metrics: error rate, latency, saturation. If they stay healthy through a **bake time**, you ramp to 25%, then 50%, then 100%, watching at each step; if they degrade, you abort and roll back, and only that 5% of users ever saw the bad version. The blast radius is bounded by the smallest slice you start with. This is what @fig-cicd-pipeline shows as the progressive path: canary first, then full, with a rollback arrow waiting on a breached metric. Canary trades rollout *speed* (it's deliberately slow, minutes to hours) for the *smallest possible* blast radius, which is exactly the deal you want for a high-traffic service. There's one more move that changes the game entirely: **feature flags decouple deploy from release.** A flag is a runtime switch that gates whether a code path is active. With flags, you can deploy the code for a new feature to production *turned off*, then turn it on for 1% of users, then 10%, then everyone — all without another deploy. Deploy becomes a low-stakes, mechanical act (ship the binary); release becomes a separate, controllable decision (flip the flag). And rollback for a flagged feature is instant and surgical: turn the flag off, no redeploy required. The deploy and the release stop being the same event, which is liberating precisely because it shrinks what any single deploy can break. ## GitOps Most of what we've described is *push-based*: CI finishes, then something — a pipeline step, an `argocd` call, a `kubectl apply` — reaches into the cluster and changes it. The pipeline holds the credentials to production and pushes changes in. **GitOps** inverts this. The desired state of production lives declaratively in a git repository — Kubernetes manifests, Helm values, the image tag to run — and a **controller** running *inside* the cluster continuously watches that repo, compares the declared desired state to the cluster's actual state, and reconciles any difference. Nobody pushes to production. You change the repo; the controller pulls the change in. @fig-cicd-pipeline shows this as the git-is-desired-state loop the rollout converges on. This pull-based model (Argo CD and Flux are the two dominant controllers) buys several things almost for free. The repository becomes an exact, **auditable** record of what is running and who changed it: every production change is a commit, with an author, a timestamp, a diff, and a review. There's no "who ran what on the box" mystery, because there's no box anyone runs things on — the controller is the only actor, and its instructions are all in git. The controller also **self-heals**: because it's continuously reconciling toward the declared state, if someone makes a manual change to the cluster — a panicked 2am `kubectl edit` — the controller notices the drift and reverts it back to what git says. Configuration drift, the slow divergence between what you think is running and what actually is, stops being possible by construction. And rollback becomes the most natural operation there is: it's `git revert`. Point the repo back at the previous commit and the controller reconciles production back to the last good state. The deploy that took a Friday night and a runbook becomes a pull request. ## Rollback and safety Here is the claim that reorganizes how you think about deployment: **fast rollback beats careful deploy.** You cannot, with any amount of testing, guarantee a release is good — production carries load, data, and traffic patterns that no staging environment fully reproduces, which is why "it worked in CI" and "it worked in prod" are different sentences. Since some fraction of releases *will* be bad, the question that actually governs your reliability is not "how do we stop bad releases" but "how fast can we recover from one." A team that can roll back in thirty seconds can afford to deploy twenty times a day; a team whose rollback is a four-hour manual database rebuild — the Friday-night team — cannot afford to deploy at all. Recovery speed, not deploy caution, is what makes frequent shipping safe. The mechanism that makes rollback automatic is the **health gate**: an explicit check that stands between deployment and promotion. After a deploy to staging or a canary slice, the pipeline runs smoke tests and reads health metrics, and promotion proceeds *only if they pass*. The same metrics power **automated rollback** — in a canary, you define a success condition (say, HTTP success rate at or above 99% over a five-minute window) and the rollout controller evaluates it at each step; if the canary breaches it, the controller aborts and reverts on its own, without a human in the loop, while the incident is still 5% of traffic instead of all of it. This is the rollback arrow in @fig-cicd-pipeline: a failed health check or a breached metric routes straight back to the last good artifact. The deploy is, as promised, a reversible state change — and the reversal is something the system does for you, faster than you could decide to do it yourself. ## A note on Jenkins and the modern model It's worth placing the older world against the new, because you will meet both. **Jenkins** is the long-dominant CI server, and its model is *imperative* and *server-centric*: a long-lived Jenkins controller, configured through plugins, runs pipelines defined in a `Jenkinsfile` (a Groovy script of `stage` blocks) on a pool of build agents you operate. It's enormously flexible — there is a plugin for everything — and that flexibility is also its cost: a Jenkins installation is infrastructure you own, patch, and secure, and its configuration tends to sprawl across the UI, plugins, and scripts in ways that drift from version control. The modern model — GitHub Actions, GitLab CI, and the GitOps controllers — is *declarative* and *cloud-native*: the pipeline is YAML that lives in the repo next to the code, runners are ephemeral and provisioned on demand, and there's no server for you to keep alive. The shift is the same one containers made: from a long-lived, hand-tended machine to a declarative artifact that's reproducible and disposable. Jenkins is far from dead — it runs an enormous amount of the world's CI, especially on-premises and in regulated environments — but for a new project, the declarative cloud-native model is the default, and the reason is that the pipeline becomes code you review and version like everything else. ::: {.callout-warning} ## War story: the pipeline nobody trusted A platform team's CI pipeline took forty minutes to run and failed at random perhaps one run in five — a flaky integration test that depended on timing, a cache that occasionally corrupted, a third-party API that rate-limited the test suite. The pipeline was technically "required to merge," but the team had quietly adapted: push, switch to another task, come back later, and when the run came back red, glance at it, decide "that looks like the flaky one," and click *re-run* until it went green. Nobody read failures anymore, because most failures were noise. Then a real bug — a genuine regression that broke checkout — failed the pipeline, got the reflexive re-run treatment, happened to pass on the third try when the flaky test flipped green, and shipped. The outage cost a weekend. The lesson is that **a pipeline that isn't trusted is worse than no pipeline at all.** Not because it does nothing, but because it manufactures false confidence: a green check that means "the dice came up green," not "this change is safe." A slow pipeline trains people to stop watching; a flaky one trains them to stop believing; and a pipeline people route around — re-running until green, merging past a "known" failure — is providing exactly zero of the safety it appears to provide. Speed and determinism are not polish; they are the load-bearing properties. If your pipeline is slow, make it fast before you add a single feature to it. If it's flaky, fixing the flake is the highest-priority work on the board, because every flaky failure is a lesson teaching your team to ignore the one failure that matters. ::: --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — A CI gate that means something.** Write a GitHub Actions workflow that triggers on every push and pull request to `main`, checks out the code, sets up the runtime with dependency caching, and runs a linter followed by the test suite. Deliberately push a commit that fails a test and confirm the workflow goes red; then enable branch protection so the workflow is *required to merge* and confirm the failing change cannot be merged. Write one sentence explaining why you ran the linter before the tests. 2. **Level II — Build, scan, and a staged deploy with a gate.** Extend the workflow so that after tests pass it builds a container image, runs a vulnerability scan (e.g. Trivy) that fails the build on a critical finding, and pushes the image to a registry tagged by commit SHA (never `latest`). Add a deploy job that ships the image to a staging environment and runs a smoke test against it as a **health gate** — a failed smoke test must block promotion to production. Explain, from the build output, why the *same* image (by digest) must flow to staging and prod rather than being rebuilt. 3. **Level III — Design a canary with automated rollback.** For a high-traffic service, design a canary deployment on paper. Define: (a) the specific **metric** that triggers rollback (name it, give the threshold and the evaluation window — e.g. success rate below 99% over five minutes); (b) the **bake time** and ramp schedule (5% → 25% → 50% → 100%, and how long the canary bakes at each step before promoting); and (c) how **feature flags** let you separate the deploy from the release, so the new code can ship dark and be turned on for a fraction of users independently of the rollout. State what happens, step by step, when the metric breaches mid-ramp, and how a human would know it happened. ## Summary CI/CD turns deployment from a rare, manual, irreversible event into a continuous, automated, reversible pipeline. **CI** keeps the mainline always-releasable by building and testing every change behind quality gates ordered to fail fast — cheap checks first, expensive ones only on survivors — and produces a single immutable artifact, built once and promoted unchanged. **CD** carries that artifact to production behind safety rails: a deployment strategy chosen to trade rollout speed against blast radius (rolling for the common case, blue-green for instant cutover, canary for the smallest blast radius), health gates that block bad releases, and automated rollback that recovers faster than a human could decide to. **GitOps** makes git the source of truth and a controller the deployer, buying an audit trail, self-healing, and `git revert` as rollback. The thread through all of it is that small, frequent, reversible changes are *safer* than big, rare ones — and that a pipeline is only worth having if it's fast, deterministic, and trusted enough that a green check actually means the change is safe to ship. ### Key takeaways - The goal is continuous, automated, *reversible* deployment; small frequent changes have a smaller blast radius and an obvious cause, which makes them safer than big rare ones. - Order pipeline gates to fail fast — cheapest checks first — and build the deployable artifact once, then promote that exact artifact everywhere; never rebuild for prod. - Deployment strategies trade rollout speed for blast radius: rolling (default), blue-green (instant cutover and rollback, double infra), canary (smallest blast radius, slowest). Feature flags separate deploy from release. - GitOps inverts deployment: git holds desired state, a controller reconciles production to it — yielding an audit trail, self-healing against drift, and `git revert` as the rollback. - Fast rollback beats careful deploy: you can't guarantee a release is good, so recovery speed governs reliability. Health gates make rollback automatic. - A slow or flaky pipeline is worse than none — it manufactures false confidence and trains people to stop reading failures. ### Connections to other chapters - **Containerization with Docker** (prerequisite): the pipeline's central artifact is the container image. CI builds it, scans it for vulnerabilities, and pushes it by digest; the "build once, promote everywhere" rule here is what makes the image's reproducibility promise pay off all the way to production. - **Testing fundamentals** and the per-language **Testing** chapters (prerequisite): the test suite *is* the gate CI enforces. Everything in this chapter assumes a fast, reliable suite — the war story is about what happens when it isn't, and the difference between unit, integration, and end-to-end tests is exactly the fail-fast ordering of the gates. - **Observability** (sibling): a deploy is an event you watch, and a canary is observability turned into a control loop — the automated-rollback metric (success rate, latency) is read straight from the same telemetry the observability chapter teaches you to emit and query. - **Security** (sibling): the scanning stage, SBOM generation, secret detection, and least-privilege pipeline credentials all live in the pipeline. CI is where supply-chain security is enforced — a vulnerable dependency or a leaked secret is caught at a gate before it ships, not after. ## Further reading ### Essential - Humble & Farley, *Continuous Delivery* — the foundational book; the deployment pipeline, build-once-deploy-many, and the discipline of keeping the mainline always releasable all come from here. - *GitHub Actions documentation* — the canonical reference for the workflow/job/step model, caching, matrices, environments, and reusable workflows. ### Deep dives - Forsgren, Humble & Kim, *Accelerate* — the research behind the **DORA metrics** (deployment frequency, lead time, change-failure rate, time-to-restore) and the evidence that frequent, small deploys correlate with *higher* stability, not lower. - *Argo CD* and *Flux* documentation — the two dominant GitOps controllers; read these for the reconciliation loop, sync policies, self-healing, and progressive delivery with Argo Rollouts. ### Historical context - John Allspaw & Paul Hammond, *"10+ Deploys Per Day: Dev and Ops Cooperation at Flickr"* (Velocity, 2009) — the talk widely credited as the spark for the DevOps and continuous-deployment movement, where the idea that you *should* deploy many times a day went from heresy to aspiration.