Orchestration with Kubernetes
kubernetes, orchestration, control plane, reconciliation, pods, deployments, services, scheduler, etcd, declarative, self-healing
Introduction
A team had done containers right. Every service shipped as a slim image, builds were reproducible, and docker run behaved identically on a laptop and in staging. For a single VM, this was a quiet pleasure. Then the product grew. One service became eight, then twenty, spread across six machines because no single box could hold them all. The pleasure curdled into a second job nobody had been hired for: operating containers by hand.
The first real scare came at 3am. A node’s disk filled, the kernel started killing processes, and the four containers on that host simply stopped. Nothing noticed. No alert fired from the container layer because nothing was watching it; the containers that had been running were now containers that used to be running, and the machine had no opinion about the difference. Traffic black-holed until someone woke up, SSH’d in, and typed docker run four times from memory. The next scare came in daylight: a launch drove ten times the usual load, and scaling meant a person picking which hosts had spare capacity, copying the right docker run invocation, and updating the load balancer by hand — slowly, while the site buckled. And underneath both was a quieter rot. Each host had been tweaked live during some past incident, so no two were quite the same; the “config” was an oral history of who had logged in and changed what. There was no file you could point at and say this is what should be running.
The problem wasn’t containers. Containers solved packaging. The new problem was operating many containers across many machines, continuously, while things fail — and that problem doesn’t yield to more discipline or a better runbook. It yields to a different kind of system. Kubernetes is that system, and the shift it demands is less about new commands than about a new way of stating what you want.
The Core Insight: stop running containers, start declaring state
The team’s whole operating model was imperative: a human (or a script) issued commands — “run this container on that host,” “start two more,” “remove the dead one.” Every command assumed the operator knew the current state of the world and could compute the next correct action. That assumption breaks the instant the fleet is large enough that no one holds the whole picture, or fast-moving enough that the picture is stale by the time you act on it. A node dies and the imperative model has nothing to say, because nobody issued the command to notice.
Kubernetes inverts this. You don’t tell it to do things; you tell it what should be true. “Five replicas of this service, healthy, behind this address.” That declaration — the desired state — is recorded, and a set of background processes called controllers run a continuous reconciliation loop: observe the actual state of the cluster, compare it to the desired state, and take whatever action closes the gap. The shift from imperative to declarative is the entire game. Once you’ve described what should be true, every operational behavior you were doing by hand falls out of the same loop for free.
Hand-rolled orchestration costs you, specifically, all of these at once:
- No self-healing. A crashed container stays crashed until a human notices. Nothing reconciles “should be running” against “is running.”
- No rescheduling. When a node dies, its workloads die with it. There is no actor whose job is to find them new homes.
- No rollout safety. Deploying a new version means stopping the old and starting the new by hand, with no automatic health gating and no clean way back if it’s broken.
- No declarative source of truth. The real configuration lives in the union of every host’s drift and every operator’s memory, not in a file you can review, diff, and roll back.
A mental model
A thermostat is the cleanest mental model for Kubernetes. You don’t operate a thermostat by turning the furnace on and off; you set a target — 21°C — and walk away. The device reads the actual temperature, compares it to the target, and acts to close the gap, over and over, forever. It doesn’t care why the room got cold; an open window and a failed furnace produce the same corrective response, because the controller only ever asks one question: does reality match the spec? Kubernetes is a thermostat for your infrastructure, and its single setpoint is the desired state you declared. A pod dies, a node vanishes, you bump the replica count from five to eight — to the loop these are all the same event, a gap between desired and actual, and it answers them all the same way: act, observe, repeat. It is, in effect, a tireless operations team that never sleeps and only knows one instruction — make reality match the spec.
When to use Kubernetes (and when it’s overkill)
Kubernetes is not free. It is a distributed system you adopt to operate your distributed systems, and it brings real operational weight — a control plane to keep healthy, an object model to learn, and a long tail of ways to misconfigure it. That cost is justified by scale and by the failure modes that scale creates. Figure 48.1 shows the machinery you’re signing up to run; the decision below is whether you need it.
- Reach for Kubernetes when you run many services across a fleet of machines and genuinely need self-healing, autoscaling, and zero-downtime rollouts — when a node dying at 3am must be a non-event, not a page. It also earns its keep when you want a consistent, declarative API over heterogeneous or multi-cloud infrastructure.
- Avoid it when a single VM, a managed PaaS, or a serverless function would do. One small app at low traffic does not need a control plane; you’ll spend more time operating Kubernetes than the app would ever cost you to run by hand. Batch or event-driven work with no steady footprint is often better served by a queue or serverless than by a standing cluster.
The honest framing: Kubernetes trades a large fixed operational cost for a near-zero marginal cost per additional service. Below some fleet size that trade loses; above it, it wins decisively.
What you’ll learn
- How the control plane and data plane divide responsibility, and why the API server sits at the center of everything
- Why the reconciliation loop is the one idea that explains self-healing, rescheduling, scaling, and rollouts all at once
- How the core objects stack — Pod, ReplicaSet, Deployment — and why you almost never touch the lower two directly
- How Services and Ingress turn ephemeral pods into stable, routable endpoints
- When a workload needs a StatefulSet and a PersistentVolume instead of a Deployment — and the caveat that makes stateful workloads genuinely harder
- How liveness and readiness probes, used correctly, are what makes rollouts and self-healing safe instead of dangerous
- How the Horizontal Pod Autoscaler closes the loop on load, and where it misbehaves
Prerequisites
- Containers and images: what a container is, how images are built and layered (the Containerization with Docker chapter). Kubernetes schedules the images you built there; it does not replace them.
- Networking basics: DNS, ports, and load balancing — Kubernetes leans on all three to make services findable.
- Comfort reading YAML, the declarative format every Kubernetes object is written in.
The architecture: control plane and data plane
A Kubernetes cluster splits cleanly into two halves. The control plane is the brain: it holds the desired state and runs the loops that pursue it. The data plane is the muscle: worker nodes that actually run your containers. The split matters because it tells you where every responsibility lives, and where each kind of failure does its damage.
At the center of the control plane sits the API server. It is the single front door — every other component, every kubectl command, every controller, talks only to the API server, never directly to each other. Behind it, etcd is a consistent, replicated key-value store that holds the cluster’s entire desired state: every object you’ve declared, persisted durably. The API server is the only thing that reads and writes etcd; everyone else reads and writes through the API server. This is why the API server is the source of truth — not because it’s clever, but because it’s the one component everything funnels through.
Two more control-plane components do the actual reconciling. The scheduler watches for pods that have been created but not yet assigned to a node, and for each one decides where it should run — filtering nodes by available resources, affinity rules, and taints, then scoring the survivors and binding the pod to the best fit. The controller manager runs the fleet of controllers, each responsible for one kind of object, each executing the same observe-diff-act loop: the Deployment controller keeps the right number of replicas alive, the node controller notices when a node stops reporting, and so on.
The data plane is simpler. Each worker node runs a kubelet, an agent that takes its marching orders from the API server: “you should be running these pods.” The kubelet starts the containers via the container runtime, watches them, and continuously reports their actual status back up. Your application containers live inside pods — the smallest unit Kubernetes schedules. A pod is one or more containers that share a network namespace and storage, scheduled together as a unit; most of the time it holds a single container, with extra ones reserved for tightly-coupled sidecars.
The reconciliation loop is the whole system
Everything above is in service of one loop, and it is worth slowing down on, because once you see it you stop memorizing Kubernetes features and start deriving them. The loop is: the desired state lives in etcd; controllers and the scheduler watch the API server for changes and for drift; when actual state diverges from desired — because you changed the spec, or because something broke — they act to close the gap; the kubelets carry out the work and report the new actual state back; and the loop runs again, forever. There is no separate “self-healing feature” and no separate “rescheduling feature.” There is one loop, and those behaviors are what the loop does under different inputs.
Watch it absorb every operational task the team used to do by hand. You declare five replicas; the Deployment controller sees zero running, creates five pod objects, the scheduler places them, the kubelets start them. One pod crashes; the controller sees four where it wants five and creates a replacement. A node dies at 3am; the node controller marks its pods as gone, the count drops below desired, and replacements are scheduled onto healthy nodes — the exact incident that woke someone up is now a thing that happens silently in seconds. You bump the count to eight during a traffic spike; same loop, three new pods. The imperative work disappears not because Kubernetes automated each task, but because each task was only ever a special case of “actual ≠ desired.”
The single most important consequence: you describe the destination, never the journey. You don’t tell Kubernetes how to get from four pods to five, or how to recover from a node failure. You state the invariant you want held, and the loop holds it.
The core objects, as a layered hierarchy
You don’t usually create pods directly. Pods are mortal — when one dies, it’s gone, not restarted — so Kubernetes wraps them in higher-level objects that own pods and maintain their count. The hierarchy is worth holding in your head because each layer adds exactly one capability over the one below.
A ReplicaSet owns a set of identical pods and keeps a target number of them running. A Deployment owns ReplicaSets and adds change management — it knows how to roll from one version to the next safely. You almost always work at the Deployment level; the ReplicaSet underneath is an implementation detail the Deployment manages for you. A Deployment is mostly a small declaration of intent:
# Illustrative: keep 5 of this image healthy. The Deployment owns the rest.
apiVersion: apps/v1
kind: Deployment
metadata: { name: api }
spec:
replicas: 5 # the desired-state setpoint the loop chases
selector: { matchLabels: { app: api } }
template: # the pod spec the ReplicaSet stamps out
metadata: { labels: { app: api } }
spec:
containers:
- name: api
image: myreg/api:v1.4.2 # pin the tag; never `latest`Note what’s not there: no instructions for how to reach five replicas, no host names, no rollout steps. You declared the setpoint and the desired pod shape; the loop does the rest.
Pods are also ephemeral in their addresses — each gets an IP that vanishes when the pod does — which makes them useless to point a client at directly. A Service solves this by providing a stable virtual IP and DNS name that load-balances across whatever pods currently match its label selector. As pods come and go, the Service keeps a steady front door. For HTTP traffic from outside the cluster, an Ingress sits above Services, terminating TLS and routing requests by hostname and path to the right Service — one external entry point fronting many internal services.
Configuration and secrets get their own objects so they stay out of your images. A ConfigMap holds non-sensitive key-value configuration; a Secret holds sensitive values (credentials, tokens), mounted into pods at runtime rather than baked into the image. In production you typically back Secrets with an external store — AWS Secrets Manager, Vault — and sync them in, so the source of truth for credentials never sits in a Git repo.
Stateful workloads are the exception to all of this, and the exception is instructive. A Deployment treats its pods as interchangeable cattle — any pod can replace any other. A database can’t work that way: each replica has its own identity and its own data on disk. A StatefulSet gives pods stable, ordinal network identities (db-0, db-1) and, via PersistentVolume claims, durable storage that survives a pod restart and follows the identity, not the pod. The caveat that catches teams: a StatefulSet gives you stable identity and storage, but it does not make your application a cluster. Three Postgres pods in a StatefulSet with no replication configured are three isolated single-node databases, not a high-availability cluster. The clustering — replication, leader election, failover — is the application’s job (an operator like CloudNativePG, or a Patroni/Bitnami chart), and Kubernetes only provides the stable scaffolding it runs on.
Build it → A full multi-service deployment — services fronted by a Kong API gateway, wired together on Kubernetes — is built in Project 02: Microservice Platform. A stateful, production-shaped Kubernetes workload (the harder case above) is Project 50: Feature Engineering Platform.
Rollouts and self-healing: probes make them safe
The reconciliation loop will happily replace a dead pod and roll out a new version — but it can only act safely if it knows what “healthy” and “ready” mean, and that knowledge has to come from you, per container, in the form of probes.
A liveness probe answers is this container alive, or wedged? If it fails, the kubelet restarts the container. A readiness probe answers a different question: should this container receive traffic right now? If it fails, the Service stops routing requests to that pod — without restarting it. The two are constantly confused, and conflating them causes outages in both directions.
# Illustrative: the two probes answer different questions.
livenessProbe: # restart if this fails (the container is wedged)
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 30
readinessProbe: # withhold traffic if this fails (not ready yet)
httpGet: { path: /ready, port: 8000 }
initialDelaySeconds: 5With correct probes, a rolling update becomes safe and automatic. When you change the image, the Deployment spins up new pods alongside the old, waits for each new pod’s readiness probe to pass before sending it traffic, and only then retires an old one — so capacity never dips and no request hits a pod that isn’t ready. If the new version’s pods never become ready, the rollout stalls instead of taking the service down, and you can roll back to the previous ReplicaSet in one command. Self-healing and zero-downtime deploys are, again, the same loop — now gated on the health signals you supplied.
A team shipped a service that took eight seconds to load its model into memory on startup. They had a liveness probe but no readiness probe. On every rolling update, the Deployment created new pods, and the Service — having no readiness signal — immediately routed traffic to them. For eight seconds per pod, real requests hit a process that hadn’t finished starting, returning 500s. Worse, the slow startup occasionally tripped the liveness probe’s early checks, so the kubelet restarted pods that were merely busy booting, turning a slow start into a restart loop. The fix was two lines: a readiness probe so traffic waited until the model was loaded, and a startupProbe so liveness didn’t fire during the boot window. The lesson is the distinction itself — liveness means restart-if-dead; readiness means withhold-traffic-until-ready — and a missing readiness probe is a silent way to serve errors straight through a “successful” deploy.
Autoscaling: closing the loop on load
Manually scaling during a spike was one of the team’s daytime fires. The Horizontal Pod Autoscaler (HPA) makes it another instance of the reconciliation loop. You declare a target — say, keep average CPU around 70% — along with a minimum and maximum replica count, and a controller continuously measures actual utilization and adjusts the Deployment’s replica count to track the target.
# Illustrative: hold CPU near 70%, between 3 and 20 replicas.
spec:
scaleTargetRef: { kind: Deployment, name: api }
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }One sharp edge worth internalizing: autoscaling on memory is unreliable for garbage-collected runtimes (Python, the JVM). These rarely return freed memory to the OS after collection, so utilization ratchets up and never falls — the HPA scales up and never scales back down. Prefer CPU or a custom application metric (requests per second, queue depth) for those workloads. And HPA only scales pods; if every node is full, new pods sit Pending until a separate cluster autoscaler adds nodes. Scaling has two levels, and they’re easy to conflate.
The honest cost: a distributed system to run your distributed systems
It would be dishonest to end without naming the price. Everything above describes a distributed system — etcd quorum, controllers, schedulers, networking overlays, the API server — and it is now yours to operate. You traded the problem of operating containers by hand for the problem of operating Kubernetes. For a large fleet that’s a fantastic trade; the per-service marginal cost drops toward zero. For a small one it’s a terrible trade, which is exactly why the when-to-use question came first. Managed offerings (EKS, GKE, AKS) absorb the control-plane operation and make the trade far more attractive, but they don’t make the object model, the failure modes, or the conceptual surface area go away. Adopt Kubernetes because you have the scale that justifies it, not because it’s the default.
Practical exercise
Difficulty: Level I · Level II · Level III
- Level I — Watch the loop self-heal. On a local cluster (kind or minikube), write a Deployment with 3 replicas of any small HTTP image and a ClusterIP Service in front of it. Apply it, confirm three pods are running, then
kubectl scaleit to 5 and watch the new pods appear. Nowkubectl delete podon one of them and observe — without doing anything else — a replacement get created. Explain, in two sentences, which controller did that and what gap it was closing. - Level II — Make a rollout boring. Add a liveness probe, a readiness probe, and an HPA to the Deployment above. Deliberately give the container an artificial 8-second startup delay, then perform a rolling update to a new image tag with
kubectl rollout statuswatching. Demonstrate zero failed requests during the rollout (hit the Service in a loop while it happens), and explain how the readiness probe is what prevented errors. Then ship a deliberately broken image and show the rollout stalling rather than taking the service down, and roll back. - Level III — Design a multi-tenant cluster. Sketch a topology for a shared cluster serving several teams: namespaces for isolation, ResourceQuotas to cap each tenant’s footprint, NetworkPolicies enforcing default-deny with explicit allows, and one stateful workload (a database) done correctly. Then write the failure analysis: what is the blast radius of a single node failure, of an etcd quorum loss, of a full control-plane outage? Argue which failures the data plane survives without the control plane, and which it doesn’t — and what that implies about how hard you must protect etcd.
Summary
Kubernetes is not a smarter way to run containers; it’s a declarative control system for keeping a fleet of them in the state you asked for. You describe desired state, and a continuous reconciliation loop — controllers and a scheduler watching the API server, kubelets carrying out work and reporting back — drives actual state toward it, forever. That one loop is the source of self-healing, rescheduling, autoscaling, and zero-downtime rollouts; they aren’t separate features but special cases of “make actual match desired.” The object hierarchy (Pod → ReplicaSet → Deployment, plus Services and Ingress for networking, and StatefulSets with PersistentVolumes for the harder stateful case) is the vocabulary for stating that desired state, and probes are the health signals that make the loop’s actions safe. All of this is bought with the real cost of operating a distributed system to operate your distributed systems — worth it at scale, overkill below it.
Key takeaways
- The imperative-to-declarative shift is the whole point: you state the destination, never the journey, and the reconciliation loop holds the invariant.
- Self-healing, rescheduling, scaling, and rollouts are all the same loop — learn the loop and you’ve learned the system.
- Work at the Deployment level; let it manage ReplicaSets and pods. Use Services for stable addressing and StatefulSets only when identity and durable storage truly matter — and remember a StatefulSet is scaffolding, not a cluster.
- Probes are not optional polish: liveness (restart-if-dead) and readiness (withhold-traffic-until-ready) are what make rollouts and self-healing safe.
- Kubernetes has a large fixed operational cost and a near-zero marginal one; adopt it for fleet scale, not by default.
Connections to other chapters
- Containerization with Docker (prerequisite): Kubernetes orchestrates the images you build there. The slim, non-root, pinned-tag image is the unit a pod runs — orchestration assumes packaging is already solved.
- Benchmarking Systems (sibling): cluster capacity, resource requests/limits, and HPA targets are only as good as the load measurements behind them. Sizing a request or choosing a scale-up threshold is a benchmarking question wearing an ops hat.
- Observability (extension): the cluster emits metrics, traces, and logs from every pod and control-plane component; turning those into dashboards, SLOs, and alerts is how you actually see the reconciliation loop doing its job — and how HPA gets its custom metrics.
- Security (extension): RBAC scopes who and what can talk to the API server, NetworkPolicies enforce zero-trust pod-to-pod traffic, Secrets keep credentials out of images, and a service mesh adds mTLS and identity between services — the multi-tenant design in the Level III exercise rests entirely on these.
Build it → Zero-trust service-to-service security on Kubernetes — mTLS and SPIFFE identities issued and rotated automatically — is implemented in Project 13: Service Mesh. A full-stack application targeting a Kubernetes deployment is Project 05: SaaS Web Platform.
Further reading
Essential
- Kubernetes Documentation — Concepts (kubernetes.io/docs/concepts) — the canonical reference for the object model and the control loop; start with “Cluster Architecture” and “Controllers.”
- Burns, Beda & Hightower, Kubernetes: Up and Running — the standard practical on-ramp, written in part by Kubernetes’ creators.
Deep dives
- Burns et al., Borg, Omega, and Kubernetes (2016) — the design lineage and the hard lessons that shaped Kubernetes’ API.
- Beyer et al., Site Reliability Engineering (Google, 2016) — the operational philosophy (error budgets, toil reduction, SLOs) that the reconciliation model is built to serve.
Historical context
- Verma et al., Large-scale cluster management at Google with Borg (2015) — the internal predecessor whose declarative, controller-driven design Kubernetes inherited.
- The control-theory lineage: declarative configuration and reconciliation loops are infrastructure’s rediscovery of closed-loop control — desired state, measured state, and a controller minimizing the error between them.