Cost Optimization

Keywords

cost optimization, finops, cloud cost, autoscaling, right-sizing, spot instances, resource management, utilization, elasticity, tradeoffs

Introduction

The bill had a shape nobody had drawn on purpose. A platform team pulled the cloud invoice apart after finance flagged a number that had crept past a threshold three months running. The biggest item was a fleet of compute instances chosen long ago, sized “one larger than we think we’ll need,” and never revisited. The dashboards told the rest: the fleet averaged eight percent CPU. Ninety-two percent of what the company paid for, every hour of every day, was heat — capacity that existed only to be idle. Beside it sat a forgotten cluster from a launch a year earlier, still running, still billing, owned by no one. And the nightly batch jobs — restartable, indifferent to interruption — paid full on-demand prices around the clock for work that would have run on spare capacity at a seventy-to-ninety-percent discount. None of this was a bug. Every instance was provisioned by a real engineer solving a real problem; the bill grew the way a junk drawer fills, one defensible item at a time, with nobody whose job it was to look at the whole.

That is the trap the cloud sets. It makes capacity feel infinite — ask for a bigger machine and it appears in seconds — and it makes the cost invisible at the moment you incur it; what replaces the datacenter’s friction is a monthly number that arrives long after the decisions that produced it. The correction is a shift in stance. Cost is not an accident you discover at month’s end; it is an engineering property you design for, the way you design for latency or correctness. A workload’s bill is the output of choices about how much you provision, how long it runs, and what you pay per unit — and engineers control all three. This chapter is about taking that control deliberately: measuring where the money goes, matching resources to actual usage, scaling capacity to demand instead of peak, and buying compute at the right price for its tolerance to risk.

The Core Insight

Strip a cloud bill to its arithmetic and it is almost embarrassingly simple. The cost of any workload is resources × time × price — how much you provision, multiplied by how long it runs, multiplied by the rate you pay per unit. Every optimization technique that exists is an attack on one of those three terms, and an engineer can move all three. That decomposition sorts the techniques into three levers, to be pulled in a specific order:

Right-size — attack the resources term. Most cloud waste is not exotic; it is over-provisioning. A workload using eight percent of a machine pays for the other ninety-two. Matching what you allocate to what you use is the highest-leverage move because it costs nothing in reliability and the savings start immediately.
Scale with demand — attack the time term. Static capacity is sized for the peak and pays for it even at three in the morning. Elasticity lets capacity track load, so you pay for what you use, not the worst minute of the day. Idle capacity scaled to zero costs nothing.
Buy smarter — attack the price term. The same compute sells at very different rates by commitment and guarantee: spare capacity (spot, preemptible) is sixty to ninety percent cheaper if you tolerate interruption; commitments (reserved instances, savings plans) discount steady baseline usage for a term promise; on-demand is the expensive, no-strings default for the genuinely unpredictable.

The order matters because cost optimization has a tension at its heart, not just a checklist. You optimize cost by trading slack you don’t need, not reliability you do. The cost / reliability / performance triangle pulls in three directions: the cheapest system is also the most fragile and slowest; the most reliable, fastest system is the most expensive. The art is finding the operating point where you’ve trimmed the idle headroom, the over-provisioning, and the full-price-for-interruptible work — the slack — while keeping the redundancy, latency budget, and durability the business requires. Right-size before you commit, because committing to an oversized instance locks the waste in for years. Measure before you right-size, because optimizing the wrong thing is its own waste.

A mental model

The master metric is utilization — the fraction of what you bought that you actually use — and the simplest way to hold the subject in your head is to treat idle capacity as pure waste. Every dollar of capacity at near-zero load is a dollar with no return. Cost optimization is, almost entirely, the discipline of driving utilization up.

Two analogies make the levers concrete. Autoscaling is a thermostat. It doesn’t run the furnace full blast all day to guarantee the house is never cold; it senses temperature and matches output to demand, idling when the room is warm and ramping when it drops. An autoscaler does the same — it watches a signal that tracks load and adds or removes instances to hold a target, instead of provisioning for the hottest hour forever. And a thermostat that short-cycles wastes energy and wears out the furnace, just as an autoscaler tuned too aggressively thrashes; tuning is the whole game.

Spot instances are the airline’s cheap seats. An airline sells the same seat as a flexible full-fare ticket and as a deep-discount fare that comes with conditions — and the discount fare can occasionally get bumped. Spot capacity is the cloud’s discount fare: spare inventory, sold cheap, reclaimable on short notice (two minutes on AWS, thirty seconds on GCP). It is wonderful for work that can take the bump — batch jobs, checkpointable training, stateless workers behind a queue — and exactly wrong for the passenger who must be on this flight: the stateful database mid-write, the low-latency endpoint under an SLA. Knowing which seat a workload needs is the skill.

How to reason about the tradeoff

The levers come in a fixed order, and the order is the framework. Measure first, since roughly eighty percent of the savings live in twenty percent of the resources — sort by spend and start at the top. Then right-size, then autoscale, then buy smarter: right-sizing first because it’s free of reliability cost and shrinks the baseline every later decision compounds on; autoscaling second because once each instance is correctly sized you want the count to follow demand; buying smarter last because commitments and spot only make sense once you know the true shape of your baseline — commit to an un-right-sized fleet and you’ve frozen the waste into a contract. Underneath every step sits the cost / reliability / performance triangle, shown in Figure 46.1: each lever is a deliberate trade of slack for savings, bounded by the reliability and performance the workload genuinely needs.

What you’ll learn

Why cost decomposes into resources × time × price, and how each lever attacks one term
How to measure before you optimize — utilization, attribution, and finding the idle and over-provisioned resources that hold most of the waste
How right-sizing works, why over-provisioning is the default failure mode, and how requests, limits, and bin-packing turn measured usage into allocation
How autoscaling matches capacity to demand, what signal to scale on, and the cold-start-versus-thrash tradeoff tuning has to resolve
When spot and preemptible capacity is safe to use, and the techniques — checkpointing, diversification, drain handling — that make interruptible work survive interruption
How to match a pricing model to a workload’s shape: spot for the interruptible, commitments for the baseline, on-demand for the unpredictable rest
Why optimizing cost is trimming slack rather than cutting reliability, and how to find the operating point on the cost / reliability / performance triangle

Prerequisites

Containerization: images and containers, workloads as schedulable units, and the stateless-vs-stateful distinction (the Containerization material) — the foundation autoscaling and spot interruption-handling build on
Distributed-systems basics: replicas, load balancing, failure as a normal event, and why redundancy buys availability — the reliability half of the tradeoff
Comfort reading a metrics dashboard: utilization, percentiles (p95/p99), and the idea of a target value a controller drives toward

Measure first: utilization and attribution

Every cost-optimization effort that starts with action instead of measurement ends up optimizing the wrong thing. The canonical failure is the team that spent a sprint shaving fifty dollars a month off a function’s invocation count while a forgotten GPU fleet in another account quietly burned twenty thousand. The fix was not better engineering on the function; it was looking at the bill first. Cost is a metric, and like any metric it deserves a feedback loop — discover, optimize, operate — the same shape as observability’s monitor-alert-respond loop.

Two things must be visible before any lever is worth pulling. The first is utilization: for each significant resource, the fraction of what you provisioned that’s actually in use. An instance at eight percent CPU, a database at twenty percent of provisioned throughput, a GPU idle between training runs — these are the waste, invisible until you graph them against capacity. The second is attribution: which team, service, or environment owns each slice of spend. Without it, no one sees their own costs, so no one owns them, so no one optimizes them. The mechanism is tagging — every resource labeled by team, service, and environment — and the practice on top is showback (putting each team’s bill in front of them) or chargeback (actually billing it). Visibility is the precondition for accountability; a cost no one can see is a cost no one will cut.

A subtle, expensive detail: cost-attribution tags are usually not retroactive — turn on tag-based reporting today and it covers resources tagged from today forward, not the months behind you. Enforce tagging early, through policy and infrastructure-as-code defaults, so the data exists when you need it. Measurement also tells you where not to spend effort: below a certain scale the engineering time to optimize exceeds the savings, and a pre-product-market-fit team should spend its attention on the product.

Right-sizing and resource management

Right-sizing is matching what you allocate to what the workload actually uses, and it is the first lever because it is the only free one — it trades away nothing but idle headroom. It is needed at all because of one near-universal human default: over-provisioning “to be safe.” Picking an instance size, an engineer reasons “I’d rather be too big than get paged at 2 a.m.,” chooses one size up, and never revisits it. Multiplied across a fleet and across years, that instinct is the eight-percent-utilization story from the introduction. Headroom you never use is not safety; it’s a standing payment for an emergency that isn’t happening.

The cure is to size from data, not intuition. Cloud providers expose recommendation engines (AWS Compute Optimizer and equivalents) that read days of real utilization and propose a smaller instance with the savings attached; the discipline is to act on the recommendation rather than file a ticket that ages into the backlog. The same logic governs deleting what costs money even while idle — unattached volumes, stale snapshots, orphaned gateways all bill around the clock — so a scheduled cleanup that removes truly-unused resources (with a tag-based safelist) recovers pure waste.

In a container orchestrator, right-sizing has a precise vocabulary: requests and limits. A container’s request is the CPU and memory it’s guaranteed — the scheduler reserves exactly that when it places the pod, so the request is what you really pay for in packing terms. The limit is the hard ceiling: exceed the CPU limit and the container is throttled; exceed the memory limit and it’s killed. Teams waste money by setting requests far above real usage, reserving capacity that sits idle on the node. The loop is to observe actual usage (p95 or p99, not the average, to keep headroom for normal spikes), set the request to that plus a margin, apply, re-observe. Too low and you get evictions and out-of-memory kills; too high and you waste the node.

Below the single workload sits bin-packing: fitting many right-sized workloads onto as few machines as possible. A node is a fixed-size bin, each pod’s request a box, the scheduler’s job to pack densely so you run fewer, fuller nodes instead of many half-empty ones. This is exactly why right-sizing comes first — accurate requests are what make tight packing possible. Inflated requests defeat the packer, because the scheduler honors the reservation even when the container isn’t using it, and you pay for nodes that are reserved-full but actually-empty.

Build it → These ideas under load: Project 46: Multi-Tenant GPU Scheduler implements fair-share scheduling, per-tenant resource quotas, GPU partitioning, and preemption — the machinery of driving utilization up on expensive, shared hardware — and Project 48: Multi-GPU Kernel Scheduler takes the same packing-and-utilization problem down to scheduling kernels across GPUs.

Autoscaling: matching capacity to demand

Right-sizing fixes how big each instance is; autoscaling fixes how many there are at any moment. It is the centerpiece because demand is almost never flat — traffic has a daily rhythm, a queue fills and drains, a batch wave passes — and static capacity sized for the peak pays for the peak continuously, at midnight and noon alike. An autoscaler closes that gap: a classical control loop that reads a signal, compares it to a target, computes the desired capacity, applies the change, and feeds the result back. Every autoscaler you’ll meet is the same machine with different knobs.

The first decision is horizontal versus vertical. Horizontal scaling changes the count of instances — add replicas when load rises, remove them when it falls — and suits stateless work (web services, API servers, queue consumers) because any replica handles any request, it happens in seconds, and there’s no per-instance ceiling. Vertical scaling changes the size of a single instance; it suits workloads that can’t spread across replicas — a stateful node, a memory-hungry job — but is slower (usually a restart) and hits the ceiling of the largest machine. The default for cost-elastic services is horizontal; vertical is for right-sizing the things that can’t go wide.

The second decision, the one teams most often get wrong, is what signal to scale on. The signal must track the thing that’s actually saturating. CPU is the reflexive default and often wrong: an inference service can bottleneck on a request queue growing to hundreds of items while CPU sits at forty percent, because the work is waiting, not computing. Scaling on queue depth, request rate, or p99 latency — the signal that genuinely reflects load — keeps the service healthy. Choosing the signal is choosing what “busy” means for your workload; get it wrong and the autoscaler is technically working and practically useless.

The most valuable behavior, for cost, is scaling to zero. A workload that drops to zero replicas when there’s no work consumes nothing — the purest “pay for what you use” — ideal for the spiky and intermittent: event-driven jobs, dev and test environments no one touches overnight, endpoints that go quiet between bursts. But it forces the central tradeoff open: cold start versus thrash. When a request hits a workload scaled to zero, something must spin up before it can be served — pull an image, start a process, load a model — and that latency is paid by the first requests after idle. A model that takes minutes to load is a brutal cold start; a slim stateless service is cheap. The mitigations also solve the deeper problem of scale-up lagging a traffic ramp: shrink the cold start (slimmer images, lazy init), keep headroom so you scale before saturation (target sixty percent, not ninety), pre-scale ahead of known patterns, and let a queue absorb the burst so the scaler has time to react.

Thrash is the opposite hazard: an autoscaler that reacts too eagerly scales up on a brief spike, back down the moment it passes, then up again — churning instances, paying their startup cost repeatedly, destabilizing the service. The defenses are a stabilization window (require the signal to persist before acting) and asymmetry — scale up fast because under-provisioning hurts users, scale down slow because holding a few extra instances briefly costs little next to removing them and immediately needing them back. Scale-up-fast, scale-down-slow is the single most useful default in autoscaler tuning.

Spot and preemptible instances

Spot capacity is the provider’s spare inventory, sold sixty to ninety percent off on-demand, with one condition: it can be reclaimed on two minutes’ notice (AWS) or thirty seconds’ (GCP). That condition is the whole story. For the right workload, spot is the largest discount in cloud computing; for the wrong workload, it’s a data-loss incident waiting for a reclamation event.

The dividing line is interruption tolerance. Work that is stateless or checkpointable survives reclamation: a killed stateless worker just means its task goes back on the queue for another to pick up; a training job that checkpoints to durable storage resumes on a fresh instance, losing only the work since the last save. Batch processing, ETL, CI/CD build agents, hyperparameter sweeps, rendering — anything restartable and parallel — is ideal. The opposite end is anything stateful and critical: a database holding the only copy of state, a low-latency endpoint under an SLA, a stateful stream processor carrying in-flight state reclamation would corrupt. Putting those on spot is the war story below.

Using spot safely is a small set of techniques. Checkpoint stateful work often enough that the compute lost to an interruption is acceptable — wasted work per interruption is roughly half the checkpoint interval plus the write time, and you want that to stay a small fraction of the uptime between interruptions. Diversify across many instance types and zones so a shortage in one pool doesn’t reclaim the whole fleet at once; capacity-optimized allocation exists to draw from the deepest pools. Handle the notice: poll for the reclamation warning and, when it arrives, use the grace window to checkpoint, drain connections, deregister from the load balancer, and exit cleanly while an orchestrator launches a replacement, falling back to on-demand if spot is scarce. Keep an on-demand base for resilient-but-not-batch work — running twenty to thirty percent of a service on guaranteed on-demand and the rest on spot means a mass reclamation degrades capacity instead of taking the service down.

War story: the stateful workload on the discount fare

A team running a stateful stream processor — the kind that keeps windowed aggregation state in memory and commits it periodically — moved it onto spot to capture the discount the batch jobs already enjoyed. For weeks it worked. Then a regional capacity crunch reclaimed a large fraction of the spot pool at once. The processors got their two-minute warning, but the application had no interruption handler: written for stable on-demand hardware, it neither checkpointed on the notice nor drained cleanly. Instances vanished mid-window, in-flight aggregation state with them, and because several died together there was no healthy peer to recover the partitions from. The pipeline produced wrong numbers, silently, until a downstream consumer noticed totals that didn’t reconcile — and recovery meant a manual replay from the source log. The lesson is the dividing line stated plainly: spot is for interruptible work, and a workload is only interruptible if its code actually handles interruption. A stateful service with no checkpoint and no drain handler is not interruption-tolerant just because you wish it were.

Commitments and pricing models

The third lever attacks the price term directly: the same instance-hour sells at very different rates depending on what you promise, and the job is to match each pricing mode to the shape of demand it fits. On-demand is the no-commitment default — full price, full flexibility, available and droppable instantly — for demand that’s genuinely unpredictable or short-lived: a new workload whose shape you don’t yet know, a project too short to justify a commitment. You pay a premium for owing the provider nothing.

Commitments — reserved instances and savings plans — discount steady, predictable usage for a term promise, typically one or three years, for up to roughly seventy percent off. The trade is flexibility for price: you keep paying whether or not you use the capacity, so a commitment only pays off against a baseline you’re confident will persist for the whole term. (Reserved instances tie the discount to a specific instance family for the deepest savings; savings plans trade some discount for the freedom to change types.) The rule is absolute: right-size before you commit. Committing to an oversized instance locks the over-provisioning into a multi-year contract — your worst cost mistake made irreversible. Right-size, watch usage stabilize, then commit to the baseline you’ve measured.

Spot is the third mode, covered above: deepest discount, no guarantee, interruptible work only. The three compose into the canonical cost-optimized fleet — a committed baseline sized to the steady-state floor, spot for the fault-tolerant bulk, on-demand absorbing the unpredictable peak. Run naively at all-on-demand, the same workload typically pays roughly double. The point is not to pick one mode but to layer them so each slice of demand is served at the price its reliability requirements justify.

The cost / reliability / performance tradeoff

It is tempting to read this chapter as a list of ways to spend less, but the honest framing is sharper: every lever is a trade, and what you must keep straight is what you’re trading away. The cost / reliability / performance triangle says you can’t maximize all three — drive cost to the floor and you sacrifice reliability, performance, or both — and what makes optimization safe rather than reckless is the distinction between slack and reliability. Slack is capacity that exists for no reason the business needs: the ninety-two percent of an idle instance, headroom sized for a peak that never comes, the on-demand premium on interruptible work, committed capacity that outlived its workload. Cutting slack is pure win. Reliability is capacity that exists for a reason: the redundant replica that survives a node failure, the on-demand base that holds when spot is reclaimed, the latency headroom that keeps p99 inside the SLA during a spike. Cutting that doesn’t optimize cost; it borrows against reliability and repays it, with interest, during the next incident.

So the operating point you’re aiming for is where all the slack is gone and all the necessary reliability remains — which requires knowing what the workload actually needs, looping back to measuring first. The autoscaler’s sixty-percent target rather than ninety, and the on-demand base under a spot fleet, both buy reliability with a little cost, deliberately. Those are not failures to optimize; they are optimization done right — every remaining dollar traceable to a reliability or performance requirement you can name.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — Measure and right-size. Take a real workload and graph its actual CPU and memory utilization over a few days. Identify the gap between provisioned and used, propose a smaller instance size or lower resource requests, and report the projected savings as a percentage. Then write a paragraph diagnosing why the original was over-provisioned — the specific reasoning that led to the larger size — so you can recognize the pattern next time.
Level II — Autoscale on the right signal. Configure horizontal autoscaling for a service and choose the signal deliberately: argue, from the workload’s behavior, whether CPU, request rate, queue depth, or latency actually tracks its load, and why the obvious default would or wouldn’t work. Then reason about three knobs — the scale-up threshold (how much headroom you leave), the scale-down cooldown (how you avoid thrash), and the cold-start cost (what the first requests after a scale-up pay). State the tradeoff you chose at each and what it costs.
Level III — Design a cost-optimized mixed-workload architecture. Design the capacity strategy for a system with three tiers: a steady-state baseline, a fault-tolerant batch tier, and a spiky user-facing tier. Specify which parts run on spot and exactly how you make them interruption-safe (checkpoint cadence, diversification, drain handling, on-demand fallback); which run on committed capacity and how you sized the commitment; and how autoscaling handles the spiky tier including cold start. Then place yourself on the cost / reliability / performance triangle: quantify the cost at your chosen operating point versus naive all-on-demand, and name the reliability properties you preserved and the slack you cut to get there.

Summary

Cost optimization is engineering, not accounting: a cloud bill is the output of choices about how much you provision, how long it runs, and what you pay per unit, and engineers control all three. The arithmetic — resources × time × price — sorts every technique into three levers, pulled in order. Measure first, because optimizing the wrong thing is its own waste and most savings hide in a few line items. Right-size, because over-provisioning is the default failure mode and trimming idle capacity costs nothing in reliability. Autoscale, because static capacity pays for the peak forever while elastic capacity tracks demand and can scale to zero. Buy smarter, layering spot for the interruptible, commitments for the baseline, on-demand for the unpredictable rest. And underneath it all, hold the cost / reliability / performance triangle steady: you optimize by trimming slack you don’t need, never the reliability you do.

Key takeaways

Cost is resources × time × price; the three levers — right-size, autoscale, buy smarter — each attack one term, pulled in that order.
Measure before you act: utilization shows what’s idle, attribution shows who owns it, the bill’s 80/20 shows where to start. You can’t optimize what you can’t see.
Over-provisioning “to be safe” is the most common, most expensive default; size from real p95/p99 usage, and accurate requests are what make bin-packing dense.
Autoscaling is a control loop — scale on the signal that actually tracks load, scale up fast and down slow, and resolve cold-start-versus-thrash with headroom, stabilization windows, and pre-scaling.
Spot is the deepest discount but only for interruptible work — and interruptible means the code checkpoints, diversifies, and drains on the notice; never stateful-critical.
Optimization is trimming slack, not cutting reliability; the right operating point is where every remaining dollar is justified by a named reliability or performance need.

Connections to other chapters

Orchestration with Kubernetes (extension): this chapter’s levers live concretely in Kubernetes. Resource requests and limits are right-sizing; the Horizontal Pod Autoscaler and Cluster Autoscaler are the elasticity machinery; node pools mix spot and on-demand. Kubernetes is where you operate cost optimization day to day.
Benchmarking Systems (sibling): right-sizing and capacity decisions rest on real measurements, not guesses. The discipline for measuring utilization, latency percentiles, and the deltas you report after an optimization comes from the benchmarking chapter. Cost optimization is benchmarking with the dollar as the unit.
Observability (sibling): utilization and cost are metrics you watch continuously, not numbers you check once. The discover-optimize-operate loop here is the same shape as observability’s monitor-alert-respond loop — cost anomaly detection, budget alerts, and per-team dashboards are its cost-flavored instances.
Cross-Cutting CI/CD (sibling): scale-to-zero and ephemeral environments cut non-production cost dramatically — preview environments that live only for a pull request, build agents that run on spot and disappear, dev clusters scheduled to shut down overnight. The CI/CD chapter is where much of the non-prod cost lever gets pulled.

AWS Well-Architected Framework — Cost Optimization Pillar (and the equivalent GCP guidance) — the providers’ own catalog of right-sizing, commitment, and elasticity practices, with the pricing-model details made concrete.
Spot/preemptible design patterns — provider documentation on checkpointing, diversification, capacity-optimized allocation, and interruption handling: the engineering that makes the deepest discount safe.

Historical context

Queueing-theory intuition for utilization versus latency — that latency climbs sharply as utilization approaches one is why you target sixty percent rather than ninety, and why “just run everything hot” trades reliability for cost in a way the math makes precise. Any treatment of the M/M/1 queue is the foundation.

--- title: "Cost Optimization" keywords: [cost optimization, finops, cloud cost, autoscaling, right-sizing, spot instances, resource management, utilization, elasticity, tradeoffs] difficulty: intermediate prerequisites: [containerization, distributed-systems-basics] estimated_time: "3-4 hours" --- ## Introduction The bill had a shape nobody had drawn on purpose. A platform team pulled the cloud invoice apart after finance flagged a number that had crept past a threshold three months running. The biggest item was a fleet of compute instances chosen long ago, sized "one larger than we think we'll need," and never revisited. The dashboards told the rest: the fleet averaged eight percent CPU. Ninety-two percent of what the company paid for, every hour of every day, was heat — capacity that existed only to be idle. Beside it sat a forgotten cluster from a launch a year earlier, still running, still billing, owned by no one. And the nightly batch jobs — restartable, indifferent to interruption — paid full on-demand prices around the clock for work that would have run on spare capacity at a seventy-to-ninety-percent discount. None of this was a bug. Every instance was provisioned by a real engineer solving a real problem; the bill grew the way a junk drawer fills, one defensible item at a time, with nobody whose job it was to look at the whole. That is the trap the cloud sets. It makes capacity feel infinite — ask for a bigger machine and it appears in seconds — and it makes the cost *invisible* at the moment you incur it; what replaces the datacenter's friction is a monthly number that arrives long after the decisions that produced it. The correction is a shift in stance. Cost is not an accident you discover at month's end; it is an engineering property you design for, the way you design for latency or correctness. A workload's bill is the output of choices about how much you provision, how long it runs, and what you pay per unit — and engineers control all three. This chapter is about taking that control deliberately: measuring where the money goes, matching resources to actual usage, scaling capacity to demand instead of peak, and buying compute at the right price for its tolerance to risk. ### The Core Insight Strip a cloud bill to its arithmetic and it is almost embarrassingly simple. The cost of any workload is **resources × time × price** — how much you provision, multiplied by how long it runs, multiplied by the rate you pay per unit. Every optimization technique that exists is an attack on one of those three terms, and an engineer can move all three. That decomposition sorts the techniques into three levers, to be pulled in a specific order: 1. **Right-size** — attack the *resources* term. Most cloud waste is not exotic; it is over-provisioning. A workload using eight percent of a machine pays for the other ninety-two. Matching what you allocate to what you use is the highest-leverage move because it costs nothing in reliability and the savings start immediately. 2. **Scale with demand** — attack the *time* term. Static capacity is sized for the peak and pays for it even at three in the morning. Elasticity lets capacity track load, so you pay for what you use, not the worst minute of the day. Idle capacity scaled to zero costs nothing. 3. **Buy smarter** — attack the *price* term. The same compute sells at very different rates by commitment and guarantee: spare capacity (spot, preemptible) is sixty to ninety percent cheaper if you tolerate interruption; commitments (reserved instances, savings plans) discount steady baseline usage for a term promise; on-demand is the expensive, no-strings default for the genuinely unpredictable. The order matters because cost optimization has a *tension* at its heart, not just a checklist. You optimize cost by trading **slack you don't need**, not **reliability you do**. The cost / reliability / performance triangle pulls in three directions: the cheapest system is also the most fragile and slowest; the most reliable, fastest system is the most expensive. The art is finding the operating point where you've trimmed the idle headroom, the over-provisioning, and the full-price-for-interruptible work — the slack — while keeping the redundancy, latency budget, and durability the business requires. Right-size before you commit, because committing to an oversized instance locks the waste in for years. Measure before you right-size, because optimizing the wrong thing is its own waste. ### A mental model The master metric is **utilization** — the fraction of what you bought that you actually use — and the simplest way to hold the subject in your head is to treat idle capacity as pure waste. Every dollar of capacity at near-zero load is a dollar with no return. Cost optimization is, almost entirely, the discipline of driving utilization up. Two analogies make the levers concrete. **Autoscaling is a thermostat.** It doesn't run the furnace full blast all day to guarantee the house is never cold; it senses temperature and matches output to demand, idling when the room is warm and ramping when it drops. An autoscaler does the same — it watches a signal that tracks load and adds or removes instances to hold a target, instead of provisioning for the hottest hour forever. And a thermostat that short-cycles wastes energy and wears out the furnace, just as an autoscaler tuned too aggressively thrashes; tuning is the whole game. **Spot instances are the airline's cheap seats.** An airline sells the same seat as a flexible full-fare ticket and as a deep-discount fare that comes with conditions — and the discount fare can occasionally get bumped. Spot capacity is the cloud's discount fare: spare inventory, sold cheap, reclaimable on short notice (two minutes on AWS, thirty seconds on GCP). It is wonderful for work that can take the bump — batch jobs, checkpointable training, stateless workers behind a queue — and exactly wrong for the passenger who *must* be on this flight: the stateful database mid-write, the low-latency endpoint under an SLA. Knowing which seat a workload needs is the skill. ### How to reason about the tradeoff The levers come in a fixed order, and the order is the framework. **Measure first**, since roughly eighty percent of the savings live in twenty percent of the resources — sort by spend and start at the top. Then **right-size, then autoscale, then buy smarter:** right-sizing first because it's free of reliability cost and shrinks the baseline every later decision compounds on; autoscaling second because once each instance is correctly sized you want the *count* to follow demand; buying smarter last because commitments and spot only make sense once you know the true shape of your baseline — commit to an un-right-sized fleet and you've frozen the waste into a contract. Underneath every step sits the cost / reliability / performance triangle, shown in @fig-cost-levers: each lever is a deliberate trade of slack for savings, bounded by the reliability and performance the workload genuinely needs. ![The cost levers, applied in order: measure utilization, then right-size resources to actual usage, then autoscale capacity to demand (scaling to zero when idle), then buy smarter — spot instances for interruptible work, commitments for the steady baseline, on-demand for the rest. The governing constraint is the cost / reliability / performance tradeoff: you optimize by trimming slack you don't need, not reliability you do.](../assets/diagrams/rendered/cost_levers.svg){#fig-cost-levers .lightbox} ### What you'll learn - Why cost decomposes into **resources × time × price**, and how each lever attacks one term - How to measure before you optimize — utilization, attribution, and finding the idle and over-provisioned resources that hold most of the waste - How right-sizing works, why over-provisioning is the default failure mode, and how requests, limits, and bin-packing turn measured usage into allocation - How autoscaling matches capacity to demand, what signal to scale on, and the cold-start-versus-thrash tradeoff tuning has to resolve - When spot and preemptible capacity is safe to use, and the techniques — checkpointing, diversification, drain handling — that make interruptible work survive interruption - How to match a pricing model to a workload's shape: spot for the interruptible, commitments for the baseline, on-demand for the unpredictable rest - Why optimizing cost is trimming slack rather than cutting reliability, and how to find the operating point on the cost / reliability / performance triangle ### Prerequisites - Containerization: images and containers, workloads as schedulable units, and the stateless-vs-stateful distinction (the *Containerization* material) — the foundation autoscaling and spot interruption-handling build on - Distributed-systems basics: replicas, load balancing, failure as a normal event, and why redundancy buys availability — the reliability half of the tradeoff - Comfort reading a metrics dashboard: utilization, percentiles (p95/p99), and the idea of a target value a controller drives toward --- ## Measure first: utilization and attribution Every cost-optimization effort that starts with action instead of measurement ends up optimizing the wrong thing. The canonical failure is the team that spent a sprint shaving fifty dollars a month off a function's invocation count while a forgotten GPU fleet in another account quietly burned twenty thousand. The fix was not better engineering on the function; it was *looking at the bill first*. Cost is a metric, and like any metric it deserves a feedback loop — discover, optimize, operate — the same shape as observability's monitor-alert-respond loop. Two things must be visible before any lever is worth pulling. The first is **utilization**: for each significant resource, the fraction of what you provisioned that's actually in use. An instance at eight percent CPU, a database at twenty percent of provisioned throughput, a GPU idle between training runs — these are the waste, invisible until you graph them against capacity. The second is **attribution**: which team, service, or environment owns each slice of spend. Without it, no one sees their own costs, so no one owns them, so no one optimizes them. The mechanism is tagging — every resource labeled by team, service, and environment — and the practice on top is *showback* (putting each team's bill in front of them) or *chargeback* (actually billing it). Visibility is the precondition for accountability; a cost no one can see is a cost no one will cut. A subtle, expensive detail: cost-attribution tags are usually **not retroactive** — turn on tag-based reporting today and it covers resources tagged from today forward, not the months behind you. Enforce tagging early, through policy and infrastructure-as-code defaults, so the data exists when you need it. Measurement also tells you where *not* to spend effort: below a certain scale the engineering time to optimize exceeds the savings, and a pre-product-market-fit team should spend its attention on the product. ## Right-sizing and resource management Right-sizing is matching what you allocate to what the workload actually uses, and it is the first lever because it is the only free one — it trades away nothing but idle headroom. It is needed at all because of one near-universal human default: **over-provisioning "to be safe."** Picking an instance size, an engineer reasons "I'd rather be too big than get paged at 2 a.m.," chooses one size up, and never revisits it. Multiplied across a fleet and across years, that instinct is the eight-percent-utilization story from the introduction. Headroom you never use is not safety; it's a standing payment for an emergency that isn't happening. The cure is to size from data, not intuition. Cloud providers expose recommendation engines (AWS Compute Optimizer and equivalents) that read days of real utilization and propose a smaller instance with the savings attached; the discipline is to *act* on the recommendation rather than file a ticket that ages into the backlog. The same logic governs *deleting* what costs money even while idle — unattached volumes, stale snapshots, orphaned gateways all bill around the clock — so a scheduled cleanup that removes truly-unused resources (with a tag-based safelist) recovers pure waste. In a container orchestrator, right-sizing has a precise vocabulary: **requests and limits.** A container's *request* is the CPU and memory it's guaranteed — the scheduler reserves exactly that when it places the pod, so the request is what you really pay for in packing terms. The *limit* is the hard ceiling: exceed the CPU limit and the container is throttled; exceed the memory limit and it's killed. Teams waste money by setting requests far above real usage, reserving capacity that sits idle on the node. The loop is to observe actual usage (p95 or p99, not the average, to keep headroom for normal spikes), set the request to that plus a margin, apply, re-observe. Too low and you get evictions and out-of-memory kills; too high and you waste the node. Below the single workload sits **bin-packing**: fitting many right-sized workloads onto as few machines as possible. A node is a fixed-size bin, each pod's request a box, the scheduler's job to pack densely so you run fewer, fuller nodes instead of many half-empty ones. This is exactly why right-sizing comes first — accurate requests are what make tight packing possible. Inflated requests defeat the packer, because the scheduler honors the reservation even when the container isn't using it, and you pay for nodes that are reserved-full but actually-empty. > **Build it →** These ideas under load: > [Project 46: Multi-Tenant GPU Scheduler](https://github.com/jchu0/applied-cs-projects/tree/main/46-multi-tenant-gpu-scheduler) > implements fair-share scheduling, per-tenant resource quotas, GPU partitioning, and > preemption — the machinery of driving utilization up on expensive, shared hardware — > and [Project 48: Multi-GPU Kernel Scheduler](https://github.com/jchu0/applied-cs-projects/tree/main/48-multi-gpu-kernel-scheduler) > takes the same packing-and-utilization problem down to scheduling kernels across GPUs. ## Autoscaling: matching capacity to demand Right-sizing fixes how big each instance is; autoscaling fixes how *many* there are at any moment. It is the centerpiece because demand is almost never flat — traffic has a daily rhythm, a queue fills and drains, a batch wave passes — and static capacity sized for the peak pays for the peak continuously, at midnight and noon alike. An autoscaler closes that gap: a classical control loop that reads a signal, compares it to a target, computes the desired capacity, applies the change, and feeds the result back. Every autoscaler you'll meet is the same machine with different knobs. The first decision is **horizontal versus vertical**. Horizontal scaling changes the *count* of instances — add replicas when load rises, remove them when it falls — and suits stateless work (web services, API servers, queue consumers) because any replica handles any request, it happens in seconds, and there's no per-instance ceiling. Vertical scaling changes the *size* of a single instance; it suits workloads that can't spread across replicas — a stateful node, a memory-hungry job — but is slower (usually a restart) and hits the ceiling of the largest machine. The default for cost-elastic services is horizontal; vertical is for right-sizing the things that can't go wide. The second decision, the one teams most often get wrong, is **what signal to scale on.** The signal must track *the thing that's actually saturating*. CPU is the reflexive default and often wrong: an inference service can bottleneck on a request queue growing to hundreds of items while CPU sits at forty percent, because the work is waiting, not computing. Scaling on queue depth, request rate, or p99 latency — the signal that genuinely reflects load — keeps the service healthy. Choosing the signal is choosing what "busy" means for your workload; get it wrong and the autoscaler is technically working and practically useless. The most valuable behavior, for cost, is **scaling to zero.** A workload that drops to zero replicas when there's no work consumes nothing — the purest "pay for what you use" — ideal for the spiky and intermittent: event-driven jobs, dev and test environments no one touches overnight, endpoints that go quiet between bursts. But it forces the central tradeoff open: **cold start versus thrash.** When a request hits a workload scaled to zero, something must spin up before it can be served — pull an image, start a process, load a model — and that latency is paid by the first requests after idle. A model that takes minutes to load is a brutal cold start; a slim stateless service is cheap. The mitigations also solve the deeper problem of scale-up lagging a traffic ramp: shrink the cold start (slimmer images, lazy init), keep headroom so you scale before saturation (target sixty percent, not ninety), pre-scale ahead of known patterns, and let a queue absorb the burst so the scaler has time to react. Thrash is the opposite hazard: an autoscaler that reacts too eagerly scales up on a brief spike, back down the moment it passes, then up again — churning instances, paying their startup cost repeatedly, destabilizing the service. The defenses are a **stabilization window** (require the signal to persist before acting) and asymmetry — scale *up* fast because under-provisioning hurts users, scale *down* slow because holding a few extra instances briefly costs little next to removing them and immediately needing them back. Scale-up-fast, scale-down-slow is the single most useful default in autoscaler tuning. ## Spot and preemptible instances Spot capacity is the provider's spare inventory, sold sixty to ninety percent off on-demand, with one condition: it can be reclaimed on two minutes' notice (AWS) or thirty seconds' (GCP). That condition is the whole story. For the right workload, spot is the largest discount in cloud computing; for the wrong workload, it's a data-loss incident waiting for a reclamation event. The dividing line is **interruption tolerance.** Work that is stateless or checkpointable survives reclamation: a killed stateless worker just means its task goes back on the queue for another to pick up; a training job that checkpoints to durable storage resumes on a fresh instance, losing only the work since the last save. Batch processing, ETL, CI/CD build agents, hyperparameter sweeps, rendering — anything restartable and parallel — is ideal. The opposite end is anything stateful and critical: a database holding the only copy of state, a low-latency endpoint under an SLA, a stateful stream processor carrying in-flight state reclamation would corrupt. Putting those on spot is the war story below. Using spot *safely* is a small set of techniques. **Checkpoint** stateful work often enough that the compute lost to an interruption is acceptable — wasted work per interruption is roughly half the checkpoint interval plus the write time, and you want that to stay a small fraction of the uptime between interruptions. **Diversify** across many instance types and zones so a shortage in one pool doesn't reclaim the whole fleet at once; capacity-optimized allocation exists to draw from the deepest pools. **Handle the notice**: poll for the reclamation warning and, when it arrives, use the grace window to checkpoint, drain connections, deregister from the load balancer, and exit cleanly while an orchestrator launches a replacement, falling back to on-demand if spot is scarce. **Keep an on-demand base** for resilient-but-not-batch work — running twenty to thirty percent of a service on guaranteed on-demand and the rest on spot means a mass reclamation degrades capacity instead of taking the service down. ::: {.callout-warning} ## War story: the stateful workload on the discount fare A team running a stateful stream processor — the kind that keeps windowed aggregation state in memory and commits it periodically — moved it onto spot to capture the discount the batch jobs already enjoyed. For weeks it worked. Then a regional capacity crunch reclaimed a large fraction of the spot pool at once. The processors got their two-minute warning, but the application had no interruption handler: written for stable on-demand hardware, it neither checkpointed on the notice nor drained cleanly. Instances vanished mid-window, in-flight aggregation state with them, and because several died together there was no healthy peer to recover the partitions from. The pipeline produced *wrong numbers*, silently, until a downstream consumer noticed totals that didn't reconcile — and recovery meant a manual replay from the source log. The lesson is the dividing line stated plainly: **spot is for interruptible work, and a workload is only interruptible if its code actually handles interruption.** A stateful service with no checkpoint and no drain handler is not interruption-tolerant just because you wish it were. ::: ## Commitments and pricing models The third lever attacks the *price* term directly: the same instance-hour sells at very different rates depending on what you promise, and the job is to match each pricing mode to the *shape* of demand it fits. **On-demand** is the no-commitment default — full price, full flexibility, available and droppable instantly — for demand that's genuinely unpredictable or short-lived: a new workload whose shape you don't yet know, a project too short to justify a commitment. You pay a premium for owing the provider nothing. **Commitments** — reserved instances and savings plans — discount steady, predictable usage for a term promise, typically one or three years, for up to roughly seventy percent off. The trade is flexibility for price: you keep paying whether or not you use the capacity, so a commitment only pays off against a baseline you're confident will persist for the whole term. (Reserved instances tie the discount to a specific instance family for the deepest savings; savings plans trade some discount for the freedom to change types.) The rule is absolute: **right-size before you commit.** Committing to an oversized instance locks the over-provisioning into a multi-year contract — your worst cost mistake made irreversible. Right-size, watch usage stabilize, *then* commit to the baseline you've measured. **Spot** is the third mode, covered above: deepest discount, no guarantee, interruptible work only. The three compose into the canonical cost-optimized fleet — a committed baseline sized to the steady-state floor, spot for the fault-tolerant bulk, on-demand absorbing the unpredictable peak. Run naively at all-on-demand, the same workload typically pays roughly double. The point is not to pick one mode but to *layer* them so each slice of demand is served at the price its reliability requirements justify. ## The cost / reliability / performance tradeoff It is tempting to read this chapter as a list of ways to spend less, but the honest framing is sharper: every lever is a trade, and what you must keep straight is *what* you're trading away. The cost / reliability / performance triangle says you can't maximize all three — drive cost to the floor and you sacrifice reliability, performance, or both — and what makes optimization *safe* rather than reckless is the distinction between slack and reliability. **Slack** is capacity that exists for no reason the business needs: the ninety-two percent of an idle instance, headroom sized for a peak that never comes, the on-demand premium on interruptible work, committed capacity that outlived its workload. Cutting slack is pure win. **Reliability** is capacity that exists for a reason: the redundant replica that survives a node failure, the on-demand base that holds when spot is reclaimed, the latency headroom that keeps p99 inside the SLA during a spike. Cutting *that* doesn't optimize cost; it borrows against reliability and repays it, with interest, during the next incident. So the operating point you're aiming for is where all the slack is gone and all the necessary reliability remains — which requires knowing what the workload actually needs, looping back to measuring first. The autoscaler's sixty-percent target rather than ninety, and the on-demand base under a spot fleet, both buy reliability with a little cost, deliberately. Those are not failures to optimize; they are optimization done right — every remaining dollar traceable to a reliability or performance requirement you can name. --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — Measure and right-size.** Take a real workload and graph its actual CPU and memory utilization over a few days. Identify the gap between provisioned and used, propose a smaller instance size or lower resource requests, and report the projected savings as a percentage. Then write a paragraph diagnosing *why* the original was over-provisioned — the specific reasoning that led to the larger size — so you can recognize the pattern next time. 2. **Level II — Autoscale on the right signal.** Configure horizontal autoscaling for a service and choose the signal deliberately: argue, from the workload's behavior, whether CPU, request rate, queue depth, or latency actually tracks its load, and why the obvious default would or wouldn't work. Then reason about three knobs — the scale-up threshold (how much headroom you leave), the scale-down cooldown (how you avoid thrash), and the cold-start cost (what the first requests after a scale-up pay). State the tradeoff you chose at each and what it costs. 3. **Level III — Design a cost-optimized mixed-workload architecture.** Design the capacity strategy for a system with three tiers: a steady-state baseline, a fault-tolerant batch tier, and a spiky user-facing tier. Specify which parts run on spot and exactly how you make them interruption-safe (checkpoint cadence, diversification, drain handling, on-demand fallback); which run on committed capacity and how you sized the commitment; and how autoscaling handles the spiky tier including cold start. Then place yourself on the cost / reliability / performance triangle: quantify the cost at your chosen operating point versus naive all-on-demand, and name the reliability properties you preserved and the slack you cut to get there. ## Summary Cost optimization is engineering, not accounting: a cloud bill is the output of choices about how much you provision, how long it runs, and what you pay per unit, and engineers control all three. The arithmetic — **resources × time × price** — sorts every technique into three levers, pulled in order. Measure first, because optimizing the wrong thing is its own waste and most savings hide in a few line items. Right-size, because over-provisioning is the default failure mode and trimming idle capacity costs nothing in reliability. Autoscale, because static capacity pays for the peak forever while elastic capacity tracks demand and can scale to zero. Buy smarter, layering spot for the interruptible, commitments for the baseline, on-demand for the unpredictable rest. And underneath it all, hold the cost / reliability / performance triangle steady: you optimize by trimming slack you don't need, never the reliability you do. ### Key takeaways - Cost is **resources × time × price**; the three levers — right-size, autoscale, buy smarter — each attack one term, pulled in that order. - Measure before you act: utilization shows what's idle, attribution shows who owns it, the bill's 80/20 shows where to start. You can't optimize what you can't see. - Over-provisioning "to be safe" is the most common, most expensive default; size from real p95/p99 usage, and accurate requests are what make bin-packing dense. - Autoscaling is a control loop — scale on the signal that actually tracks load, scale up fast and down slow, and resolve cold-start-versus-thrash with headroom, stabilization windows, and pre-scaling. - Spot is the deepest discount but only for interruptible work — and interruptible means the code checkpoints, diversifies, and drains on the notice; never stateful-critical. - Optimization is trimming slack, not cutting reliability; the right operating point is where every remaining dollar is justified by a named reliability or performance need. ### Connections to other chapters - **Orchestration with Kubernetes** (extension): this chapter's levers live concretely in Kubernetes. Resource requests and limits *are* right-sizing; the Horizontal Pod Autoscaler and Cluster Autoscaler *are* the elasticity machinery; node pools mix spot and on-demand. Kubernetes is where you operate cost optimization day to day. - **Benchmarking Systems** (sibling): right-sizing and capacity decisions rest on real measurements, not guesses. The discipline for measuring utilization, latency percentiles, and the deltas you report after an optimization comes from the benchmarking chapter. Cost optimization is benchmarking with the dollar as the unit. - **Observability** (sibling): utilization and cost are metrics you watch continuously, not numbers you check once. The discover-optimize-operate loop here is the same shape as observability's monitor-alert-respond loop — cost anomaly detection, budget alerts, and per-team dashboards are its cost-flavored instances. - **Cross-Cutting CI/CD** (sibling): scale-to-zero and ephemeral environments cut non-production cost dramatically — preview environments that live only for a pull request, build agents that run on spot and disappear, dev clusters scheduled to shut down overnight. The CI/CD chapter is where much of the non-prod cost lever gets pulled. ## Further reading ### Essential - *Cloud FinOps* (Storment & Fuller) — the standard practitioner text on running cost as a cross-functional discipline; source of the discover-optimize-operate framing. - *The FinOps Foundation framework* — the open, vendor-neutral definition of FinOps capabilities, principles, and maturity stages. ### Deep dives - *AWS Well-Architected Framework — Cost Optimization Pillar* (and the equivalent GCP guidance) — the providers' own catalog of right-sizing, commitment, and elasticity practices, with the pricing-model details made concrete. - *Spot/preemptible design patterns* — provider documentation on checkpointing, diversification, capacity-optimized allocation, and interruption handling: the engineering that makes the deepest discount safe. ### Historical context - Queueing-theory intuition for **utilization versus latency** — that latency climbs sharply as utilization approaches one is *why* you target sixty percent rather than ninety, and why "just run everything hot" trades reliability for cost in a way the math makes precise. Any treatment of the M/M/1 queue is the foundation.