Concurrency and Parallelism Models

Keywords

concurrency, parallelism, threads, goroutines, async, await, event loop, csp, channels, gil, mutex, data races, memory model, structured concurrency, scheduler

Introduction

The same bug shipped four times that quarter, in four different languages, and each team thought it was unique to theirs. A Python service rebuilt on asyncio fell over under load because one synchronous database call, buried three frames deep, froze the single event-loop thread and serialized every in-flight request. A Go search endpoint leaked goroutines — two per request, blocked forever on a channel nobody would ever receive from — until the box was OOM-killed on a traffic spike. A Java reporting service sized its thread pool at 200 to keep cores busy through blocking calls, then watched one slow dependency occupy all 200 threads at once and reject healthy traffic to unrelated endpoints. And a C++ cache shipped a double-checked-locking optimization that passed review and ran for months on x86, then crashed on a fraction of a percent of ARM phones because the hardware was free to reorder two writes the code silently assumed were ordered.

Read those four incidents as bugs and they look unrelated. Read them as a family and a pattern jumps out: every one is a team reaching for a concurrency tool without first understanding the model underneath it. The Python team didn’t grasp that a cooperative event loop runs to completion between yield points. The Go team didn’t grasp that a cheap, freely-spawned goroutine still needs a guaranteed way to stop. The Java team was working around the price of an OS thread without realizing the price had a fix. The C++ team assumed a single global order of memory that hardware does not provide. The model was the missing piece in each case — and the deep, useful fact is that there are only a handful of models, shared across all six languages this chapter covers. Learn them once, comparatively, and the four bugs become one bug you can recognize anywhere.

The Core Insight

Concurrency is not parallelism, and conflating the two is where most of the confusion starts. Concurrency is a way of structuring a program so that independent pieces of work can be in progress at the same time — interleaved, suspended, resumed. Parallelism is executing multiple pieces of work literally simultaneously, which needs multiple cores. A single-core machine can be highly concurrent (an event loop juggling ten thousand connections) and not parallel at all. A program can also be parallel without being interesting-ly concurrent (a for loop split across cores). The two are orthogonal, and the first question for any workload is which one you actually need.

That question has a sharp answer because it tracks one physical distinction. Work that spends its time waiting — on the network, a disk, a database — is I/O-bound, and the win is concurrency: overlap the waits so the thread isn’t idle. Work that spends its time computing — parsing, hashing, transforming numbers — is CPU-bound, and the only win is parallelism: spread the arithmetic across cores. Reach for a concurrency model on CPU-bound work and you get no speedup; reach for raw parallelism on I/O-bound work and you pay for threads that mostly sleep. Every language in this chapter gives you tools for both, but the shape of those tools — and the safety they offer — differs enormously, and that variation is the whole subject.

The deepest divergence is over shared mutable state. Two threads touching the same memory, at least one of them writing, with no synchronization between them, is a data race — and a data race is the most expensive bug in systems programming, because it is non-deterministic, load-dependent, immune to the debugger, and usually found in production. Six languages give six different answers to it: Python and JavaScript mostly sidestep it with a single thread; Go offers channels (don’t share) plus a runtime race detector; Java codifies a memory model with happens-before edges; C++ hands you the raw memory orderings and the obligation to use them; Rust makes the race a compile error. Concurrency is the axis on which programming languages differ most, and the model a language picks shapes every system built in it.

A mental model

Picture a restaurant, and you can place every concurrency model in this chapter on its floor. The slow work — cooking — is the I/O: the part where you wait. Serving tables is the computation: the part that needs hands.

A thread-per-request restaurant hires one full-time waiter per table. Each waiter walks an order to the kitchen and stands there until the food is plated. It is dead simple to reason about — one waiter, one story, top to bottom — but waiters are expensive, you can only afford a few dozen, and most of them are standing idle at a stove. That is the 1:1 OS thread model: capable, preemptively scheduled by a kitchen manager who can yank any waiter off the floor at any instant, but rationed because each one costs a salary (a megabyte of stack) whether working or not.

A single async waiter works the whole room alone. She drops an order at the kitchen and immediately moves to the next table rather than waiting; when a dish is ready a bell rings and she swings by to deliver it. One waiter keeps fifty tables moving precisely because the room is mostly waiting, and waiting needs no waiter. That is the event loop — cooperative, single-threaded, scaling I/O beautifully — with one fatal rule: if she ever stops to chop vegetables herself (a blocking call), the entire room freezes.

A crew of green-thread waiters is the modern answer: you write down a million orders for almost nothing, and a small permanent crew picks them up, setting each one aside the instant it says “waiting on the kitchen” and grabbing the next. Each order thinks it has its own dedicated waiter standing by — which keeps the code simple — but the crew is never idle. That is the M:N model: goroutines and JVM virtual threads, cheap units multiplexed onto a few real threads. And Rust’s futures are a fourth shape — a stack of order tickets that do nothing until a waiter picks one up and works it as far as it can, a deliberately lazy design that costs only the paper the ticket is written on.

When to use which model

The choice of model is not taste; it follows mechanically from two questions. Figure 2.1 maps the four units of concurrency onto the cores beneath them, and the two coordination styles that ride on top — and the decision walks straight through it.

First question: is the work CPU-bound or I/O-bound? CPU-bound work needs real parallelism — threads (or processes) spread across cores, ideally one worker per core, because more workers than cores just add context-switch overhead without adding compute. I/O-bound work needs cheap concurrency — and the more of it, the cheaper each unit must be. A handful of concurrent waits is fine on OS threads; tens of thousands of mostly-idle connections demands a model where an idle unit costs bytes, not a stack, which means an event loop, goroutines/virtual threads, or async futures.

Second question: how do the concurrent units coordinate? If data flows in one direction — a pipeline, a producer feeding workers, a result handed back — prefer message passing: a channel transfers ownership with the value, so there is nothing left to race over and no lock to forget. If state sits still and several units mutate it in place — a counter, a cache, a registry — prefer shared memory with a lock, the smaller and shorter-lived the critical section the better. “Flowing wants a channel; sitting still wants a lock” is the rule of thumb, and it holds in every language that offers both.

The languages then differ on what they make easy and what they make safe. Go and recent Java make cheap M:N concurrency the path of least resistance. Python and JavaScript hand you one cooperative event loop and (mostly) one thread. Rust gives you the cheapest async there is but makes you supply the runtime, and proves the absence of data races at compile time. C++ gives you the rawest control and the heaviest obligation. None is strictly best; each is a point in a design space, and the rest of this chapter is the map.

What you’ll learn

How to tell concurrency from parallelism, diagnose a workload as CPU- or I/O-bound, and pick the model that diagnosis implies
The four units of concurrency — OS threads, M:N green threads, event-loop tasks, and poll-based futures — and the scheduler each one rides on
Why cooperative schedulers (event loops, async runtimes) and preemptive ones (OS threads) fail in opposite ways, and the cardinal rule that protects every cooperative one
How Python’s GIL turns one question — I/O-bound or CPU-bound? — into the answer for the whole language, and how Java and Go reached “blocking code that scales”
The two coordination styles — shared memory with locks versus message passing over channels (CSP) — and when each is the simpler, faster, safer fit
How async/await works across languages, from JavaScript’s eager Promises to Rust’s lazy, zero-cost, poll-based futures with a bring-your-own runtime
Structured concurrency and cancellation — how to scope a group of tasks so failure and timeouts propagate instead of leaking
The five answers to data races — Rust’s Send/Sync, Go’s race detector, the Java Memory Model, C++’s std::atomic and memory orderings, and the single-thread sidestep

Prerequisites

Software Engineering Overview — what a process and a thread are, what it means for a call to block, and why reproducibility and resource limits matter
Go Fundamentals and Rust Fundamentals — this chapter draws its sharpest contrasts from Go’s goroutines-and-channels and Rust’s ownership model; comfort reading idiomatic Go and Rust will make the comparisons land
A working idea of what a data race and a deadlock are, even informally

The unit of concurrency: four shapes

Every concurrency model is, at bottom, a choice about the unit you spawn and how that unit maps onto the operating system’s threads. There are four shapes in wide use, and the rest of the chapter is variations on them. The cleanest way to see the design space is a single table comparing the unit each language gives you first.

Language	Default unit	Mapping to OS threads	Scheduling	Cost per unit
C++	`std::thread`	1:1	preemptive (kernel)	~1 MB stack
Rust (sync)	`std::thread`	1:1	preemptive (kernel)	~1 MB stack
Java (classic)	platform `Thread`	1:1	preemptive (kernel)	~1 MB stack
Java (Loom)	virtual `Thread`	M:N	cooperative + carrier preempt	~1 KB
Go	goroutine	M:N	cooperative + async preempt	~2 KB (grows)
Python	asyncio task / thread / process	1:1 thread (GIL-bound)	cooperative (loop)	~1 KB task
JavaScript / TS	Promise / async task	single thread	cooperative (event loop)	tiny
Rust (async)	`Future` task	M:N over a runtime	cooperative (executor)	bytes (state machine)

The table tells two stories at once. Read the mapping column and the world splits in two: the 1:1 languages, where each unit is an OS thread, and the M:N languages, where many cheap units share a few OS threads. Read the cost column and you see why that split exists at all — an OS thread reserves about a megabyte of stack whether or not it is doing anything, so a process tops out around the low tens of thousands of them, which is a hard ceiling for a server whose job is to hold many slow connections open. The M:N units cost kilobytes or less, so you can have millions.

The 1:1 model is the oldest and the simplest to reason about. A thread is a real OS thread: its own stack, scheduled by the kernel, running truly in parallel on another core. The kernel is preemptive — it can interrupt any thread at any instant to run another — so no thread can starve the others by refusing to yield, which is a genuine safety property. The cost is the megabyte and the kernel’s involvement in every context switch. This is what C++, Rust’s std::thread, and classic Java give you, and for CPU-bound work or modest concurrency it is exactly right: a handful of threads, one per core, crunching numbers in parallel.

The M:N model keeps the simple blocking programming style but removes the cost. The runtime keeps a small pool of OS threads (Go calls them Ps, the JVM calls them carriers) and multiplexes many cheap units onto them. When a unit blocks — on a channel, on I/O — the runtime parks it, unmounts it from its OS thread, and runs a different runnable unit there, remounting the parked one when its blocking call completes. The unit thinks it blocked; the OS thread never sat idle. This is the goroutine, and it is the Java virtual thread, and the two are the same idea reached two decades apart.

The event-loop model goes to one thread and makes the concurrency entirely cooperative. There is one call stack; only one thing runs at a time; slow work is handed to the runtime and its completion is queued; the loop runs queued continuations when the stack is clear. This is Python’s asyncio and all of JavaScript. It scales I/O superbly on a single thread and gives up parallelism entirely — many operations in flight, but only one line of your code running at any instant.

The poll-based future is Rust’s distinctive fourth shape, and it inverts an assumption the other three share. In every other model, spawning a unit starts it. A Rust future is lazy: calling an async fn builds a state machine and runs none of its body. Nothing happens until an executor polls it. That single difference — laziness — is what makes Rust async zero-cost and is the source of most of its surprises, and we will come back to it.

Schedulers: cooperative versus preemptive

The unit is half the model; the scheduler is the other half, and the axis that matters most is cooperative versus preemptive. A preemptive scheduler can interrupt a running unit at any point and switch to another — the OS does this to threads on a timer interrupt, dozens of times a second. A cooperative scheduler can only switch when the running unit yields control voluntarily, at an explicit point. The two fail in exactly opposite ways, and knowing which one you are on tells you which failure to fear.

Preemptive scheduling is robust against a misbehaving unit. A thread stuck in an infinite CPU loop cannot freeze the others, because the kernel will preempt it and let everyone else run. The price is that a context switch can happen anywhere — between any two instructions, including in the middle of count++ — so shared mutable state can be corrupted at any interleaving, and you need locks to make multi-step operations atomic. Preemption gives you robustness against starvation and takes away the ability to reason about where you can be interrupted.

Cooperative scheduling makes the opposite trade. Because a unit yields only at explicit points — await in Python/JS/Rust, a channel operation in Go’s older model, a blocking call on a virtual thread — you know exactly where control can leave your code, which makes whole categories of races evaporate: between two yield points, your code runs uninterrupted. The price is the cardinal rule that governs every cooperative runtime: a unit that never yields freezes everything. This is the single most important operational fact about event loops and async runtimes, and it caused two of the four incidents in the introduction.

War story: the one synchronous call that serialized everything (Python)

A team rebuilt a flaky API on FastAPI and asyncio specifically to handle a flood of concurrent traffic, load-tested it to thousands of requests per second, and shipped it. In production, p99 latency was catastrophic — 20 ms requests took seconds, and throughput collapsed to roughly one request at a time. The handler was async, the framework was async, and buried three calls deep was a single legacy helper issuing a database query through a synchronous driver. Each time any request reached that helper, its blocking query pinned the one event-loop thread for the entire round trip, and every other in-flight request — hundreds of them — sat frozen until it returned. The async service was, in effect, single-threaded and synchronous, the worst of both. The fix was to stop blocking the loop: swap in an async driver so the query awaits, or push the sync call off the loop onto a thread pool. The same hazard has the same shape in JavaScript (a long synchronous loop stalls the page), in Rust async (a std::fs::read inside a task removes a worker from rotation), and even on Java’s virtual threads (blocking inside synchronized pins the carrier). One rule, five languages: never block a cooperative scheduler.

The M:N models blur the line in a useful way. Go’s scheduler is mostly cooperative — a goroutine yields at channel operations and function-call preemption points — but since Go 1.14 it also has asynchronous preemption, so a goroutine in a tight CPU loop can still be interrupted, recovering the robustness of preemption. Java’s virtual threads are cooperative about blocking (they unmount and yield the carrier) but the carrier itself is a preemptible OS thread. The lesson is that “cooperative” and “preemptive” are ends of a spectrum, and the modern runtimes deliberately sit in the middle: cooperative enough to be cheap, preemptive enough not to be starved.

The GIL: when the model collapses to one question

No language makes the CPU-versus-I/O distinction more consequential than Python, because of one design decision: the Global Interpreter Lock. The GIL is a single mutex inside the standard CPython interpreter that a thread must hold to execute Python bytecode. Because there is exactly one, exactly one thread runs Python code at a time, no matter how many cores you have. It exists for a real reason — it makes CPython’s reference-counting memory management and its vast C-extension ecosystem simpler and faster in the common single-threaded case — but the consequence is stark: in CPython, threads do not give you CPU parallelism.

The crucial detail is when the lock is released. A thread drops the GIL whenever it is not running Python bytecode — most importantly, while it waits on the OS for I/O. A thread blocked on a socket holds no lock, so other threads run. That single asymmetry is the entire Python concurrency story, and it collapses every decision into one diagnostic question: is the work I/O-bound or CPU-bound?

I/O-bound work spends its time outside the interpreter, so the GIL is released and many waits overlap. Threads help here, and asyncio helps far more cheaply — a coroutine costs about a kilobyte where a thread costs megabytes. Use threading/concurrent.futures for modest concurrency or to keep synchronous libraries; use asyncio for very high concurrency.
CPU-bound work keeps the interpreter busy executing bytecode, so threads serialize on the GIL and you get no speedup — sometimes a slowdown, from the cost of handing the lock back and forth. The only path to real parallelism is multiprocessing: separate processes, each with its own interpreter and its own GIL, paying the cost of pickling everything that crosses the process boundary.

This is the report-generator bug in one paragraph: a team wrapped a CPU-bound hot loop in eight threads expecting an eight-times speedup and got eleven minutes and four seconds — one core pinned, seven asleep — because the GIL serialized them. The fix was a process pool, which woke the idle cores. The asterisk worth remembering is that heavy numerical libraries (NumPy, pandas) release the GIL while computing in C, so already-vectorized work can be parallelized by threads after all.

Set Python beside the other languages and the GIL’s specialness is obvious. Java, Go, C++, and Rust all run threads in genuine parallel across cores with no interpreter lock; for them, CPU-bound and I/O-bound both have in-language answers. Python alone must leave the process to parallelize computation. (The experimental free-threaded CPython builds in 3.13+ are beginning to lift this, but for code shipped today, assume the GIL is present and let the workload pick the model.)

Build it → A production distributed job queue that runs CPU-bound and I/O-bound tasks across worker pools — Celery alongside asyncio, the concrete form of “offload heavy work to processes, keep the front door responsive” — is Project 01: Distributed Job Queue.

Shared memory with locks: the default, and its perils

When several units must touch one piece of state in place, the classic answer is shared memory protected by a lock. The idea is universal — a mutex admits one unit into the critical section at a time — but each language wraps it differently, and the differences reveal how much each one trusts the programmer.

C++ gives you the rawest version and the heaviest obligation. A std::mutex sits beside the data it guards, and nothing connects them except your discipline — you must remember to take the lock, and never call lock()/unlock() by hand, because an exception between them leaks the lock forever. RAII wrappers are the cure: std::scoped_lock locks in its constructor and unlocks in its destructor, releasing on every exit path including a throw, and acquiring multiple mutexes deadlock-free.

// C++: the lock sits beside the data; RAII guarantees release; scoped_lock avoids deadlock.
void transfer(Account& from, Account& to, long cents) {
    std::scoped_lock lock(from.mtx, to.mtx);   // both mutexes, deadlock-free
    from.balance -= cents;
    to.balance   += cents;
}                                              // both released here, even on a throw

Rust takes the same primitive and welds it to the data, which is the quiet masterstroke. A Mutex<T> contains the T; the only way to reach the value is .lock(), which returns a guard that releases automatically when it drops. You cannot touch the data without holding the lock, because the data is unreachable any other way — forgetting to lock is not a discipline you maintain but a state the type system will not let you express. Shared ownership across threads adds Arc, an atomically reference-counted handle, so the canonical pattern is Arc<Mutex<T>>.

// Rust: the lock IS the data's container; Arc shares it; both verified by the compiler.
let counter = Arc::new(Mutex::new(0));
for _ in 0..10 {
    let counter = Arc::clone(&counter);          // a handle to the SAME value
    thread::spawn(move || {
        let mut n = counter.lock().unwrap();     // unreachable except through the lock
        *n += 1;
    });                                          // guard drops here, lock released
}

Java separates which lock from what it protects like C++, but layers a contract on top: the Java Memory Model, which we cover below. In practice, idiomatic Java reaches for a higher-level building block before a raw lock — a ConcurrentHashMap whose computeIfAbsent makes “look up, and create if missing” a single atomic step, or a BlockingQueue that hands backpressure to you for free. The throughline across all four locking languages is the same as the tool-choice rule: pick the highest-level construct that fits, and drop to a hand-rolled lock only when nothing else expresses what you need.

The peril shared by every locking model is that a lock prevents data races but invites deadlocks. Two units that grab two locks in opposite orders wedge permanently — each holds one and waits for the other. No type system catches this, because lock ordering is a property of program logic, not types; even Rust’s compiler, which eliminates data races outright, will happily compile a deadlock. The defenses are the same everywhere: impose a global lock-ordering discipline, collapse two locks into one, or acquire multiple locks atomically (scoped_lock). A lock buys you synchronization safety; it does not buy you thinking.

Message passing: don’t share, communicate

The other coordination style avoids shared state entirely. Instead of putting a value behind a lock and letting every unit reach in, you give one unit ownership of the value and let others hand it off over a channel — a typed pipe that carries values between units. When a value travels down a channel, ownership travels with it: the sender is done with it, the receiver now holds it, and there is no window where both touch it at once. The handoff is the synchronization, so no separate lock is needed. This is CSP — Communicating Sequential Processes, Tony Hoare’s 1978 idea — and its slogan is worth memorizing: don’t communicate by sharing memory; share memory by communicating.

Go builds its whole identity on this. A goroutine sends with ch <- job and another receives with job := <-ch, and the channel guarantees exactly one receiver gets each value. The buffered-versus-unbuffered distinction is the one to internalize: an unbuffered channel is a synchronous handoff (a send blocks until someone receives), while a buffered channel decouples the two and its capacity is your backpressure knob — a full buffer is the system telling the producer to slow down. A select waits on several channels at once, which is how timeouts and cancellation enter the model.

// Go: a value (and its ownership) flows from producer to worker over a typed channel.
func generate(nums ...int) <-chan int {
    out := make(chan int)
    go func() {
        defer close(out)            // closing signals "no more values" to receivers
        for _, n := range nums {
            out <- n                // ownership of each value moves down the pipe
        }
    }()
    return out                      // send-only type documents and enforces direction
}

Rust has the same idiom — std::sync::mpsc (multiple producer, single consumer) — but backs it with a stronger guarantee. Because Rust moves the value into the channel, the compiler rejects any code that tries to use the value after sending it. Go’s safety here comes from convention and a race detector you run afterward; Rust’s comes from the type system, statically. Same shape, stronger contract.

// Rust: send MOVES ownership; the compiler forbids touching the value afterward.
let (tx, rx) = mpsc::channel();
thread::spawn(move || {
    for msg in ["work", "more", "done"] {
        tx.send(String::from(msg)).unwrap();   // the String moves out; can't be reused
    }
});                                            // tx drops, channel closes, rx loop ends
for received in rx {
    println!("got: {received}");
}

The decision between the two styles is the rule from the introduction, applied per piece of state. Is the data flowing, or sitting still? Flowing wants a channel — pipelines, fan-out to workers, fan-in of results, a result handed back. Sitting still wants a lock — a counter, a cache, a config served to every request. Forcing flowing data through a lock means manual coordination where a channel would be the literal expression of the handoff; forcing sitting-still state through a channel is ceremony where lock(); x++; unlock() would do. Both Go and Rust offer both tools precisely so you can match the tool to the shape.

Build it → Channels carrying ownership between threads at production scale: the Rust Project 51: Message Queue is built around producer/consumer hand-off, and the Go services in Project 02: Microservice Platform use worker pools, streaming-RPC channels, and context propagation across a multi-service stack.

The async/await model: one idea, four implementations

async/await is the syntax that made event-loop concurrency mainstream, and four of our six languages have it — Python, JavaScript/TypeScript, and Rust, with Java taking a deliberately different road we cover next. The syntax looks identical across them: an async function, an await (or .await) that reads like a blocking wait but is actually a yield point where the function suspends and hands the thread back to the scheduler. Underneath, though, the implementations diverge on two axes that change how you reason about them: eager versus lazy, and built-in runtime versus bring-your-own.

In JavaScript, an async operation is eager. Calling a function that returns a Promise starts the work immediately; the Promise is already in flight, and await merely waits for a result already in motion. The model is one thread, one stack, and — the detail that explains the most otherwise-baffling behavior — two queues with strict priority. Promise continuations go on the microtask queue; timers and I/O callbacks go on the macrotask queue; and the loop drains the entire microtask queue before taking a single macrotask. That one rule predicts almost every “why did this run first?” puzzle:

console.log("A");
setTimeout(() => console.log("B"), 0);            // macrotask
Promise.resolve().then(() => console.log("C"));   // microtask
console.log("D");
// Output: A, D, C, B — synchronous first, then all microtasks, then one macrotask.

Python’s asyncio is the same cooperative single-thread idea, with gather to run a batch of coroutines concurrently and TaskGroup for structured concurrency. It differs from JavaScript in having no built-in micro/macro split exposed to you, and in coexisting with threads and processes (and the GIL) as one of several models rather than the only one.

Rust is the outlier on both axes, and the differences are the reason it can be so cheap. A Rust future is lazy — calling an async fn builds a state machine and runs none of it; the work starts only when an executor polls the future. And Rust ships no runtime at all: the language gives you async, await, and the Future trait, but not the executor or the reactor that drive them. You must bring your own, which in practice means Tokio.

// Rust: an async fn compiles to a lazy state machine that does NOTHING until polled.
async fn handle(id: u64) -> Vec<u8> {
    let conn = open(id).await;     // suspension point 1
    let data = conn.read().await;  // suspension point 2 — each .await is a pause/resume state
    process(data)
}

The compiler turns that function into an enum-like state machine with one state per .await, each holding exactly the locals that must survive the suspension. That is why it is zero-cost: the future is sized at compile time to its largest state, allocated inline, with no garbage collector and no green-thread runtime baked in — a parked future costs only the bytes to remember where it paused. The runtime that drives it is two halves: an executor that schedules and polls runnable futures over a small work-stealing thread pool, and a reactor that talks to the OS readiness API (epoll, kqueue, IOCP) and wakes a parked future when its I/O is ready. The table makes the divergence concrete:

Property	JavaScript	Python asyncio	Rust (Tokio)
Eager or lazy	eager (work starts on call)	eager (task scheduled)	lazy (nothing until polled)
Runtime	built into the host (browser/Node)	built in (`asyncio`)	bring your own (Tokio)
Parallelism	none (one thread)	none (one thread, GIL)	yes (multi-thread executor)
Cost per task	tiny	~1 KB	bytes (state machine)
Data-race safety	no shared memory to race	no shared memory to race	`Send`/`Sync` checked at compile time

The shared price of async everywhere is function coloring: async is contagious, because an async function can only be awaited from another async context, so async-ness propagates up the call stack and splits the world into “red” async functions and “blue” sync ones that don’t compose freely. This ergonomic tax is the strongest argument for not going async unless the concurrency profile — massive, idle, I/O-bound — actually warrants it.

Build it → Make the machinery real: Project 06: Async Runtime is a from-scratch Tokio-style runtime — an epoll-based reactor, a work-stealing scheduler, a timer wheel, and the waker mechanism — i.e. everything the async section describes, implemented. For async I/O under sustained load, the asyncio-based Project 08: Streaming Platform runs high-throughput event pipelines with bounded concurrency and backpressure.

M:N green threads: blocking code that scales

There is a fourth answer that sidesteps async’s function-coloring tax entirely: keep the simple, blocking, top-to-bottom programming style, but make the unit so cheap that you can spawn millions and let the runtime multiplex them. This is the M:N green-thread model, and Go has had it from day one while Java arrived at it in 2021 — a striking case of two languages reaching the same destination by opposite routes.

Go’s goroutine is a function call with go in front of it. It begins life with a tiny stack (a couple of kilobytes that grows on demand), is created and switched in user space without a kernel system call, and is multiplexed by the runtime’s M:N scheduler onto a small pool of OS threads sized by GOMAXPROCS. When a goroutine blocks on a channel or I/O, the scheduler lifts it off its thread and runs another there. You write thousands of little sequential programs; the runtime makes a handful of cores do the work.

Java’s virtual thread (Project Loom, stable in Java 21) is the same idea retrofitted onto a 1:1 history. A virtual thread is an ordinary Thread to your code — same API, same blocking calls, same stack traces — but it is not an OS thread. The JVM keeps a small pool of OS carrier threads and mounts a virtual thread onto a carrier only while it runs; the instant it hits a blocking call, the runtime unmounts it, frees the carrier for another virtual thread, and remounts it when the call completes. The payoff is the headline: blocking is cheap again.

// Java: one virtual thread per request; ten thousand concurrent blocking calls is fine.
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
    List<Future<Response>> futures = requests.stream()
        .map(req -> executor.submit(() -> handleBlocking(req)))   // blocks freely
        .toList();
    for (var f : futures) process(f.get());
}   // close() waits for every virtual thread to finish

That handleBlocking can call the database, sleep, and make three sequential HTTP calls, and the only cost of all that blocking is some parked virtual threads, which are nearly free. This is the payments gateway from countless real migrations: it threw away the reactive-callback rewrite that once seemed mandatory and went back to “call the bank, wait, call fraud, wait” — and it now scales further than the callback version ever did.

The contrast with async/await is the crux. Python, JavaScript, and Rust scale I/O by coloring functions async — the call sites change, and the coloring propagates. Go and Java scale I/O by keeping the call sites ordinary blocking calls and pushing the multiplexing entirely into the runtime, so it is invisible. Neither is strictly better: async gives you explicit, inspectable suspension points and (in Rust) zero-cost state machines; green threads give you the simple straight-line code and stack traces that survive a debugger. Both have the same sharp edge — a goroutine or virtual thread that blocks the wrong way (an unbuffered channel with no receiver; a synchronized block that pins the carrier) defeats the model — but the cheapness is the point in both.

War story: the leak that only load could find (Go)

A search endpoint queried three backends in parallel and returned whichever answered first. It spawned three goroutines, each sending its result onto an unbuffered channel, and the handler received exactly one value — the fastest — then returned. It passed every test and ran fine for a week. The bug was the two goroutines nobody received from: after the handler took the first result and returned, the other two were still blocked trying to send onto a channel with no receiver, and on an unbuffered channel a send blocks until someone receives. Those goroutines blocked forever, holding their stacks and buffers. At low traffic the leak was invisible — the scheduler parks blocked goroutines, so CPU looked healthy. Under a spike, two leaks per request became hundreds of thousands in minutes and the process was OOM-killed. Two fixes: give the channel a buffer of three so the slow senders can deposit and exit, and derive a context for the request with defer cancel() so the losers are cancelled the moment the handler has its answer. The lesson is the green-thread model in one sentence: cheap to spawn means easy to leak — every unit needs a guaranteed way to stop.

Structured concurrency and cancellation

The Go leak above points at a deeper problem that every cheap-concurrency model shares: when you can spawn units freely, you can also lose track of them. A unit with no owner responsible for ending it is a leak waiting for load. The modern answer, converging across languages, is structured concurrency — the idea that a group of concurrent tasks should be bound to a lexical scope, so the scope does not exit until every task it started has finished, and if one task fails, its siblings are cancelled rather than left running.

The pattern is the same shape everywhere. Python’s asyncio.TaskGroup is an async with block that owns the tasks created inside it; it won’t exit until they all finish, and if one raises it cancels the rest and propagates the error. Java’s StructuredTaskScope forks each subtask as a virtual thread, joins them all, and on the first failure interrupts the others. Both give concurrency the same block-structured discipline that try-with-resources and RAII gave to resource cleanup.

# Python: a TaskGroup scopes its tasks — all finish together, one failure cancels the rest.
async def main(urls: list[str]) -> None:
    async with asyncio.TaskGroup() as tg:
        for url in urls:
            tg.create_task(fetch(url))   # one raise cancels siblings; error propagates here
    # on exit, every task is done

// Java: a structured scope forks subtasks; the first failure cancels the others.
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
    var user  = scope.fork(() -> fetchUser(id));    // each fork is a virtual thread
    var order = scope.fork(() -> fetchOrder(id));
    scope.join();              // wait for both
    scope.throwIfFailed();     // propagate the first failure, cancelling the rest
    return combine(user.get(), order.get());
}   // scope close guarantees no subtask is left running

Go expresses the same need with a different primitive: the context package. A context.Context carries a cancellation signal and an optional deadline, and it propagates down a call tree so that cancelling a parent cancels every child at once. The convention is rigid — ctx is the first parameter of any function that does I/O, every blocking operation selects on ctx.Done(), and you defer cancel() even on the success path to free resources. Context is Go’s structural cure for the leak: had the search handler tied a context to the request and passed it to all three backends, the disconnecting client would have cancelled it and the losing goroutines would have returned.

// Go: a timeout context cancels the whole subtree; defer cancel() is the leak fix made routine.
ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
select {
case body := <-doRequest(url):
    return body, nil
case <-ctx.Done():
    return "", ctx.Err()       // cancelled or timed out — stop now, don't leak
}

Rust’s runtimes carry the same idea: tokio::select! races a future against a timeout or a shutdown signal and drops the losing branches mid-flight, and dropping a future is how Rust cancels it. The unifying insight is that cancellation must be structural, not ad hoc — a timeout or a client disconnect should propagate to every task spawned under it, automatically, the way an exception unwinds a call stack. Structured concurrency is that propagation made into a language construct.

How each language stops a data race

We end on the axis where the six languages differ most sharply: what happens when two units touch the same memory, at least one writing, with no synchronization — the data race. The remarkable thing is that the same bug has five genuinely different fates depending on the language, and those fates form a spectrum from “your problem entirely” to “impossible to compile.”

Language	Mechanism	When a race is caught	What the race costs
Rust	`Send`/`Sync` traits + borrow checker	compile time	a red squiggle in your editor
Go	`-race` detector (runtime instrumentation)	when a test exercises it	a CI failure, if covered
Java	Java Memory Model (`happens-before`)	not caught — you must reason	a subtle “works on my laptop” bug
C++	`std::atomic` + memory orderings	not caught — undefined behavior	a crash on a compiler upgrade or new CPU
Python / JS	one event-loop thread	no shared-memory race exists	(sidestepped, not solved)

At the sidestep end, Python’s asyncio and all of JavaScript run your code on a single thread, so two pieces of your code never execute simultaneously and there is no shared-memory hazard to begin with. This is real safety, but narrow — it is bought by giving up parallelism, and it does nothing for the data races that do occur in Python’s threading and multiprocessing worlds.

At the detector end, Go ships the -race flag, which instruments memory accesses at runtime and reports any pair of goroutines that touched the same location without a happens-before relationship, with both stacks. It is close to definitive — but only on code paths your tests actually execute, so it is as good as your concurrent-path coverage. The discipline that pays off is running concurrency tests under -race in CI, every time.

go test -race ./...     # finds the data races your tests exercise; a CI failure you got to skip

The memory-model languages, Java and C++, make you reason about visibility yourself, and they are where the deepest bugs live. The shared insight both encode is happens-before: absent an explicit ordering edge, one thread’s write may never become visible to another, because cores cache independently and both compiler and CPU reorder freely. Java’s JMM gives you volatile (visibility) and synchronized (visibility + atomicity + ordering) to build those edges on a garbage-collected heap. C++ gives you the raw std::atomic with explicit memory orderings — seq_cst, acquire/release, relaxed — and the blunt rule that an unsynchronized conflicting access is undefined behavior, not merely a wrong value. The compiler may assume the race never happens and optimize on that basis, which is precisely why such a bug can “work” for two years and then break on a compiler upgrade.

// C++: a release-store publishes prior writes; a matching acquire-load sees them.
std::atomic<bool> ready{false};
int payload = 0;                                  // ordinary, non-atomic
void producer() {
    payload = 42;                                 // (1) ordinary write
    ready.store(true, std::memory_order_release); // (2) publishes (1)
}
void consumer() {
    while (!ready.load(std::memory_order_acquire));// (3) waits, then sees (1)
    assert(payload == 42);                         // guaranteed: (2) happens-before (3)
}

War story: the optimization that x86 hid and ARM exposed (C++)

A team shipped a cache with double-checked locking: the first thread built a singleton under a mutex and stashed a pointer; every thread after read the pointer with no lock. Textbook, review-approved, and flawless through months of load testing — all of it on x86. Then the mobile build went out, and a fraction of a percent of ARM devices crashed reading garbage fields after the pointer was non-null. The writer set the object’s fields and then the pointer, in source order — but nothing made those two writes become visible to other cores in that order. x86’s strongly-ordered hardware happened to preserve it; ARM’s weak memory model let the pointer store land in another core’s view before the field stores it pointed to, so a reader saw a non-null pointer to a half-built object. The reordering was legal — both hardware and compiler are allowed it — and the code silently assumed it wouldn’t happen. The portable contract is the C++ memory model (a release store paired with acquire loads), not the architecture you happened to test on. “It worked on x86” is a statement about hardware, not about correctness.

At the compile-error end stands Rust, and its move is the most elegant in the spectrum: it notices that a data race is exactly the borrow checker’s shared-XOR-mutable rule — many readers or one writer, never both — violated across a thread boundary. So it doesn’t need a separate mechanism; it extends the rule it already enforces with two marker traits the compiler tracks automatically. Send means a value is safe to move to another thread; Sync means a reference to it is safe to share across threads. thread::spawn demands Send; sharing a reference demands Sync. A non-thread-safe type like Rc (non-atomic reference count) implements neither, so the moment you try to send it across a thread boundary the compiler stops you by name:

error[E0277]: `Rc<Vec<i32>>` cannot be sent between threads safely
   = help: the trait `Send` is not implemented for `Rc<Vec<i32>>`

The compiler is not running your program and observing a race; it is reading your types and reasoning that a race is possible, and refusing on that basis. The class of bug that is undefined behavior in C++, a heisenbug in Java, and a CI gamble in Go is, in Rust, a build failure. The honest caveat — worth stating because Rust’s marketing oversells it — is that this guarantee is narrow: it eliminates data races, not deadlocks, livelocks, or logical races where every individual access is synchronized but the overall sequence is still wrong. The compiler buys synchronization safety; lock ordering and program logic remain yours, in Rust exactly as in C++.

Build it → Shared state and message passing under real concurrency: Project 11: Distributed KV (Raft) guards replicated state machines behind Arc<Mutex<…>> and threads async networking through a consensus protocol, exercising both coordination styles at once.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — Diagnose the bound, then pick the model. Take two workloads: one that hashes a large in-memory list (CPU-bound) and one that fetches many small URLs (I/O-bound). For each, write down whether it is CPU- or I/O-bound and which model you would use in two languages of your choice — and justify each choice in terms of the model, not the API. For a Python answer, explain in one sentence why the GIL makes threading useless for the hashing workload but fine for the fetching one. For a Go or Java answer, explain why you would not need that distinction. The deliverable is the reasoning, not running code.
Level II — Reproduce one failure mode and fix it. Pick one of the cooperative-scheduler hazards and demonstrate it end to end in the relevant language: (a) block a Python asyncio event loop with a synchronous time.sleep inside a coroutine and show throughput collapse, then fix it with asyncio.to_thread; or (b) leak a goroutine in Go by sending on an unbuffered channel with no receiver, prove the leak with runtime.NumGoroutine() before and after, then fix it with a buffer or a context. Either way, write a paragraph naming the model property that caused the failure (cooperative scheduling never preempts a blocking call; a cheap unit still needs a stop signal) and explain why the fix addresses the property, not just the symptom.
Level III — Build the same concurrent component twice, with different coordination styles. Implement a metrics aggregator that ingests events from several producers and maintains running per-key counts, once with shared state (a lock-protected map — Arc<Mutex<…>> in Rust, ConcurrentHashMap in Java, or sync.Mutex in Go) and once with message passing (producers send events down a channel to a single owner that holds the map alone, no lock). Compare them: which version has a lock and which transfers ownership; which would extend more cleanly to a multi-stage pipeline; and — for the shared-state version — name two failure modes the language does not protect you from (a deadlock from lock ordering, a logical race from a synchronized-but-mis-sequenced read-modify-write) and say how you would prevent each. Close with a sentence on where the language’s safety guarantee ends and your judgment begins.

Summary

Concurrency is structuring independent work to be in progress at once; parallelism is executing it simultaneously, which needs cores — and the two are orthogonal. The first question for any workload is whether it is I/O-bound (wants cheap concurrency, the cheaper the better the more of it) or CPU-bound (wants real parallelism across cores), because that diagnosis picks the model. The unit of concurrency comes in four shapes: 1:1 OS threads (C++, Rust std, classic Java — capable, preemptive, rationed at a megabyte each); M:N green threads (goroutines, JVM virtual threads — cheap, blocking code that scales); event-loop tasks (Python asyncio, JavaScript — one cooperative thread, no parallelism); and Rust’s lazy, zero-cost, poll-based futures with a bring-your-own runtime. Schedulers split cooperative from preemptive, and every cooperative one obeys the cardinal rule: never block it. Python’s GIL collapses the whole language into the I/O-versus-CPU question. The two coordination styles — shared memory with locks (flowing data’s wrong tool) versus message passing over channels (sitting-still data’s wrong tool) — are the choice you make per piece of state. And the five answers to the data race run from Rust’s compile error through Go’s detector and the Java/C++ memory models to Python and JavaScript’s single-thread sidestep — the sharpest axis on which these languages differ, and the one that most shapes the systems people build in them.

Key takeaways

Concurrency is not parallelism. Diagnose the workload — I/O-bound wants cheap concurrency, CPU-bound wants real parallelism — and the model follows from the diagnosis, not from taste.
Four units, two mappings. 1:1 OS threads are capable but cost a megabyte and are rationed; M:N green threads (goroutines, virtual threads), event-loop tasks, and poll-based futures all cost kilobytes or less, which is why they scale I/O.
Never block a cooperative scheduler. One synchronous call freezes an event loop or removes an async worker from rotation — the same bug in Python, JavaScript, Rust async, and pinned virtual threads.
Python’s GIL makes one question decide everything: I/O-bound wants threads or asyncio, CPU-bound needs multiprocessing — the only language here that must leave the process to parallelize computation.
Flowing data wants a channel; sitting-still state wants a lock. Message passing transfers ownership so there is nothing to race over; shared memory needs a lock and invites deadlocks no type system catches.
Five fates for one data race: Rust’s compile error (Send/Sync), Go’s -race detector, the Java Memory Model’s happens-before, C++’s atomics-or-undefined-behavior, and the single-thread sidestep of Python asyncio and JavaScript.

Connections to other chapters

Software Engineering Overview (prerequisite): the process/thread/block vocabulary and the reproducibility mindset framed there are the substrate for everything here — a data race is the reproducibility problem at its most vicious, non-deterministic and load-dependent.
Rust Ownership and Borrowing (prerequisite): fearless concurrency is literally the shared-XOR-mutable rule from that chapter applied across a thread boundary — Send/Sync, the move into a spawned thread, and the move through a channel are all the borrow checker reaching across threads. Read it first to see why a Rust data race is a compile error.
Memory and Resource Management (sibling): the C++ memory model, false sharing, and the reclamation hazard in lock-free code are lifetime-and-memory problems before they are concurrency problems; a Mutex<T> welding a lock to its data is RAII, and a leaked goroutine is a resource leak. The cost of an OS thread’s stack is a memory fact.
Error Handling (sibling): cancellation and structured concurrency are error handling for concurrent code — a TaskGroup propagating one task’s failure to its siblings is exactly the unwind-on-error discipline applied to a tree of tasks, and an unhandled Promise rejection is an error that escaped its scope.
Async Runtime (Project 06) and Data Streaming (extensions): the executor/reactor split, work-stealing scheduling, and bounded-concurrency backpressure described here are built in the async-runtime project and applied at scale in streaming pipelines, where the same patterns must also survive worker failure and retries across machines.
Distributed Training and GPU and CUDA (extensions): parallelism at the largest scale — splitting computation across many machines or thousands of GPU cores — is the CPU-bound branch of this chapter’s first question taken to its limit, where the coordination problem becomes the whole problem.

Rob Pike, “Concurrency Is Not Parallelism” (2012 talk) — the talk that crystallized the distinction at the heart of this chapter, and the clearest 30 minutes you can spend on the topic.
Katherine Cox-Buday, Concurrency in Go (O’Reilly) — goroutines, channels, select, and context built up rigorously; the best book-length treatment of the CSP model.
The Rust Programming Language (Klabnik & Nichols), the “Fearless Concurrency” chapter — the canonical introduction to Send/Sync, Arc<Mutex<T>>, and channels, and the source of the term.

Deep dives

Anthony Williams, C++ Concurrency in Action (2nd ed.) and Herb Sutter, “atomic Weapons: The C++ Memory Model and Modern Hardware” (talk) — the definitive accounts of memory ordering, acquire/release, and why weakly-ordered hardware exposes bugs strong hardware hides.
JEP 444: Virtual Threads and Ron Pressler’s Project Loom talks — why the JVM chose user-mode threads over an async/await language feature, and what M:N multiplexing onto carriers buys you.
The Tokio Tutorial and Asynchronous Programming in Rust (the “async book”) — how a lazy, poll-based, zero-cost future works and why Rust keeps the runtime out of the language.
Jake Archibald, “In The Loop” (JSConf talk) — the definitive walk-through of the JavaScript microtask/macrotask split and why ordering is what it is.

Historical context

C. A. R. Hoare, “Communicating Sequential Processes” (CACM, 1978) — the foundational paper Go’s channels and goroutines directly implement.
Leslie Lamport, “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs” (1979) — the origin of sequential consistency, the baseline the memory-ordering spectrum is defined against.
PEP 703, Making the Global Interpreter Lock Optional in CPython — the authoritative account of what the GIL costs, what it protects, and what removing it requires.
Jung et al., “RustBelt: Securing the Foundations of the Rust Programming Language” (POPL 2018) — the formal proof that Send and Sync are sound: the mathematics under “if it compiles, it has no data races.”

--- title: "Concurrency and Parallelism Models" keywords: [concurrency, parallelism, threads, goroutines, async, await, event loop, csp, channels, gil, mutex, data races, memory model, structured concurrency, scheduler] difficulty: advanced prerequisites: [software-engineering-overview, go-fundamentals, rust-fundamentals] estimated_time: "4-5 hours" --- ## Introduction The same bug shipped four times that quarter, in four different languages, and each team thought it was unique to theirs. A Python service rebuilt on asyncio fell over under load because one synchronous database call, buried three frames deep, froze the single event-loop thread and serialized every in-flight request. A Go search endpoint leaked goroutines — two per request, blocked forever on a channel nobody would ever receive from — until the box was OOM-killed on a traffic spike. A Java reporting service sized its thread pool at 200 to keep cores busy through blocking calls, then watched one slow dependency occupy all 200 threads at once and reject healthy traffic to unrelated endpoints. And a C++ cache shipped a double-checked-locking optimization that passed review and ran for months on x86, then crashed on a fraction of a percent of ARM phones because the hardware was free to reorder two writes the code silently assumed were ordered. Read those four incidents as bugs and they look unrelated. Read them as a family and a pattern jumps out: every one is a team reaching for a concurrency tool without first understanding the *model* underneath it. The Python team didn't grasp that a cooperative event loop runs to completion between yield points. The Go team didn't grasp that a cheap, freely-spawned goroutine still needs a guaranteed way to *stop*. The Java team was working around the price of an OS thread without realizing the price had a fix. The C++ team assumed a single global order of memory that hardware does not provide. The model was the missing piece in each case — and the deep, useful fact is that there are only a handful of models, shared across all six languages this chapter covers. Learn them once, comparatively, and the four bugs become one bug you can recognize anywhere. ### The Core Insight Concurrency is not parallelism, and conflating the two is where most of the confusion starts. **Concurrency is a way of *structuring* a program** so that independent pieces of work can be in progress at the same time — interleaved, suspended, resumed. **Parallelism is *executing* multiple pieces of work literally simultaneously**, which needs multiple cores. A single-core machine can be highly concurrent (an event loop juggling ten thousand connections) and not parallel at all. A program can also be parallel without being interesting-ly concurrent (a `for` loop split across cores). The two are orthogonal, and the first question for any workload is which one you actually need. That question has a sharp answer because it tracks one physical distinction. Work that spends its time **waiting** — on the network, a disk, a database — is *I/O-bound*, and the win is *concurrency*: overlap the waits so the thread isn't idle. Work that spends its time **computing** — parsing, hashing, transforming numbers — is *CPU-bound*, and the only win is *parallelism*: spread the arithmetic across cores. Reach for a concurrency model on CPU-bound work and you get no speedup; reach for raw parallelism on I/O-bound work and you pay for threads that mostly sleep. Every language in this chapter gives you tools for both, but the *shape* of those tools — and the safety they offer — differs enormously, and that variation is the whole subject. The deepest divergence is over **shared mutable state**. Two threads touching the same memory, at least one of them writing, with no synchronization between them, is a *data race* — and a data race is the most expensive bug in systems programming, because it is non-deterministic, load-dependent, immune to the debugger, and usually found in production. Six languages give six different answers to it: Python and JavaScript mostly sidestep it with a single thread; Go offers channels (don't share) plus a runtime race detector; Java codifies a memory model with `happens-before` edges; C++ hands you the raw memory orderings and the obligation to use them; Rust makes the race a *compile error*. Concurrency is the axis on which programming languages differ most, and the model a language picks shapes every system built in it. ### A mental model Picture a restaurant, and you can place every concurrency model in this chapter on its floor. The slow work — cooking — is the I/O: the part where you *wait*. Serving tables is the computation: the part that needs *hands*. A **thread-per-request** restaurant hires one full-time waiter per table. Each waiter walks an order to the kitchen and stands there until the food is plated. It is dead simple to reason about — one waiter, one story, top to bottom — but waiters are expensive, you can only afford a few dozen, and most of them are standing idle at a stove. That is the **1:1 OS thread** model: capable, preemptively scheduled by a kitchen manager who can yank any waiter off the floor at any instant, but rationed because each one costs a salary (a megabyte of stack) whether working or not. A **single async waiter** works the whole room alone. She drops an order at the kitchen and *immediately* moves to the next table rather than waiting; when a dish is ready a bell rings and she swings by to deliver it. One waiter keeps fifty tables moving precisely because the room is mostly *waiting*, and waiting needs no waiter. That is the **event loop** — cooperative, single-threaded, scaling I/O beautifully — with one fatal rule: if she ever stops to chop vegetables herself (a blocking call), the entire room freezes. A **crew of green-thread waiters** is the modern answer: you write down a million orders for almost nothing, and a small permanent crew picks them up, setting each one aside the instant it says "waiting on the kitchen" and grabbing the next. Each order *thinks* it has its own dedicated waiter standing by — which keeps the code simple — but the crew is never idle. That is the **M:N model**: goroutines and JVM virtual threads, cheap units multiplexed onto a few real threads. And **Rust's futures** are a fourth shape — a stack of order tickets that do *nothing* until a waiter picks one up and works it as far as it can, a deliberately lazy design that costs only the paper the ticket is written on. ### When to use which model The choice of model is not taste; it follows mechanically from two questions. @fig-concurrency-models maps the four units of concurrency onto the cores beneath them, and the two coordination styles that ride on top — and the decision walks straight through it. ![A comparative map of concurrency models: 1:1 OS threads, M:N goroutines over a work-stealing scheduler, single-threaded event loops with async tasks, and poll-based futures — and the two coordination styles, shared memory with locks versus message passing over channels.](../assets/diagrams/rendered/concurrency_models.svg){#fig-concurrency-models .lightbox} **First question: is the work CPU-bound or I/O-bound?** CPU-bound work needs real parallelism — threads (or processes) spread across cores, ideally one worker per core, because more workers than cores just add context-switch overhead without adding compute. I/O-bound work needs cheap concurrency — and the more of it, the cheaper each unit must be. A handful of concurrent waits is fine on OS threads; tens of thousands of mostly-idle connections demands a model where an idle unit costs bytes, not a stack, which means an event loop, goroutines/virtual threads, or async futures. **Second question: how do the concurrent units coordinate?** If data *flows* in one direction — a pipeline, a producer feeding workers, a result handed back — prefer **message passing**: a channel transfers ownership with the value, so there is nothing left to race over and no lock to forget. If state *sits still* and several units mutate it in place — a counter, a cache, a registry — prefer **shared memory with a lock**, the smaller and shorter-lived the critical section the better. "Flowing wants a channel; sitting still wants a lock" is the rule of thumb, and it holds in every language that offers both. The languages then differ on what they make *easy* and what they make *safe*. Go and recent Java make cheap M:N concurrency the path of least resistance. Python and JavaScript hand you one cooperative event loop and (mostly) one thread. Rust gives you the cheapest async there is but makes you supply the runtime, and proves the absence of data races at compile time. C++ gives you the rawest control and the heaviest obligation. None is strictly best; each is a point in a design space, and the rest of this chapter is the map. ### What you'll learn - How to tell **concurrency from parallelism**, diagnose a workload as CPU- or I/O-bound, and pick the model that diagnosis implies - The four **units of concurrency** — OS threads, M:N green threads, event-loop tasks, and poll-based futures — and the scheduler each one rides on - Why **cooperative** schedulers (event loops, async runtimes) and **preemptive** ones (OS threads) fail in opposite ways, and the cardinal rule that protects every cooperative one - How Python's **GIL** turns one question — I/O-bound or CPU-bound? — into the answer for the whole language, and how Java and Go reached "blocking code that scales" - The two coordination styles — **shared memory with locks** versus **message passing over channels (CSP)** — and when each is the simpler, faster, safer fit - How `async`/`await` works across languages, from JavaScript's eager Promises to Rust's lazy, zero-cost, poll-based futures with a bring-your-own runtime - **Structured concurrency** and **cancellation** — how to scope a group of tasks so failure and timeouts propagate instead of leaking - The five answers to **data races** — Rust's `Send`/`Sync`, Go's race detector, the Java Memory Model, C++'s `std::atomic` and memory orderings, and the single-thread sidestep ### Prerequisites - **Software Engineering Overview** — what a process and a thread are, what it means for a call to *block*, and why reproducibility and resource limits matter - **Go Fundamentals** and **Rust Fundamentals** — this chapter draws its sharpest contrasts from Go's goroutines-and-channels and Rust's ownership model; comfort reading idiomatic Go and Rust will make the comparisons land - A working idea of what a data race and a deadlock are, even informally --- ## The unit of concurrency: four shapes Every concurrency model is, at bottom, a choice about the *unit* you spawn and how that unit maps onto the operating system's threads. There are four shapes in wide use, and the rest of the chapter is variations on them. The cleanest way to see the design space is a single table comparing the unit each language gives you first. | Language | Default unit | Mapping to OS threads | Scheduling | Cost per unit | |----------|-------------|----------------------|------------|---------------| | C++ | `std::thread` | 1:1 | preemptive (kernel) | ~1 MB stack | | Rust (sync) | `std::thread` | 1:1 | preemptive (kernel) | ~1 MB stack | | Java (classic) | platform `Thread` | 1:1 | preemptive (kernel) | ~1 MB stack | | Java (Loom) | virtual `Thread` | M:N | cooperative + carrier preempt | ~1 KB | | Go | goroutine | M:N | cooperative + async preempt | ~2 KB (grows) | | Python | asyncio task / thread / process | 1:1 thread (GIL-bound) | cooperative (loop) | ~1 KB task | | JavaScript / TS | Promise / async task | single thread | cooperative (event loop) | tiny | | Rust (async) | `Future` task | M:N over a runtime | cooperative (executor) | bytes (state machine) | The table tells two stories at once. Read the **mapping** column and the world splits in two: the 1:1 languages, where each unit *is* an OS thread, and the M:N languages, where many cheap units share a few OS threads. Read the **cost** column and you see why that split exists at all — an OS thread reserves about a megabyte of stack whether or not it is doing anything, so a process tops out around the low tens of thousands of them, which is a hard ceiling for a server whose job is to hold many slow connections open. The M:N units cost kilobytes or less, so you can have millions. The **1:1 model** is the oldest and the simplest to reason about. A thread is a real OS thread: its own stack, scheduled by the kernel, running truly in parallel on another core. The kernel is *preemptive* — it can interrupt any thread at any instant to run another — so no thread can starve the others by refusing to yield, which is a genuine safety property. The cost is the megabyte and the kernel's involvement in every context switch. This is what C++, Rust's `std::thread`, and classic Java give you, and for CPU-bound work or modest concurrency it is exactly right: a handful of threads, one per core, crunching numbers in parallel. The **M:N model** keeps the simple blocking *programming style* but removes the cost. The runtime keeps a small pool of OS threads (Go calls them Ps, the JVM calls them carriers) and multiplexes many cheap units onto them. When a unit blocks — on a channel, on I/O — the runtime *parks* it, unmounts it from its OS thread, and runs a different runnable unit there, remounting the parked one when its blocking call completes. The unit *thinks* it blocked; the OS thread never sat idle. This is the goroutine, and it is the Java virtual thread, and the two are the same idea reached two decades apart. The **event-loop model** goes to one thread and makes the concurrency entirely cooperative. There is one call stack; only one thing runs at a time; slow work is handed to the runtime and its completion is queued; the loop runs queued continuations when the stack is clear. This is Python's asyncio and all of JavaScript. It scales I/O superbly on a single thread and gives up parallelism entirely — many operations in flight, but only one line of your code running at any instant. The **poll-based future** is Rust's distinctive fourth shape, and it inverts an assumption the other three share. In every other model, spawning a unit *starts* it. A Rust future is **lazy**: calling an `async fn` builds a state machine and runs *none* of its body. Nothing happens until an executor *polls* it. That single difference — laziness — is what makes Rust async zero-cost and is the source of most of its surprises, and we will come back to it. ## Schedulers: cooperative versus preemptive The unit is half the model; the scheduler is the other half, and the axis that matters most is **cooperative versus preemptive**. A *preemptive* scheduler can interrupt a running unit at any point and switch to another — the OS does this to threads on a timer interrupt, dozens of times a second. A *cooperative* scheduler can only switch when the running unit *yields* control voluntarily, at an explicit point. The two fail in exactly opposite ways, and knowing which one you are on tells you which failure to fear. Preemptive scheduling is robust against a misbehaving unit. A thread stuck in an infinite CPU loop cannot freeze the others, because the kernel will preempt it and let everyone else run. The price is that a context switch can happen *anywhere* — between any two instructions, including in the middle of `count++` — so shared mutable state can be corrupted at any interleaving, and you need locks to make multi-step operations atomic. Preemption gives you robustness against starvation and takes away the ability to reason about where you can be interrupted. Cooperative scheduling makes the opposite trade. Because a unit yields only at explicit points — `await` in Python/JS/Rust, a channel operation in Go's older model, a blocking call on a virtual thread — you know *exactly* where control can leave your code, which makes whole categories of races evaporate: between two yield points, your code runs uninterrupted. The price is the cardinal rule that governs every cooperative runtime: **a unit that never yields freezes everything.** This is the single most important operational fact about event loops and async runtimes, and it caused two of the four incidents in the introduction. ::: {.callout-warning} ## War story: the one synchronous call that serialized everything (Python) A team rebuilt a flaky API on FastAPI and asyncio specifically to handle a flood of concurrent traffic, load-tested it to thousands of requests per second, and shipped it. In production, p99 latency was catastrophic — 20 ms requests took *seconds*, and throughput collapsed to roughly one request at a time. The handler was `async`, the framework was async, and buried three calls deep was a single legacy helper issuing a database query through a *synchronous* driver. Each time any request reached that helper, its blocking query pinned the one event-loop thread for the entire round trip, and every other in-flight request — hundreds of them — sat frozen until it returned. The async service was, in effect, single-threaded *and* synchronous, the worst of both. The fix was to stop blocking the loop: swap in an async driver so the query `await`s, or push the sync call off the loop onto a thread pool. The same hazard has the same shape in JavaScript (a long synchronous loop stalls the page), in Rust async (a `std::fs::read` inside a task removes a worker from rotation), and even on Java's virtual threads (blocking inside `synchronized` *pins* the carrier). One rule, five languages: **never block a cooperative scheduler.** ::: The M:N models blur the line in a useful way. Go's scheduler is mostly cooperative — a goroutine yields at channel operations and function-call preemption points — but since Go 1.14 it also has *asynchronous preemption*, so a goroutine in a tight CPU loop can still be interrupted, recovering the robustness of preemption. Java's virtual threads are cooperative about *blocking* (they unmount and yield the carrier) but the carrier itself is a preemptible OS thread. The lesson is that "cooperative" and "preemptive" are ends of a spectrum, and the modern runtimes deliberately sit in the middle: cooperative enough to be cheap, preemptive enough not to be starved. ## The GIL: when the model collapses to one question No language makes the CPU-versus-I/O distinction more consequential than Python, because of one design decision: the **Global Interpreter Lock**. The GIL is a single mutex inside the standard CPython interpreter that a thread must hold to execute Python bytecode. Because there is exactly one, **exactly one thread runs Python code at a time, no matter how many cores you have.** It exists for a real reason — it makes CPython's reference-counting memory management and its vast C-extension ecosystem simpler and faster in the common single-threaded case — but the consequence is stark: in CPython, threads do not give you CPU parallelism. The crucial detail is *when the lock is released*. A thread drops the GIL whenever it is not running Python bytecode — most importantly, while it waits on the OS for I/O. A thread blocked on a socket holds no lock, so other threads run. That single asymmetry is the entire Python concurrency story, and it collapses every decision into one diagnostic question: **is the work I/O-bound or CPU-bound?** - **I/O-bound** work spends its time outside the interpreter, so the GIL is released and many waits overlap. Threads help here, and asyncio helps far more cheaply — a coroutine costs about a kilobyte where a thread costs megabytes. Use `threading`/`concurrent.futures` for modest concurrency or to keep synchronous libraries; use asyncio for very high concurrency. - **CPU-bound** work keeps the interpreter busy executing bytecode, so threads serialize on the GIL and you get *no* speedup — sometimes a slowdown, from the cost of handing the lock back and forth. The only path to real parallelism is **multiprocessing**: separate processes, each with its own interpreter and its own GIL, paying the cost of pickling everything that crosses the process boundary. This is the report-generator bug in one paragraph: a team wrapped a CPU-bound hot loop in eight threads expecting an eight-times speedup and got eleven minutes and four seconds — one core pinned, seven asleep — because the GIL serialized them. The fix was a process pool, which woke the idle cores. The asterisk worth remembering is that heavy numerical libraries (NumPy, pandas) *release the GIL* while computing in C, so already-vectorized work can be parallelized by threads after all. Set Python beside the other languages and the GIL's specialness is obvious. Java, Go, C++, and Rust all run threads in genuine parallel across cores with no interpreter lock; for them, CPU-bound and I/O-bound both have in-language answers. Python alone must leave the process to parallelize computation. (The experimental free-threaded CPython builds in 3.13+ are beginning to lift this, but for code shipped today, assume the GIL is present and let the workload pick the model.) > **Build it →** A production distributed job queue that runs CPU-bound and I/O-bound tasks > across worker pools — Celery alongside asyncio, the concrete form of "offload heavy work to > processes, keep the front door responsive" — is [Project 01: Distributed Job Queue](https://github.com/jchu0/applied-cs-projects/tree/main/01-distributed-job-queue). ## Shared memory with locks: the default, and its perils When several units must touch one piece of state in place, the classic answer is **shared memory protected by a lock**. The idea is universal — a mutex admits one unit into the critical section at a time — but each language wraps it differently, and the differences reveal how much each one trusts the programmer. C++ gives you the rawest version and the heaviest obligation. A `std::mutex` sits *beside* the data it guards, and nothing connects them except your discipline — you must remember to take the lock, and never call `lock()`/`unlock()` by hand, because an exception between them leaks the lock forever. RAII wrappers are the cure: `std::scoped_lock` locks in its constructor and unlocks in its destructor, releasing on every exit path including a throw, and acquiring multiple mutexes deadlock-free. ```cpp // C++: the lock sits beside the data; RAII guarantees release; scoped_lock avoids deadlock. void transfer(Account& from, Account& to, long cents) { std::scoped_lock lock(from.mtx, to.mtx); // both mutexes, deadlock-free from.balance -= cents; to.balance += cents; } // both released here, even on a throw ``` Rust takes the same primitive and welds it to the data, which is the quiet masterstroke. A `Mutex<T>` *contains* the `T`; the only way to reach the value is `.lock()`, which returns a guard that releases automatically when it drops. You cannot touch the data without holding the lock, because the data is unreachable any other way — forgetting to lock is not a discipline you maintain but a state the type system will not let you express. Shared ownership across threads adds `Arc`, an atomically reference-counted handle, so the canonical pattern is `Arc<Mutex<T>>`. ```rust // Rust: the lock IS the data's container; Arc shares it; both verified by the compiler. let counter = Arc::new(Mutex::new(0)); for _ in 0..10 { let counter = Arc::clone(&counter); // a handle to the SAME value thread::spawn(move || { let mut n = counter.lock().unwrap(); // unreachable except through the lock *n += 1; }); // guard drops here, lock released } ``` Java separates *which lock* from *what it protects* like C++, but layers a contract on top: the **Java Memory Model**, which we cover below. In practice, idiomatic Java reaches for a higher-level building block before a raw lock — a `ConcurrentHashMap` whose `computeIfAbsent` makes "look up, and create if missing" a single atomic step, or a `BlockingQueue` that hands backpressure to you for free. The throughline across all four locking languages is the same as the tool-choice rule: pick the highest-level construct that fits, and drop to a hand-rolled lock only when nothing else expresses what you need. The peril shared by every locking model is that **a lock prevents data races but invites deadlocks**. Two units that grab two locks in opposite orders wedge permanently — each holds one and waits for the other. No type system catches this, because lock ordering is a property of program logic, not types; even Rust's compiler, which eliminates data races outright, will happily compile a deadlock. The defenses are the same everywhere: impose a global lock-ordering discipline, collapse two locks into one, or acquire multiple locks atomically (`scoped_lock`). A lock buys you synchronization safety; it does not buy you thinking. ## Message passing: don't share, communicate The other coordination style avoids shared state entirely. Instead of putting a value behind a lock and letting every unit reach in, you give one unit ownership of the value and let others *hand it off* over a **channel** — a typed pipe that carries values between units. When a value travels down a channel, ownership travels with it: the sender is done with it, the receiver now holds it, and there is no window where both touch it at once. The handoff *is* the synchronization, so no separate lock is needed. This is **CSP** — Communicating Sequential Processes, Tony Hoare's 1978 idea — and its slogan is worth memorizing: *don't communicate by sharing memory; share memory by communicating.* Go builds its whole identity on this. A goroutine sends with `ch <- job` and another receives with `job := <-ch`, and the channel guarantees exactly one receiver gets each value. The buffered-versus-unbuffered distinction is the one to internalize: an unbuffered channel is a synchronous handoff (a send blocks until someone receives), while a buffered channel decouples the two and its capacity is your backpressure knob — a full buffer is the system telling the producer to slow down. A `select` waits on several channels at once, which is how timeouts and cancellation enter the model. ```go // Go: a value (and its ownership) flows from producer to worker over a typed channel. func generate(nums ...int) <-chan int { out := make(chan int) go func() { defer close(out) // closing signals "no more values" to receivers for _, n := range nums { out <- n // ownership of each value moves down the pipe } }() return out // send-only type documents and enforces direction } ``` Rust has the same idiom — `std::sync::mpsc` (multiple producer, single consumer) — but backs it with a stronger guarantee. Because Rust *moves* the value into the channel, the compiler rejects any code that tries to use the value after sending it. Go's safety here comes from convention and a race detector you run afterward; Rust's comes from the type system, statically. Same shape, stronger contract. ```rust // Rust: send MOVES ownership; the compiler forbids touching the value afterward. let (tx, rx) = mpsc::channel(); thread::spawn(move || { for msg in ["work", "more", "done"] { tx.send(String::from(msg)).unwrap(); // the String moves out; can't be reused } }); // tx drops, channel closes, rx loop ends for received in rx { println!("got: {received}"); } ``` The decision between the two styles is the rule from the introduction, applied per piece of state. *Is the data flowing, or sitting still?* Flowing wants a channel — pipelines, fan-out to workers, fan-in of results, a result handed back. Sitting still wants a lock — a counter, a cache, a config served to every request. Forcing flowing data through a lock means manual coordination where a channel would be the literal expression of the handoff; forcing sitting-still state through a channel is ceremony where `lock(); x++; unlock()` would do. Both Go and Rust offer both tools precisely so you can match the tool to the shape. > **Build it →** Channels carrying ownership between threads at production scale: the Rust > [Project 51: Message Queue](https://github.com/jchu0/applied-cs-projects/tree/main/51-message-queue) > is built around producer/consumer hand-off, and the Go services in > [Project 02: Microservice Platform](https://github.com/jchu0/applied-cs-projects/tree/main/02-microservice-platform) > use worker pools, streaming-RPC channels, and context propagation across a multi-service stack. ## The async/await model: one idea, four implementations `async`/`await` is the syntax that made event-loop concurrency mainstream, and four of our six languages have it — Python, JavaScript/TypeScript, and Rust, with Java taking a deliberately different road we cover next. The syntax looks identical across them: an `async` function, an `await` (or `.await`) that *reads* like a blocking wait but is actually a **yield point** where the function suspends and hands the thread back to the scheduler. Underneath, though, the implementations diverge on two axes that change how you reason about them: **eager versus lazy**, and **built-in runtime versus bring-your-own**. In **JavaScript**, an async operation is *eager*. Calling a function that returns a Promise starts the work immediately; the Promise is already in flight, and `await` merely waits for a result already in motion. The model is one thread, one stack, and — the detail that explains the most otherwise-baffling behavior — *two* queues with strict priority. Promise continuations go on the **microtask queue**; timers and I/O callbacks go on the **macrotask queue**; and the loop drains the *entire* microtask queue before taking a *single* macrotask. That one rule predicts almost every "why did this run first?" puzzle: ```typescript console.log("A"); setTimeout(() => console.log("B"), 0); // macrotask Promise.resolve().then(() => console.log("C")); // microtask console.log("D"); // Output: A, D, C, B — synchronous first, then all microtasks, then one macrotask. ``` **Python's** asyncio is the same cooperative single-thread idea, with `gather` to run a batch of coroutines concurrently and `TaskGroup` for structured concurrency. It differs from JavaScript in having no built-in micro/macro split exposed to you, and in coexisting with threads and processes (and the GIL) as one of several models rather than the only one. **Rust** is the outlier on both axes, and the differences are the reason it can be so cheap. A Rust future is **lazy** — calling an `async fn` builds a state machine and runs none of it; the work starts only when an executor *polls* the future. And Rust ships **no runtime at all**: the language gives you `async`, `await`, and the `Future` trait, but not the executor or the reactor that drive them. You must bring your own, which in practice means Tokio. ```rust // Rust: an async fn compiles to a lazy state machine that does NOTHING until polled. async fn handle(id: u64) -> Vec<u8> { let conn = open(id).await; // suspension point 1 let data = conn.read().await; // suspension point 2 — each .await is a pause/resume state process(data) } ``` The compiler turns that function into an `enum`-like state machine with one state per `.await`, each holding exactly the locals that must survive the suspension. That is why it is **zero-cost**: the future is sized at compile time to its largest state, allocated inline, with no garbage collector and no green-thread runtime baked in — a parked future costs only the bytes to remember where it paused. The runtime that drives it is two halves: an **executor** that schedules and polls runnable futures over a small work-stealing thread pool, and a **reactor** that talks to the OS readiness API (epoll, kqueue, IOCP) and wakes a parked future when its I/O is ready. The table makes the divergence concrete: | Property | JavaScript | Python asyncio | Rust (Tokio) | |----------|-----------|----------------|--------------| | Eager or lazy | eager (work starts on call) | eager (task scheduled) | lazy (nothing until polled) | | Runtime | built into the host (browser/Node) | built in (`asyncio`) | bring your own (Tokio) | | Parallelism | none (one thread) | none (one thread, GIL) | yes (multi-thread executor) | | Cost per task | tiny | ~1 KB | bytes (state machine) | | Data-race safety | no shared memory to race | no shared memory to race | `Send`/`Sync` checked at compile time | The shared price of `async` everywhere is **function coloring**: `async` is contagious, because an `async` function can only be `await`ed from another `async` context, so async-ness propagates up the call stack and splits the world into "red" async functions and "blue" sync ones that don't compose freely. This ergonomic tax is the strongest argument for *not* going async unless the concurrency profile — massive, idle, I/O-bound — actually warrants it. > **Build it →** Make the machinery real: [Project 06: Async Runtime](https://github.com/jchu0/applied-cs-projects/tree/main/06-async-runtime) > is a from-scratch Tokio-style runtime — an epoll-based reactor, a work-stealing scheduler, a > timer wheel, and the waker mechanism — i.e. everything the async section describes, implemented. > For async I/O under sustained load, the asyncio-based [Project 08: Streaming Platform](https://github.com/jchu0/applied-cs-projects/tree/main/08-streaming-platform) > runs high-throughput event pipelines with bounded concurrency and backpressure. ## M:N green threads: blocking code that scales There is a fourth answer that sidesteps `async`'s function-coloring tax entirely: keep the simple, blocking, top-to-bottom programming style, but make the *unit* so cheap that you can spawn millions and let the runtime multiplex them. This is the **M:N green-thread** model, and Go has had it from day one while Java arrived at it in 2021 — a striking case of two languages reaching the same destination by opposite routes. Go's **goroutine** is a function call with `go` in front of it. It begins life with a tiny stack (a couple of kilobytes that grows on demand), is created and switched in user space without a kernel system call, and is multiplexed by the runtime's **M:N scheduler** onto a small pool of OS threads sized by `GOMAXPROCS`. When a goroutine blocks on a channel or I/O, the scheduler lifts it off its thread and runs another there. You write thousands of little sequential programs; the runtime makes a handful of cores do the work. Java's **virtual thread** (Project Loom, stable in Java 21) is the *same idea* retrofitted onto a 1:1 history. A virtual thread is an ordinary `Thread` to your code — same API, same blocking calls, same stack traces — but it is not an OS thread. The JVM keeps a small pool of OS *carrier* threads and mounts a virtual thread onto a carrier only while it runs; the instant it hits a blocking call, the runtime unmounts it, frees the carrier for another virtual thread, and remounts it when the call completes. The payoff is the headline: **blocking is cheap again.** ```java // Java: one virtual thread per request; ten thousand concurrent blocking calls is fine. try (var executor = Executors.newVirtualThreadPerTaskExecutor()) { List<Future<Response>> futures = requests.stream() .map(req -> executor.submit(() -> handleBlocking(req))) // blocks freely .toList(); for (var f : futures) process(f.get()); } // close() waits for every virtual thread to finish ``` That `handleBlocking` can call the database, sleep, and make three sequential HTTP calls, and the only cost of all that blocking is some parked virtual threads, which are nearly free. This is the payments gateway from countless real migrations: it threw away the reactive-callback rewrite that once seemed mandatory and went *back* to "call the bank, wait, call fraud, wait" — and it now scales further than the callback version ever did. The contrast with `async`/`await` is the crux. Python, JavaScript, and Rust scale I/O by *coloring functions async* — the call sites change, and the coloring propagates. Go and Java scale I/O by keeping the call sites *ordinary blocking calls* and pushing the multiplexing entirely into the runtime, so it is invisible. Neither is strictly better: async gives you explicit, inspectable suspension points and (in Rust) zero-cost state machines; green threads give you the simple straight-line code and stack traces that survive a debugger. Both have the same sharp edge — a goroutine or virtual thread that blocks the *wrong* way (an unbuffered channel with no receiver; a `synchronized` block that pins the carrier) defeats the model — but the cheapness is the point in both. ::: {.callout-warning} ## War story: the leak that only load could find (Go) A search endpoint queried three backends in parallel and returned whichever answered first. It spawned three goroutines, each sending its result onto an *unbuffered* channel, and the handler received exactly one value — the fastest — then returned. It passed every test and ran fine for a week. The bug was the two goroutines nobody received from: after the handler took the first result and returned, the other two were still blocked trying to send onto a channel with no receiver, and on an unbuffered channel a send blocks until someone receives. Those goroutines blocked *forever*, holding their stacks and buffers. At low traffic the leak was invisible — the scheduler parks blocked goroutines, so CPU looked healthy. Under a spike, two leaks per request became hundreds of thousands in minutes and the process was OOM-killed. Two fixes: give the channel a buffer of three so the slow senders can deposit and exit, and derive a `context` for the request with `defer cancel()` so the losers are cancelled the moment the handler has its answer. The lesson is the green-thread model in one sentence: **cheap to spawn means easy to leak — every unit needs a guaranteed way to stop.** ::: ## Structured concurrency and cancellation The Go leak above points at a deeper problem that every cheap-concurrency model shares: when you can spawn units freely, you can also *lose track* of them. A unit with no owner responsible for ending it is a leak waiting for load. The modern answer, converging across languages, is **structured concurrency** — the idea that a group of concurrent tasks should be bound to a lexical scope, so the scope does not exit until every task it started has finished, and if one task fails, its siblings are cancelled rather than left running. The pattern is the same shape everywhere. Python's `asyncio.TaskGroup` is an `async with` block that owns the tasks created inside it; it won't exit until they all finish, and if one raises it cancels the rest and propagates the error. Java's `StructuredTaskScope` forks each subtask as a virtual thread, joins them all, and on the first failure interrupts the others. Both give concurrency the same block-structured discipline that `try`-with-resources and RAII gave to resource cleanup. ```python # Python: a TaskGroup scopes its tasks — all finish together, one failure cancels the rest. async def main(urls: list[str]) -> None: async with asyncio.TaskGroup() as tg: for url in urls: tg.create_task(fetch(url)) # one raise cancels siblings; error propagates here # on exit, every task is done ``` ```java // Java: a structured scope forks subtasks; the first failure cancels the others. try (var scope = new StructuredTaskScope.ShutdownOnFailure()) { var user = scope.fork(() -> fetchUser(id)); // each fork is a virtual thread var order = scope.fork(() -> fetchOrder(id)); scope.join(); // wait for both scope.throwIfFailed(); // propagate the first failure, cancelling the rest return combine(user.get(), order.get()); } // scope close guarantees no subtask is left running ``` Go expresses the same need with a different primitive: the **`context`** package. A `context.Context` carries a cancellation signal and an optional deadline, and it propagates down a call tree so that cancelling a parent cancels every child at once. The convention is rigid — `ctx` is the first parameter of any function that does I/O, every blocking operation `select`s on `ctx.Done()`, and you `defer cancel()` even on the success path to free resources. Context is Go's structural cure for the leak: had the search handler tied a context to the request and passed it to all three backends, the disconnecting client would have cancelled it and the losing goroutines would have returned. ```go // Go: a timeout context cancels the whole subtree; defer cancel() is the leak fix made routine. ctx, cancel := context.WithTimeout(ctx, 2*time.Second) defer cancel() select { case body := <-doRequest(url): return body, nil case <-ctx.Done(): return "", ctx.Err() // cancelled or timed out — stop now, don't leak } ``` Rust's runtimes carry the same idea: `tokio::select!` races a future against a timeout or a shutdown signal and *drops* the losing branches mid-flight, and dropping a future is how Rust cancels it. The unifying insight is that cancellation must be *structural*, not ad hoc — a timeout or a client disconnect should propagate to every task spawned under it, automatically, the way an exception unwinds a call stack. Structured concurrency is that propagation made into a language construct. ## How each language stops a data race We end on the axis where the six languages differ most sharply: what happens when two units touch the same memory, at least one writing, with no synchronization — the data race. The remarkable thing is that the *same* bug has five genuinely different fates depending on the language, and those fates form a spectrum from "your problem entirely" to "impossible to compile." | Language | Mechanism | When a race is caught | What the race costs | |----------|-----------|----------------------|---------------------| | Rust | `Send`/`Sync` traits + borrow checker | compile time | a red squiggle in your editor | | Go | `-race` detector (runtime instrumentation) | when a test exercises it | a CI failure, if covered | | Java | Java Memory Model (`happens-before`) | not caught — you must reason | a subtle "works on my laptop" bug | | C++ | `std::atomic` + memory orderings | not caught — undefined behavior | a crash on a compiler upgrade or new CPU | | Python / JS | one event-loop thread | no shared-memory race exists | (sidestepped, not solved) | At the **sidestep** end, Python's asyncio and all of JavaScript run your code on a single thread, so two pieces of your code never execute simultaneously and there is no shared-memory hazard to begin with. This is real safety, but narrow — it is bought by giving up parallelism, and it does nothing for the data races that *do* occur in Python's `threading` and `multiprocessing` worlds. At the **detector** end, Go ships the `-race` flag, which instruments memory accesses at runtime and reports any pair of goroutines that touched the same location without a `happens-before` relationship, with both stacks. It is close to definitive — but only on code paths your tests actually execute, so it is as good as your concurrent-path coverage. The discipline that pays off is running concurrency tests under `-race` in CI, every time. ```bash go test -race ./... # finds the data races your tests exercise; a CI failure you got to skip ``` The **memory-model** languages, Java and C++, make you reason about visibility yourself, and they are where the deepest bugs live. The shared insight both encode is *happens-before*: absent an explicit ordering edge, one thread's write may never become visible to another, because cores cache independently and both compiler and CPU reorder freely. Java's JMM gives you `volatile` (visibility) and `synchronized` (visibility + atomicity + ordering) to build those edges on a garbage-collected heap. C++ gives you the raw `std::atomic` with explicit memory orderings — `seq_cst`, `acquire`/`release`, `relaxed` — and the blunt rule that an unsynchronized conflicting access is *undefined behavior*, not merely a wrong value. The compiler may assume the race never happens and optimize on that basis, which is precisely why such a bug can "work" for two years and then break on a compiler upgrade. ```cpp // C++: a release-store publishes prior writes; a matching acquire-load sees them. std::atomic<bool> ready{false}; int payload = 0; // ordinary, non-atomic void producer() { payload = 42; // (1) ordinary write ready.store(true, std::memory_order_release); // (2) publishes (1) } void consumer() { while (!ready.load(std::memory_order_acquire));// (3) waits, then sees (1) assert(payload == 42); // guaranteed: (2) happens-before (3) } ``` ::: {.callout-warning} ## War story: the optimization that x86 hid and ARM exposed (C++) A team shipped a cache with double-checked locking: the first thread built a singleton under a mutex and stashed a pointer; every thread after read the pointer with no lock. Textbook, review-approved, and flawless through months of load testing — all of it on x86. Then the mobile build went out, and a fraction of a percent of ARM devices crashed reading garbage fields *after* the pointer was non-null. The writer set the object's fields and *then* the pointer, in source order — but nothing made those two writes become visible to other cores in that order. x86's strongly-ordered hardware happened to preserve it; ARM's weak memory model let the pointer store land in another core's view *before* the field stores it pointed to, so a reader saw a non-null pointer to a half-built object. The reordering was legal — both hardware and compiler are allowed it — and the code silently assumed it wouldn't happen. The portable contract is the C++ memory model (a `release` store paired with `acquire` loads), not the architecture you happened to test on. **"It worked on x86" is a statement about hardware, not about correctness.** ::: At the **compile-error** end stands Rust, and its move is the most elegant in the spectrum: it notices that a data race is *exactly* the borrow checker's shared-XOR-mutable rule — many readers *or* one writer, never both — violated across a thread boundary. So it doesn't need a separate mechanism; it extends the rule it already enforces with two **marker traits** the compiler tracks automatically. `Send` means a value is safe to *move* to another thread; `Sync` means a reference to it is safe to *share* across threads. `thread::spawn` demands `Send`; sharing a reference demands `Sync`. A non-thread-safe type like `Rc` (non-atomic reference count) implements neither, so the moment you try to send it across a thread boundary the compiler stops you by name: ```text error[E0277]: `Rc<Vec<i32>>` cannot be sent between threads safely = help: the trait `Send` is not implemented for `Rc<Vec<i32>>` ``` The compiler is not running your program and observing a race; it is reading your types and reasoning that a race is *possible*, and refusing on that basis. The class of bug that is undefined behavior in C++, a heisenbug in Java, and a CI gamble in Go is, in Rust, a build failure. The honest caveat — worth stating because Rust's marketing oversells it — is that this guarantee is *narrow*: it eliminates data races, **not** deadlocks, livelocks, or logical races where every individual access is synchronized but the overall sequence is still wrong. The compiler buys synchronization safety; lock ordering and program logic remain yours, in Rust exactly as in C++. > **Build it →** Shared state and message passing under real concurrency: [Project 11: Distributed > KV (Raft)](https://github.com/jchu0/applied-cs-projects/tree/main/11-distributed-kv-raft) guards > replicated state machines behind `Arc<Mutex<…>>` and threads async networking through a consensus > protocol, exercising both coordination styles at once. --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — Diagnose the bound, then pick the model.** Take two workloads: one that hashes a large in-memory list (CPU-bound) and one that fetches many small URLs (I/O-bound). For each, write down whether it is CPU- or I/O-bound and which model you would use *in two languages of your choice* — and justify each choice in terms of the model, not the API. For a Python answer, explain in one sentence why the GIL makes threading useless for the hashing workload but fine for the fetching one. For a Go or Java answer, explain why you would *not* need that distinction. The deliverable is the reasoning, not running code. 2. **Level II — Reproduce one failure mode and fix it.** Pick *one* of the cooperative-scheduler hazards and demonstrate it end to end in the relevant language: (a) block a Python asyncio event loop with a synchronous `time.sleep` inside a coroutine and show throughput collapse, then fix it with `asyncio.to_thread`; or (b) leak a goroutine in Go by sending on an unbuffered channel with no receiver, prove the leak with `runtime.NumGoroutine()` before and after, then fix it with a buffer or a `context`. Either way, write a paragraph naming the model property that caused the failure (cooperative scheduling never preempts a blocking call; a cheap unit still needs a stop signal) and explain why the fix addresses the *property*, not just the symptom. 3. **Level III — Build the same concurrent component twice, with different coordination styles.** Implement a metrics aggregator that ingests events from several producers and maintains running per-key counts, once with **shared state** (a lock-protected map — `Arc<Mutex<…>>` in Rust, `ConcurrentHashMap` in Java, or `sync.Mutex` in Go) and once with **message passing** (producers send events down a channel to a single owner that holds the map alone, no lock). Compare them: which version has a lock and which transfers ownership; which would extend more cleanly to a multi-stage pipeline; and — for the shared-state version — name two failure modes the language does *not* protect you from (a deadlock from lock ordering, a logical race from a synchronized-but-mis-sequenced read-modify-write) and say how you would prevent each. Close with a sentence on where the language's safety guarantee ends and your judgment begins. ## Summary Concurrency is structuring independent work to be in progress at once; parallelism is executing it simultaneously, which needs cores — and the two are orthogonal. The first question for any workload is whether it is I/O-bound (wants cheap concurrency, the cheaper the better the more of it) or CPU-bound (wants real parallelism across cores), because that diagnosis picks the model. The unit of concurrency comes in four shapes: 1:1 OS threads (C++, Rust std, classic Java — capable, preemptive, rationed at a megabyte each); M:N green threads (goroutines, JVM virtual threads — cheap, blocking code that scales); event-loop tasks (Python asyncio, JavaScript — one cooperative thread, no parallelism); and Rust's lazy, zero-cost, poll-based futures with a bring-your-own runtime. Schedulers split cooperative from preemptive, and every cooperative one obeys the cardinal rule: never block it. Python's GIL collapses the whole language into the I/O-versus-CPU question. The two coordination styles — shared memory with locks (flowing data's wrong tool) versus message passing over channels (sitting-still data's wrong tool) — are the choice you make per piece of state. And the five answers to the data race run from Rust's compile error through Go's detector and the Java/C++ memory models to Python and JavaScript's single-thread sidestep — the sharpest axis on which these languages differ, and the one that most shapes the systems people build in them. ### Key takeaways - **Concurrency is not parallelism.** Diagnose the workload — I/O-bound wants cheap concurrency, CPU-bound wants real parallelism — and the model follows from the diagnosis, not from taste. - **Four units, two mappings.** 1:1 OS threads are capable but cost a megabyte and are rationed; M:N green threads (goroutines, virtual threads), event-loop tasks, and poll-based futures all cost kilobytes or less, which is why they scale I/O. - **Never block a cooperative scheduler.** One synchronous call freezes an event loop or removes an async worker from rotation — the same bug in Python, JavaScript, Rust async, and pinned virtual threads. - **Python's GIL makes one question decide everything:** I/O-bound wants threads or asyncio, CPU-bound needs multiprocessing — the only language here that must leave the process to parallelize computation. - **Flowing data wants a channel; sitting-still state wants a lock.** Message passing transfers ownership so there is nothing to race over; shared memory needs a lock and invites deadlocks no type system catches. - **Five fates for one data race:** Rust's compile error (`Send`/`Sync`), Go's `-race` detector, the Java Memory Model's `happens-before`, C++'s atomics-or-undefined-behavior, and the single-thread sidestep of Python asyncio and JavaScript. ### Connections to other chapters - **Software Engineering Overview** (prerequisite): the process/thread/block vocabulary and the reproducibility mindset framed there are the substrate for everything here — a data race is the reproducibility problem at its most vicious, non-deterministic and load-dependent. - **Rust Ownership and Borrowing** (prerequisite): fearless concurrency is literally the shared-XOR-mutable rule from that chapter applied across a thread boundary — `Send`/`Sync`, the `move` into a spawned thread, and the move through a channel are all the borrow checker reaching across threads. Read it first to see *why* a Rust data race is a compile error. - **Memory and Resource Management** (sibling): the C++ memory model, false sharing, and the reclamation hazard in lock-free code are lifetime-and-memory problems before they are concurrency problems; a `Mutex<T>` welding a lock to its data is RAII, and a leaked goroutine is a resource leak. The cost of an OS thread's stack is a memory fact. - **Error Handling** (sibling): cancellation and structured concurrency are error handling for concurrent code — a `TaskGroup` propagating one task's failure to its siblings is exactly the unwind-on-error discipline applied to a tree of tasks, and an unhandled Promise rejection is an error that escaped its scope. - **Async Runtime (Project 06)** and **Data Streaming** (extensions): the executor/reactor split, work-stealing scheduling, and bounded-concurrency backpressure described here are *built* in the async-runtime project and *applied* at scale in streaming pipelines, where the same patterns must also survive worker failure and retries across machines. - **Distributed Training** and **GPU and CUDA** (extensions): parallelism at the largest scale — splitting computation across many machines or thousands of GPU cores — is the CPU-bound branch of this chapter's first question taken to its limit, where the coordination problem becomes the whole problem. ## Further reading ### Essential - Rob Pike, *"Concurrency Is Not Parallelism"* (2012 talk) — the talk that crystallized the distinction at the heart of this chapter, and the clearest 30 minutes you can spend on the topic. - Katherine Cox-Buday, *Concurrency in Go* (O'Reilly) — goroutines, channels, `select`, and `context` built up rigorously; the best book-length treatment of the CSP model. - *The Rust Programming Language* (Klabnik & Nichols), the "Fearless Concurrency" chapter — the canonical introduction to `Send`/`Sync`, `Arc<Mutex<T>>`, and channels, and the source of the term. ### Deep dives - Anthony Williams, *C++ Concurrency in Action* (2nd ed.) and Herb Sutter, *"atomic Weapons: The C++ Memory Model and Modern Hardware"* (talk) — the definitive accounts of memory ordering, acquire/release, and why weakly-ordered hardware exposes bugs strong hardware hides. - *JEP 444: Virtual Threads* and Ron Pressler's Project Loom talks — why the JVM chose user-mode threads over an `async`/`await` language feature, and what M:N multiplexing onto carriers buys you. - *The Tokio Tutorial* and *Asynchronous Programming in Rust* (the "async book") — how a lazy, poll-based, zero-cost future works and why Rust keeps the runtime out of the language. - Jake Archibald, *"In The Loop"* (JSConf talk) — the definitive walk-through of the JavaScript microtask/macrotask split and why ordering is what it is. ### Historical context - C. A. R. Hoare, *"Communicating Sequential Processes"* (CACM, 1978) — the foundational paper Go's channels and goroutines directly implement. - Leslie Lamport, *"How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs"* (1979) — the origin of sequential consistency, the baseline the memory-ordering spectrum is defined against. - **PEP 703**, *Making the Global Interpreter Lock Optional in CPython* — the authoritative account of what the GIL costs, what it protects, and what removing it requires. - Jung et al., *"RustBelt: Securing the Foundations of the Rust Programming Language"* (POPL 2018) — the formal proof that `Send` and `Sync` are sound: the mathematics under "if it compiles, it has no data races."