GPU Programming & CUDA
gpu, cuda, simt, kernels, triton, memory hierarchy, coalescing, occupancy, parallelism, throughput, accelerators
Introduction
The job was a single, ordinary line of linear algebra: multiply two 4096×4096 matrices, the kind of thing that sits in the inner loop of every neural network. On the research team’s workstation the CPU took the better part of a minute per call, and the experiment needed thousands of calls. Someone moved the two arrays onto the GPU, called the same multiply, and watched one call fall from tens of seconds to a few milliseconds — a four-figure speedup from a one-line change. The arithmetic had not gotten easier; the hardware had simply stopped doing the work one number at a time and started doing thousands of multiplications at once.
That is the promise, and it is real. But the same team learned the other half of the lesson a week later, when a colleague wrote a custom kernel for an unusual elementwise transform the framework didn’t offer. It compiled, it ran, it produced correct numbers — and it was ten times slower than it had any right to be. Nothing was wrong with the math. The kernel read memory in a pattern the GPU hates: each thread reached for an address far from its neighbor’s, so every access that should have been one wide transaction became thirty-two separate ones, and the chip spent almost all its time waiting on DRAM instead of computing. The GPU had not failed; it had been used against its grain.
Both stories are the same story. A GPU delivers order-of-magnitude speedups for the right workload — and only if you respect how it actually executes. This chapter is about that “how”: what the machine is, why it wins and loses where it does, and how to write and optimize code that runs with the grain. It is the only place this book covers GPU kernel programming, so it aims to be complete enough that you can read framework performance, reason about a hotspot, and write a kernel when you need one.
The Core Insight
A GPU is not a fast CPU. A CPU is a latency machine: a few fat cores with deep caches, branch prediction, and out-of-order execution, each racing a single thread of control to completion as quickly as possible. A GPU is a throughput machine: thousands of simple cores that execute the same instruction across many data elements at once, trading per-thread speed for sheer parallel width. NVIDIA calls this model SIMT — Single Instruction, Multiple Threads — and it is the one fact from which every other fact in this chapter follows.
The consequence is that a GPU wins or loses on the shape of the work, not just its size. It wins decisively when the work is embarrassingly parallel — a million independent elements, each transformed the same way — and when memory access is regular, so neighboring threads touch neighboring addresses. It loses badly, often worse than a CPU, on work that is branchy (every element does something different), sequential (each step depends on the last), or scattered in memory (threads chase pointers all over DRAM). Hand a GPU a million identical additions and it is glorious; hand it a linked-list traversal and it is a thousand cores idle.
And here is the part that surprises people: GPU performance is a memory-bandwidth game, not a FLOPs game. Modern accelerators can issue arithmetic far faster than they can feed themselves data from global memory — the gap between fast and slow memory on the chip is a factor of hundreds. So the two levers that actually move the needle are both about memory, not math:
- Memory access patterns. Reads must be coalesced — adjacent threads reading adjacent addresses, so the hardware fuses thirty-two loads into one wide transaction — and hot data should live in the small, fast, on-chip shared memory rather than being fetched repeatedly from global DRAM.
- Occupancy. The GPU hides the long latency of a memory read by switching to other threads that are ready to compute. Keep enough threads in flight and the wait is invisible; let occupancy drop and the cores stall, waiting on DRAM with nothing else to do.
Get those two right and a kernel approaches the hardware’s limits. Get them wrong and no amount of clever arithmetic will save you — which is exactly what happened to the ten-times-slow kernel.
A mental model
Picture the GPU as an enormous crew of workers who can only move in lockstep. The foreman calls out one step — “multiply your number by two” — and all thirty-two workers in a squad do that step, each on their own piece of data, at the same instant. That squad is a warp, and lockstep is the whole trick: one instruction drives thirty-two results. It is wonderful for “apply this to a million elements” and terrible for “everyone do something different,” because the moment half the squad needs to take a different branch than the other half, the foreman has to run the two paths one after the other while the idle half waits. That stall is warp divergence, and it is why branchy code wilts on a GPU.
The second picture is the memory hierarchy as a ladder you climb for speed. At the bottom is global memory — gigabytes of it, but every access costs hundreds of cycles. One rung up is shared memory: a small on-chip scratchpad the workers in a block share, roughly a hundred times faster. At the top are registers, private to each thread and effectively free. The real performance work is moving data up the ladder once and reusing it, instead of trudging down to global memory again and again. Figure 41.1 shows both pictures together: the grid of blocks and warps on one side, the memory ladder on the other.
The third picture is the simplest. A kernel is just the function each worker runs on its own slice of the data. You write it once, as if for a single thread, and the hardware launches it across a whole grid of thousands of threads, each one computing its own index and handling its own element. “Write the body for one; the machine runs it for all” is the entire programming model in a sentence.
When (and what) to write yourself
The most important GPU skill is knowing when not to write a kernel. The framework you already use — PyTorch, JAX, TensorFlow — is, underneath, a vast library of CUDA kernels written by people who do nothing else, and standard operations (matrix multiply, convolution, attention) route to vendor libraries like cuBLAS and cuDNN tuned to within a few percent of the silicon’s limit. You will almost never beat them. So the decision framework is a ladder of escalating effort, climbed only as far as a measurement forces you:
- Start at the top. Use the framework’s built-in ops, turn on mixed precision and
torch.compile, and reach for pre-tuned libraries (cuBLAS, cuDNN, FlashAttention). For the overwhelming majority of work this is the end of the story. - Profile before you do anything else. If training or inference is slow, measure where the time goes with
nsys(Nsight Systems, for the timeline) andncu(Nsight Compute, for a single kernel). The bottleneck is very often not a kernel at all — it is a starved data pipeline, a host-to-device copy in a loop, or a GPU sitting at thirty percent utilization waiting on the CPU. - Drop one level only for a measured hotspot a library doesn’t cover. When profiling points at a specific op the framework handles inefficiently — typically several small operations that could be fused into one — write it in Triton, a Python dialect that gets you most of CUDA’s performance for a fraction of the effort.
- Drop to raw CUDA C++ only when Triton’s abstractions block you, or when you are squeezing the last five to fifteen percent out of a critical-path kernel at frontier scale. This is weeks of work and a maintenance burden; spend it deliberately.
The shape of that ladder — and the execution model that explains why each rung exists — is what the rest of the chapter builds (Figure 41.1).
What you’ll learn
- How the GPU executes code under the SIMT model: threads, warps, blocks, and the grid, how the scheduler maps them onto hardware, and why branch divergence destroys throughput
- Why the memory hierarchy — global, shared, registers — is the real performance ladder, and what coalesced access means in mechanical terms
- How to decompose a problem into a grid of blocks and threads, write a simple kernel, and account for the cost of moving data between host and device
- What occupancy is, why hiding memory latency depends on it, and why most real kernels are bound by memory bandwidth rather than compute
- How Triton lets you write fused, high-performance kernels in Python, and how custom ops slot in behind a framework’s autograd
- Why you rarely hand-write kernels at all — and how understanding the model still explains, and lets you debug, your framework’s performance
Prerequisites
- Performance and Profiling — the cross-language chapter’s cache and memory-hierarchy thinking, and the roofline idea of memory-bound versus compute-bound work. The GPU is that story turned up to eleven: the same ladder, a steeper cliff, and bandwidth as the dominant constraint.
- Deep Learning Frameworks — what a tensor is, what a forward and backward pass do, and why matrix multiply dominates the workload. Frameworks are the thing whose performance this chapter explains.
- Comfort reading C-family and Python code, and a working mental model of threads and synchronization from any prior concurrency exposure.
The GPU execution model
Everything starts with how a kernel becomes work on the chip. When you launch a kernel you specify a grid of thread blocks, each block a number of threads — for a million-element array, perhaps 4096 blocks of 256 threads, one thread per element with a little slack. The grid is your logical decomposition of the problem; the hardware’s job is to run it.
The hardware is a collection of Streaming Multiprocessors (SMs) — an A100 has 108, each with its own cores, registers, and shared-memory scratchpad. The scheduler hands whole blocks to SMs, and an SM interleaves as many blocks as its registers and shared memory allow. But the unit of execution is finer than the block: within a block, threads are grouped into warps of 32, and a warp is what actually runs — all 32 threads issue the same instruction at the same time, each on its own data. This is SIMT made concrete, and the grid-block-thread hierarchy you write is not API ceremony; it is a direct map onto device, SM, and warp in silicon.
This is also where the model’s one great vulnerability lives. Because a warp shares a single instruction pointer, a conditional that sends some lanes one way and the rest another cannot run both paths at once; the hardware serializes them, running the if branch with the else lanes masked off and idle, then the reverse. A warp that splits evenly runs at half throughput; a warp where each lane takes a different path collapses toward 1/32 of peak. This is warp divergence, the mechanical reason the GPU punishes branchy code. The fix is to structure data so threads within a warp take the same path — sort by branch condition, mask uniformly, replace data-dependent branches with arithmetic — and to remember that divergence across warps is free; only divergence within a warp costs you. The flip side of warps — how the machine stays fast despite its slow memory — is the second half of the model.
The memory hierarchy
A GPU spends most of its time waiting on memory, and the numbers are stark. A read from global memory — the gigabytes of HBM that hold your tensors — costs on the order of 500 cycles. Shared memory, the on-chip scratchpad each block owns, is roughly 100 times faster. Registers, private to each thread, are faster still and effectively free. Deciding where your data lives is therefore the single most consequential optimization you can make; it dwarfs almost any arithmetic change. This is the same hierarchy you met in Performance and Profiling, with a steeper cliff and one new, programmer-controlled rung.
The first lever is coalescing. When a warp’s 32 threads read global memory, the hardware inspects the 32 addresses. If they are contiguous — thread 0 reads address n, thread 1 reads n+1, and so on — it fuses them into one wide transaction and the warp pays once for all 32 values. If they are scattered or strided, it cannot fuse them, and the warp pays for as many transactions as there are distinct cache lines — up to 32 times the traffic for the same data. This is the introduction’s bug: row-major access (A[row * width + col], the thread index varying the fast dimension) coalesces; column-major access (A[col * height + row]) strides across rows and does not, and effective bandwidth collapses by an order of magnitude. The data is identical; only the order in which threads touch it differs, and that order is the whole game.
The second lever is shared-memory tiling, the move behind every fast matrix kernel. A naive matrix multiply has each thread loop over a full row of A and column of B straight from global memory, so the same values are fetched from DRAM hundreds of times. The tiled version flips this: the threads of a block cooperate to load a small tile of A and B into shared memory once, synchronize with __syncthreads() so everyone sees it, then do many multiply-adds reading from the fast scratchpad. Loading once and reusing many times turns a memory-bound kernel into a compute-bound one. (Tiling has a subtlety: shared memory is banked, and a tile whose stride collides with the bank count serializes accesses; the classic fix is padding a tile to [32][33] so successive rows fall in different banks.) The diagnostic question for any kernel is the pair from the roofline model — am I bandwidth-bound or compute-bound, and is my access coalesced? — and the answers come from ncu, not from staring at the source.
Writing a kernel
With the model in hand, a kernel is almost anticlimactic. You write the body for one thread; the launch runs it across the grid. The thread’s first job is to compute its own global index from the hierarchy variables, then guard against running off the end of the array, then do its work. A complete elementwise kernel — vector addition — is just that:
// One thread per element. blockIdx, blockDim, threadIdx are the
// hardware-provided coordinates that let each thread find its own slice.
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x; // this thread's global index
if (i < n) { // last block is usually ragged
c[i] = a[i] + b[i]; // coalesced: thread i touches a[i]
}
}
The bounds check matters because you rarely have an exact multiple of the block size, so the final block launches a few threads past the end of the data; without the guard they read and write out of bounds, which on a GPU shows up not as a clean crash but as silent corruption or a deferred illegal memory access. The index arithmetic means thread i touches element a[i] — adjacent threads, adjacent addresses, perfectly coalesced.
What the tidy kernel hides is the cost that dominates real programs: moving the data. Before the kernel can run, a and b must be copied from host (CPU) memory across the PCIe bus into device memory, and afterward c copied back — a slow transfer of thousands of cycles, and the most common way a GPU program ends up slower than the CPU it replaced. The trap is a loop that ships one item to the GPU, computes, and ships the result back every iteration, so PCIe traffic dwarfs the compute. The cure is to keep tensors resident: copy the whole batch over once, do all the work on the GPU, copy back once — in a framework, the difference between calling .cuda() inside the inner loop and once before it. Catching exactly this is half of why profiling exists.
Occupancy and optimization
Why does a machine with 500-cycle memory latency run fast at all? Because it hides the latency with parallelism. When a warp issues a memory load and must wait, the SM does not stall — it switches to another warp that is ready to compute, and another, keeping the cores busy while the first warp’s data is in flight. The ratio of active warps to the maximum the hardware can hold is occupancy, and it is the mechanism by which memory latency becomes invisible. Too few warps in flight and the SM runs out of ready work during a memory wait, so the cores stall and throughput craters; enough warps and the waiting overlaps the computing and the latency disappears.
Occupancy is bounded by resources. Each thread needs registers and each block needs shared memory, and an SM has fixed budgets of both, so a kernel that uses too many registers per thread — or too much shared memory per block — fits fewer warps and caps its own occupancy (and register spilling, when the compiler pushes values out to slow memory, makes it worse). But — the recurring theme — higher occupancy only helps if you were latency-bound to begin with. Most ML kernels are not compute-bound; they are memory-bandwidth-bound, limited by how fast they stream data from HBM. For those, the win is not more warps or more FLOPs but less memory traffic: fuse operations so intermediates stay in registers instead of round-tripping through global memory, use mixed precision so every value is half the bytes (and so half the bandwidth), and coalesce every access. This is why “just add more compute” so often does nothing, and why the roofline framing from Performance and Profiling — plot arithmetic intensity against the hardware’s bandwidth and peak FLOPs, and see which ceiling you hit — is how you decide what to optimize. Decide with the profiler: ncu reports achieved occupancy, achieved bandwidth as a fraction of peak, and warp-stall reasons, and those three numbers tell you whether you are leaving performance on the table and, more importantly, where.
Triton and PyTorch integration
When profiling does point at a kernel worth writing — almost always a fusion of several small ops that should be one — the modern default is Triton, not raw CUDA. Triton is an OpenAI-built Python dialect that raises the abstraction from the thread to the block. In CUDA you write “what does thread i do?”; in Triton you write “what does this program do with a tile of data?” — a small contiguous chunk of a tensor, manipulated with NumPy-like vector ops. The compiler then takes over the low-level grind that consumes most CUDA effort: coalesced loads, shared memory placement, register allocation, bank-conflict avoidance, and Tensor Core instructions where shapes allow. A Triton elementwise kernel is the same idea as the CUDA one, one altitude up:
import triton
import triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
pid = tl.program_id(0) # which tile this program owns
offs = pid * BLOCK + tl.arange(0, BLOCK) # a whole vector of indices at once
mask = offs < n # predicated, like the bounds guard
x = tl.load(x_ptr + offs, mask=mask) # one coalesced vector load
y = tl.load(y_ptr + offs, mask=mask)
tl.store(out_ptr + offs, x + y, mask=mask)Two things make Triton pay. First, autotuning: decorate a kernel with candidate configurations — tile sizes, warp counts, pipeline depths — and Triton benchmarks them on your actual input shapes and caches the winner, which is how a Triton matmul reaches roughly 95% of cuBLAS with no manual tuning. Second, fusion is natural: because you operate on whole tiles, combining a bias add, a GELU, and a scale into one kernel that touches global memory once instead of three times is a few lines, and on memory-bound ops that fusion is the entire speedup. The payoff is leverage — a kernel written in two hours at 95% efficiency routinely beats two weeks of CUDA chasing the last 5%. (The frontier kernels you have heard of, FlashAttention chief among them, are structurally exactly this: a tile-loop matmul nested inside a softmax, fused so the giant attention matrix never touches global memory. Those LLM-specific kernels are the subject of the companion AI Engineering book; the point here is that they are built from these primitives.)
Whichever language you choose, the kernel must enter the framework, and the mechanism is the same: wrap it as an autograd function with a forward and a backward pass, so the framework can fold it into the computation graph and backpropagate through it like any built-in. In raw CUDA you compile a C++/CUDA extension and bind it with PYBIND11; in Triton you allocate the output with torch.empty_like and launch the kernel on it. Either way you get a callable that behaves like a native op, gradients and all. And the highest rung — torch.compile — closes the loop by doing the fusion automatically: it traces your model, fuses chains of elementwise ops, and emits Triton kernels behind the scenes, often a 1.5–3× speedup on existing code with no kernel written at all. It is, literally, the framework reaching for the same tool you would.
The framework does most of this
The practical takeaway is almost paradoxical: you have just learned to write GPU kernels in order to understand why you usually shouldn’t. The framework is a mountain of expertly tuned CUDA, and its built-in ops, vendor libraries, and compiler out-perform almost anything you write by hand, almost all the time. The value of the execution model is not that it lets you replace the framework — it is that it lets you read it. When training is slow, the model says look at occupancy and the data pipeline before the kernels. When a model won’t fit, it says why mixed precision and fusion help. When a custom op is genuinely missing, it says reach for Triton and aim for a fusion, not raw CUDA and a rewrite of matmul. The deepest GPU skill is judgment about where the time actually goes — and that rests entirely on knowing how the machine executes.
A team needed a custom normalization that the framework didn’t offer, and an engineer wrote it as a faithful CUDA kernel. It passed every numerical test and shipped — and inference got slower. Profiling told a two-part story. First, the kernel read its input column-major: adjacent threads in a warp reached for addresses a full row apart, so every coalesced 128-byte transaction degenerated into 32 separate ones and effective bandwidth fell by roughly an order of magnitude. Transposing the access so threads walked memory contiguously recovered most of it. Second, the kernel was being called inside a loop that moved a tensor to the GPU and pulled the result back to the CPU every iteration, so the PCIe copy — thousands of cycles — dwarfed the compute it bracketed. Hoisting the transfer out of the loop and keeping the tensor resident on the device fixed the rest. The lesson is the chapter in miniature: a kernel can be perfectly correct and still leave 10× on the floor, because GPU performance is decided by how you touch memory — coalescing and host-device transfer — not by whether the arithmetic is right. Profile first; the bottleneck is rarely the FLOPs.
Build it → Go deeper where the kernels are real: Project 19: GPU Kernel Optimization is the direct analog of this chapter — coalescing, shared-memory tiling, and occupancy tuning measured against the roofline; Project 39: GPU Memory Manager implements the caching allocator and fragmentation handling that keep tensors resident on the device; and Project 48: Multi-GPU Kernel Scheduler shows how single-GPU kernels compose into a scheduled, multi-device workload.
Practical exercise
Difficulty: Level I · Level II · Level III
- Level I — Judge the fit. Take three workloads — a million-element elementwise activation, a depth-first traversal of an irregular tree, and a dense matrix multiply — and for each, decide whether it is a good GPU fit. Argue from the two criteria that matter: is the work embarrassingly parallel, and is its memory access regular? Then, for the activation, estimate the host-device transfer cost: if the array is 4 MB and PCIe moves ~16 GB/s, how long is the round-trip copy, and how does that compare to the microseconds of actual compute? State the implication for whether the GPU is worth it for a single call versus a resident batch.
- Level II — Make it coalesce, then tile. Given a kernel that reads a matrix column-major (
A[col * height + row]), explain in mechanical terms why a warp’s 32 reads become 32 transactions instead of one, and rewrite the index so the access is coalesced. Then take a naive matrix multiply where every thread reads its row and column straight from global memory, and describe the shared-memory tiling that loads each tile once and reuses it across the block. Quantify the bandwidth win: if the tile is reused byTILE_SIZEthreads, by what factor does global-memory traffic drop, and why does that move the kernel from bandwidth-bound toward compute-bound? - Level III — Decide and defend. You have profiled a model and found one hotspot: a sequence of three small elementwise ops (bias, activation, scale) that together account for 20% of inference time. Decide whether to leave it to
torch.compile, write a Triton kernel, or drop to raw CUDA — and justify the choice. Then reason about the kernel you’d write: would it be memory-bound or compute-bound (use arithmetic intensity and the roofline framing from Performance and Profiling), how does fusing the three ops change the global-memory traffic, and what would you check for occupancy and warp divergence to know you’d reached the bandwidth ceiling rather than left performance on the table?
Summary
A GPU is a throughput machine, not a fast CPU: thousands of simple cores running one instruction across many data elements in lockstep (SIMT). It delivers order-of-magnitude speedups on work that is embarrassingly parallel and regular in its memory access, and performs badly on branchy, sequential, or memory-scattered code. The execution model — a grid of blocks, each scheduled as 32-wide warps — explains both: divergence within a warp serializes and kills throughput. Performance is governed by the memory hierarchy, not by FLOPs. Coalesced access fuses a warp’s reads into one transaction; shared memory lets a block load data once and reuse it; occupancy keeps enough warps in flight to hide the hundreds of cycles a global-memory read costs. Most real kernels are memory-bandwidth-bound, so the wins come from less memory traffic — fusion, mixed precision, coalescing — not more compute. And you rarely write kernels yourself: the framework’s tuned CUDA wins almost always, and when a measured hotspot needs one, Triton fuses ops in Python at near-CUDA speed while raw CUDA is reserved for the last few percent. Understanding the model is what lets you read framework performance and know where the time goes.
Key takeaways
- A GPU trades per-thread latency for massive parallel throughput; it wins only on work that is parallel and regular in its memory access.
- SIMT means warps of 32 threads share one instruction — branch divergence within a warp serializes the paths and is the mechanical cause of slow branchy code.
- Performance is a bandwidth game: coalesced reads, shared-memory tiling, and high occupancy (to hide memory latency) matter far more than raw FLOPs.
- Most kernels are memory-bound, so fewer bytes and less traffic — fusion, mixed precision — beat adding compute; decide with a profiler and the roofline, never by reading source.
- Use the framework’s kernels first; drop to Triton for a measured, uncovered hotspot (usually a fusion), and to raw CUDA only for the last few percent.
Connections to other chapters
- Performance and Profiling (prerequisite): the cache and memory-hierarchy reasoning, and the roofline distinction between memory-bound and compute-bound work, taught comparatively across languages there, are the intellectual core of this chapter taken to the GPU — same ladder, a hundred-fold steeper cliff, and bandwidth as the dominant ceiling instead of one constraint among many.
- Deep Learning Frameworks (prerequisite): every framework is, underneath, the CUDA kernels and vendor libraries described here. This chapter is what lets you explain why a framework is fast or slow, why mixed precision and
torch.compilehelp, and where its abstractions bottom out in the execution model. - Distributed Training (extension): scaling a model across many GPUs builds directly on the single-GPU kernel. Once a kernel saturates one device, throughput comes from running it on many and overlapping the communication between them — the kernel is the unit that distributed training schedules and replicates.
- Rust: Unsafe Rust (sibling): a GPU kernel and an FFI boundary share a discipline — a small, carefully audited core where you trade the compiler’s guarantees for direct control of memory, fenced off so the unsafety doesn’t leak. Raw pointers, manual bounds checks, and “get this exactly right because nothing will catch you” are the same craft in both places.
A note on scope: the LLM-specific kernels this chapter gestures at — FlashAttention and its relatives — are built from exactly these primitives but live in the companion AI Engineering book, which treats attention and inference kernels in depth.
Further reading
Essential
- NVIDIA CUDA C++ Programming Guide — the canonical reference for the execution model, the memory hierarchy, coalescing rules, and occupancy; the source of record for everything in this chapter.
- Kirk & Hwu, Programming Massively Parallel Processors — the standard textbook, building the SIMT mental model and the tiling/reduction patterns from first principles.
Deep dives
- Tillet, Kung & Cox, “Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations” (MAPL 2019) — the paper behind the block-level programming model, plus the official Triton documentation and tutorials.
- Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” — the definitive case study in turning a memory-bound op into a fused, bandwidth-aware kernel; the concrete payoff of this chapter’s ideas.
Historical context
- Drepper, “What Every Programmer Should Know About Memory” — the bandwidth and latency framing that makes the GPU memory hierarchy intelligible; written for CPUs, but the cliff is the same and steeper on the GPU.
- NVIDIA GPU architecture whitepapers (Volta, Ampere, Hopper) — how SMs, warps, Tensor Cores, and the memory subsystem changed generation to generation, and why optimizations are sometimes hardware-specific.