Data Processing Engines

Keywords

data processing, spark, dask, polars, duckdb, distributed computing, lazy evaluation, columnar, partitioning, shuffle, single-node

Introduction

The dataset was “big data,” everyone agreed, so the team did the responsible thing: they stood up a Spark cluster. Ten executors, a driver node, autoscaling, an on-call rotation for the overnight jobs. The pipeline computed a few dozen aggregations over a daily feed of event logs and wrote the results to a warehouse. It worked. It also cost several thousand dollars a month, took ninety seconds of cluster spin-up before the first byte was read, and broke in ways that required someone who understood shuffles, partition skew, and YARN to diagnose.

Then a new engineer, profiling the inputs to chase a bug, noticed something: the daily feed was eleven gigabytes. Uncompressed. The “big data” that justified a distributed cluster fit, with room to spare, in the RAM of a laptop. She rewrote the pipeline as a single columnar query — sixty lines, no cluster — and it finished in four seconds on her machine. The Spark job had not been solving a data-size problem. It had been solving an imagined one, and charging the company for distributed-systems complexity it never needed.

The opposite mistake is just as common and more painful. A pandas script that sailed through every test on sample data meets the full month of production data, tries to materialize all of it at once, and dies with a MemoryError at 3 a.m. — or worse, swaps to disk and crawls for six hours before failing anyway. One team paid for distribution they didn’t need; the other refused distribution — or even out-of-core processing — they did. Both got the first processing decision wrong. That decision is scale-up versus scale-out, and our industry’s reflex has been to reach for scale-out far too early.

The Core Insight

Here is the thing that the four engines in this chapter — Spark, Dask, Polars, DuckDB — would all agree on, despite being built by different people in different languages for different scales: fast data processing comes from a small set of shared ideas, and only one of them is “use more machines.” The other ideas matter more, and they’re what make a single machine astonishingly capable.

The first idea is columnar memory with vectorized execution. Storing a table as columns rather than rows, and processing each column in a tight loop over a contiguous array, lets the CPU do what it is good at: stream predictable data through its caches and apply the same operation to many values at once with SIMD instructions. The alternative — looping over rows as Python objects, one dict-like thing at a time — spends almost all its time in interpreter overhead and cache misses. Columnar-plus-vectorized is routinely ten to a hundred times faster on the same hardware, and it is the single biggest reason a modern engine on one laptop outruns a poorly-built cluster.

The second idea is lazy evaluation. Instead of executing each operation the instant you write it, the engine records what you want — a query plan — and runs nothing until you ask for results. That delay is the point: with the whole computation visible at once, an optimizer can rewrite it. It can push a filter down to the file reader so rows are discarded before they’re decoded, prune columns you never reference, and fuse adjacent operations into a single pass. You write the readable version; the engine runs the efficient one.

The third idea is partitioning and the shuffle. To use many cores or many machines, you split the data into partitions and process them in parallel. The catch is that some operations — grouping, joining, sorting — need rows that live in different partitions to meet. Moving them is a shuffle, and the shuffle is where distributed processing spends most of its time and almost all of its pain. The expensive part of going distributed is never the computation; it is the data movement between partitions.

Put those three together and the big shift of the last decade becomes obvious. Hardware grew — laptops with 64 GB of RAM, servers with terabytes — and single-node columnar engines learned to use it ruthlessly. The result is that the band of “big data” that genuinely requires a cluster has shrunk dramatically. A great deal of what teams still distribute out of habit now fits, comfortably, on one fast machine.

A mental model

Picture the choice as a fork in the road. Down the scale-up path you make one machine as powerful as possible and run everything in a single process: a columnar engine reading columns straight into vectorized loops, using every core you have. Down the scale-out path you spread the work across many machines coordinated by a driver, accepting network communication as the cost of admission. Scale-up is simpler, cheaper, and faster until the data genuinely exceeds what one machine can hold; scale-out is the only option past that line, but you pay for it in complexity at every step.

Lazy evaluation is best understood as the difference between doing and describing. An eager API does each step as you type it, like following a recipe one line at a time with no idea what comes next. A lazy API lets you describe the whole dish first, then hands it to a chef who reorders the prep, skips the ingredients you won’t use, and combines steps to save passes over the cutting board. This is exactly how SQL has always worked — you declare the result you want, and a query planner decides how to get it. The lazy DataFrame APIs in Polars and Spark are SQL’s planner-and-optimizer model wearing a Python coat.

And the shuffle is a tax on distribution. Within a partition, work is local and cheap. The moment a computation needs to regroup data so matching keys sit together — every GROUP BY, every non-broadcast join, every global sort — the engine must ship rows across the network from where they are to where they need to be. You can’t always avoid the tax, but you can minimize it, and recognizing that the shuffle is the cost center is most of what separates someone who can tune a distributed job from someone who just runs one.

When single-node vs distributed

The decision framework is refreshingly blunt, and Figure 30.1 draws the two paths side by side.

Start by asking the only question that really matters: does the data fit on one machine? Not “is it big” in some vibes-based sense, but concretely — can the working set fit in the RAM of a fat single node, or at least on its local disk for an out-of-core engine to stream through?

If it fits on one machine, stay on one machine. Reach for Polars when you want a fast DataFrame API with lazy evaluation, or DuckDB when you’d rather express the work as SQL. Both are in-process, columnar, vectorized, and zero-config — pip install and go. They are simpler to operate, faster on equal data, and dramatically cheaper than a cluster, because there is no cluster. This case is far larger than most teams assume: tens to low hundreds of gigabytes is squarely single-node territory today.

Reach for distributed only when the data genuinely exceeds one machine, or when you’re already standing in a cluster. Choose Spark at true scale (multiple terabytes and up), when you need a mature ecosystem with streaming and ML libraries, or in an organization already standardized on it. Choose Dask when your team lives in pandas and NumPy and wants to scale that familiar code across nodes with the least new API to learn. The honest signal that you’ve crossed the line: a single big machine can no longer hold the working set, and you’ve confirmed it by measuring, not guessing.

The anti-pattern to retire is “we might need to scale later, so let’s start distributed.” Premature distribution buys complexity now against a need that may never arrive, and the migration from single-node to distributed, when it does arrive, is rarely the hard part. Start on one machine. Move to a cluster when one machine actually runs out — and not a dataset sooner.

What you’ll learn

Why columnar memory layout plus vectorized, SIMD-friendly execution beats row-by-row object processing by one to two orders of magnitude
How lazy evaluation turns your code into a query plan an optimizer can rewrite — and what predicate pushdown, projection pushdown, and operator fusion actually do
How partitioning enables parallelism, and why the shuffle that moves data between partitions dominates the cost — and the risk — of distributed jobs
How in-process single-node engines (Polars, DuckDB) handle most “big data” workloads that teams reflexively distribute
When a distributed engine (Spark, Dask) genuinely earns its coordination cost, and how to recognize that line instead of guessing at it
How out-of-core execution lets one machine process datasets larger than its RAM by streaming through them

Prerequisites

Performance and Profiling — the cross-language chapter on what makes code slow and how to measure it; in particular why native, vectorized execution is the escape hatch from interpreter overhead. Columnar engines are that escape hatch applied to whole tables.
The Data Engineering Landscape — where processing sits in a pipeline, the difference between batch and streaming, and the role of columnar file formats like Parquet.
Working SQL: SELECT, WHERE, JOIN, GROUP BY, window functions. The mental model of declaring a result and letting a planner execute it carries the whole chapter.

Columnar memory and vectorized execution

Everything fast about a modern data engine starts with how the table is laid out in memory. A row-oriented store keeps each record’s fields together: name, age, city, salary, then the next person’s name, age, city, salary, and so on. That is the natural shape for a transactional database that fetches one whole record at a time. It is the worst possible shape for analytics, which almost never wants whole records and almost always wants one operation applied across one column of millions of values — sum this, average that, filter on the other.

A columnar store flips the layout: all the names together, then all the ages, then all the salaries, each column a single contiguous array. The payoff is mechanical and large. When you compute SUM(salary), the engine reads only the salary array and never touches the other columns — on a fifty-column table where you asked about one, that alone is a fiftyfold cut in bytes read. The salary values, being adjacent in memory, stream through the CPU’s caches with perfect locality and trigger the hardware prefetcher instead of fighting it. And because every value in the array has the same type and a predictable stride, the engine can apply SIMD instructions — Single Instruction, Multiple Data — that add eight or sixteen values in one CPU operation. This is what vectorized execution means: processing a column as a batch in a tight native loop, rather than walking rows one boxed object at a time.

The contrast with the naive approach is stark, and it is the same lesson as the performance chapter, scaled up to tables. Consider a row-wise transform written the way a Python programmer reaches for first:

# Row-by-row in the interpreter: every iteration pays object and dispatch overhead.
results = []
for row in rows:                      # millions of Python-level iterations
    results.append(row["price"] * row["qty"])

Each iteration boxes values into Python objects, does a dictionary lookup, and dispatches a multiply through the interpreter — overhead that dwarfs the actual arithmetic. A columnar engine expresses the same computation as one operation over two arrays, and runs it as a single vectorized native loop with no per-row interpreter cost:

import polars as pl

# Whole-column expression: one native, vectorized multiply over contiguous arrays.
df = df.with_columns((pl.col("price") * pl.col("qty")).alias("revenue"))

The second version is not a little faster; it is routinely tens of times faster, and the gap widens with data size. The lesson generalizes: the way to make data processing fast is to stop describing it as a loop over rows and start describing it as operations over columns, then let the engine run those operations in native code.

There is one more reason the four engines interoperate so cleanly, and it is worth naming: Apache Arrow. Arrow is a standardized, language-agnostic columnar memory format — a common spec for how a column of integers or strings is laid out in RAM. Polars is built on Arrow; DuckDB speaks it; Spark and pandas can produce and consume it. Because they share the layout, handing a table from one to another is often a zero-copy operation — no serialization, no re-encoding, just a pointer to the same bytes. Arrow is the reason “use DuckDB for the SQL part and Polars for the DataFrame part” costs almost nothing: they are looking at the same memory.

Lazy evaluation and query optimization

Columnar layout makes each operation fast. Lazy evaluation makes the sequence of operations fast, by refusing to run them one at a time. An eager API — pandas, or Polars’ read_* functions — executes each call immediately and returns a materialized result. A lazy API records the operation into a growing plan and runs nothing until you call a terminal action: .collect() in Polars, .compute() in Dask, an action like count() or write in Spark, or simply issuing the query in DuckDB. Nothing in the chain below touches data:

import polars as pl

# scan_* is lazy: this builds a plan, reads nothing yet.
plan = (
    pl.scan_parquet("events/*.parquet")   # lazy source
    .filter(pl.col("amount") > 100)        # recorded, not run
    .select(["customer_id", "amount"])     # recorded, not run
    .group_by("customer_id")
    .agg(pl.col("amount").sum().alias("total"))
)
result = plan.collect()                     # NOW the optimizer runs, then execution

The deferral is what unlocks optimization, because by the time you call .collect() the engine can see the whole computation and rewrite it before a single byte is read. Three rewrites do most of the work. Predicate pushdown takes your filter and pushes it as far toward the data source as possible — into the Parquet reader itself, which can skip entire row groups whose statistics show no value above 100, so those rows are never decoded. Projection pushdown does the same for columns: you selected customer_id and amount, so the reader fetches only those two columns from disk and ignores the rest of the file. Operator fusion collapses adjacent steps into a single pass so the data is walked once, not once per operation. You can see the result of all this by asking the engine to explain its plan rather than execute it:

print(plan.explain(optimized=True))   # shows pushed-down filters and pruned columns

This is not a Polars trick; it is the universal pattern, and it is why SQL has always been declarative. DuckDB’s SQL planner does exactly the same pushdowns when you query a Parquet file directly — SELECT customer_id, SUM(amount) FROM 'events/*.parquet' WHERE amount > 100 GROUP BY customer_id reads only two columns and only the qualifying row groups, with no hint from you. Spark’s Catalyst optimizer builds an analyzed logical plan, rewrites it into an optimized logical plan with the same pushdowns and reorderings, then chooses a physical plan and generates code. The names differ; the idea is one idea. You write the query for clarity; the optimizer rewrites it for speed; and the more of your computation you keep inside the lazy plan, the more the optimizer can do for you.

Partitioning, parallelism, and the shuffle

To go faster than one core, you split the data into partitions — independent chunks each engine can process in parallel. On a single machine, partitions map to CPU cores; across a cluster, they map to worker machines. As long as an operation acts on each partition independently — a filter, a projection, a row-wise transform — parallelism is nearly free: hand each partition to a worker, collect the results, done. This is the easy, embarrassingly-parallel half of distributed processing, and it scales beautifully.

The trouble begins with operations that need data from across partitions to meet. To group by customer, every row for a given customer must end up in the same place — but those rows started scattered across many partitions. To join two tables on a key, matching keys from both sides must be colocated. To sort globally, the whole dataset has to be reordered across partition boundaries. Each of these forces a shuffle: the engine redistributes rows across the network so that, afterward, all the rows sharing a key live together.

# In Spark, this single line forces a shuffle: rows for each key must be colocated.
daily = events.groupBy("customer_id").sum("amount")

The shuffle is the expensive heart of distributed processing. It means writing intermediate data to disk, transferring it between nodes over the network, and reading it back — orders of magnitude slower than the in-memory, in-partition work around it. A job’s runtime is dominated not by how much it computes but by how much it shuffles; a pipeline with five shuffles is, roughly, five times the pain of one. This is why the seasoned move is to reduce shuffles: filter before you join so less data crosses the network; pre-aggregate within partitions before the global aggregation; and, when one side of a join is small, broadcast it — send the small table in full to every worker so the join happens locally with no shuffle at all.

Worse than an honest shuffle is a skewed one. Partitioning assumes keys are roughly balanced. When they aren’t — when one customer has ten million rows and the rest have a thousand each, or a null sentinel swallows 90% of the keys — the shuffle dumps almost everything into a single partition, and one unlucky worker grinds away while the other ninety-nine sit idle. The job is only as fast as its slowest task, so a single hot key can make a hundred-machine cluster run at the speed of one machine. Skew is the most common reason a distributed job that “should” be fast mysteriously hangs at 99% for hours, and detecting and mitigating it — salting the hot key, broadcasting, or filtering it out — is a core distributed-systems skill that single-node processing lets you skip entirely.

Single-node engines: Polars and DuckDB

The practical upshot of columnar layout, vectorized execution, and lazy optimization is that a single process can now chew through datasets that a few years ago everyone assumed required a cluster. Polars and DuckDB are the two engines built to exploit exactly that, and they are more alike than their interfaces suggest: both are in-process (no server, no cluster — they run inside your Python program), both store data as Arrow-format columns, both execute vectorized over those columns using every core on the machine, and both are lazy optimizers under the hood. They differ mainly in the front door. Polars presents a DataFrame and an expression API; DuckDB presents SQL.

import duckdb

# DuckDB: the same pushdowns and parallelism, expressed as SQL, in-process.
df = duckdb.sql("""
    SELECT customer_id, SUM(amount) AS total
    FROM 'events/*.parquet'
    WHERE amount > 100
    GROUP BY customer_id
""").df()

The strategic point is the one the chapter opened with: for a large fraction of real workloads, this is the right answer, and the cluster is not. There is no spin-up latency, no executor tuning, no shuffle to skew, no on-call rotation — just a query that runs in seconds on hardware you already have. Practitioners sometimes call this shift “big data is dead”: not that large data ceased to exist, but that the volume which genuinely demands distribution has receded far faster than most teams’ habits. Before you reach for scale-out, the honest first question is always whether one of these two engines on a bigger single machine would simply do the job — and the answer, surprisingly often, is yes.

Distributed engines: Spark and Dask

There is, of course, a real line past which one machine cannot go, and on the far side of it live Spark and Dask. Both follow the same architecture the diagram shows: a coordinator (Spark calls it the driver; Dask calls it the scheduler) builds a task graph from your code, splits the data into partitions, and dispatches tasks to a pool of workers that each process their partitions in parallel. Both are lazy — Spark waits for an action, Dask waits for .compute() — so the coordinator can optimize the whole graph before execution, the same principle as the single-node engines, now spread across a network.

The two differ mostly in heritage and reach. Spark is the JVM-native, enterprise-grade incumbent: it scales to thousands of nodes and petabytes, ships mature streaming and ML libraries, and is what large organizations have standardized on. Its DataFrame and SQL APIs are optimized by Catalyst, and its tuning surface — executor memory, shuffle partitions, broadcast thresholds, adaptive query execution — is deep because the problems it solves are deep. Dask takes a gentler, Python-native path: parallel drop-in replacements for pandas DataFrames and NumPy arrays, so a team already fluent in those can scale existing code to a cluster with minimal relearning. Dask shines in the 10 GB–1 TB range and in scientific-Python workloads; Spark pulls ahead at the largest scales and where a mature ecosystem matters most.

What both buy you — true horizontal scale — they charge for in coordination cost. A distributed job has a driver that can become a bottleneck, partitions that must be sized correctly, shuffles that move data across the network, and skew that can sandbag the whole cluster. None of that complexity is wasted when the data genuinely requires it. All of it is pure overhead when the data didn’t. That asymmetry is the entire reason this chapter insists you answer the scale-up question first.

Out-of-core: bigger than RAM on one node

There is a middle ground between “fits in RAM” and “needs a cluster,” and missing it is how teams talk themselves into distribution they don’t need. A dataset can be larger than memory and still belong on a single machine, because a good columnar engine can process it out-of-core — streaming the data through RAM in batches rather than loading all of it at once. Memory usage stays roughly constant regardless of total size; the data flows past the computation instead of piling up behind it.

import polars as pl

# Streaming execution: process a 200 GB file in constant memory on one machine.
(
    pl.scan_parquet("huge/*.parquet")
    .filter(pl.col("date") >= "2024-01-01")
    .group_by("category")
    .agg(pl.col("value").sum())
    .sink_parquet("out.parquet")   # streams straight to disk, never fully materialized
)

DuckDB does the same automatically — its execution engine spills to disk when a query’s working set exceeds the memory limit you’ve set, so a query over files far larger than RAM simply runs, slower than a fully in-memory query but without falling over. The implication for the scale-up-versus-scale-out decision is important: the threshold for needing a cluster is not “bigger than RAM.” It is “bigger than what one machine can stream through in acceptable time.” That bar is much higher, and it pushes the genuine need for distribution further out than the naive in-memory limit suggests.

War story: the cluster that was slower than a laptop

A data team inherited a nightly Spark job that aggregated clickstream logs and had grown to take over an hour, occasionally hanging past its SLA. The instinct in the room was to add executors. Before spending the money, one engineer profiled the job and found two things. First, the inputs totaled fourteen gigabytes a day — a number that fit in the RAM of the very laptops the team was debugging on. Second, the Spark UI showed a single straggler task running for forty-five minutes while every other task finished in seconds: the groupBy was keyed on a column where a null sentinel accounted for most of the rows, so the shuffle funneled almost the entire dataset into one partition and one worker. The cluster wasn’t slow because it was too small; it was slow because it was a cluster at all — paying full shuffle and skew costs on data that never needed partitioning. Rewritten as a single DuckDB query against the same Parquet files, the job finished in under a minute on one node, with no shuffle to skew and no executors to tune. Adding machines would have made the honest part of the work marginally faster and left the skewed shuffle — the actual bottleneck — exactly as slow. The fix was not more scale-out. It was less.

Build it → A single-node columnar engine, from the inside: Project 17: Columnar Query Engine is a DuckDB-style OLAP engine implementing columnar storage, vectorized execution, and a pushdown-aware query optimizer — the three core ideas of this chapter, built rather than imported.

Build it → Processing at the pipeline scale: Project 07: Data Lakehouse runs columnar batch transforms over Parquet on object storage, and Project 50: Feature Engineering Platform turns these engines into a production transform stage that materializes features for ML.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — Trade the row loop for a column. Take a transform written the pandas/row-wise way — a for loop over rows, or df.apply(..., axis=1) — that computes a derived column over a few million rows. Rewrite it as a single lazy, columnar expression in Polars (or as a DuckDB SQL query). Time both, report the speedup, and explain why it’s faster in terms of two specific mechanisms from this chapter: interpreter-versus-native execution and columnar memory locality. State which mattered more and how you’d tell.
Level II — Make the optimizer prove its work. Take a query that filters on one column and selects a handful of columns from a wide Parquet file (50+ columns, tens of millions of rows). Run it eagerly (read everything, then filter and select) and lazily (scan_parquet → filter → select → collect). Print the optimized plan with .explain(optimized=True) (or DuckDB’s EXPLAIN) and point to where predicate pushdown and projection pushdown appear. Then measure the bytes actually read from disk in each case and tie the difference back to the two pushdowns. Quantify how much less data the lazy version scanned.
Level III — Choose the architecture and find the bottleneck. Given a concrete workload (say, a daily join-and-aggregate over an event feed), a data size, and a monthly budget, write a one-page decision: scale-up (Polars/DuckDB on one large machine, possibly out-of-core) or scale-out (Spark/Dask). Justify the choice against the fit-on-one-machine test, including the out-of-core threshold, not just raw size. Then, assuming you had gone distributed, identify exactly where a shuffle would occur in your pipeline, which key could skew it, and what you’d do to detect and mitigate that skew. The deliverable is the reasoning, not the code.

Summary

The first and most consequential data-processing decision is scale-up versus scale-out, and the industry’s reflex to distribute early is usually wrong. The reason it’s wrong is that processing performance comes mostly from three ideas that have nothing to do with adding machines: columnar memory with vectorized execution, which processes contiguous typed columns in tight native loops with SIMD instead of boxing rows in the interpreter; lazy evaluation, which defers work into a query plan so an optimizer can push filters and projections down to the source and fuse operations; and partitioning, whose parallelism is nearly free until an operation needs cross-partition data and forces a shuffle — the network-and-disk tax that, especially when skewed, dominates distributed cost. Because hardware grew while these ideas matured, single-node columnar engines — Polars and DuckDB, in-process and Arrow-backed — now handle most workloads teams still distribute out of habit, and out-of-core streaming pushes their reach past RAM. Spark and Dask remain the right answer past the line where one machine genuinely can’t keep up — but you cross that line by measuring, not by assuming.

Key takeaways

Decide scale-up vs scale-out first, and decide it by whether the working set fits (in RAM, or on disk for an out-of-core engine) on one machine — not by whether the data feels “big.”
Columnar layout plus vectorized, SIMD-friendly execution is the largest single performance lever; it’s why a laptop engine can beat a misused cluster.
Lazy evaluation exists to enable optimization: predicate pushdown, projection pushdown, and operator fusion all depend on the engine seeing the whole plan before it runs.
In a distributed job the shuffle is the cost center, and a skewed key — one partition holding most of the data — is the classic reason a big cluster runs at the speed of one node.
Out-of-core streaming means “bigger than RAM” is not the same as “needs a cluster”; the real distribution threshold is much higher than the in-memory limit.

Connections to other chapters

Performance and Profiling (prerequisite): the cross-language chapter on measuring and optimizing code supplies both halves of why a data engine is fast. On the high level, columnar vectorized execution is the same move as escaping the interpreter into NumPy or native code — pushing tight numeric loops out of Python and into compiled, batched operations, which is why row-wise apply is the analog of a slow Python loop and a column expression is the fast path. On the hardware level, columnar is fast because of cache-friendly, SIMD-friendly memory access — contiguous arrays, predictable strides, prefetcher-friendly streaming. The cache-hierarchy, vectorization, and data-layout mechanics covered there are the why underneath this chapter’s columnar what.
Data Warehousing (sibling): the same columnar storage and cost-based query planning that power Polars and DuckDB power the analytical warehouse. A warehouse is, in large part, these ideas turned into a managed service; this chapter is the engine, that chapter is the platform.
The Data Engineering Landscape (prerequisite/context): processing is the transform stage of a pipeline — the T between extracting raw data and loading curated results. This chapter is how that stage actually runs; the landscape chapter is where it sits among ingestion, storage, and orchestration.

Polars User Guide and DuckDB Documentation — the two single-node engines, with the clearest practical treatment of lazy evaluation, expressions, and direct file querying.
Apache Spark Documentation — Spark SQL, DataFrames, and Tuning Guide — the canonical reference for distributed processing, Catalyst, shuffles, and skew-handling (adaptive query execution).

Deep dives

Kleppmann, Designing Data-Intensive Applications — the chapters on batch processing and column-oriented storage put partitioning, shuffles, and columnar layout in their full systems context.
The Apache Arrow Project — the columnar memory standard that lets these engines share data zero-copy; its documentation is the best explanation of why columnar layout is fast and interoperable.

Historical context

Dean & Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters” (OSDI, 2004) — the paper that defined the shuffle-centric, partition-and-combine model all distributed engines descend from.
Zaharia et al., “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” (NSDI, 2012) — the Spark RDD paper, which made in-memory distributed processing and lazy lineage practical.
Tigani, “Big Data is Dead” (2023) — the essay arguing that the data sizes justifying distributed clusters are far rarer than the industry assumes; the intellectual backbone of this chapter’s scale-up-first stance.

--- title: "Data Processing Engines" keywords: [data processing, spark, dask, polars, duckdb, distributed computing, lazy evaluation, columnar, partitioning, shuffle, single-node] difficulty: intermediate prerequisites: [performance-and-profiling, data-engineering-overview] estimated_time: "3-4 hours" --- ## Introduction The dataset was "big data," everyone agreed, so the team did the responsible thing: they stood up a Spark cluster. Ten executors, a driver node, autoscaling, an on-call rotation for the overnight jobs. The pipeline computed a few dozen aggregations over a daily feed of event logs and wrote the results to a warehouse. It worked. It also cost several thousand dollars a month, took ninety seconds of cluster spin-up before the first byte was read, and broke in ways that required someone who understood shuffles, partition skew, and YARN to diagnose. Then a new engineer, profiling the inputs to chase a bug, noticed something: the daily feed was eleven gigabytes. Uncompressed. The "big data" that justified a distributed cluster fit, with room to spare, in the RAM of a laptop. She rewrote the pipeline as a single columnar query — sixty lines, no cluster — and it finished in four seconds on her machine. The Spark job had not been solving a data-size problem. It had been solving an imagined one, and charging the company for distributed-systems complexity it never needed. The opposite mistake is just as common and more painful. A pandas script that sailed through every test on sample data meets the full month of production data, tries to materialize all of it at once, and dies with a `MemoryError` at 3 a.m. — or worse, swaps to disk and crawls for six hours before failing anyway. One team paid for distribution they didn't need; the other refused distribution — or even out-of-core processing — they did. Both got the *first* processing decision wrong. That decision is **scale-up versus scale-out**, and our industry's reflex has been to reach for scale-out far too early. ### The Core Insight Here is the thing that the four engines in this chapter — Spark, Dask, Polars, DuckDB — would all agree on, despite being built by different people in different languages for different scales: **fast data processing comes from a small set of shared ideas, and only one of them is "use more machines."** The other ideas matter more, and they're what make a single machine astonishingly capable. The first idea is **columnar memory with vectorized execution**. Storing a table as columns rather than rows, and processing each column in a tight loop over a contiguous array, lets the CPU do what it is good at: stream predictable data through its caches and apply the same operation to many values at once with SIMD instructions. The alternative — looping over rows as Python objects, one `dict`-like thing at a time — spends almost all its time in interpreter overhead and cache misses. Columnar-plus-vectorized is routinely ten to a hundred times faster on the same hardware, and it is the single biggest reason a modern engine on one laptop outruns a poorly-built cluster. The second idea is **lazy evaluation**. Instead of executing each operation the instant you write it, the engine records what you *want* — a query plan — and runs nothing until you ask for results. That delay is the point: with the whole computation visible at once, an optimizer can rewrite it. It can push a filter down to the file reader so rows are discarded before they're decoded, prune columns you never reference, and fuse adjacent operations into a single pass. You write the readable version; the engine runs the efficient one. The third idea is **partitioning and the shuffle**. To use many cores or many machines, you split the data into partitions and process them in parallel. The catch is that some operations — grouping, joining, sorting — need rows that live in *different* partitions to meet. Moving them is a **shuffle**, and the shuffle is where distributed processing spends most of its time and almost all of its pain. The expensive part of going distributed is never the computation; it is the data movement between partitions. Put those three together and the big shift of the last decade becomes obvious. Hardware grew — laptops with 64 GB of RAM, servers with terabytes — and single-node columnar engines learned to use it ruthlessly. The result is that the band of "big data" that genuinely requires a cluster has shrunk dramatically. A great deal of what teams still distribute out of habit now fits, comfortably, on one fast machine. ### A mental model Picture the choice as a fork in the road. Down the **scale-up** path you make one machine as powerful as possible and run everything in a single process: a columnar engine reading columns straight into vectorized loops, using every core you have. Down the **scale-out** path you spread the work across many machines coordinated by a driver, accepting network communication as the cost of admission. Scale-up is simpler, cheaper, and faster *until* the data genuinely exceeds what one machine can hold; scale-out is the only option past that line, but you pay for it in complexity at every step. Lazy evaluation is best understood as the difference between *doing* and *describing*. An eager API does each step as you type it, like following a recipe one line at a time with no idea what comes next. A lazy API lets you describe the whole dish first, then hands it to a chef who reorders the prep, skips the ingredients you won't use, and combines steps to save passes over the cutting board. This is exactly how SQL has always worked — you declare the result you want, and a query planner decides how to get it. The lazy DataFrame APIs in Polars and Spark are SQL's planner-and-optimizer model wearing a Python coat. And the shuffle is a **tax on distribution**. Within a partition, work is local and cheap. The moment a computation needs to regroup data so matching keys sit together — every `GROUP BY`, every non-broadcast join, every global sort — the engine must ship rows across the network from where they are to where they need to be. You can't always avoid the tax, but you can minimize it, and recognizing that the shuffle is the cost center is most of what separates someone who can tune a distributed job from someone who just runs one. ### When single-node vs distributed The decision framework is refreshingly blunt, and @fig-de-processing draws the two paths side by side. ![Two processing models: a single-node columnar engine (Polars, DuckDB) runs a vectorized query in-process, while a distributed engine (Spark, Dask) partitions data across workers under a coordinator — where the shuffle that moves data between partitions is the expensive step. Both build a lazy query plan, optimize it (predicate and projection pushdown), then execute. Most "big data" now fits the single-node path.](../assets/diagrams/rendered/de_processing.svg){#fig-de-processing .lightbox} Start by asking the only question that really matters: **does the data fit on one machine?** Not "is it big" in some vibes-based sense, but concretely — can the working set fit in the RAM of a fat single node, or at least on its local disk for an out-of-core engine to stream through? **If it fits on one machine, stay on one machine.** Reach for **Polars** when you want a fast DataFrame API with lazy evaluation, or **DuckDB** when you'd rather express the work as SQL. Both are in-process, columnar, vectorized, and zero-config — `pip install` and go. They are simpler to operate, faster on equal data, and dramatically cheaper than a cluster, because there is no cluster. This case is far larger than most teams assume: tens to low hundreds of gigabytes is squarely single-node territory today. **Reach for distributed only when the data genuinely exceeds one machine, or when you're already standing in a cluster.** Choose **Spark** at true scale (multiple terabytes and up), when you need a mature ecosystem with streaming and ML libraries, or in an organization already standardized on it. Choose **Dask** when your team lives in pandas and NumPy and wants to scale that familiar code across nodes with the least new API to learn. The honest signal that you've crossed the line: a single big machine can no longer hold the working set, *and* you've confirmed it by measuring, not guessing. The anti-pattern to retire is "we might need to scale later, so let's start distributed." Premature distribution buys complexity now against a need that may never arrive, and the migration from single-node to distributed, when it does arrive, is rarely the hard part. Start on one machine. Move to a cluster when one machine actually runs out — and not a dataset sooner. ### What you'll learn - Why columnar memory layout plus vectorized, SIMD-friendly execution beats row-by-row object processing by one to two orders of magnitude - How lazy evaluation turns your code into a query plan an optimizer can rewrite — and what predicate pushdown, projection pushdown, and operator fusion actually do - How partitioning enables parallelism, and why the shuffle that moves data between partitions dominates the cost — and the risk — of distributed jobs - How in-process single-node engines (Polars, DuckDB) handle most "big data" workloads that teams reflexively distribute - When a distributed engine (Spark, Dask) genuinely earns its coordination cost, and how to recognize that line instead of guessing at it - How out-of-core execution lets one machine process datasets larger than its RAM by streaming through them ### Prerequisites - **Performance and Profiling** — the cross-language chapter on what makes code slow and how to measure it; in particular why native, vectorized execution is the escape hatch from interpreter overhead. Columnar engines are that escape hatch applied to whole tables. - **The Data Engineering Landscape** — where processing sits in a pipeline, the difference between batch and streaming, and the role of columnar file formats like Parquet. - Working SQL: `SELECT`, `WHERE`, `JOIN`, `GROUP BY`, window functions. The mental model of declaring a result and letting a planner execute it carries the whole chapter. --- ## Columnar memory and vectorized execution Everything fast about a modern data engine starts with how the table is laid out in memory. A row-oriented store keeps each record's fields together: name, age, city, salary, then the next person's name, age, city, salary, and so on. That is the natural shape for a transactional database that fetches one whole record at a time. It is the worst possible shape for analytics, which almost never wants whole records and almost always wants one operation applied across one column of millions of values — sum this, average that, filter on the other. A **columnar** store flips the layout: all the names together, then all the ages, then all the salaries, each column a single contiguous array. The payoff is mechanical and large. When you compute `SUM(salary)`, the engine reads only the salary array and never touches the other columns — on a fifty-column table where you asked about one, that alone is a fiftyfold cut in bytes read. The salary values, being adjacent in memory, stream through the CPU's caches with perfect locality and trigger the hardware prefetcher instead of fighting it. And because every value in the array has the same type and a predictable stride, the engine can apply **SIMD** instructions — Single Instruction, Multiple Data — that add eight or sixteen values in one CPU operation. This is what **vectorized execution** means: processing a column as a batch in a tight native loop, rather than walking rows one boxed object at a time. The contrast with the naive approach is stark, and it is the same lesson as the performance chapter, scaled up to tables. Consider a row-wise transform written the way a Python programmer reaches for first: ```python # Row-by-row in the interpreter: every iteration pays object and dispatch overhead. results = [] for row in rows: # millions of Python-level iterations results.append(row["price"] * row["qty"]) ``` Each iteration boxes values into Python objects, does a dictionary lookup, and dispatches a multiply through the interpreter — overhead that dwarfs the actual arithmetic. A columnar engine expresses the same computation as one operation over two arrays, and runs it as a single vectorized native loop with no per-row interpreter cost: ```python import polars as pl # Whole-column expression: one native, vectorized multiply over contiguous arrays. df = df.with_columns((pl.col("price") * pl.col("qty")).alias("revenue")) ``` The second version is not a little faster; it is routinely tens of times faster, and the gap widens with data size. The lesson generalizes: the way to make data processing fast is to stop describing it as a loop over rows and start describing it as operations over columns, then let the engine run those operations in native code. There is one more reason the four engines interoperate so cleanly, and it is worth naming: **Apache Arrow**. Arrow is a standardized, language-agnostic columnar memory format — a common spec for how a column of integers or strings is laid out in RAM. Polars is built on Arrow; DuckDB speaks it; Spark and pandas can produce and consume it. Because they share the layout, handing a table from one to another is often a *zero-copy* operation — no serialization, no re-encoding, just a pointer to the same bytes. Arrow is the reason "use DuckDB for the SQL part and Polars for the DataFrame part" costs almost nothing: they are looking at the same memory. ## Lazy evaluation and query optimization Columnar layout makes each operation fast. Lazy evaluation makes the *sequence* of operations fast, by refusing to run them one at a time. An eager API — pandas, or Polars' `read_*` functions — executes each call immediately and returns a materialized result. A lazy API records the operation into a growing plan and runs nothing until you call a terminal action: `.collect()` in Polars, `.compute()` in Dask, an action like `count()` or `write` in Spark, or simply issuing the query in DuckDB. Nothing in the chain below touches data: ```python import polars as pl # scan_* is lazy: this builds a plan, reads nothing yet. plan = ( pl.scan_parquet("events/*.parquet") # lazy source .filter(pl.col("amount") > 100) # recorded, not run .select(["customer_id", "amount"]) # recorded, not run .group_by("customer_id") .agg(pl.col("amount").sum().alias("total")) ) result = plan.collect() # NOW the optimizer runs, then execution ``` The deferral is what unlocks optimization, because by the time you call `.collect()` the engine can see the whole computation and rewrite it before a single byte is read. Three rewrites do most of the work. **Predicate pushdown** takes your `filter` and pushes it as far toward the data source as possible — into the Parquet reader itself, which can skip entire row groups whose statistics show no value above 100, so those rows are never decoded. **Projection pushdown** does the same for columns: you selected `customer_id` and `amount`, so the reader fetches only those two columns from disk and ignores the rest of the file. **Operator fusion** collapses adjacent steps into a single pass so the data is walked once, not once per operation. You can see the result of all this by asking the engine to explain its plan rather than execute it: ```python print(plan.explain(optimized=True)) # shows pushed-down filters and pruned columns ``` This is not a Polars trick; it is the universal pattern, and it is why SQL has always been declarative. DuckDB's SQL planner does exactly the same pushdowns when you query a Parquet file directly — `SELECT customer_id, SUM(amount) FROM 'events/*.parquet' WHERE amount > 100 GROUP BY customer_id` reads only two columns and only the qualifying row groups, with no hint from you. Spark's Catalyst optimizer builds an analyzed logical plan, rewrites it into an optimized logical plan with the same pushdowns and reorderings, then chooses a physical plan and generates code. The names differ; the idea is one idea. You write the query for clarity; the optimizer rewrites it for speed; and the more of your computation you keep inside the lazy plan, the more the optimizer can do for you. ## Partitioning, parallelism, and the shuffle To go faster than one core, you split the data into **partitions** — independent chunks each engine can process in parallel. On a single machine, partitions map to CPU cores; across a cluster, they map to worker machines. As long as an operation acts on each partition independently — a filter, a projection, a row-wise transform — parallelism is nearly free: hand each partition to a worker, collect the results, done. This is the easy, embarrassingly-parallel half of distributed processing, and it scales beautifully. The trouble begins with operations that need data from *across* partitions to meet. To group by customer, every row for a given customer must end up in the same place — but those rows started scattered across many partitions. To join two tables on a key, matching keys from both sides must be colocated. To sort globally, the whole dataset has to be reordered across partition boundaries. Each of these forces a **shuffle**: the engine redistributes rows across the network so that, afterward, all the rows sharing a key live together. ```python # In Spark, this single line forces a shuffle: rows for each key must be colocated. daily = events.groupBy("customer_id").sum("amount") ``` The shuffle is the expensive heart of distributed processing. It means writing intermediate data to disk, transferring it between nodes over the network, and reading it back — orders of magnitude slower than the in-memory, in-partition work around it. A job's runtime is dominated not by how much it computes but by how much it shuffles; a pipeline with five shuffles is, roughly, five times the pain of one. This is why the seasoned move is to *reduce* shuffles: filter before you join so less data crosses the network; pre-aggregate within partitions before the global aggregation; and, when one side of a join is small, **broadcast** it — send the small table in full to every worker so the join happens locally with no shuffle at all. Worse than an honest shuffle is a **skewed** one. Partitioning assumes keys are roughly balanced. When they aren't — when one customer has ten million rows and the rest have a thousand each, or a null sentinel swallows 90% of the keys — the shuffle dumps almost everything into a single partition, and one unlucky worker grinds away while the other ninety-nine sit idle. The job is only as fast as its slowest task, so a single hot key can make a hundred-machine cluster run at the speed of one machine. Skew is the most common reason a distributed job that "should" be fast mysteriously hangs at 99% for hours, and detecting and mitigating it — salting the hot key, broadcasting, or filtering it out — is a core distributed-systems skill that single-node processing lets you skip entirely. ## Single-node engines: Polars and DuckDB The practical upshot of columnar layout, vectorized execution, and lazy optimization is that a single process can now chew through datasets that a few years ago everyone assumed required a cluster. **Polars** and **DuckDB** are the two engines built to exploit exactly that, and they are more alike than their interfaces suggest: both are in-process (no server, no cluster — they run inside your Python program), both store data as Arrow-format columns, both execute vectorized over those columns using every core on the machine, and both are lazy optimizers under the hood. They differ mainly in the front door. Polars presents a DataFrame and an expression API; DuckDB presents SQL. ```python import duckdb # DuckDB: the same pushdowns and parallelism, expressed as SQL, in-process. df = duckdb.sql(""" SELECT customer_id, SUM(amount) AS total FROM 'events/*.parquet' WHERE amount > 100 GROUP BY customer_id """).df() ``` The strategic point is the one the chapter opened with: for a large fraction of real workloads, this *is* the right answer, and the cluster is not. There is no spin-up latency, no executor tuning, no shuffle to skew, no on-call rotation — just a query that runs in seconds on hardware you already have. Practitioners sometimes call this shift "big data is dead": not that large data ceased to exist, but that the volume which genuinely demands distribution has receded far faster than most teams' habits. Before you reach for scale-out, the honest first question is always whether one of these two engines on a bigger single machine would simply do the job — and the answer, surprisingly often, is yes. ## Distributed engines: Spark and Dask There is, of course, a real line past which one machine cannot go, and on the far side of it live **Spark** and **Dask**. Both follow the same architecture the diagram shows: a **coordinator** (Spark calls it the driver; Dask calls it the scheduler) builds a task graph from your code, splits the data into partitions, and dispatches tasks to a pool of **workers** that each process their partitions in parallel. Both are lazy — Spark waits for an action, Dask waits for `.compute()` — so the coordinator can optimize the whole graph before execution, the same principle as the single-node engines, now spread across a network. The two differ mostly in heritage and reach. **Spark** is the JVM-native, enterprise-grade incumbent: it scales to thousands of nodes and petabytes, ships mature streaming and ML libraries, and is what large organizations have standardized on. Its DataFrame and SQL APIs are optimized by Catalyst, and its tuning surface — executor memory, shuffle partitions, broadcast thresholds, adaptive query execution — is deep because the problems it solves are deep. **Dask** takes a gentler, Python-native path: parallel drop-in replacements for pandas DataFrames and NumPy arrays, so a team already fluent in those can scale existing code to a cluster with minimal relearning. Dask shines in the 10 GB–1 TB range and in scientific-Python workloads; Spark pulls ahead at the largest scales and where a mature ecosystem matters most. What both buy you — true horizontal scale — they charge for in coordination cost. A distributed job has a driver that can become a bottleneck, partitions that must be sized correctly, shuffles that move data across the network, and skew that can sandbag the whole cluster. None of that complexity is wasted *when the data genuinely requires it*. All of it is pure overhead when the data didn't. That asymmetry is the entire reason this chapter insists you answer the scale-up question first. ## Out-of-core: bigger than RAM on one node There is a middle ground between "fits in RAM" and "needs a cluster," and missing it is how teams talk themselves into distribution they don't need. A dataset can be larger than memory and still belong on a single machine, because a good columnar engine can process it **out-of-core** — streaming the data through RAM in batches rather than loading all of it at once. Memory usage stays roughly constant regardless of total size; the data flows past the computation instead of piling up behind it. ```python import polars as pl # Streaming execution: process a 200 GB file in constant memory on one machine. ( pl.scan_parquet("huge/*.parquet") .filter(pl.col("date") >= "2024-01-01") .group_by("category") .agg(pl.col("value").sum()) .sink_parquet("out.parquet") # streams straight to disk, never fully materialized ) ``` DuckDB does the same automatically — its execution engine spills to disk when a query's working set exceeds the memory limit you've set, so a query over files far larger than RAM simply runs, slower than a fully in-memory query but without falling over. The implication for the scale-up-versus-scale-out decision is important: the threshold for needing a cluster is not "bigger than RAM." It is "bigger than what one machine can stream through in acceptable time." That bar is much higher, and it pushes the genuine need for distribution further out than the naive in-memory limit suggests. ::: {.callout-warning} ## War story: the cluster that was slower than a laptop A data team inherited a nightly Spark job that aggregated clickstream logs and had grown to take over an hour, occasionally hanging past its SLA. The instinct in the room was to add executors. Before spending the money, one engineer profiled the job and found two things. First, the inputs totaled fourteen gigabytes a day — a number that fit in the RAM of the very laptops the team was debugging on. Second, the Spark UI showed a single straggler task running for forty-five minutes while every other task finished in seconds: the `groupBy` was keyed on a column where a null sentinel accounted for most of the rows, so the shuffle funneled almost the entire dataset into one partition and one worker. The cluster wasn't slow because it was too small; it was slow because it was a cluster at all — paying full shuffle and skew costs on data that never needed partitioning. Rewritten as a single DuckDB query against the same Parquet files, the job finished in under a minute on one node, with no shuffle to skew and no executors to tune. Adding machines would have made the *honest* part of the work marginally faster and left the skewed shuffle — the actual bottleneck — exactly as slow. The fix was not more scale-out. It was less. ::: > **Build it →** A single-node columnar engine, from the inside: > [Project 17: Columnar Query Engine](https://github.com/jchu0/applied-cs-projects/tree/main/17-columnar-query-engine) > is a DuckDB-style OLAP engine implementing columnar storage, vectorized > execution, and a pushdown-aware query optimizer — the three core ideas of this > chapter, built rather than imported. > **Build it →** Processing at the pipeline scale: > [Project 07: Data Lakehouse](https://github.com/jchu0/applied-cs-projects/tree/main/07-data-lakehouse) > runs columnar batch transforms over Parquet on object storage, and > [Project 50: Feature Engineering Platform](https://github.com/jchu0/applied-cs-projects/tree/main/50-feature-engineering-platform) > turns these engines into a production transform stage that materializes > features for ML. --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — Trade the row loop for a column.** Take a transform written the pandas/row-wise way — a `for` loop over rows, or `df.apply(..., axis=1)` — that computes a derived column over a few million rows. Rewrite it as a single lazy, columnar expression in Polars (or as a DuckDB SQL query). Time both, report the speedup, and explain *why* it's faster in terms of two specific mechanisms from this chapter: interpreter-versus-native execution and columnar memory locality. State which mattered more and how you'd tell. 2. **Level II — Make the optimizer prove its work.** Take a query that filters on one column and selects a handful of columns from a wide Parquet file (50+ columns, tens of millions of rows). Run it eagerly (read everything, then filter and select) and lazily (`scan_parquet` → filter → select → collect). Print the optimized plan with `.explain(optimized=True)` (or DuckDB's `EXPLAIN`) and point to where predicate pushdown and projection pushdown appear. Then measure the bytes actually read from disk in each case and tie the difference back to the two pushdowns. Quantify how much *less* data the lazy version scanned. 3. **Level III — Choose the architecture and find the bottleneck.** Given a concrete workload (say, a daily join-and-aggregate over an event feed), a data size, and a monthly budget, write a one-page decision: scale-up (Polars/DuckDB on one large machine, possibly out-of-core) or scale-out (Spark/Dask). Justify the choice against the fit-on-one-machine test, including the out-of-core threshold, not just raw size. Then, *assuming you had gone distributed*, identify exactly where a shuffle would occur in your pipeline, which key could skew it, and what you'd do to detect and mitigate that skew. The deliverable is the reasoning, not the code. ## Summary The first and most consequential data-processing decision is scale-up versus scale-out, and the industry's reflex to distribute early is usually wrong. The reason it's wrong is that processing performance comes mostly from three ideas that have nothing to do with adding machines: **columnar memory with vectorized execution**, which processes contiguous typed columns in tight native loops with SIMD instead of boxing rows in the interpreter; **lazy evaluation**, which defers work into a query plan so an optimizer can push filters and projections down to the source and fuse operations; and **partitioning**, whose parallelism is nearly free until an operation needs cross-partition data and forces a **shuffle** — the network-and-disk tax that, especially when skewed, dominates distributed cost. Because hardware grew while these ideas matured, single-node columnar engines — **Polars** and **DuckDB**, in-process and Arrow-backed — now handle most workloads teams still distribute out of habit, and out-of-core streaming pushes their reach past RAM. **Spark** and **Dask** remain the right answer past the line where one machine genuinely can't keep up — but you cross that line by measuring, not by assuming. ### Key takeaways - Decide scale-up vs scale-out *first*, and decide it by whether the working set fits (in RAM, or on disk for an out-of-core engine) on one machine — not by whether the data feels "big." - Columnar layout plus vectorized, SIMD-friendly execution is the largest single performance lever; it's why a laptop engine can beat a misused cluster. - Lazy evaluation exists to enable optimization: predicate pushdown, projection pushdown, and operator fusion all depend on the engine seeing the whole plan before it runs. - In a distributed job the shuffle is the cost center, and a skewed key — one partition holding most of the data — is the classic reason a big cluster runs at the speed of one node. - Out-of-core streaming means "bigger than RAM" is not the same as "needs a cluster"; the real distribution threshold is much higher than the in-memory limit. ### Connections to other chapters - **Performance and Profiling** (prerequisite): the cross-language chapter on measuring and optimizing code supplies both halves of why a data engine is fast. On the high level, columnar vectorized execution is the same move as escaping the interpreter into NumPy or native code — pushing tight numeric loops out of Python and into compiled, batched operations, which is why row-wise `apply` is the analog of a slow Python loop and a column expression is the fast path. On the hardware level, columnar is fast because of cache-friendly, SIMD-friendly memory access — contiguous arrays, predictable strides, prefetcher-friendly streaming. The cache-hierarchy, vectorization, and data-layout mechanics covered there are the *why* underneath this chapter's columnar *what*. - **Data Warehousing** (sibling): the same columnar storage and cost-based query planning that power Polars and DuckDB power the analytical warehouse. A warehouse is, in large part, these ideas turned into a managed service; this chapter is the engine, that chapter is the platform. - **The Data Engineering Landscape** (prerequisite/context): processing is the *transform* stage of a pipeline — the T between extracting raw data and loading curated results. This chapter is how that stage actually runs; the landscape chapter is where it sits among ingestion, storage, and orchestration. ## Further reading ### Essential - *Polars User Guide* and *DuckDB Documentation* — the two single-node engines, with the clearest practical treatment of lazy evaluation, expressions, and direct file querying. - *Apache Spark Documentation — Spark SQL, DataFrames, and Tuning Guide* — the canonical reference for distributed processing, Catalyst, shuffles, and skew-handling (adaptive query execution). ### Deep dives - Kleppmann, *Designing Data-Intensive Applications* — the chapters on batch processing and column-oriented storage put partitioning, shuffles, and columnar layout in their full systems context. - *The Apache Arrow Project* — the columnar memory standard that lets these engines share data zero-copy; its documentation is the best explanation of why columnar layout is fast and interoperable. ### Historical context - Dean & Ghemawat, *"MapReduce: Simplified Data Processing on Large Clusters"* (OSDI, 2004) — the paper that defined the shuffle-centric, partition-and-combine model all distributed engines descend from. - Zaharia et al., *"Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing"* (NSDI, 2012) — the Spark RDD paper, which made in-memory distributed processing and lazy lineage practical. - Tigani, *"Big Data is Dead"* (2023) — the essay arguing that the data sizes justifying distributed clusters are far rarer than the industry assumes; the intellectual backbone of this chapter's scale-up-first stance.