Data Orchestration & Pipelines
orchestration, airflow, dagster, prefect, dag, scheduling, idempotency, backfill, data pipeline, retries, dependencies
Introduction
It started, as these things do, with a single cron line. A script pulled yesterday’s events from an API at 2am, wrote them to a file, and a second cron job an hour later loaded that file into the warehouse, and a third refreshed the dashboard the executives opened with their coffee. It worked. Then a fourth source was added, then a fifth, and the crontab grew into a column of timestamps chosen by guesswork — give the extract an hour, that should be enough — each job blindly trusting that the one before it had finished and left good data behind.
One Tuesday the API returned a 503 at 2am. The extract script caught the error, logged it to a file nobody read, and exited zero. An hour later the load job ran exactly on schedule, found yesterday’s stale file still on disk, and loaded it again — so the warehouse now held Monday’s numbers stamped with Tuesday’s date. The dashboard refreshed on time and looked perfect. No alert fired, because nothing had “failed” in a way the crontab could see. By the time someone noticed revenue was implausibly flat, the bad data had been double-counted into three downstream tables, with no way to re-run just the extract for that one day short of manually deleting rows and praying. The pipeline had no memory, no retries, no notion of “this step depends on that one having actually succeeded,” and no safe way to repair history.
This is the failure that orchestration exists to prevent. A chain of cron jobs is a pipeline only in the sense that water through a cracked pipe is plumbing: it moves, until it doesn’t, and then it floods the basement silently. Orchestration turns that fragile chain into a managed pipeline — one that knows what ran, what each step depends on, what failed and why, that retries transient errors on its own, alerts a human on real ones, and lets you safely re-run a single bad day. It is the difference between “run this at 2am and hope” and “ensure this data exists, is fresh, and is correct — and tell me the moment it isn’t.”
The Core Insight
The reframe is small and total: a data pipeline is a directed acyclic graph (DAG) of tasks with dependencies, and an orchestrator’s entire job is to run that graph in the right order — on a schedule or in response to an event — with retries, backfills, idempotency, and observability built into the substrate rather than bolted on. The crontab above was a DAG; it just lived in nobody’s head and the operating system had no way to reason about it. Make the graph explicit and the orchestrator can do what cron structurally cannot:
- Run independent steps in parallel and dependent steps in order, because it knows the edges. Cron knows only clock times, so it fakes dependencies with sleep-and-pray timestamps that drift out of sync the first time anything runs slow.
- Retry a failed task without re-running the successful ones. Cron’s unit of work is the whole script; if step three of five fails, you re-run all five by hand.
- Backfill history deliberately — replay last month one day at a time — instead of never repairing the past or hand-writing a loop that hammers the source.
- Make failure visible. A run is an event with a state, a log, and an owner to notify; a failed cron job is an exit code that fell in the forest.
There is a second, deeper shift underway in how we declare that graph. The first generation of orchestrators is task- and schedule-centric: you tell the system “run this task at 2am, then that task.” Airflow is the archetype — you describe verbs, steps, and a clock. The newer generation is asset- and data-aware: you declare “this dataset should exist and be fresh by 6am,” and the system figures out which steps make that true. Dagster’s software-defined assets are the clearest expression — you describe nouns, the data you want, not the procedure to get it. The Tuesday incident is, at root, a task-centric blind spot: the schedule fired and the task “succeeded,” but nobody had told the system that the asset “fresh Tuesday events” did not yet exist. An asset-aware orchestrator asks a question cron and even classic Airflow do not: is the data actually there, and is it current?
A mental model
Picture the pipeline as a graph you can walk, and the orchestrator as a reliable foreman standing in front of it. The foreman does not lay the bricks — the extract, the transform, the load are your code, the muscle. The foreman’s job is to read the plan, send each worker to their task only once its inputs are ready, watch for a worker who drops a tool, hand them the same task again, and ring the alarm if it fails three times running. A good foreman is judged not by how fast they work but by how little surprises them: every dependency on the board, every state tracked, nothing running on stale inputs because someone guessed at a timestamp.
The property that makes those re-runs safe is idempotency: re-running a task produces the same result as running it once. The load step in the disaster wasn’t idempotent — it appended, so a second run doubled the data. Make it delete-then-insert for the day, or upsert by key, and the foreman can re-run it as many times as needed; the table lands in exactly one correct state. Idempotency is not a nicety — it is the precondition that turns “retry” and “backfill” from dangerous operations into routine ones. Without it, every automatic retry is a coin-flip on data corruption; with it, the orchestrator can be aggressive about recovery because recovery can’t hurt you. Figure 29.1 shows the whole shape: a trigger fires a run, the orchestrator walks the DAG from its roots, retries what fails, and can replay a range of dates — every move safe precisely because the tasks are idempotent.
When to orchestrate, and which model
Not every job needs a foreman. The decision has two layers: do you need an orchestrator at all, and if so, task-centric or asset-aware.
Stay with cron (or a single script) when the work is genuinely trivial: one step, no downstream consumers, no real cost to a silent failure, no need to repair history. The moment a second step depends on the first, or you want retries, or someone asks “can you re-run just last Thursday?”, or there’s an SLA on freshness — reach for an orchestrator. The tell is timestamp guesswork: if your scheduling strategy is “give it an hour and hope the upstream finished,” you have already outgrown cron.
Choose a task/schedule-centric orchestrator (Airflow) when your model genuinely is “run these steps in this order on this cadence” — operator-heavy ETL, a mature integration ecosystem, an existing investment you shouldn’t rewrite. Choose an asset-aware one (Dagster’s assets, Prefect’s data-aware flows) when you care more about what data exists and how fresh it is than which tasks ran — when consumers ask “when was this last updated?”, when lineage matters for debugging or compliance, when you want to refresh one slice of the graph without re-running everything. A rule of thumb: if you find yourself building custom lineage or freshness monitoring on top of a task scheduler, you actually wanted an asset-aware tool. The three named systems are examples of these ideas, not religions; the concepts below outlive any of them.
What you’ll learn
- Why a pipeline is best modeled as a DAG, and what the orchestrator’s scheduler does with that graph that a crontab structurally cannot
- Why idempotency is the load-bearing property of any reliable pipeline, and the patterns (partition-by-date, upsert-not-append) that get you there
- How retries recover from transient failure without re-running successful work — and when a retry on a non-idempotent task is worse than the failure
- How scheduling, data intervals, catchup, and backfills let you run history deliberately rather than by accident
- How the task-centric and asset-aware models differ, using Airflow, Dagster, and Prefect as concrete illustrations
- How to treat pipeline runs as observable events you watch and alert on, like any production service
Prerequisites
- Python language features: functions, decorators, and type hints — every orchestrator in this chapter expresses pipelines as decorated Python functions
- The Data Engineering Landscape (Part II opener): where pipelines sit in the data lifecycle, and the ingest → transform → serve shape they sequence
- Comfort with the idea of a scheduled batch job and a relational table you can
DELETEfrom andINSERTinto
Pipelines as DAGs
Everything starts with the graph. A task is a unit of work — pull a file, run a transform, load a table. A dependency is an edge: this task may not start until that one has succeeded. Wire tasks and edges together and you have a directed acyclic graph, and each word earns its place. Directed: dependencies point one way, so the system knows extract comes before transform. Acyclic: no loops, because a task that depended on itself through a cycle could never start — the orchestrator rejects cycles at definition time, before they deadlock anything. Graph: tasks are nodes, dependencies are edges, and the whole pipeline is a structure the orchestrator can read, reason about, and walk.
In Airflow’s modern decorator syntax the graph is almost invisible — you write ordinary Python functions and the data flow is the dependency. Calling one task with another’s return value declares the edge:
# The dependency graph is implied by the data flow: transform() consumes
# what extract() returns, so the scheduler knows extract must finish first.
@dag(schedule="0 2 * * *", start_date=datetime(2024, 1, 1), catchup=False)
def events_etl():
@task
def extract() -> str: ... # returns a path to the raw data
@task
def transform(path: str) -> str: ...
@task
def load(path: str) -> None: ...
load(transform(extract())) # edges: extract → transform → loadOnce the graph exists, the scheduler brings it to life. On a schedule or trigger it creates a run — one execution of the whole DAG for one logical time window — then walks the graph: it finds root tasks whose dependencies are satisfied, dispatches them to workers, and as each succeeds it unlocks the tasks downstream. Independent branches run in parallel automatically, because the scheduler sees they don’t depend on each other. This is the first thing cron cannot do: cron knows only “it is now 3am,” never “task B’s inputs are ready.” The scheduler’s view of the graph is the entire difference between a pipeline that coordinates itself and a column of timestamps hoping to stay in sync.
A practical corollary: tasks run in separate processes, often on separate machines, so they cannot pass a 500MB DataFrame in memory from one to the next. The discipline — true across all three tools — is to pass references, not payloads: a task writes its output to durable storage (S3, a warehouse table) and hands the next task a path or key, a few bytes of metadata. Airflow enforces this with a sharp edge called XCom, a small key-value channel backed by the metadata database; push a row count or an S3 path and you’re fine, push a DataFrame and you’ll bloat the database and learn the limit the painful way. The rule generalizes: the graph carries control and small metadata; the data itself lives in storage the tasks share.
Idempotency and retries
A pipeline that runs perfectly on its first try every day does not exist. Sources time out, networks blip, a worker gets OOM-killed mid-transform. So the orchestrator’s core move in the face of failure is the retry: when a task fails, wait and run it again, usually with exponential backoff so a struggling upstream service isn’t hammered. Most failures in data engineering are transient, and three attempts with growing delays quietly absorbs the vast majority of 3am incidents without waking anyone. Critically, the orchestrator retries only the failed task, not the whole pipeline — the successful extract and transform stay done, only the flaky load runs again. That selectivity is impossible when your unit of work is an entire shell script.
But retries are only safe if re-running a task is harmless, and that is exactly where the Tuesday disaster lived. Its load step appended rows:
# Non-idempotent: every run — including every retry — appends again.
# One retry after a partial failure and the day's data is double-counted.
def load(df, engine):
df.to_sql("daily_metrics", engine, if_exists="append")The fix is idempotency — engineering each task so that running it twice produces the same result as running it once. For a date-partitioned load, the canonical pattern is delete-then-insert (or its single-statement cousin, upsert): before writing the day’s data, remove whatever is already there for that day. Now the table converges to one correct state no matter how many times the task runs:
# Idempotent: scope the write to one logical date and clear it first,
# so a retry or a deliberate re-run lands the same single correct result.
def load(df, engine, run_date):
with engine.begin() as conn:
conn.execute(text("DELETE FROM daily_metrics WHERE date = :d"),
{"d": run_date}) # clear this partition
df.to_sql("daily_metrics", engine, if_exists="append") # then write itThe deeper principle is partitioned, incremental processing: rather than reprocessing the whole table wholesale, divide it along a dimension — usually date — and make each task responsible for exactly one partition, derived from the run’s logical time, never the wall clock. Dagster makes this first-class with partition definitions, so each day is an independently materializable unit and the UI shows which partitions are fresh, stale, or failed. But you don’t need a framework feature; you need the habit of scoping every write to a partition key. Once you have it, three things become free: failed partitions re-run surgically, only new data is processed each run, and — the next section’s payoff — history replays safely. Idempotency and partitioning are the twin properties that turn an orchestrator’s recovery machinery from a liability into a superpower.
Backfills and scheduling
Here is the subtlety that trips up everyone new to orchestration: a scheduled run is named for the interval it covers, not the instant it fires. A daily pipeline starting January 1st produces its first run at the end of January 1st — because only then does a full day of data exist to process. Airflow calls this the data interval; the run’s logical date (its ds, its execution_date in older parlance) is the interval it owns, and it is the value every idempotent task should key on. Your load deletes-and-inserts for the run’s date, not for today(). That is what lets the same code run for January 1st today and for a date last year during a backfill, and do the right thing both times. Hardcode today() and you’ve quietly broken the ability to reprocess history — the single most important capability the data interval gives you.
Which brings us to backfilling: running the pipeline over a range of past intervals to fill or repair history. New pipeline with two years of data to load? Backfill, one daily interval at a time. Transform bug that poisoned last month? Fix the code and backfill that month — and because each task is idempotent and partitioned, replaying those thirty runs overwrites the bad partitions rather than duplicating anything. This is the clean ending the Tuesday incident never got: with idempotent, date-scoped tasks, repairing the bad day is a one-command backfill of a single interval, not a manual DELETE-and-pray. Asset-aware tools lean in hardest — Dagster’s backfills are partition-aware from the UI, letting you rematerialize exactly the partitions you want.
The companion knob is catchup, and it is the most dangerous default in the orchestration world. Catchup controls what happens when you deploy a DAG whose start date is in the past: with catchup on, the scheduler immediately tries to run every interval between the start date and now.
A team shipped a new daily DAG with start_date set, reasonably enough, to the start of the data — about six months back — and left catchup at its default of True. On deploy, the scheduler did exactly what it was told: it queued roughly 180 backfill runs all at once, each opening connections to the same upstream operational database. The source’s connection pool exhausted in seconds, the database fell over under the synchronized load, and the “new pipeline” took down a system other live services depended on. Two lessons travel together. First, set catchup deliberately — almost always False for a fresh deploy — and when you do want history, run it as a throttled backfill with a cap on concurrent runs, not a stampede. Second, the stampede was survivable only because the tasks were idempotent; had they appended, those 180 simultaneous runs would have triple-loaded every row they touched on top of overloading the source. Idempotency lets you treat a botched backfill as “run it again, correctly” rather than “now also clean up the mess.”
Build it → Orchestrated ingest-to-serve pipelines in practice: Project 07: Data Lakehouse stands up a partitioned, backfill-aware lakehouse with idempotent table writes, and Project 10: Warehouse Semantic Layer shows scheduled transforms feeding a modeled, queryable serving layer downstream.
The orchestrator landscape
The three best-known orchestrators are not three answers to one question; they are three framings of what a pipeline is. Reading them as examples of the concepts above beats memorizing any one’s API.
Airflow is the incumbent task-scheduler — the system most teams picture when they hear “orchestration,” with the deepest ecosystem of pre-built operators for S3, Redshift, EMR, Spark, and nearly everything else a data team touches. Its model is squarely task/schedule-centric: define a DAG of operators, a cron-like schedule, and dependencies, and the scheduler runs the tasks. Its strengths — maturity, scale, operator coverage, first-class backfills — are exactly what operator-heavy batch ETL wants. Its costs are operational weight (a scheduler, a metadata database, workers) and a model that knows about tasks, not data, so lineage and freshness are bolted on rather than free.
Dagster optimizes for the asset-aware model. Its central abstraction is the software-defined asset: instead of “run this task,” you declare “this dataset exists, and here is the function that produces it,” with dependencies inferred from which assets a function consumes. The payoff is everything that flows from making data first-class — a queryable lineage graph, freshness the system tracks, partitions and backfills built into the model, strong typing, and a dagster dev story that runs the whole graph on your laptop. It is strongest when what data exists and how fresh it is matters more than the bare fact that a task ran.
Prefect optimizes for dynamic, Python-native flows. Its model is the lightest: any function becomes a task or flow with a decorator, with no explicit DAG object to construct — the graph emerges from ordinary control flow, including loops and conditionals evaluated at runtime. That fits pipelines whose shape isn’t known until execution (process whatever files landed), teams wanting minimal ceremony between a working script and a deployed pipeline, and hybrid execution where the same flow runs locally, on Kubernetes, or on a cloud worker pool. The trade: an implicit, dynamic graph is harder to reason about statically than an explicit one.
There is no winner, only fits. Heavy operator needs and an existing investment point to Airflow; data-product thinking, lineage, and freshness SLAs to Dagster; dynamic workflows and Python-native simplicity to Prefect. What they share matters more than what divides them — all three give you the DAG, the scheduler, the retries, the backfills, and the observability the crontab never could.
Observability and alerting
A pipeline you cannot see is a pipeline you cannot trust, and the Tuesday disaster was, at bottom, an observability failure: the data was wrong for hours and nothing said so. The fix is to treat every pipeline run as a first-class event with a state — queued, running, succeeded, failed, retrying — recorded, timestamped, and inspectable. This is what an orchestrator’s run database and UI give you and a crontab cannot: not “did the process exit zero” but “did the pipeline do what it was supposed to, and if not, where did it stop.” Every orchestrator here makes runs browsable, surfaces per-task logs, and colors the graph by state so a failure is a red node, not a mystery.
On top of that visibility sits alerting — turning a state change into a human’s attention. The pattern is the same across tools: register a callback (Airflow’s on_failure_callback, Prefect’s state handlers, Dagster’s sensors and hooks) that fires on failure and routes a message to Slack, PagerDuty, or email with enough context — which DAG, which task, which date, a link to the logs — to start debugging from the alert alone. The asset-aware tools add a sharper question: not just “did a task fail?” but “is this dataset stale?” — a freshness SLA that alerts when data hasn’t been updated within its promised window, catching the silent-staleness mode that killed the dashboard even though no task errored. Pipeline observability isn’t a separate discipline from production monitoring; it is the same practice of watching runs as events, applied to data, and it ties directly to the broader observability material in Part IV.
Practical exercise
Difficulty: Level I · Level II · Level III
Level I — Model a three-step ETL as a DAG. In the orchestrator of your choice, express a pipeline of three tasks — extract, transform, load — with explicit dependencies (
extract → transform → load) and a retry policy of three attempts with a delay between them. Run it, then make the transform raise on its first invocation and confirm from the run’s UI/logs that the orchestrator retried only the transform and did not re-run extract.Level II — Make a non-idempotent step idempotent, then backfill. Start with a load task that
appends rows. Demonstrate the bug: run it twice for the same logical date and show the duplicates. Then refactor it to scope its write to the run’s data interval — delete-then-insert (or upsert) by date — so re-running produces one correct copy. Finally, run a throttled backfill over a five-day historical range (cap concurrent runs) and verify each day’s partition is correct and singular. Note why the backfill would have been unsafe before your fix.Level III — Design an asset-aware pipeline with freshness SLAs and alerting. Design (and if you can, build) a small pipeline as assets rather than tasks: declare the datasets that should exist, let dependencies be inferred from what each asset consumes, and attach a freshness expectation (“no more than 24 hours stale”) plus a failure/staleness alert routed to a channel with run context. Then argue: for this pipeline, asset-aware or task-centric, and why? Name the concrete property (lineage, freshness, selective refresh, operator ecosystem, dynamic shape) that decides it.
Summary
Orchestration turns a fragile chain of scripts into a managed, observable, recoverable pipeline. The reframe at its center: a pipeline is a DAG of tasks with dependencies, and an orchestrator’s job is to walk that graph in the right order — on a schedule or a trigger — with retries, backfills, idempotency, and observability built into the substrate. Retries recover the transient failures that are inevitable at scale, but they are only safe because tasks are idempotent: scoped to a data interval, writing with delete-then-insert or upsert so running twice equals running once. That same idempotency, with partitioned processing and the data-interval concept, makes backfilling history a routine one-command operation rather than a manual rescue. The field is shifting from the task/schedule-centric model (Airflow: “run this at 2am”) to the asset/data-aware model (Dagster, Prefect: “this dataset should exist and be fresh”) — declaring the data you want, not just the steps. And across all of it, treating runs as observable, alertable events is what keeps a 3am failure from becoming a silent, downstream-poisoning catastrophe.
Key takeaways
- A pipeline is a DAG; the orchestrator’s scheduler walks it in dependency order, which is the thing cron structurally cannot do — cron knows clock times, not whether inputs are ready.
- Retries recover transient failures and re-run only the failed task — but a retry on a non-idempotent task is a data-corruption coin-flip, so idempotency comes first.
- Idempotency means re-running a task produces the same result; achieve it by partitioning on the run’s data interval and writing with delete-then-insert or upsert, never append.
- Backfills replay history one interval at a time and are safe because tasks are idempotent;
catchupis the dangerous default — set it deliberately and throttle real backfills. - The model is shifting from task/schedule-centric (Airflow) to asset/data-aware (Dagster, Prefect): declare the data you want and its freshness, not just the steps.
- Pipeline runs are observable events with states; alert on failures and on staleness so silent corruption surfaces before it poisons downstream data.
Connections to other chapters
- The Data Engineering Landscape (Part II opener, prerequisite): orchestration sequences the data lifecycle that chapter frames — the ingest → transform → serve flow doesn’t run itself; the orchestrator is the foreman that makes it happen on schedule, in order, and recoverably.
- Data Processing Engines (sibling): the heavy transforms an orchestrator triggers — the Spark jobs, the warehouse queries — run on the engines covered there. Orchestration decides when and in what order; the engine does the processing. Partners, not rivals.
- Data Quality (sibling): the
validatenode in our DAG is a quality gate, and orchestration is how quality checks become enforced steps rather than good intentions — a failing check fails the run, blocks downstream tasks, and fires an alert. - CI/CD (Part IV, extension): pipelines are code — DAGs, asset definitions, flows in a repo — and deploy through the same CI/CD machinery as any other code, with tests, review, and versioned releases. The idempotency you build for retries also makes a redeploy safe.
- Observability (Part IV, extension): pipeline runs are events you watch, and this chapter’s alerting is one application of the broader observability practice there — turning a state change into a human’s attention before users notice the breakage.
Further reading
Essential
- Apache Airflow documentation — Core Concepts (DAGs, Tasks, Scheduling) — the canonical reference for the task/schedule-centric model, data intervals, catchup, and backfills.
- Dagster documentation — Software-Defined Assets — the clearest articulation of the asset/data-aware model, partitions, and freshness, and the best way to internalize the task-vs-asset distinction.
- Prefect documentation — Flows, Tasks, and Deployments — the Python-native, dynamic-flow framing, and a gentle on-ramp to retries, caching, and scheduling.
Deep dives
- Reis & Housley, Fundamentals of Data Engineering — the “Orchestration” treatment places pipelines inside the full data lifecycle and is the best book-length grounding for this chapter’s concepts.
- The software-defined asset (Dagster’s foundational essays) — the argument for declaring data products rather than tasks, and why lineage and freshness fall out of it for free.
Historical context
- Airflow’s origin at Airbnb (Maxime Beauchemin’s writing on its design) — why a data-engineering team built a programmable, code-defined scheduler when cron and GUI schedulers ran out of road, and the “configuration as code” principle that shaped a generation of orchestrators.
- Luigi (Spotify) — the earlier task-and-target dependency framework whose ideas Airflow built on, useful for seeing how the DAG-of-tasks abstraction crystallized.