Data Quality & Testing
data quality, data validation, great expectations, soda, data contracts, data testing, lineage, freshness, schema drift, data observability
Introduction
The number on the executive dashboard had been wrong for three weeks before anyone noticed, and by then a hiring plan had been built on it. Daily active users had “grown” thirty percent overnight, and nobody questioned good news. The pipeline that produced the figure had run green every morning — every Airflow task succeeded, every log line said done, every alert stayed silent. What had actually happened was small and invisible: an upstream service team, cleaning up their event schema, had changed a field that used to report session minutes to report session seconds, and a downstream join started multiplying a metric by sixty. The code was correct. The SQL was correct. The job ran perfectly. The data was garbage, and “the job didn’t crash” had been quietly mistaken for “the data is right.”
This is the failure mode that catches teams who have done everything else well. They have tests — unit tests on every transformation, integration tests on every load — and all of them pass, because all of them assert things about code. None of them assert anything about the values flowing through that code. A test that checks calculate_dau() returns the right number for a fixed input fixture will pass forever, no matter what the real upstream sends tomorrow. The pipeline’s definition of success was operational (“the process completed”) when the definition that mattered was semantic (“the numbers are true”). Closing that gap is what this chapter is about: data, unlike code, needs tests that run against the data itself, continuously, in production.
The Core Insight
Testing code and testing data are different problems, and the difference is not one of degree. Code is static and yours: you write it, you control it, and once it is correct it stays correct until someone edits it. A unit test pins behavior at a single moment and that behavior does not drift on its own. You can test code once, before you ship, and trust the result.
Data has none of those properties. It is mutable, external, and time-degrading. You do not write the data — it arrives from systems other teams own, vendors change without telling you, and sources break in ways your code cannot see. It changes every hour: schemas drift as an upstream adds or renames a column, distributions shift as the business or the world changes underneath you, volumes spike and collapse, NULLs creep in where they never used to, a retried load doubles a metric. Yesterday’s perfectly valid dataset says nothing about today’s. So you cannot test data once. You must continuously validate it as it flows — asserting expectations about its schema, ranges, uniqueness, freshness, volume, and referential integrity at the boundaries where it enters and moves through your system, and failing loudly when an expectation breaks, ideally before the bad data reaches anyone who would act on it. Quality is not a property you verify at build time and forget. It is a property you enforce continuously, every batch, forever, because the thing you are checking will not hold still.
A mental model
Three images make the rest of this chapter concrete. First, expectations are unit tests for data. Where a code test asserts “this function returns 42 for this input,” a data expectation asserts “this column is never null,” “this row count is within ten percent of yesterday’s,” “every order_id is unique,” “this timestamp is no more than an hour old.” They are small, declarative, named assertions about a dataset’s properties, and like unit tests they are most valuable when they fail — telling you exactly which invariant broke and where.
Second, the validation gate is a circuit breaker. A circuit breaker stops current before a fault propagates and burns down the house. A validation gate sits in the pipeline and runs the expectations against each batch; if the batch fails, the gate quarantines it — holds it, does not load it — so the fault never spreads to the tables your dashboards and models read. Bad data caught at the gate is an incident contained; bad data loaded past the gate is an incident discovered three weeks later by an executive.
Third, data observability is monitoring applied to datasets. The same way you watch a service’s latency, error rate, and saturation, you watch a dataset’s freshness (did it arrive on time?), volume (is the row count normal?), schema (did the columns change?), and distribution (did the values shift?). Expectations catch the failures you thought to write down; observability catches the ones you didn’t — the anomaly no fixed rule anticipated.
What to validate, and where
Quality is usually described in six dimensions, and naming them helps you decide what to assert. Accuracy: do the values reflect reality? Completeness: are required fields populated, or are NULLs creeping in? Consistency: do related facts agree across tables — does the summary’s order count match the orders table? Timeliness: is the data fresh enough for the decisions riding on it? Validity: do values conform to format and domain — emails that look like emails, statuses from the allowed set? Uniqueness: are there duplicates where a key should be one-to-one? Every check serves one of these; if you can’t say which, you probably don’t need it.
The where is as important as the what, and the rule is simple: validate at boundaries. The two that matter most are ingestion — the moment external data enters your system, where you catch a broken source before it touches anything — and transformation outputs, where you catch your own logic mangling good input into bad output. Figure 33.1 shows the canonical placement: a gate immediately after ingestion, with passing data flowing downstream and failing data peeled off into quarantine.
The last decision per check is block versus warn. A blocking check stops the pipeline and quarantines the batch; a warning logs and alerts but lets the data through. Block the things that are catastrophic and unambiguous — a missing primary key, a schema that no longer matches, a required field gone null. Warn on the things that are suspicious but survivable — a distribution that drifted a little, an optional field’s null rate creeping up. Get this calibration wrong in the strict direction and you train the team to ignore a pipeline that cries wolf; get it wrong in the loose direction and the gate becomes decoration. The art of a quality strategy is mostly the art of setting these thresholds.
What you’ll learn
- Why testing data is a fundamentally different problem from testing code, and what that difference forces you to do differently
- How to express data quality as declarative expectations — schema, not-null, range, uniqueness, freshness, and volume checks — using tools like Great Expectations and Soda
- How data contracts push quality upstream by making producers guarantee a schema and semantics to their consumers
- Where to place validation gates in a pipeline, how to quarantine a failing batch, and how to choose block-versus-warn per check
- What data observability monitors (freshness, volume, schema, distribution) and how lineage lets you trace impact and find root cause
- How to test the pipeline logic itself — the transformations — separately from the data flowing through it
Prerequisites
- Data Orchestration — how pipelines are structured as DAGs of tasks, since quality checks run as tasks and gates within those DAGs
- The Data Engineering Landscape — the data lifecycle (ingest, store, transform, serve) whose boundaries are where you place checks
- The shared testing mindset from Testing and Quality — fixtures, assertions, the testing pyramid; that cross-language chapter assumes you can write a test, and this one focuses on what changes when the subject is data
Why testing data is different
Every practical decision in this chapter follows from one distinction. The Testing and Quality chapter teaches a mindset that transfers wholesale — small focused assertions, isolate the unit, run them automatically, treat a red test as a stop sign. What does not transfer is the assumption underneath it: that the thing you test is deterministic and yours.
A function is a closed world. Same input, same output; you own both. When a unit test fails, something changed in your code, and the diff will show you what. Data is an open world: the “input” is produced by systems you don’t own, on schedules you don’t set, with quality you can’t enforce at the source. When a data check fails, nothing in your repository changed — the world did. That is why the testing pyramid, applied to a pipeline, leans harder on one rung: schema and contract checks become the dominant interface tests, because the schema is the interface to an upstream you cannot control, and it is the thing most likely to break silently.
The second consequence: code tests run at build time and you are done, but data checks must run at runtime, on every batch, forever, because the data they guard is replaced on every run. A green suite this morning is no evidence about the batch landing this afternoon. This is why data-quality tools are built to be scheduled and operational — they live in the pipeline, not just in CI — and why “the job succeeded” and “the data is valid” are two separate claims.
Expectations: declarative validation
The most leveraged idea in the field is to stop writing imperative validation by hand and start declaring what good data looks like. The brittle approach — scattering assert len(df) > 0 and assert 'user_id' in df.columns through your code — is hard to read, hard to reuse, and silent about everything you forgot to assert. The declarative approach names each property of the data you care about and lets a framework run, record, and report all of them. The two open-source tools that dominate this space are Great Expectations (Python-first, programmatic, richly documented) and Soda (YAML-first, SQL-native, fast to stand up), and although they differ in surface, they express the same vocabulary.
That vocabulary is small and worth knowing by name, because it maps onto the quality dimensions above. A schema check asserts the columns exist, in the right types — the structural floor. A not-null (completeness) check asserts required fields are populated, often with a tolerance: zero nulls in a primary key, perhaps five percent in an optional field. A range / validity check asserts values fall in bounds or an allowed set — ages between 0 and 120, status in {pending, shipped, delivered}, emails matching a pattern. A uniqueness check asserts no duplicates in a key, single or compound. A freshness check asserts the newest record is recent enough — the most data-specific check of all, with no analogue in code testing. And a volume check asserts the row count is sane, often relative to history, since “within ten percent of the trailing average” catches a half-loaded batch that “more than zero rows” never would.
In Soda’s declarative YAML, a batch of these reads almost like a specification a non-engineer could review. Note that each check carries a name (so a failure is legible in an alert) and that some checks split into graduated thresholds:
# Soda: an ingestion gate for the orders table, expressed declaratively.
checks for orders:
- row_count > 0:
name: Orders table has data
- freshness(created_at) < 1h:
name: Orders arrived within the last hour
- missing_count(order_id) = 0:
name: "[CRITICAL] order_id never null — primary key"
- duplicate_count(order_id) = 0:
name: "[CRITICAL] order_id is unique"
- invalid_count(status) = 0:
valid values: ['pending', 'shipped', 'delivered']
name: status drawn from the allowed set
- duplicate_percent(transaction_id):
warn: when > 0.1% # investigate
fail: when > 1% # block the pipeline
name: transaction duplicate rateGreat Expectations expresses the same intent through a Python validator, suiting teams in pandas and Spark who want programmatic, version-controlled suites. The shape is identical — named assertions about properties — and the choice between the tools is mostly register: YAML for SQL-first analysts who want checks the whole team can read, Python for engineers who need custom logic and DataFrame-native validation. Freshness and volume deserve emphasis precisely because they have no code-testing equivalent: they assert when the data arrived and how much there is — the two signals that most reliably reveal a broken source while every schema check still passes.
A team guarding a critical events table had exactly one quality check on it: row_count > 0. It stayed green for months. Then an upstream schema migration quietly changed a column the table depended on, and one field went one hundred percent null overnight — but the rows still arrived, so the count check never blinked. Downstream models that keyed on that field silently degraded; a recommendation surface served garbage for over a week before a customer complaint surfaced it. Row count is a necessary check and a wildly insufficient one. It tells you data showed up, not that it is correct. The fix was not a better row-count check; it was adding not-null and validity checks on the columns the consumers actually depended on, plus a distribution monitor that would have flagged the null spike on day one. The lesson generalizes: a green check only protects you against the failure it was written to catch, and the failures that hurt are the ones nobody wrote a check for.
Data contracts: shifting quality upstream
Every check so far is defensive — the consumer inspecting data after it arrives, hoping to catch what the producer broke. A data contract flips the posture. It is an explicit, versioned agreement in which the producer guarantees the schema and semantics of what it emits: these columns, these types, these meanings, this nullability, this is what status can be, this is what the units are. The consumer codes against the contract; the producer is on the hook to honor it or to version it deliberately when it must change.
The value is moving the catch upstream, to the cheapest place to fix a problem. The dashboard disaster that opened this chapter — minutes silently becoming seconds — is exactly a contract violation: a producer changed a field’s semantics without telling anyone, and no downstream schema check could see it, because the column kept the same name and numeric type. A contract makes that a breaking change at the boundary: the producer’s CI sees the field’s documented unit no longer matches the agreement, and the change is blocked or deliberately versioned before it ships, rather than surfacing three weeks later in a hiring plan.
In practice a contract is often a schema with constraints and metadata, checked in CI on both sides, with a mismatch failing the build:
# A contract as a checkable schema (pandera): types, nullability, value constraints.
import pandera as pa
from pandera import Column, DataFrameSchema, Check
orders_contract = DataFrameSchema({
"order_id": Column(int, nullable=False, unique=True),
"customer_id": Column(int, nullable=False),
"total_amount": Column(float, nullable=False,
checks=Check.greater_than_or_equal_to(0)),
"status": Column(str, nullable=False,
checks=Check.isin(["pending", "shipped", "delivered"])),
})Contracts and gates are not competitors; they are complementary layers. The contract prevents a producer from intending a breaking change; the ingestion gate catches the breakage the contract missed — a source that violates its own contract, a vendor with no contract at all, a corruption introduced in transit. Defense at the boundary protects you from the world; contracts upstream shrink how much of the world can hurt you in the first place.
Where to enforce, and block versus warn
A check is only as useful as its placement and its consequence. Placement, as a rule, means boundaries — and the highest-value boundary is ingestion, because data is cheapest to reject before it has fanned out into a dozen downstream tables. The second is each transformation’s output, where you verify your own logic didn’t turn valid input into invalid output (the cancelled-order that still counts as revenue, the join that exploded row counts). Placing a gate inside the pipeline DAG is, mechanically, just another task: extract, then validate, then transform, then validate, then load. If the validate task fails, the tasks downstream of it never run — which is the whole point.
The consequence of a failed check is the quarantine-versus-warn decision, where strategy lives. A blocking check stops the run and quarantines the batch — moves it aside, holds it, does not load it — so consumers keep reading the last known-good data instead of fresh garbage. This is the circuit breaker: a bad load is prevented, not merely reported. A warning check logs, alerts, and lets the data through, right for soft signals where blocking would do more harm than the anomaly. Graduated thresholds (the Soda warn/fail pair above) let one check do both: warn at “a human should look,” fail at “do not let this through.”
The failure mode on each side is real. Over-blocking — a hard fail on every soft statistical check — pages at 3 a.m. for a two-percent distribution wobble, and a team that auto-acks the page is no better protected than one with no checks. Under-blocking — warning on everything, blocking on nothing — is a gate that has never stopped a bad load, which is no gate. Calibrate by consequence: block the catastrophic and unambiguous (no primary key, schema changed, required field null), warn on the suspicious-but-recoverable, and revisit the thresholds when reality proves the defaults wrong.
Data observability and lineage
Expectations and contracts share a blind spot: they only catch what someone thought to write down. The dangerous failures are the ones nobody anticipated — a distribution that quietly skews, a join key whose cardinality changes, a source that starts arriving two hours late. Data observability is the practice of catching those by monitoring datasets the way you monitor services. The chapter on Observability (Part IV) frames the three pillars for systems; data observability is that same instinct turned on data, and it watches four signals in particular.
Freshness: when did this dataset last update, and is that within cadence? A table that should refresh hourly but hasn’t moved in six is broken even if every value is valid. Volume: is the row count in its normal range? A statistical test — a z-score against the trailing window — flags the half-loaded batch and the duplicate-load spike an absolute threshold misses. Schema: did columns, types, or nullability change? Automated drift detection — the alarm for the silent upstream migration. Distribution: did the values shift? A Kolmogorov–Smirnov test against history catches the units-changed disaster and the slow rot no schema check sees, because the schema is fine and only the meaning moved.
# Distribution drift as a statistical test: did today's values shift from history?
from scipy import stats
def distribution_shifted(current, historical, alpha: float = 0.05) -> bool:
"""KS test; True means the distributions differ enough to investigate."""
_, p_value = stats.ks_2samp(current, historical)
return p_value < alpha # reject "same distribution" -> drift detectedThe companion to detection is lineage — the graph of which datasets derive from which, which jobs produce and consume each table, which dashboards and models read the end. It answers the two questions an anomaly always raises. Downstream: this table is bad, so what else is now suspect — which reports, features, decisions inherit the corruption? That is impact analysis: what to quarantine and whom to notify. Upstream: this metric is wrong, so where did the rot enter — which source, which transformation? That is root-cause analysis, turning a multi-hour hand-trace into walking a graph. Detection tells you something is wrong; lineage tells you how far it spread and where it started.
Build it → A working data-observability system — freshness/volume/schema/distribution monitors, anomaly detection, and lineage for impact and root cause — is exactly Project 09: Data Observability, the direct analog of this section. For quality applied to ML systems specifically — where “bad data” means a drifted benchmark or a regressed evaluation set — Project 49: AI Benchmark Suite builds standardized, reproducible workloads and regression detection for the inference stack.
Testing the pipeline logic itself
Everything above guards the data. There is a parallel obligation that guards the code that moves it — and the two must not be conflated, because they fail for different reasons and are caught by different tests. The transformation logic — the function that computes revenue, the SQL that aggregates by category — is ordinary code, and it gets ordinary code tests, drawing directly on the mindset from Testing and Quality.
A pure transformation function is the easiest thing in the pipeline to test well: feed it a small handcrafted fixture, assert the output, cover the edge cases (empty input, nulls, boundary values). SQL transformations are testable the same way using an in-memory engine like DuckDB — seed a few rows, run the query, assert the result, with no warehouse required. These tests are fast, deterministic, and run in CI on every commit, because the code is deterministic and yours — exactly the property the data lacks.
# A code test for a transformation: deterministic input, deterministic assertion.
def test_cancelled_orders_excluded_from_revenue():
orders = sample_orders(statuses=["completed", "completed", "cancelled"])
revenue = revenue_by_category(orders)
assert revenue.total == 300.00 # the cancelled order contributes nothingThe clean division of labor: code tests run once in CI and prove the transformation is correct; data checks run on every batch in production and prove the data flowing through it is valid. A pipeline needs both. Correct code over garbage input still produces garbage output — the opening disaster, where the code was flawless and the data was a lie. Valid data through broken code produces garbage too. Test the logic with the testing chapters’ tools; validate the data with this chapter’s. Neither substitutes for the other.
Practical exercise
Difficulty: Level I · Level II · Level III
- Level I — Write expectations and run them as a gate. Take a real dataset (a CSV of orders, events, or users). Using Soda or Great Expectations, write a suite of expectations covering all the core dimensions: not-null on the key, a range or allowed-set check on a categorical, uniqueness on the primary key, and a freshness check on a timestamp. Run the suite, deliberately corrupt one row to break each check in turn, and confirm the failure is named and legible. You should be able to read the output and know exactly which invariant broke.
- Level II — Add a quarantining gate and calibrate block-versus-warn. Insert the suite as a validation task in a small pipeline (extract → validate → load). Make a failing batch get quarantined — moved aside, not loaded — while the last known-good data stays live for consumers, and an alert fires. Then go through your checks and assign each one block or warn with a written justification: which failures are catastrophic enough to stop the pipeline, which are suspicious-but-survivable, and where you’d set a graduated
warn/failthreshold. Defend your calibration against both failure modes — alert fatigue and a gate that never gates. - Level III — Design a quality strategy for a multi-source platform. You own a platform ingesting from three upstreams: a partner API (no contract, changes without notice), an internal service (your team can negotiate a contract), and a nightly vendor file dump. Design the full strategy: where data contracts go and with whom, where validation gates sit and what they block, what data observability monitors across datasets, and how lineage supports root-cause when something slips through anyway. The deliverable is an argument, not a config: for each layer, name a class of failure it catches that the others miss — the silent semantic change a contract stops, the corrupt batch a gate quarantines, the unanticipated drift only observability sees, the blast radius only lineage can trace. Show why no single layer is sufficient.
Summary
Data quality is the trust that makes the rest of the data lifecycle usable, and it cannot be bought with code tests alone. Testing data is a different problem from testing code because data is mutable, external, and time-degrading: it arrives from systems you don’t control, changes every hour, and degrades on its own. So instead of testing once at build time, you express quality as declarative expectations — schema, not-null, range, uniqueness, freshness, volume — and run them continuously, at the boundaries where data enters and moves, with a validation gate that quarantines bad batches before they reach consumers. Data contracts push the catch upstream by making producers guarantee schema and semantics. Data observability monitors freshness, volume, schema, and distribution to catch the anomalies no fixed expectation anticipated, and lineage traces impact and root cause when something slips through. Underneath it all, the transformation code still needs ordinary code tests — proving the logic is correct is a separate claim from proving the data is valid, and a pipeline needs both.
Key takeaways
- “The job succeeded” and “the data is correct” are different claims; a pipeline can run green for weeks while serving garbage. Check them separately.
- Code is static and yours; data is mutable, external, and degrades over time — so you can’t test data once, you must validate it continuously, every batch.
- Express quality as named declarative expectations (Great Expectations, Soda) across the dimensions — schema, completeness, validity, uniqueness, freshness, volume — not as hand-rolled asserts.
- Validate at boundaries (ingestion and transformation outputs); block the catastrophic and unambiguous, warn on the suspicious-but-survivable, and calibrate to avoid both alert fatigue and a gate that never gates.
- Contracts shift quality upstream; observability catches the unanticipated; lineage traces the blast radius. They are complementary layers, each catching a class the others miss.
Connections to other chapters
- Testing and Quality (prerequisite mindset): the testing instincts — fixtures, assertions, the pyramid, red-means-stop — taught comparatively across languages there transfer wholesale to data’s harder problem, but data forces them to run continuously in production rather than once in CI, and pushes the weight of the pyramid onto schema and contract checks.
- Data Orchestration (sibling): quality checks don’t run in a vacuum — they run as tasks inside the orchestrated DAG, and the validation gate is a node whose failure stops the downstream nodes. Orchestration is the machinery that makes “validate every batch” operational.
- Observability (Part IV, extension): data observability is observability turned on datasets. The three-pillars instinct for systems — watch the signals, alert on anomalies, trace the cause — becomes freshness/volume/schema/distribution monitoring plus lineage for data.
- The Data Engineering Landscape (foundation): quality is the trust that makes the whole lifecycle — ingest, store, transform, serve — actually usable. A warehouse full of data nobody trusts is a liability, not an asset; this chapter is how the data earns its trust.
Further reading
Essential
- Great Expectations documentation — the canonical reference for declarative expectations, suites, checkpoints, and data docs in a Python-first workflow.
- Soda Core documentation and the SodaCL reference — the YAML/SQL-native approach to declarative checks, freshness, and pipeline gating with exit codes.
Deep dives
- Reis & Housley, Fundamentals of Data Engineering — the data-quality and trust chapters place validation in the full lifecycle and argue why quality is an undercurrent of the whole discipline, not a bolt-on.
- Chad Sanderson’s writing on data contracts (the Data Products / data-contracts essays) — the canonical articulation of shifting quality upstream by making producers accountable to consumers.
Historical context
- Moses, Gavish & Vorwerck, Data Quality Fundamentals — the book that named and codified data observability, including the freshness/volume/schema/distribution/lineage framing this chapter uses.
- The pandera and dbt tests project documentation — two influential takes on schema-as-contract and SQL-native data testing that shaped how the field thinks about validating data structurally.