Data Warehousing & Modeling

Keywords

data warehouse, dimensional modeling, star schema, facts, dimensions, dbt, elt, sql optimization, columnar storage, olap

Introduction

The two dashboards both said “revenue,” and they disagreed by eleven percent. One was finance’s, the other growth’s, and each had written its own SQL against the raw operational tables — SQL that had quietly diverged. Finance counted an order when it was paid; growth counted it when it was placed. Finance excluded refunds; growth netted them out a week later. Neither query was wrong, exactly; they encoded two different definitions of the same word. And because every analyst queried the raw tables directly, there were as many definitions of “revenue” as there were people who had ever needed the number. When the CFO asked which figure was real, nobody could say. The warehouse had become a swamp: a hundred slightly different truths, none authoritative, all slow.

The slowness was its own crisis. The same warehouse held a web_events table of a billion rows, unpartitioned, and the most common dashboard query — “sessions by country, last 7 days” — scanned all billion of them every run, because the engine had no way to know only the last week’s rows mattered. It took ninety seconds when it didn’t time out. Analysts learned to run it before lunch.

These are the two failure modes of an undisciplined warehouse, with a shared root: analytical data stored and queried as if it were transactional data. A warehouse needs a deliberate model — an agreed-upon shape that encodes the business’s definitions once — and transformation discipline, so the model is built reproducibly and queried efficiently. Without both, it decays into exactly this: inconsistent answers, delivered slowly. This chapter is about the model and the discipline that keep a warehouse from becoming a swamp.

The Core Insight

The mistake underneath both dashboards is treating one database design as if it could serve two opposite jobs. A transactional database — the one behind the application — is optimized for writes, and its winning design is normalization: every fact stored exactly once, spread across many narrow tables joined by keys, so an update touches one place and can never contradict itself. That shape is correct for an application and miserable for analytics. Asking “revenue by product category by quarter” against a normalized schema means joining five or six tables and scanning millions of rows to answer a question that touches only a handful of columns.

Analytical data wants the opposite shape, because the workload is the opposite. Analysts don’t update single rows; they scan and aggregate — sum, count, group, slice — across enormous numbers of rows but only a few columns at a time. The insight behind the data warehouse is to stop forcing this read-heavy, aggregate-heavy workload through a write-optimized schema and instead design for the way analysts actually query. That redesign has three layers, and the rest of the chapter is each one in turn:

Model. Reshape the data with dimensional modeling — measurable events (facts) surrounded by the descriptive context that gives them meaning (dimensions), arranged in a star schema that an analyst can slice and aggregate with one or two joins instead of six.
Transform. Build that model with transformation-as-code — dbt’s approach of version-controlled SQL, automated tests, and generated documentation, so the definition of “revenue” lives in one tested model that every dashboard references rather than in a hundred copy-pasted queries.
Optimize. Run it on a columnar OLAP engine whose speed comes not from cleverness but from physics — storing each column separately, partitioning the data, and pruning the irrelevant parts before they’re ever read, so the billion-row scan becomes a million-row scan.

A mental model

Picture a star. At the center sits one fact table — the measurable events, say sales, each row a thing that happened with numbers attached: a quantity, an amount. Radiating outward are the dimension tables — date, customer, product, store — the who, what, when, and where that turn a bare number into a meaningful one. A sale of $49.98 is data; a sale of $49.98 to a loyalty member, of a kitchen product, at the Boston store, on a holiday is analysis. The star’s geometry is the point: because the descriptive context is denormalized into a few wide dimension tables, a query reaches the fact at the center, joins out to the dimensions it needs, and slices or aggregates — fast, with shallow joins, in a shape a business user can read. Figure 31.1 draws this shape.

The second analogy is about where the transformation happens. The old way was ETL — Extract, Transform, Load — where data was reshaped on a separate cluster before it landed in the warehouse, because warehouse compute was scarce. The modern way is ELT — Extract, Load, Transform — load raw data into the warehouse first, then transform it in place using the warehouse’s own compute. ELT won because cloud warehouses made that compute cheap and elastic.

The third analogy names what makes the transformation trustworthy: dbt is software engineering for SQL. Before dbt, analytics SQL was the last code on earth that ran without version control, tests, or modularity — the very absences that diverged the two dashboards. dbt treats a transformation as code: each model is a SQL SELECT checked into Git, models reference each other (so a fix cascades), every model can carry tests, and the dependency graph and documentation are generated automatically. You declare what the table should be; dbt figures out how to build it.

When to model, and how

The default for an analytical workload on a modern stack is a short, opinionated recipe, and you should reach for it before considering anything more exotic.

Model dimensionally — a star schema — for analytics. For read-heavy, aggregate-heavy business intelligence (dashboards, ad-hoc analysis, reporting), a star schema is the proven default; its denormalized dimensions keep joins shallow and queries legible. The alternatives have narrow niches: a fully normalized snowflake schema (dimensions split into sub-tables) saves storage that columnar compression already reclaims, at the cost of more joins; a Data Vault earns its complexity only when integrating many volatile source systems with full auditability; a single denormalized one big table is fine for a simple dashboard or an ML feature set. When in doubt, start with the star — and never model dimensionally for an OLTP workload, the write-optimized job that wants normalization.

Use ELT on a cloud warehouse as the default transformation pattern. Load raw, transform in-warehouse with SQL. Reach for an external engine (Spark, Flink) only when the transformation genuinely needs Python/Scala logic or true streaming — dbt is batch and SQL.

Figure 31.1 shows the target shape: the star you’re modeling toward, and the ELT layering that builds it.

What you’ll learn

Why analytical data wants a different physical shape than transactional data, and how the star schema delivers it
How to identify the fact, define its grain in one sentence, and surround it with the right dimensions
How slowly changing dimensions preserve history, so a sale is analyzed against the customer segment that was true at the time of the sale
Why ELT displaced ETL, and how the raw → staging → marts layering organizes a warehouse
How dbt brings version control, testing, lineage, and documentation to SQL — and how a single tested model becomes the one definition of a metric
Why columnar storage, partitioning, and predicate pushdown make a billion-row scan avoidable, and how to read a query plan to prove it
Where the warehouse sits relative to the data lake and the lakehouse

Prerequisites

SQL basics: SELECT, JOIN, GROUP BY, and aggregate functions — enough to read an analytical query (the SQL Basics material)
The data-engineering landscape: where the warehouse sits in the ingest → store → serve pipeline, and the difference between operational and analytical systems (the Data Engineering Overview material)
Comfort with the idea of a relational schema: tables, primary and foreign keys

Dimensional modeling: facts, dimensions, and grain

Dimensional modeling is the centerpiece, because the model is what makes everything downstream possible — a clean star is what dbt builds toward and what the columnar engine queries fast. The model has exactly two kinds of table, and the discipline is knowing which is which.

A fact table holds the measurable events of a business process: one row per thing that happened, carrying measures (the numbers) and foreign keys to the dimensions (the context). A dimension table holds the descriptive attributes — the who, what, when, where — that you filter and group by. The asymmetry matters: fact tables are long and narrow (billions of rows, a handful of measures and keys), dimension tables short and wide (thousands of rows, dozens of descriptive columns). In a sales warehouse, fact_sales is the events and dim_date, dim_customer, dim_product, and dim_store are the context — exactly the star in Figure 31.1.

Before you create a single column, answer one question: what does one row of the fact table represent? That is the grain, the most important decision in the whole design. State it in a single sentence — “one row per line item per order” — and every other choice follows: the dimensions are the things that vary at that grain, the measures the numbers meaningful at that grain. If you cannot state the grain in one sentence, the design is wrong, and the consequences are not cosmetic. Mix two grains in one table — order-level totals alongside line-item rows — and every SUM silently double-counts, because the engine has no way to know some rows are summaries of others. A grain stated and enforced is the best defense against a class of bug that corrupts numbers without raising an error.

With the grain fixed, the fact table is a few keys and a few measures:

-- Grain: one row per line item per sale.
CREATE TABLE fact_sales (
    sale_key      BIGINT PRIMARY KEY,   -- surrogate key for this row
    date_key      INT  NOT NULL,        -- FK -> dim_date
    customer_key  INT  NOT NULL,        -- FK -> dim_customer
    product_key   INT  NOT NULL,        -- FK -> dim_product
    store_key     INT  NOT NULL,        -- FK -> dim_store
    quantity      INT  NOT NULL,        -- additive measure
    total_amount  DECIMAL(10,2) NOT NULL -- additive measure
);

Note the keys. A fact table joins to a dimension on a surrogate key — a meaningless integer the warehouse assigns — not on the business’s natural key (a customer ID, a SKU). This indirection is what makes history possible, which is the next idea.

Slowly changing dimensions: history without lying

Dimensions change. A customer moves cities; a product is reclassified; a loyalty tier is upgraded. The question a dimensional model has to answer is: when you analyze a sale from last year, do you want the customer’s segment as it is now, or as it was when the sale happened? Both are legitimate, and the technique for choosing is the slowly changing dimension (SCD).

The simplest policy, Type 1, just overwrites the old value — no history, the dimension always reflects “now.” That’s right for correcting a typo, and wrong for anything you’ll want to analyze historically, because it silently rewrites the past: last year’s sales suddenly look as if they were always made by Premium customers.

The workhorse is Type 2, which preserves full history by adding a new row each time an attribute changes, rather than overwriting. Each version of the customer gets its own surrogate key and a validity window:

-- A Type 2 dimension keeps every version of a customer as a separate row.
CREATE TABLE dim_customer (
    customer_key   INT PRIMARY KEY,     -- surrogate key, one per VERSION
    customer_id    VARCHAR(50),         -- natural key, stable across versions
    customer_segment VARCHAR(50),
    effective_date DATE NOT NULL,       -- this version became true on...
    end_date       DATE,                -- ...and stopped being true on (NULL = current)
    is_current     BOOLEAN DEFAULT TRUE
);

Now the surrogate key earns its keep. A fact row points at the version of the customer that was current when the event happened, so summing sales by customer_segment groups each sale under the segment it had at the time. This is why facts join on surrogate keys, not natural keys: the natural key (customer_id) is shared across all versions, but the surrogate key (customer_key) pins a fact to one specific version. (Types 3, 4, and 6 exist for narrower needs — a “previous value” column, a separate mini-dimension for rapidly changing attributes — but Types 1 and 2 cover the vast majority of real designs.)

ETL vs ELT, and the layered warehouse

For decades the pattern was ETL: extract from the sources, transform on a dedicated cluster, then load finished tables into the warehouse. The transform happened outside because warehouse storage and compute were expensive and fixed-capacity. Cloud warehouses inverted that economics — Snowflake, BigQuery, and their kin separate storage from compute and make both cheap and elastic — so the constraint that justified ETL evaporated. The modern pattern is ELT: extract, load the raw data in as-is, then transform in place with SQL on the warehouse’s own massively parallel compute. ELT is simpler (no second engine to operate), more flexible (the raw data is always there to re-transform when requirements change), and it puts the transformation in the language analysts already speak.

ELT comes with a layering convention that keeps the in-warehouse transformation from becoming its own swamp. Raw data lands untouched in a raw (or “bronze”) layer — a faithful copy of the source, so you can always rebuild from it. A staging layer does light, mechanical cleanup: rename columns to a consistent convention, cast types — one staging model per source table, no business logic. Finally a marts layer (the “gold” of the medallion metaphor) holds the business-ready facts and dimensions — the star schema itself, where joins and business definitions live. Each layer has one job, and data flows one direction: raw → staging → marts, exactly the side note in Figure 31.1. This is the structure that makes the next idea — dbt — natural.

Transformation as code: dbt

If ELT says transform in the warehouse with SQL, dbt is how you do that without recreating the original swamp. Recall the two dashboards: their definitions diverged because the SQL was copy-pasted, untested, and unowned. dbt fixes each of those by treating transformations as software.

The atom of dbt is a model: a single SELECT statement in a .sql file that defines one table or view. You never write CREATE TABLE; you write the query that produces the rows, and dbt handles materializing it. The thing that makes models compose is the ref() function — instead of hardcoding a table name, a model references another model by name:

-- models/marts/fct_orders.sql — one definition of "completed order", referenced everywhere.
select
    o.order_id,
    o.order_date,
    c.customer_segment,
    o.amount
from {{ ref('stg_orders') }} o          -- ref() builds the dependency graph
join {{ ref('stg_customers') }} c on o.customer_id = c.customer_id
where o.status = 'completed'            -- the business rule lives HERE, once

That ref() does two jobs at once. It lets dbt build the dependency graph (the DAG) automatically — it knows fct_orders depends on stg_orders and stg_customers, so it builds them in order — and it produces the lineage in the generated docs: a clickable graph from raw source to final mart. The business rule (status = 'completed') now lives in exactly one model that every dashboard references. The two definitions of “revenue” can no longer diverge, because there is only one definition.

The piece that would have caught the silent corruption is tests. In dbt a test is a SQL query that should return zero rows; if it returns any, the build fails. The two most valuable tests are declared in a line of YAML each:

# A unique + not_null pair on the primary key catches the most common data bug:
# a join that quietly changed the grain and started producing duplicate rows.
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests: [unique, not_null]

This is the test that would have failed the build the instant a join inflated the row count — the difference between catching corruption in CI and discovering it at quarterly reconciliation. dbt rounds out the story with documentation: descriptions written alongside the models, compiled (with the lineage graph) into a searchable site by dbt docs generate, so “what does this column mean?” has an answer that isn’t tribal knowledge. Version control, modularity, tests, docs — the practices that make application code trustworthy, finally applied to the SQL that defines the business’s numbers.

War story: the swamp, and the one definition that drained it

The eleven-percent dashboard discrepancy from the introduction was never one bug; it was the absence of a model. Every analyst queried the raw orders and payments tables directly, so every analyst encoded their own answer to “when is an order revenue?” — paid vs. placed, gross vs. net of refunds — and the answers drifted apart silently because nothing tied them together. The fix was not a smarter query. It was a dimensional model built as code: a single fct_orders mart encoding the canonical revenue definition once, a unique/not_null test on its key, and a rule that dashboards query the mart, never the raw tables. Within a sprint the two dashboards agreed, because they were reading the same number from the same tested model. Consistency is not a discipline you can ask analysts to maintain by hand across a hundred queries; it has to be built in, as one tested definition that everything references. A metric without a single owning model is a metric with as many values as it has authors.

Build it → A warehouse semantic layer turns these mart tables into governed, reusable metric definitions: Project 10: Warehouse Semantic Layer implements the “one definition of revenue” idea as a queryable layer over the star.

SQL and query optimization: why the scan was avoidable

A clean model on a slow engine still times out. The ninety-second dashboard from the introduction wasn’t slow because the SQL was bad; it was slow because the engine had no way to read less data. Optimization in an analytical warehouse is almost entirely about that — reading less — and three mechanisms do the work.

The first is columnar storage, the deepest reason analytical engines are fast. A transactional database stores data row by row: all of row 1’s columns together, then all of row 2’s — ideal when you fetch whole rows. But an analytical query touches a few columns across millions of rows; “sum total_amount” needs one column out of fifty. A columnar engine stores each column separately and contiguously, so that query reads exactly one column’s worth of bytes and ignores the other forty-nine. Storing a column together also makes it compress beautifully — adjacent values are similar — so there are fewer bytes to read. This is why SELECT * is a sin in a columnar warehouse and harmless in a row store: in columnar layout, every column you name is more data scanned, so you project only what you need.

The second mechanism is partitioning (and its finer-grained cousin, clustering). The billion-row table was unpartitioned, so “last 7 days” had to scan everything to find the recent rows. Partition the table by date and the warehouse physically segregates rows into per-day chunks; a query that filters WHERE order_date >= ... triggers partition pruning — the engine skips entire partitions it can prove are irrelevant, never reading them. A WHERE on the partition key is the difference between scanning a billion rows and scanning seven million. Cloud warehouses spell this differently — Redshift’s sort keys, Snowflake’s cluster keys, BigQuery’s PARTITION BY plus CLUSTER BY — but the idea is identical: organize the data so the engine can skip most of it.

The third is predicate pushdown: applying the filter as early as possible, so rows are eliminated before the expensive work (joins, aggregation). The optimizer does much of this for you, but you can defeat it. Wrap a partitioned column in a function — WHERE DATE(created_at) = '2024-03-15' — and the predicate becomes non-sargable: the engine can no longer match it against the column’s physical organization and falls back to a full scan. Keep the column bare on one side of the comparison (WHERE created_at >= '2024-03-15' AND created_at < '2024-03-16') and pruning works.

To diagnose any of this, read the query plan — EXPLAIN ANALYZE shows what the engine actually did. Watch for a few red flags: a sequential scan on a large table (no pruning), a partition count like “1 of 365 selected” (good) versus “365 of 365” (pruning failed), and a gap between estimated and actual rows of 10x or more (stale statistics — ANALYZE to refresh them often fixes the plan outright). The plan turns “the query is slow” into “the query scanned 365 partitions because the filter was non-sargable,” which is a problem you can fix.

Build it → Columnar storage and query optimization, implemented from scratch: Project 17: Columnar Query Engine builds a DuckDB-style engine with columnar storage, vectorized execution, and a query optimizer, and Project 52: Time-Series Database shows columnar compression and partition pruning tuned for time-stamped data.

Warehouse, lake, and lakehouse

The warehouse is one of three names for where analytical data lives, differing mostly in structure and timing. A data warehouse stores structured, modeled data — the star schemas of this chapter — optimized for fast SQL; you pay the modeling cost up front (schema-on-write) and get clean, fast queries in return. A data lake stores raw files of any shape (JSON, Parquet, logs, images) cheaply and without an imposed schema, deferring structure to read time (schema-on-read); flexible and cheap, but undisciplined on its own — the literal swamp risk made architectural. The lakehouse is the synthesis: lake-cheap file storage with warehouse-style structure, transactions, and SQL layered on via open table formats. For the discipline in this chapter the distinction barely matters — a star schema built with dbt is a star schema whether the marts live in a classic warehouse or a lakehouse table. The deeper treatment belongs to the Data Infrastructure material; here it’s enough that the warehouse is the modeled, query-optimized end of that spectrum.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — Model a business process as a star. Pick a business process you understand — restaurant orders, gym check-ins, a streaming service’s plays. Identify the fact (the measurable event), state its grain in a single sentence (“one row per …”), then list the dimensions (the who/what/when/where you’d filter and group by) and the measures (the numbers). Sketch the star. Then write the one analytical query the model is meant to make easy — a SELECT that joins the fact to two dimensions, filters on one, and aggregates a measure — and confirm it needs only shallow joins.
Level II — Build a layered transformation with a test that catches an error. Design a raw → staging → marts pipeline (dbt-style, on paper or in dbt + a local warehouse) for your Level I process. Write the staging model (rename/cast only) and the mart model (the fact, with its business logic). Add a unique + not_null test on the fact’s key, then deliberately introduce a join that changes the grain and produces duplicates — and show that the test fails. In a short paragraph, explain ELT vs ETL: why the transformation happens in the warehouse, and what the raw layer buys you when requirements change.
Level III — Optimize a slow analytical query. Take a query that scans a large fact table filtered by date and grouped by a couple of dimensions. Choose a partitioning (or clustering) key and justify it from the query’s filter. Rewrite the query to enable predicate pushdown — in particular, fix any non-sargable predicate (a function wrapping the filter column) so partition pruning can fire. Read the EXPLAIN ANALYZE output before and after, point to the partition count and the scan type that changed, and explain in terms of columnar storage why the optimized query reads dramatically less data — both the columns it skips and the partitions it prunes.

Summary

Analytical data wants a different shape than transactional data, because the analytical workload — scan and aggregate over many rows but few columns — is the opposite of the transactional one. The data warehouse delivers that shape in three layers. You model dimensionally: measurable events (facts) at the center of a star, surrounded by descriptive dimensions, with the grain stated in one sentence and slowly changing dimensions preserving history truthfully. You transform with discipline: ELT loads raw data and reshapes it in-warehouse through a raw → staging → marts layering, and dbt brings version control, ref()-driven modularity, tests, lineage, and docs — so a metric has one tested definition instead of a hundred drifting copies. And you optimize by reading less: columnar storage scans only the columns a query names, partitioning prunes the rows it doesn’t, and predicate pushdown filters early — with the query plan as the instrument that turns “it’s slow” into a diagnosable fact.

Key takeaways

The fact/dimension split is the model; the grain — one sentence describing one row — is the decision everything else follows from, and mixing grains silently corrupts every aggregate.
Slowly changing dimensions (especially Type 2) let you analyze an event against the context that was true when it happened; surrogate keys are what make that work.
ELT beat ETL because cloud warehouse compute is cheap and elastic — load raw, transform in place, organize as raw → staging → marts.
dbt is software engineering for SQL: ref() builds the dependency graph and lineage, and a one-line unique/not_null test catches the grain bugs that otherwise reach dashboards. One tested model is the cure for the inconsistent-metric swamp.
Analytical speed is about reading less data: columnar storage (skip columns), partitioning/clustering (skip rows via pruning), and sargable predicates (so pruning can fire) — verified by reading the query plan.

Connections to other chapters

Data Processing Engines (sibling): the columnar storage, vectorized execution, and query-planning ideas at the end of this chapter are the same machinery a processing engine runs at scale — the warehouse is one consumer of that engine’s technology, and the optimization intuitions transfer directly.
Data Orchestration (sibling): a dbt transformation doesn’t run itself. In production it’s a step in an orchestrated pipeline — extract and load run first, dbt builds the marts, tests gate the result — which is where scheduling, dependencies, and retries from the orchestration chapter come in.
Benchmarking Systems (Part V): every claim in the optimization section — “the scan dropped from a billion rows to seven million,” “the partition count went from 365 to 1” — is a measurement. The discipline for trusting those deltas, and for not optimizing by vibes, comes from the benchmarking chapter.
The Data Engineering Landscape (prerequisite): that chapter frames the ingest → store → serve pipeline; warehousing is the store-and-serve stage, and the warehouse-vs-lake-vs-lakehouse distinction here is the detailed view of the “store” box drawn there.

Kimball & Ross, The Data Warehouse Toolkit — the canonical text on dimensional modeling; facts, dimensions, grain, and slowly changing dimensions are all defined here, with a catalog of real-world modeling patterns.
dbt documentation (getdbt.com) and Analytics Engineering with dbt — the practical reference for models, ref(), tests, sources, and the staging→marts project structure.

Deep dives

Kleppmann, Designing Data-Intensive Applications, ch. 3 (“Storage and Retrieval”) — the clearest explanation of why column-oriented storage and the star/snowflake schema make analytical queries fast, grounded in how the bytes are laid out.
Stonebraker et al., “C-Store: A Column-oriented DBMS” (VLDB, 2005) — the research system that made the case for columnar storage in analytics; the lineage runs straight to today’s cloud warehouses.
Melnik et al., “Dremel: Interactive Analysis of Web-Scale Datasets” (VLDB, 2010) — the engine behind BigQuery; columnar storage, partitioning, and massively parallel scans applied to trillions of rows.

Historical context

The Inmon vs. Kimball debate — Bill Inmon’s top-down, normalized enterprise-data-warehouse vision versus Ralph Kimball’s bottom-up, dimensional (star-schema) approach. Modern practice is overwhelmingly Kimball-flavored at the presentation layer, and reading the original disagreement explains why the star won for analytics.

--- title: "Data Warehousing & Modeling" keywords: [data warehouse, dimensional modeling, star schema, facts, dimensions, dbt, elt, sql optimization, columnar storage, olap] difficulty: intermediate prerequisites: [sql-basics, data-engineering-overview] estimated_time: "3-4 hours" --- ## Introduction The two dashboards both said "revenue," and they disagreed by eleven percent. One was finance's, the other growth's, and each had written its own SQL against the raw operational tables — SQL that had quietly diverged. Finance counted an order when it was paid; growth counted it when it was placed. Finance excluded refunds; growth netted them out a week later. Neither query was wrong, exactly; they encoded two different definitions of the same word. And because every analyst queried the raw tables directly, there were as many definitions of "revenue" as there were people who had ever needed the number. When the CFO asked which figure was real, nobody could say. The warehouse had become a swamp: a hundred slightly different truths, none authoritative, all slow. The slowness was its own crisis. The same warehouse held a `web_events` table of a billion rows, unpartitioned, and the most common dashboard query — "sessions by country, last 7 days" — scanned all billion of them every run, because the engine had no way to know only the last week's rows mattered. It took ninety seconds when it didn't time out. Analysts learned to run it before lunch. These are the two failure modes of an undisciplined warehouse, with a shared root: analytical data stored and queried as if it were transactional data. A warehouse needs a deliberate *model* — an agreed-upon shape that encodes the business's definitions once — and *transformation discipline*, so the model is built reproducibly and queried efficiently. Without both, it decays into exactly this: inconsistent answers, delivered slowly. This chapter is about the model and the discipline that keep a warehouse from becoming a swamp. ### The Core Insight The mistake underneath both dashboards is treating one database design as if it could serve two opposite jobs. A transactional database — the one behind the application — is optimized for *writes*, and its winning design is **normalization**: every fact stored exactly once, spread across many narrow tables joined by keys, so an update touches one place and can never contradict itself. That shape is correct for an application and miserable for analytics. Asking "revenue by product category by quarter" against a normalized schema means joining five or six tables and scanning millions of rows to answer a question that touches only a handful of columns. Analytical data wants the opposite shape, because the workload is the opposite. Analysts don't update single rows; they *scan and aggregate* — sum, count, group, slice — across enormous numbers of rows but only a few columns at a time. The insight behind the data warehouse is to stop forcing this read-heavy, aggregate-heavy workload through a write-optimized schema and instead design *for the way analysts actually query*. That redesign has three layers, and the rest of the chapter is each one in turn: 1. **Model.** Reshape the data with **dimensional modeling** — measurable events (*facts*) surrounded by the descriptive context that gives them meaning (*dimensions*), arranged in a **star schema** that an analyst can slice and aggregate with one or two joins instead of six. 2. **Transform.** Build that model with **transformation-as-code** — dbt's approach of version-controlled SQL, automated tests, and generated documentation, so the definition of "revenue" lives in one tested model that every dashboard references rather than in a hundred copy-pasted queries. 3. **Optimize.** Run it on a **columnar OLAP engine** whose speed comes not from cleverness but from physics — storing each column separately, partitioning the data, and *pruning* the irrelevant parts before they're ever read, so the billion-row scan becomes a million-row scan. ### A mental model Picture a **star**. At the center sits one fact table — the measurable events, say sales, each row a thing that happened with numbers attached: a quantity, an amount. Radiating outward are the dimension tables — date, customer, product, store — the *who, what, when, and where* that turn a bare number into a meaningful one. A sale of $49.98 is data; a sale of $49.98 *to a loyalty member, of a kitchen product, at the Boston store, on a holiday* is analysis. The star's geometry is the point: because the descriptive context is denormalized into a few wide dimension tables, a query reaches the fact at the center, joins out to the dimensions it needs, and slices or aggregates — fast, with shallow joins, in a shape a business user can read. @fig-de-dimensional draws this shape. The second analogy is about *where* the transformation happens. The old way was **ETL** — Extract, Transform, Load — where data was reshaped on a separate cluster *before* it landed in the warehouse, because warehouse compute was scarce. The modern way is **ELT** — Extract, *Load*, Transform — load raw data into the warehouse first, then transform it *in place* using the warehouse's own compute. ELT won because cloud warehouses made that compute cheap and elastic. The third analogy names what makes the transformation trustworthy: **dbt is software engineering for SQL**. Before dbt, analytics SQL was the last code on earth that ran without version control, tests, or modularity — the very absences that diverged the two dashboards. dbt treats a transformation as *code*: each model is a SQL `SELECT` checked into Git, models reference each other (so a fix cascades), every model can carry tests, and the dependency graph and documentation are generated automatically. You declare *what* the table should be; dbt figures out *how* to build it. ### When to model, and how The default for an analytical workload on a modern stack is a short, opinionated recipe, and you should reach for it before considering anything more exotic. **Model dimensionally — a star schema — for analytics.** For read-heavy, aggregate-heavy business intelligence (dashboards, ad-hoc analysis, reporting), a star schema is the proven default; its denormalized dimensions keep joins shallow and queries legible. The alternatives have narrow niches: a fully normalized **snowflake** schema (dimensions split into sub-tables) saves storage that columnar compression already reclaims, at the cost of more joins; a **Data Vault** earns its complexity only when integrating many volatile source systems with full auditability; a single denormalized **one big table** is fine for a simple dashboard or an ML feature set. When in doubt, start with the star — and never model dimensionally for an OLTP workload, the write-optimized job that wants normalization. **Use ELT on a cloud warehouse as the default transformation pattern.** Load raw, transform in-warehouse with SQL. Reach for an external engine (Spark, Flink) only when the transformation genuinely needs Python/Scala logic or true streaming — dbt is batch and SQL. @fig-de-dimensional shows the target shape: the star you're modeling toward, and the ELT layering that builds it. ### What you'll learn - Why analytical data wants a different physical shape than transactional data, and how the star schema delivers it - How to identify the **fact**, define its **grain** in one sentence, and surround it with the right **dimensions** - How **slowly changing dimensions** preserve history, so a sale is analyzed against the customer segment that was true *at the time of the sale* - Why **ELT** displaced ETL, and how the raw → staging → marts layering organizes a warehouse - How **dbt** brings version control, testing, lineage, and documentation to SQL — and how a single tested model becomes the one definition of a metric - Why **columnar storage**, **partitioning**, and **predicate pushdown** make a billion-row scan avoidable, and how to read a query plan to prove it - Where the warehouse sits relative to the data **lake** and the **lakehouse** ### Prerequisites - **SQL basics**: `SELECT`, `JOIN`, `GROUP BY`, and aggregate functions — enough to read an analytical query (the *SQL Basics* material) - **The data-engineering landscape**: where the warehouse sits in the ingest → store → serve pipeline, and the difference between operational and analytical systems (the *Data Engineering Overview* material) - Comfort with the idea of a relational schema: tables, primary and foreign keys --- ## Dimensional modeling: facts, dimensions, and grain Dimensional modeling is the centerpiece, because the model is what makes everything downstream possible — a clean star is what dbt builds toward and what the columnar engine queries fast. The model has exactly two kinds of table, and the discipline is knowing which is which. A **fact table** holds the measurable events of a business process: one row per thing that happened, carrying *measures* (the numbers) and *foreign keys* to the dimensions (the context). A **dimension table** holds the descriptive attributes — the who, what, when, where — that you filter and group by. The asymmetry matters: fact tables are long and narrow (billions of rows, a handful of measures and keys), dimension tables short and wide (thousands of rows, dozens of descriptive columns). In a sales warehouse, `fact_sales` is the events and `dim_date`, `dim_customer`, `dim_product`, and `dim_store` are the context — exactly the star in @fig-de-dimensional. ![A star schema: a central fact table of measurable events (sales, with its grain and measures) surrounded by dimension tables (date, customer, product, store) joined by keys — a shape optimized for the slice-and-aggregate queries analysts actually run, rather than for transactional writes. Modern warehouses build these with ELT layering: raw to staging to marts.](../assets/diagrams/rendered/de_dimensional.svg){#fig-de-dimensional .lightbox} Before you create a single column, answer one question: **what does one row of the fact table represent?** That is the *grain*, the most important decision in the whole design. State it in a single sentence — "one row per line item per order" — and every other choice follows: the dimensions are the things that vary at that grain, the measures the numbers meaningful at that grain. If you cannot state the grain in one sentence, the design is wrong, and the consequences are not cosmetic. Mix two grains in one table — order-level totals alongside line-item rows — and every `SUM` silently double-counts, because the engine has no way to know some rows are summaries of others. A grain stated and enforced is the best defense against a class of bug that corrupts numbers without raising an error. With the grain fixed, the fact table is a few keys and a few measures: ```sql -- Grain: one row per line item per sale. CREATE TABLE fact_sales ( sale_key BIGINT PRIMARY KEY, -- surrogate key for this row date_key INT NOT NULL, -- FK -> dim_date customer_key INT NOT NULL, -- FK -> dim_customer product_key INT NOT NULL, -- FK -> dim_product store_key INT NOT NULL, -- FK -> dim_store quantity INT NOT NULL, -- additive measure total_amount DECIMAL(10,2) NOT NULL -- additive measure ); ``` Note the keys. A fact table joins to a dimension on a **surrogate key** — a meaningless integer the warehouse assigns — not on the business's natural key (a customer ID, a SKU). This indirection is what makes history possible, which is the next idea. ### Slowly changing dimensions: history without lying Dimensions change. A customer moves cities; a product is reclassified; a loyalty tier is upgraded. The question a dimensional model has to answer is: when you analyze a sale from last year, do you want the customer's segment *as it is now*, or *as it was when the sale happened*? Both are legitimate, and the technique for choosing is the **slowly changing dimension** (SCD). The simplest policy, **Type 1**, just overwrites the old value — no history, the dimension always reflects "now." That's right for correcting a typo, and wrong for anything you'll want to analyze historically, because it silently rewrites the past: last year's sales suddenly look as if they were always made by Premium customers. The workhorse is **Type 2**, which preserves full history by *adding a new row* each time an attribute changes, rather than overwriting. Each version of the customer gets its own surrogate key and a validity window: ```sql -- A Type 2 dimension keeps every version of a customer as a separate row. CREATE TABLE dim_customer ( customer_key INT PRIMARY KEY, -- surrogate key, one per VERSION customer_id VARCHAR(50), -- natural key, stable across versions customer_segment VARCHAR(50), effective_date DATE NOT NULL, -- this version became true on... end_date DATE, -- ...and stopped being true on (NULL = current) is_current BOOLEAN DEFAULT TRUE ); ``` Now the surrogate key earns its keep. A fact row points at the *version of the customer that was current when the event happened*, so summing sales by `customer_segment` groups each sale under the segment it had *at the time*. This is why facts join on surrogate keys, not natural keys: the natural key (`customer_id`) is shared across all versions, but the surrogate key (`customer_key`) pins a fact to one specific version. (Types 3, 4, and 6 exist for narrower needs — a "previous value" column, a separate mini-dimension for rapidly changing attributes — but Types 1 and 2 cover the vast majority of real designs.) ## ETL vs ELT, and the layered warehouse For decades the pattern was **ETL**: extract from the sources, *transform* on a dedicated cluster, then *load* finished tables into the warehouse. The transform happened outside because warehouse storage and compute were expensive and fixed-capacity. Cloud warehouses inverted that economics — Snowflake, BigQuery, and their kin separate storage from compute and make both cheap and elastic — so the constraint that justified ETL evaporated. The modern pattern is **ELT**: extract, *load the raw data in as-is*, then *transform in place* with SQL on the warehouse's own massively parallel compute. ELT is simpler (no second engine to operate), more flexible (the raw data is always there to re-transform when requirements change), and it puts the transformation in the language analysts already speak. ELT comes with a layering convention that keeps the in-warehouse transformation from becoming its own swamp. Raw data lands untouched in a **raw** (or "bronze") layer — a faithful copy of the source, so you can always rebuild from it. A **staging** layer does light, mechanical cleanup: rename columns to a consistent convention, cast types — *one staging model per source table*, no business logic. Finally a **marts** layer (the "gold" of the medallion metaphor) holds the business-ready facts and dimensions — the star schema itself, where joins and business definitions live. Each layer has one job, and data flows one direction: raw → staging → marts, exactly the side note in @fig-de-dimensional. This is the structure that makes the next idea — dbt — natural. ## Transformation as code: dbt If ELT says *transform in the warehouse with SQL*, dbt is how you do that without recreating the original swamp. Recall the two dashboards: their definitions diverged because the SQL was copy-pasted, untested, and unowned. dbt fixes each of those by treating transformations as software. The atom of dbt is a **model**: a single `SELECT` statement in a `.sql` file that defines one table or view. You never write `CREATE TABLE`; you write the query that produces the rows, and dbt handles materializing it. The thing that makes models *compose* is the `ref()` function — instead of hardcoding a table name, a model references another model by name: ```sql -- models/marts/fct_orders.sql — one definition of "completed order", referenced everywhere. select o.order_id, o.order_date, c.customer_segment, o.amount from {{ ref('stg_orders') }} o -- ref() builds the dependency graph join {{ ref('stg_customers') }} c on o.customer_id = c.customer_id where o.status = 'completed' -- the business rule lives HERE, once ``` That `ref()` does two jobs at once. It lets dbt build the **dependency graph** (the DAG) automatically — it knows `fct_orders` depends on `stg_orders` and `stg_customers`, so it builds them in order — and it produces the **lineage** in the generated docs: a clickable graph from raw source to final mart. The business rule (`status = 'completed'`) now lives in exactly one model that every dashboard references. The two definitions of "revenue" can no longer diverge, because there is only one definition. The piece that would have caught the silent corruption is **tests**. In dbt a test is a SQL query that should return zero rows; if it returns any, the build fails. The two most valuable tests are declared in a line of YAML each: ```yaml # A unique + not_null pair on the primary key catches the most common data bug: # a join that quietly changed the grain and started producing duplicate rows. models: - name: fct_orders columns: - name: order_id tests: [unique, not_null] ``` This is the test that would have failed the build the instant a join inflated the row count — the difference between catching corruption in CI and discovering it at quarterly reconciliation. dbt rounds out the story with **documentation**: descriptions written alongside the models, compiled (with the lineage graph) into a searchable site by `dbt docs generate`, so "what does this column mean?" has an answer that isn't tribal knowledge. Version control, modularity, tests, docs — the practices that make application code trustworthy, finally applied to the SQL that defines the business's numbers. ::: {.callout-warning} ## War story: the swamp, and the one definition that drained it The eleven-percent dashboard discrepancy from the introduction was never one bug; it was the *absence of a model*. Every analyst queried the raw `orders` and `payments` tables directly, so every analyst encoded their own answer to "when is an order revenue?" — paid vs. placed, gross vs. net of refunds — and the answers drifted apart silently because nothing tied them together. The fix was not a smarter query. It was a *dimensional model built as code*: a single `fct_orders` mart encoding the canonical revenue definition once, a `unique`/`not_null` test on its key, and a rule that dashboards query the mart, never the raw tables. Within a sprint the two dashboards agreed, because they were reading the *same* number from the *same* tested model. Consistency is not a discipline you can ask analysts to maintain by hand across a hundred queries; it has to be *built in*, as one tested definition that everything references. A metric without a single owning model is a metric with as many values as it has authors. ::: > **Build it →** A warehouse semantic layer turns these mart tables into governed, > reusable metric definitions: > [Project 10: Warehouse Semantic Layer](https://github.com/jchu0/applied-cs-projects/tree/main/10-warehouse-semantic-layer) > implements the "one definition of revenue" idea as a queryable layer over the star. ## SQL and query optimization: why the scan was avoidable A clean model on a slow engine still times out. The ninety-second dashboard from the introduction wasn't slow because the SQL was bad; it was slow because the engine had no way to read *less data*. Optimization in an analytical warehouse is almost entirely about that — reading less — and three mechanisms do the work. The first is **columnar storage**, the deepest reason analytical engines are fast. A transactional database stores data *row by row*: all of row 1's columns together, then all of row 2's — ideal when you fetch whole rows. But an analytical query touches a few columns across millions of rows; "sum `total_amount`" needs one column out of fifty. A columnar engine stores each column *separately and contiguously*, so that query reads exactly one column's worth of bytes and ignores the other forty-nine. Storing a column together also makes it compress beautifully — adjacent values are similar — so there are fewer bytes to read. This is why `SELECT *` is a sin in a columnar warehouse and harmless in a row store: in columnar layout, every column you name is more data scanned, so you project only what you need. The second mechanism is **partitioning** (and its finer-grained cousin, clustering). The billion-row table was unpartitioned, so "last 7 days" had to scan everything to find the recent rows. Partition the table by date and the warehouse physically segregates rows into per-day chunks; a query that filters `WHERE order_date >= ...` triggers **partition pruning** — the engine skips entire partitions it can prove are irrelevant, never reading them. A `WHERE` on the partition key is the difference between scanning a billion rows and scanning seven million. Cloud warehouses spell this differently — Redshift's sort keys, Snowflake's cluster keys, BigQuery's `PARTITION BY` plus `CLUSTER BY` — but the idea is identical: organize the data so the engine can *skip* most of it. The third is **predicate pushdown**: applying the filter as early as possible, so rows are eliminated *before* the expensive work (joins, aggregation). The optimizer does much of this for you, but you can defeat it. Wrap a partitioned column in a function — `WHERE DATE(created_at) = '2024-03-15'` — and the predicate becomes *non-sargable*: the engine can no longer match it against the column's physical organization and falls back to a full scan. Keep the column bare on one side of the comparison (`WHERE created_at >= '2024-03-15' AND created_at < '2024-03-16'`) and pruning works. To diagnose any of this, read the **query plan** — `EXPLAIN ANALYZE` shows what the engine actually did. Watch for a few red flags: a *sequential scan* on a large table (no pruning), a *partition count* like "1 of 365 selected" (good) versus "365 of 365" (pruning failed), and a gap between *estimated and actual rows* of 10x or more (stale statistics — `ANALYZE` to refresh them often fixes the plan outright). The plan turns "the query is slow" into "the query scanned 365 partitions because the filter was non-sargable," which is a problem you can fix. > **Build it →** Columnar storage and query optimization, implemented from scratch: > [Project 17: Columnar Query Engine](https://github.com/jchu0/applied-cs-projects/tree/main/17-columnar-query-engine) > builds a DuckDB-style engine with columnar storage, vectorized execution, and a query > optimizer, and > [Project 52: Time-Series Database](https://github.com/jchu0/applied-cs-projects/tree/main/52-time-series-database) > shows columnar compression and partition pruning tuned for time-stamped data. ## Warehouse, lake, and lakehouse The warehouse is one of three names for where analytical data lives, differing mostly in structure and timing. A **data warehouse** stores *structured, modeled* data — the star schemas of this chapter — optimized for fast SQL; you pay the modeling cost up front (schema-on-write) and get clean, fast queries in return. A **data lake** stores *raw files* of any shape (JSON, Parquet, logs, images) cheaply and without an imposed schema, deferring structure to read time (schema-on-read); flexible and cheap, but undisciplined on its own — the literal swamp risk made architectural. The **lakehouse** is the synthesis: lake-cheap file storage with warehouse-style structure, transactions, and SQL layered on via open table formats. For the discipline in this chapter the distinction barely matters — a star schema built with dbt is a star schema whether the marts live in a classic warehouse or a lakehouse table. The deeper treatment belongs to the *Data Infrastructure* material; here it's enough that the warehouse is the *modeled, query-optimized* end of that spectrum. --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — Model a business process as a star.** Pick a business process you understand — restaurant orders, gym check-ins, a streaming service's plays. Identify the **fact** (the measurable event), state its **grain** in a single sentence ("one row per ..."), then list the **dimensions** (the who/what/when/where you'd filter and group by) and the **measures** (the numbers). Sketch the star. Then write the one analytical query the model is meant to make easy — a `SELECT` that joins the fact to two dimensions, filters on one, and aggregates a measure — and confirm it needs only shallow joins. 2. **Level II — Build a layered transformation with a test that catches an error.** Design a raw → staging → marts pipeline (dbt-style, on paper or in dbt + a local warehouse) for your Level I process. Write the staging model (rename/cast only) and the mart model (the fact, with its business logic). Add a `unique` + `not_null` test on the fact's key, then deliberately introduce a join that changes the grain and produces duplicates — and show that the test fails. In a short paragraph, explain **ELT vs ETL**: why the transformation happens in the warehouse, and what the raw layer buys you when requirements change. 3. **Level III — Optimize a slow analytical query.** Take a query that scans a large fact table filtered by date and grouped by a couple of dimensions. Choose a **partitioning** (or clustering) key and justify it from the query's filter. Rewrite the query to enable **predicate pushdown** — in particular, fix any non-sargable predicate (a function wrapping the filter column) so partition pruning can fire. Read the `EXPLAIN ANALYZE` output before and after, point to the partition count and the scan type that changed, and explain in terms of **columnar storage** *why* the optimized query reads dramatically less data — both the columns it skips and the partitions it prunes. ## Summary Analytical data wants a different shape than transactional data, because the analytical workload — scan and aggregate over many rows but few columns — is the opposite of the transactional one. The data warehouse delivers that shape in three layers. You **model** dimensionally: measurable events (*facts*) at the center of a star, surrounded by descriptive *dimensions*, with the *grain* stated in one sentence and *slowly changing dimensions* preserving history truthfully. You **transform** with discipline: **ELT** loads raw data and reshapes it in-warehouse through a raw → staging → marts layering, and **dbt** brings version control, `ref()`-driven modularity, tests, lineage, and docs — so a metric has one tested definition instead of a hundred drifting copies. And you **optimize** by reading less: **columnar storage** scans only the columns a query names, **partitioning** prunes the rows it doesn't, and **predicate pushdown** filters early — with the **query plan** as the instrument that turns "it's slow" into a diagnosable fact. ### Key takeaways - The fact/dimension split is the model; the **grain** — one sentence describing one row — is the decision everything else follows from, and mixing grains silently corrupts every aggregate. - **Slowly changing dimensions** (especially Type 2) let you analyze an event against the context that was true *when it happened*; surrogate keys are what make that work. - **ELT** beat ETL because cloud warehouse compute is cheap and elastic — load raw, transform in place, organize as raw → staging → marts. - **dbt** is software engineering for SQL: `ref()` builds the dependency graph and lineage, and a one-line `unique`/`not_null` test catches the grain bugs that otherwise reach dashboards. One tested model is the cure for the inconsistent-metric swamp. - Analytical speed is about reading less data: **columnar storage** (skip columns), **partitioning/clustering** (skip rows via pruning), and **sargable predicates** (so pruning can fire) — verified by reading the query plan. ### Connections to other chapters - **Data Processing Engines** (sibling): the columnar storage, vectorized execution, and query-planning ideas at the end of this chapter are the same machinery a processing engine runs at scale — the warehouse is one consumer of that engine's technology, and the optimization intuitions transfer directly. - **Data Orchestration** (sibling): a dbt transformation doesn't run itself. In production it's a step in an orchestrated pipeline — extract and load run first, dbt builds the marts, tests gate the result — which is where scheduling, dependencies, and retries from the orchestration chapter come in. - **Benchmarking Systems** (Part V): every claim in the optimization section — "the scan dropped from a billion rows to seven million," "the partition count went from 365 to 1" — is a measurement. The discipline for trusting those deltas, and for not optimizing by vibes, comes from the benchmarking chapter. - **The Data Engineering Landscape** (prerequisite): that chapter frames the ingest → store → serve pipeline; warehousing *is* the store-and-serve stage, and the warehouse-vs-lake-vs-lakehouse distinction here is the detailed view of the "store" box drawn there. ## Further reading ### Essential - Kimball & Ross, *The Data Warehouse Toolkit* — the canonical text on dimensional modeling; facts, dimensions, grain, and slowly changing dimensions are all defined here, with a catalog of real-world modeling patterns. - *dbt documentation* (getdbt.com) and *Analytics Engineering with dbt* — the practical reference for models, `ref()`, tests, sources, and the staging→marts project structure. ### Deep dives - Kleppmann, *Designing Data-Intensive Applications*, ch. 3 ("Storage and Retrieval") — the clearest explanation of why column-oriented storage and the star/snowflake schema make analytical queries fast, grounded in how the bytes are laid out. - Stonebraker et al., *"C-Store: A Column-oriented DBMS"* (VLDB, 2005) — the research system that made the case for columnar storage in analytics; the lineage runs straight to today's cloud warehouses. - Melnik et al., *"Dremel: Interactive Analysis of Web-Scale Datasets"* (VLDB, 2010) — the engine behind BigQuery; columnar storage, partitioning, and massively parallel scans applied to trillions of rows. ### Historical context - The Inmon vs. Kimball debate — Bill Inmon's top-down, normalized enterprise-data-warehouse vision versus Ralph Kimball's bottom-up, dimensional (star-schema) approach. Modern practice is overwhelmingly Kimball-flavored at the presentation layer, and reading the original disagreement explains *why* the star won for analytics.