Data Infrastructure

Keywords

data infrastructure, lakehouse, data lake, data warehouse, object storage, table formats, iceberg, delta, terraform, infrastructure as code

Introduction

Two teams, two opposite disasters, the same root cause.

The first team bought a data warehouse — a fast, expensive, columnar machine that ran SQL like a dream. It served the BI dashboards beautifully, right up until the day the ML team showed up with a few terabytes of raw clickstream JSON and a directory of product images they wanted to train on. The warehouse had no place to put any of it. It wanted clean, typed, modeled tables; it charged warehouse prices for every byte sitting in it; and it could not hold the messy, half-structured, not-yet-understood raw data that is the lifeblood of analytics and machine learning. So the raw data went somewhere else, and now there were two systems, two copies, and a nightly job to shuttle data between them that broke every other week.

The second team had learned that lesson and swore off warehouses entirely. They dumped everything into a data lake — just a big bucket of object storage, cheap as dirt, infinitely scalable, holding raw files in whatever shape they arrived. For about a year it was glorious. Then it rotted. Without a schema, nobody knew what was in any given file. Without transactions, a job that crashed halfway left the lake in a state where some files were updated and some weren’t, and a query reading mid-write got garbage. Without a catalog, finding the right dataset meant asking the one engineer who remembered where things were. The cheap, flexible lake had decayed into a data swamp: untrustworthy, unqueryable, ungoverned. The same SELECT that worked yesterday returned different rows today, and nobody could say why.

And here is the cause both disasters shared: a third team, in the same company, had clicked all of this infrastructure together by hand in a cloud console. The buckets, the warehouse, the IAM roles, the networking — each provisioned by someone navigating a web UI, one checkbox at a time, with no record of what they did. When a region had an outage and the platform needed to be rebuilt, there was no blueprint to rebuild it from. Nobody could say what the working configuration had been, because the working configuration lived only in the cloud account itself, and the cloud account was the thing that was down.

This chapter is about the platform layer that sits underneath your data and decides three things at once: whether your data is cheap to store, whether it is reliable enough to trust, and whether it is queryable by the engines that need it — and, fourth, whether you can rebuild the whole thing from a file in version control after it’s gone. Get this layer right and the warehouse-versus-lake fork stops being a fork. Get it wrong and you pick which of the disasters above you’d prefer.

The Core Insight

Two ideas, taken together, define modern data infrastructure. Both are resolutions of false choices that earlier generations of data teams thought they had to live with.

The first idea resolves the lake-versus-warehouse fork. For two decades you genuinely had to choose. A data lake — raw files on cheap object storage — gave you low cost, infinite scale, open formats, and the freedom to store anything in any shape. But it gave you none of the guarantees a database makes: no ACID transactions, so a half-finished write left readers seeing torn data; no enforced schema, so files drifted apart until nothing could read them uniformly; no efficient updates or deletes, which made even a GDPR “delete this user” request a nightmare. A data warehouse gave you exactly those guarantees — transactions, schemas, fast SQL — but it charged warehouse prices, locked your data inside a proprietary format, fused storage to compute so you paid for both even when idle, and refused anything that wasn’t already clean and structured. The lakehouse dissolves the choice. It keeps the lake’s cheap object storage and open columnar files, and it layers an open table format — Apache Iceberg, Delta Lake, or Apache Hudi — on top of those files. That table format is a metadata layer that adds back exactly the guarantees the bare lake lacked: ACID transactions, schema evolution, time travel. You get the lake’s cost and flexibility and the warehouse’s reliability, on one copy of the data, with compute engines decoupled from storage so each scales on its own.

The second idea is that this platform must be infrastructure as code. Every bucket, catalog, role, warehouse, and cluster that makes up the platform should be defined in declarative configuration files, kept in version control, and provisioned by a tool — Terraform is the standard — rather than clicked together by hand. The reason is the third team’s outage. Hand-built infrastructure is unauditable: there is no diff, no review, no record of who changed what. It is unreproducible: you cannot stamp out an identical staging environment, and you cannot rebuild production from scratch. And it drifts: reality slowly diverges from anyone’s mental model of it, until nobody knows the true state. Infrastructure as code makes the desired state of your platform a reviewable artifact and the act of provisioning a repeatable, recoverable operation.

A mental model

Hold two pictures in your head.

The first is storage and compute as decoupled layers, stacked. At the very bottom sits cheap, durable object storage — think of it as an effectively bottomless warehouse floor where you can drop boxes for almost nothing per box. The boxes are open columnar files. By themselves they are just boxes: you can read them, but there is no inventory, no guarantee that a box isn’t being repacked while you read it, no way to ask “what did this shelf look like last Tuesday.” The table format is the inventory system you bolt on top — it tracks which files belong to which table at which moment, makes a batch of changes appear all-at-once or not at all, and remembers every past state. With that inventory in place, any number of forklifts — a SQL engine, Spark, a streaming job, an ML training run — can work the same floor at the same time without colliding, and each can be sized for its own job. The storage doesn’t care how many engines read it; the engines don’t own the storage. That decoupling is the whole architecture.

The second picture is infrastructure as code as a thermostat for your platform. You don’t reach into the furnace and adjust the flame; you set a target temperature, and a controller continuously reconciles the room toward it. Terraform works the same way: you write down the desired state of your infrastructure, and the tool computes the difference between that and what actually exists, then makes reality match. This is the identical pattern you’ll meet again in GitOps and in Kubernetes — declare the desired state, let a reconciler converge to it — and recognizing it as one idea wearing three hats is most of what you need to understand all three.

Choosing the platform shape

Before any code, decide the shape of the platform. Figure 34.1 shows where the pieces land; the decisions below are which pieces you actually want.

Warehouse, lake, or lakehouse? Choose a pure warehouse (Snowflake, BigQuery, Redshift) when your data is overwhelmingly structured, your workload is SQL analytics and BI, your volumes are modest, and you value zero-tuning simplicity over storage cost — a SQL-first analytics team with no Spark or ML ambitions is the classic fit. Choose a pure lake almost never anymore for a primary platform: a bare lake without a table format is the swamp risk from the introduction, and it survives today mostly as a landing zone underneath a lakehouse. Choose a lakehouse when you have a mix of structured and semi-structured or unstructured data, when ML and SQL analytics share the same data, when storage cost matters at scale, or when you want to avoid locking your data inside one vendor’s format. The lakehouse is the default for a serious modern platform precisely because it refuses the original fork.

Managed or self-hosted? A managed platform (Snowflake, Databricks, BigQuery) trades money for operational burden — you write SQL and they run the machines, patch them, scale them, and page themselves at 3 a.m. Self-hosting open components (Trino or Spark over Iceberg on your own object storage) trades engineering effort for control, cost at scale, and freedom from lock-in. Small teams should almost always start managed; the operational savings dwarf the licensing premium until you’re large enough to amortize a platform team.

When does Kubernetes for data make sense? Running stateful data systems (Kafka, Spark, ClickHouse) on Kubernetes yourself buys you multi-framework consistency and multi-cloud portability at the price of real operational complexity — it is for platform teams who run several frameworks together and have the expertise to operate clusters. If you run a single framework, a managed service (EMR, Dataflow, Confluent Cloud, a managed warehouse) is almost always the better trade. We return to this tension in its own section.

What you’ll learn

Why the data-lake-versus-warehouse fork was real, and how the lakehouse dissolves it by layering an open table format over cheap object storage
How object storage plus an open table format (Iceberg, Delta, Hudi) gives a data lake the ACID transactions, schema evolution, and time travel a warehouse had
Why separating storage from compute is the central cost-and-flexibility lever of every cloud data platform
How cloud warehouses (Snowflake, BigQuery, Redshift) differ, and how to reason about managed versus self-hosted
How infrastructure as code with Terraform makes a data platform declarative, reproducible, and auditable — and what plan/apply and remote state actually give you
When stateful data workloads belong on Kubernetes versus a managed service
How catalogs and platform-level access control keep a lakehouse governed rather than letting it decay into a swamp

Prerequisites

Data Warehousing & Modeling — what a warehouse is, what dimensional models and columnar storage are, and why analytical queries differ from transactional ones. The lakehouse is where those models physically come to live.
Containerization — images, the immutable-artifact mindset, and the stateful-workload caveat (state lives on a mounted volume, never in the container). The Kubernetes-for-data section builds directly on it.
Comfort with cloud basics: object storage, IAM roles, and what “a region” means.

Storage: lake vs warehouse vs lakehouse

Start at the bottom of the stack, because the bottom is where the money and the trust are won or lost. Everything above it is, in a sense, a way of querying what sits here.

A data warehouse is a database tuned for analytics. It stores data in a proprietary, highly optimized columnar format, on storage that it manages and usually fuses to its compute. Within its walls it is excellent: transactions are ACID, schemas are enforced, SQL is fast, and you tune almost nothing. Its limitations are exactly the things it refuses. It refuses cheap storage — you pay warehouse rates for every byte, including cold raw data you touch twice a year. It refuses your formats — your data lives inside the vendor’s format, and getting it out is a project. And classically it refuses to separate storage from compute, so an idle warehouse still bills, and a storage-heavy, compute-light workload pays for compute it doesn’t use.

A data lake is the opposite bet. It is just object storage — S3, Google Cloud Storage, Azure Data Lake Storage — holding files in open formats, most importantly Parquet, the open columnar format that every query engine on earth can read. Object storage is astonishingly cheap, effectively infinite, durable to eleven nines, and decoupled from any compute by construction: the bytes sit in a bucket, and whatever wants to read them brings its own compute. You can store anything, in any shape, raw, for almost nothing. The catch is everything a database does for you that a bucket does not. A bucket has no transactions: if a Spark job writes a thousand Parquet files and dies after five hundred, a reader sees half a table, and there is no rollback. A bucket has no enforced schema: file number 900 can quietly add a column or change a type, and nothing stops it, so the dataset slowly becomes unreadable as a whole. A bucket has no efficient way to update or delete a single row buried in a multi-gigabyte file. These are not minor gaps — they are the difference between a data store you can trust a query against and the swamp from the introduction.

The lakehouse is the synthesis, and the synthesizing component is the open table format. Iceberg, Delta Lake, and Hudi are not storage and not engines; they are a metadata layer that sits between the raw files and the engines reading them. The format maintains, alongside your Parquet files, a transaction log and a manifest of which files constitute the table as of each moment. That single piece of bookkeeping is what buys back the warehouse guarantees:

ACID transactions. A write produces a new committed snapshot atomically. The thousand-file Spark job either commits all its files as one new table version or commits none — a reader never sees the half-written state, because “the table” is defined by the log, and the log only points at complete commits.
Schema evolution. Adding, renaming, or retyping a column is a tracked metadata operation, not a rewrite of every file and not a silent drift. The format knows the schema of every snapshot, so old and new files coexist correctly under one logical table.
Time travel. Because every commit is a snapshot and the log remembers them, you can query the table as of a past timestamp or version, and you can roll a bad write back. The “why did this query return different rows today” mystery becomes a diffable history.

And crucially, the lakehouse keeps the lake’s defining property: storage and compute stay decoupled. The data is one copy of open Parquet files in one bucket, wrapped by one table format, and many engines query it side by side — a SQL warehouse for BI, Spark for heavy ELT, a streaming engine for fresh data, an ML pipeline for training — each engine sized and scaled for its own job, none of them owning the storage. This is the architecture in Figure 34.1: cheap object storage at the base, the table format wrapping it with database semantics, decoupled engines on top all reading the same tables, and infrastructure as code provisioning the whole thing.

A small, concrete illustration of the difference the format makes. Against a bare lake, “update yesterday’s partition” means rewriting files and praying nothing reads mid-rewrite. Against a lakehouse table, it is a transaction:

-- An Iceberg/Delta table: a normal SQL transaction over files in a bucket.
-- The MERGE commits atomically; no reader sees a half-applied state.
MERGE INTO sales.orders t
USING staging.orders_2026_06_24 s
  ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- And because every commit is a snapshot, you can look backwards:
SELECT * FROM sales.orders FOR SYSTEM_TIME AS OF '2026-06-23 00:00:00';

The SQL looks like a warehouse. The data underneath is cheap open files in a bucket you control. That is the lakehouse promise in one statement.

Cloud data platforms

The warehouse did not disappear; it evolved, and the evolution is itself the lesson. The defining move of the modern cloud warehouse — Snowflake’s original insight, and the thing BigQuery and Redshift’s RA3 generation adopted — was to separate storage from compute, the same decoupling the lakehouse makes architecturally explicit. In a classic warehouse, storage and compute were welded together: more data forced more compute and vice versa, and an idle cluster still billed. Separated, your data sits in cheap object storage, and you spin compute up against it on demand, sizing and scaling each independently. This is the cost lever of cloud data platforms. It means a storage-heavy, query-light workload stops subsidizing compute it doesn’t use; it means two teams can run two differently-sized warehouses against one copy of the data without fighting for resources; and it means an idle platform costs storage prices, not compute prices.

The major managed platforms differ mostly in where they sit on a few axes. BigQuery is fully serverless and SQL-first: you write SQL, Google runs the machines, and you pay per terabyte scanned or for reserved slots — superb for SQL analytics teams who want zero operational surface, weakest when you need heavy Spark or strict cost predictability. Snowflake is the multi-cloud, data-sharing specialist: it runs on AWS, GCP, and Azure, and its secure data marketplace lets organizations share data without moving it — the choice for enterprise SQL analytics and cross-org sharing, less so for the heaviest ML. Databricks is the Spark-and-ML platform, built around Delta Lake and the lakehouse pattern itself: it shines for data science, streaming-plus-batch unification, and open formats, and is overkill for a team that only wants SQL. Redshift is the AWS-native warehouse — the natural pick inside an AWS shop, with its RA3 and serverless generations adopting the storage-compute separation that defines the category. The practical pattern many organizations land on is several of these against one open-format copy of the data: Databricks for engineering and ML, a SQL warehouse for BI, all reading the same Iceberg or Delta tables — which is only possible because the format and the storage are open and decoupled.

The managed-versus-self-hosted decision threads through all of this. Managed platforms trade money for operational burden: you stop running machines and start writing queries, and the premium you pay buys back the people you’d otherwise need to patch, scale, and babysit a cluster. Self-hosting the open stack — Trino or Spark over Iceberg on your own object storage — trades engineering effort for raw cost at scale, full control, and freedom from any single vendor’s format and pricing. The honest default for most teams is to start managed and earn your way into self-hosting only when your scale makes the licensing premium larger than a platform team’s salary. One careless decision here, made the other way, is its own kind of war story: the cloud warehouse bills by bytes scanned, and an analyst running SELECT * against a petabyte table with no partition filter can turn a single query into a four-figure invoice. Partition your tables, require partition filters, and set per-query cost guards before anyone touches the console.

Infrastructure as code

Now provision all of this — the buckets, the catalog, the warehouse, the IAM roles, the networking — without ever opening the cloud console. That is infrastructure as code, and Terraform is its standard.

The core idea is declarative reconciliation, the thermostat from the mental model. You do not write a script that says “create this bucket, then create that role, in this order.” You write down the desired end state of your infrastructure in HCL configuration files, and Terraform figures out the difference between that desired state and what currently exists, then makes the changes — in the right dependency order — to converge. A bucket that references a role makes Terraform create the role first; you never sequence it by hand. The desired state is the source of truth, and the world is reconciled to it.

The workflow that makes this trustworthy is plan then apply. terraform plan computes and shows you the diff — every resource it will create, change, or destroy — before touching anything. terraform apply executes that reviewed plan. This is the single most important habit in the practice: you read the plan, you see “1 to add, 0 to change, 0 to destroy,” and only then do you apply. The plan is what a hand-clicked console can never give you — a reviewable preview of a change to production infrastructure, the same way a code review previews a change to production code.

What ties the declaration to reality is state. Terraform records, in a state file, its understanding of which real resources correspond to which configuration. plan is a three-way comparison: what you want (the .tf files), what Terraform thinks exists (state), and what actually exists (the live cloud). State is also why the file must live in a remote backend — an S3 bucket with a lock table, or Terraform Cloud — the moment more than one person is involved. Here is the war story that makes the rule visceral: two engineers ran terraform apply against the same local state file at the same time, and corrupted it; the recorded state drifted out of sync with reality, infrastructure ended up half-changed, and untangling it was an outage. A remote backend with state locking serializes concurrent applies so this cannot happen. A shared, unlocked state file is a guaranteed incident, eventually.

The other two pieces are modules and the auditability the whole thing yields. A module is a reusable, parameterized bundle of infrastructure — a “data lake bucket” module, say, that always sets versioning, encryption, lifecycle tiering, and public-access blocking correctly — that you instantiate per environment instead of re-deriving by hand and getting it subtly wrong each time. And because every resource lives in version-controlled HCL, your infrastructure gains exactly what application code has: a full history of who changed what and why, pull-request review on changes before they apply, and the ability to rebuild the entire platform from the repository after a region goes down. The third team’s outage — the one with no blueprint to rebuild from — simply cannot happen when the blueprint is the repo.

# A small module instance: a governed data-lake bucket plus its table-format
# catalog, declared once and reproducible across dev/staging/prod.
module "lakehouse_storage" {
  source      = "./modules/data-lake-bucket"
  bucket_name = "acme-${var.environment}-lakehouse"
  versioning  = true          # time-travel-friendly; required by the table format
  encryption  = "aws:kms"     # encrypted at rest, auditable key access
  lifecycle_tiers = {         # cheap storage gets cheaper as it cools
    infrequent_access_after_days = 30
    archive_after_days           = 90
  }
}

The plan for that change is a few lines you can read in a pull request; the apply is reproducible in any environment; and the configuration is the durable record of what your platform is. That is the entire value proposition over clicking in a console: a diff, a review, a reproduction, and a recovery — none of which a console gives you.

Kubernetes for data workloads

Sometimes you want to run the stateful data systems themselves — Kafka, Spark, Postgres, ClickHouse — rather than rent them as managed services. Kubernetes is where that happens, and it is also where the stateful-workload caveat from the Containerization chapter becomes a load-bearing rule rather than a footnote.

The honest framing is that Kubernetes for data buys multi-framework consistency and multi-cloud portability at the cost of real operational complexity. If you run Spark, Flink, and Kafka together and want one deployment substrate across clouds with no vendor lock-in, Kubernetes earns its keep. If you run a single framework, a managed service — EMR or Dataflow for Spark, Confluent Cloud or MSK for Kafka, a managed warehouse for SQL — is almost always the better trade, because the managed service absorbs the operational burden that you would otherwise carry yourself. Kubernetes for data is for platform teams with the expertise to operate clusters, not a default.

Two Kubernetes ideas do the heavy lifting for data, and both are about state. The first is the operator pattern. A naked Kubernetes does not know how to run a Kafka cluster correctly — how to add a broker, rebalance partitions, or recover a failed node without losing quorum. An operator encodes that application-specific knowledge as a controller: you declare what you want with a custom resource (a Kafka object asking for three brokers), and the operator continuously reconciles the cluster toward it — the same declarative-reconciliation idea as Terraform and GitOps, applied to a running stateful system. Strimzi does this for Kafka, CloudNativePG for Postgres, the Spark Operator for Spark jobs. The operator is how production-grade stateful data services run on Kubernetes at all.

The second idea is persistent volumes, and it is where teams lose data. Containers are ephemeral; their writable layer dies with them, and Kubernetes will reschedule your pod to another node when you least expect it. Stateful data must therefore live on a PersistentVolumeClaim — durable storage that survives the pod, backed by cloud block storage — and never in the pod’s own ephemeral storage. The war story is one line of YAML: a database pod backed by emptyDir (scratch space) instead of a PVC looked fine until the first reschedule, at which point every byte of data vanished. This is the Containerization chapter’s “state lives on a mounted volume, never in the container” rule, made concrete and unforgiving. The general mechanics of pods, scheduling, and volumes belong to the Kubernetes chapter in Part V; what matters here is the data-specific consequence — stateful data systems need operators to run correctly and PersistentVolumeClaims to survive at all, and if you cannot commit to both, you want a managed service.

Governance

A lakehouse without governance drifts back toward the swamp, just more slowly. Two platform-level concerns keep it healthy.

The first is the catalog — the inventory system from the mental model, made real. A catalog (AWS Glue, Unity Catalog, a Hive Metastore, an Iceberg REST catalog) records what tables exist, what their schemas are, where their files live, and increasingly their lineage and ownership. It is what lets every decoupled engine agree on what “the orders table” is, and it is what turns “ask the one engineer who remembers” into a queryable directory. A lakehouse’s open table format gives each table its guarantees; the catalog is what makes the set of tables discoverable and consistent across engines.

The second is access control over the platform, and it has two faces that are easy to conflate. There is data access control — who can read which tables, columns, and rows, increasingly expressed as policies in the catalog. And there is infrastructure access control — who can change the platform itself, which is exactly the IAM and least-privilege story that infrastructure as code makes auditable: roles defined in version-controlled HCL, reviewed before they apply, granting the minimum each workload needs. The two together are what “governed” means: the right people can read the right data, and changes to who-can-do-what are themselves diffable, reviewable history rather than a setting someone changed in a console at 2 a.m. and forgot.

War story: the swamp and the unreproducible console

A growth-stage company ran both halves of the introduction at once. Their analytics lake was a bare S3 bucket of Parquet with no table format, no catalog, and no access control beyond “the team can write anywhere.” A nightly Spark job updated the previous day’s partition by overwriting files in place; one night it crashed at file 600 of 1,100, and for the next eight hours every dashboard read a table that was 600 files in the future and 500 files in the past. There was no transaction to roll back, because there were no transactions — the “table” was whatever files happened to be in the prefix. The fix was not heroics; it was an open table format, which would have made that nightly update a single atomic commit that readers never saw mid-flight.

The deeper problem surfaced during the remediation. To rebuild the platform correctly, the team needed to know what the platform was — and they couldn’t, because every bucket, role, and policy had been clicked into the console by hand over two years by a rotating cast of engineers, none of it recorded anywhere. There was no source of truth to reconcile against, no way to stand up an identical staging copy to test the fix, and no way to prove the rebuilt platform matched the old one. They spent more time reverse-engineering their own infrastructure than fixing the data bug. Both halves of the lesson are the same lesson: the platform layer must be declared — the tables by a table format, the infrastructure by code — or you are trusting a state that lives nowhere you can read, review, or rebuild.

Build it → See these layers in working systems: Project 07: Data Lakehouse is the direct analog — object storage, an open table format, and decoupled engines over one copy of the data. Project 34: Distributed File System is the durable storage layer the whole stack rests on, built from first principles. Project 52: Time-Series Database is a stateful storage engine of exactly the kind you’d run on Kubernetes with an operator and persistent volumes.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — Pick the storage shape and defend it. You are handed three workloads: (a) a BI team running scheduled SQL over a few hundred gigabytes of clean, structured sales data; (b) an ML team training on tens of terabytes of raw clickstream JSON plus a directory of images; (c) a platform that must serve both (a) and (b) from one copy of the data. For each, decide whether a warehouse, a bare lake, or a lakehouse fits, and write one paragraph defending each choice in terms of cost, the reliability guarantees the workload needs, and which engines must query the data. Name explicitly, for case (c), what the open table format buys you that a bare lake would not.
Level II — Sketch a module and explain plan/apply/state. Sketch (in HCL pseudocode, not a working deployment) a small Terraform module for one data-platform component — a governed lakehouse bucket, or a catalog database, or a warehouse — with at least three parameters and the safe defaults you’d bake in (versioning, encryption, least-privilege access). Then write a short explanation, as if onboarding a teammate, of what terraform plan, terraform apply, and remote state with locking give you that clicking the same resource into the cloud console does not — name the diff, the review, the reproducibility, and the concurrency-safety, and tie each to a failure it prevents.
Level III — Design a lakehouse and argue the win. Design an end-to-end lakehouse architecture for a company that today runs a classic fused-storage-and-compute warehouse it has outgrown. Specify the four layers — object storage, the open table format, the decoupled compute engines, and the infrastructure-as-code that provisions them — plus the governance layer (catalog and access control). Then make the core argument: explain why decoupling storage from compute delivers a cost-and-flexibility win the old warehouse structurally could not, walking through at least two concrete scenarios (an idle period; two differently-sized teams querying one dataset) where the separated architecture wins and explaining precisely why in each. Address the migration risk: what does the open table format protect you from that motivated leaving the proprietary warehouse in the first place?

Summary

Data infrastructure is the platform layer that decides, at once, whether your data is cheap, reliable, queryable, and rebuildable. The lake-versus-warehouse fork that defined a generation of data teams — pick cheap-and-flexible-but-untrustworthy, or reliable-but-expensive-and-rigid — is dissolved by the lakehouse: cheap object storage holding open columnar files, wrapped by an open table format (Iceberg, Delta, Hudi) that adds back the warehouse’s ACID transactions, schema evolution, and time travel, with compute engines decoupled from storage so each scales on its own. That same storage-compute separation is the central cost lever of every modern cloud warehouse. And the whole platform must be declared as infrastructure as code — provisioned by Terraform through a reviewable plan, reconciled to a version-controlled desired state, and recoverable from the repository — because a platform clicked together by hand is unauditable, unreproducible, and, after an outage, unrecoverable. Stateful data systems can run on Kubernetes when you need multi-framework, multi-cloud consistency, but only with operators and persistent volumes; otherwise managed services are the better trade. Catalogs and access control are what keep the lakehouse governed instead of decaying into a swamp.

Key takeaways

The lakehouse resolves the lake-vs-warehouse fork: object storage gives you cheap and open; an open table format gives you ACID, schema evolution, and time travel on top of it — the lake’s cost with the warehouse’s guarantees, on one copy of the data.
Separating storage from compute is the defining cost-and-flexibility lever — it lets storage-heavy and compute-heavy workloads stop subsidizing each other and lets many engines query one dataset independently.
Infrastructure as code turns a platform into a reviewable, reproducible, recoverable artifact: plan shows the diff, apply reconciles to the committed desired state, and remote state with locking keeps concurrent changes safe.
State belongs in a remote, locked backend, and stateful data on Kubernetes belongs on a PersistentVolumeClaim — the same “state never lives in the ephemeral thing” rule in two places.
A table format keeps each table trustworthy; a catalog plus access control keeps the platform governed. Without them, a lake decays into a swamp.

Connections to other chapters

Data Warehousing & Modeling (prerequisite): that chapter teaches the dimensional models, columnar storage, and analytical query patterns; this chapter is where those models physically live and run. The lakehouse is the modern home for the schemas you learned to design there — the same star schemas, now sitting on open files instead of inside a proprietary warehouse.
Data Processing Engines (sibling): the decoupled compute engines that query this storage — Spark, Trino, streaming engines — are the subject of that chapter. The storage-compute separation here is precisely what lets those engines scale independently and share one copy of the data; the two chapters are the two halves of “where data rests” and “what computes over it.”
Containerization (Part V, prerequisite): the immutable-artifact mindset and the stateful-workload caveat (“state lives on a mounted volume, never in the container”) are the foundation this chapter builds on. The Kubernetes-for-data section is that rule taken to its data-platform conclusion — PersistentVolumeClaims are the mounted volume, made unforgiving.
Kubernetes (Part V, extension): the general mechanics of pods, scheduling, operators, and volumes belong there; this chapter takes only the data-specific consequences — when stateful data systems should run on Kubernetes at all, and what operators and persistent volumes mean for them.
CI/CD (Part IV, extension): infrastructure as code is deployed through pipelines — terraform plan on a pull request, terraform apply on merge — which is the identical declarative-reconcile idea as GitOps, applied to infrastructure instead of application deployments. The platform you declare here ships through the pipeline you build there.

Apache Iceberg documentation and Delta Lake documentation — the two dominant open table formats; read both to see how a transaction log and manifest layer ACID, schema evolution, and time travel over plain files.
Reis & Housley, Fundamentals of Data Engineering (O’Reilly) — the canonical treatment of the data-platform layers, the storage-compute separation, and the lake/warehouse/lakehouse landscape.
Terraform documentation (HashiCorp) — the declarative model, plan/apply, state, and modules, from the source.

Deep dives

Brikman, Terraform: Up & Running (O’Reilly) — the practitioner’s book on modules, remote state with locking, environments, and the team workflows that make IaC safe at scale.
Armbrust et al., “Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics” (Databricks / CIDR, 2021) — the paper that named and argued the lakehouse pattern; read it for the explicit case against the lake-vs-warehouse fork.

Historical context

The separation-of-storage-and-compute history — Snowflake’s “The Snowflake Elastic Data Warehouse” (SIGMOD, 2016) and the Amazon S3 design lineage — together explain how cheap, durable, decoupled object storage made everything in this chapter economically possible.
The Apache Parquet and Apache Hive Metastore projects — the open columnar format and the original catalog that the entire lakehouse stack still rests on.

--- title: "Data Infrastructure" keywords: [data infrastructure, lakehouse, data lake, data warehouse, object storage, table formats, iceberg, delta, terraform, infrastructure as code] difficulty: advanced prerequisites: [data-warehousing, containerization] estimated_time: "3-4 hours" --- ## Introduction Two teams, two opposite disasters, the same root cause. The first team bought a data warehouse — a fast, expensive, columnar machine that ran SQL like a dream. It served the BI dashboards beautifully, right up until the day the ML team showed up with a few terabytes of raw clickstream JSON and a directory of product images they wanted to train on. The warehouse had no place to put any of it. It wanted clean, typed, modeled tables; it charged warehouse prices for every byte sitting in it; and it could not hold the messy, half-structured, not-yet-understood raw data that is the lifeblood of analytics and machine learning. So the raw data went somewhere else, and now there were two systems, two copies, and a nightly job to shuttle data between them that broke every other week. The second team had learned that lesson and swore off warehouses entirely. They dumped everything into a *data lake* — just a big bucket of object storage, cheap as dirt, infinitely scalable, holding raw files in whatever shape they arrived. For about a year it was glorious. Then it rotted. Without a schema, nobody knew what was in any given file. Without transactions, a job that crashed halfway left the lake in a state where some files were updated and some weren't, and a query reading mid-write got garbage. Without a catalog, finding the right dataset meant asking the one engineer who remembered where things were. The cheap, flexible lake had decayed into a **data swamp**: untrustworthy, unqueryable, ungoverned. The same `SELECT` that worked yesterday returned different rows today, and nobody could say why. And here is the cause both disasters shared: a third team, in the same company, had clicked all of this infrastructure together by hand in a cloud console. The buckets, the warehouse, the IAM roles, the networking — each provisioned by someone navigating a web UI, one checkbox at a time, with no record of what they did. When a region had an outage and the platform needed to be rebuilt, there was no blueprint to rebuild it *from*. Nobody could say what the working configuration had been, because the working configuration lived only in the cloud account itself, and the cloud account was the thing that was down. This chapter is about the platform layer that sits underneath your data and decides three things at once: whether your data is **cheap** to store, whether it is **reliable** enough to trust, and whether it is **queryable** by the engines that need it — *and*, fourth, whether you can **rebuild the whole thing** from a file in version control after it's gone. Get this layer right and the warehouse-versus-lake fork stops being a fork. Get it wrong and you pick which of the disasters above you'd prefer. ### The Core Insight Two ideas, taken together, define modern data infrastructure. Both are resolutions of false choices that earlier generations of data teams thought they had to live with. The first idea resolves the lake-versus-warehouse fork. For two decades you genuinely had to choose. A **data lake** — raw files on cheap object storage — gave you low cost, infinite scale, open formats, and the freedom to store anything in any shape. But it gave you none of the guarantees a database makes: no ACID transactions, so a half-finished write left readers seeing torn data; no enforced schema, so files drifted apart until nothing could read them uniformly; no efficient updates or deletes, which made even a GDPR "delete this user" request a nightmare. A **data warehouse** gave you exactly those guarantees — transactions, schemas, fast SQL — but it charged warehouse prices, locked your data inside a proprietary format, fused storage to compute so you paid for both even when idle, and refused anything that wasn't already clean and structured. The **lakehouse** dissolves the choice. It keeps the lake's cheap object storage and open columnar files, and it layers an **open table format** — Apache Iceberg, Delta Lake, or Apache Hudi — on top of those files. That table format is a metadata layer that adds back exactly the guarantees the bare lake lacked: ACID transactions, schema evolution, time travel. You get the lake's cost and flexibility *and* the warehouse's reliability, on one copy of the data, with compute engines decoupled from storage so each scales on its own. The second idea is that this platform must be **infrastructure as code**. Every bucket, catalog, role, warehouse, and cluster that makes up the platform should be defined in declarative configuration files, kept in version control, and provisioned by a tool — Terraform is the standard — rather than clicked together by hand. The reason is the third team's outage. Hand-built infrastructure is *unauditable*: there is no diff, no review, no record of who changed what. It is *unreproducible*: you cannot stamp out an identical staging environment, and you cannot rebuild production from scratch. And it *drifts*: reality slowly diverges from anyone's mental model of it, until nobody knows the true state. Infrastructure as code makes the desired state of your platform a reviewable artifact and the act of provisioning a repeatable, recoverable operation. ### A mental model Hold two pictures in your head. The first is **storage and compute as decoupled layers, stacked**. At the very bottom sits cheap, durable object storage — think of it as an effectively bottomless warehouse floor where you can drop boxes for almost nothing per box. The boxes are open columnar files. By themselves they are just boxes: you can read them, but there is no inventory, no guarantee that a box isn't being repacked while you read it, no way to ask "what did this shelf look like last Tuesday." The **table format** is the inventory system you bolt on top — it tracks which files belong to which table at which moment, makes a batch of changes appear all-at-once or not at all, and remembers every past state. With that inventory in place, *any* number of forklifts — a SQL engine, Spark, a streaming job, an ML training run — can work the same floor at the same time without colliding, and each can be sized for its own job. The storage doesn't care how many engines read it; the engines don't own the storage. That decoupling is the whole architecture. The second picture is **infrastructure as code as a thermostat for your platform**. You don't reach into the furnace and adjust the flame; you set a target temperature, and a controller continuously reconciles the room toward it. Terraform works the same way: you write down the *desired state* of your infrastructure, and the tool computes the difference between that and what actually exists, then makes reality match. This is the identical pattern you'll meet again in GitOps and in Kubernetes — declare the desired state, let a reconciler converge to it — and recognizing it as one idea wearing three hats is most of what you need to understand all three. ### Choosing the platform shape Before any code, decide the shape of the platform. @fig-de-lakehouse shows where the pieces land; the decisions below are which pieces you actually want. **Warehouse, lake, or lakehouse?** Choose a **pure warehouse** (Snowflake, BigQuery, Redshift) when your data is overwhelmingly structured, your workload is SQL analytics and BI, your volumes are modest, and you value zero-tuning simplicity over storage cost — a SQL-first analytics team with no Spark or ML ambitions is the classic fit. Choose a **pure lake** almost never anymore for a primary platform: a bare lake without a table format is the swamp risk from the introduction, and it survives today mostly as a landing zone *underneath* a lakehouse. Choose a **lakehouse** when you have a mix of structured and semi-structured or unstructured data, when ML and SQL analytics share the same data, when storage cost matters at scale, or when you want to avoid locking your data inside one vendor's format. The lakehouse is the default for a serious modern platform precisely because it refuses the original fork. **Managed or self-hosted?** A managed platform (Snowflake, Databricks, BigQuery) trades money for operational burden — you write SQL and they run the machines, patch them, scale them, and page themselves at 3 a.m. Self-hosting open components (Trino or Spark over Iceberg on your own object storage) trades engineering effort for control, cost at scale, and freedom from lock-in. Small teams should almost always start managed; the operational savings dwarf the licensing premium until you're large enough to amortize a platform team. **When does Kubernetes for data make sense?** Running stateful data systems (Kafka, Spark, ClickHouse) on Kubernetes yourself buys you multi-framework consistency and multi-cloud portability at the price of real operational complexity — it is for platform teams who run several frameworks together and have the expertise to operate clusters. If you run a single framework, a managed service (EMR, Dataflow, Confluent Cloud, a managed warehouse) is almost always the better trade. We return to this tension in its own section. ### What you'll learn - Why the data-lake-versus-warehouse fork was real, and how the lakehouse dissolves it by layering an open table format over cheap object storage - How object storage plus an open table format (Iceberg, Delta, Hudi) gives a data lake the ACID transactions, schema evolution, and time travel a warehouse had - Why separating storage from compute is the central cost-and-flexibility lever of every cloud data platform - How cloud warehouses (Snowflake, BigQuery, Redshift) differ, and how to reason about managed versus self-hosted - How infrastructure as code with Terraform makes a data platform declarative, reproducible, and auditable — and what `plan`/`apply` and remote state actually give you - When stateful data workloads belong on Kubernetes versus a managed service - How catalogs and platform-level access control keep a lakehouse governed rather than letting it decay into a swamp ### Prerequisites - **Data Warehousing & Modeling** — what a warehouse is, what dimensional models and columnar storage are, and why analytical queries differ from transactional ones. The lakehouse is where those models physically come to live. - **Containerization** — images, the immutable-artifact mindset, and the stateful-workload caveat (state lives on a mounted volume, never in the container). The Kubernetes-for-data section builds directly on it. - Comfort with cloud basics: object storage, IAM roles, and what "a region" means. --- ## Storage: lake vs warehouse vs lakehouse Start at the bottom of the stack, because the bottom is where the money and the trust are won or lost. Everything above it is, in a sense, a way of querying what sits here. A **data warehouse** is a database tuned for analytics. It stores data in a proprietary, highly optimized columnar format, on storage that it manages and usually fuses to its compute. Within its walls it is excellent: transactions are ACID, schemas are enforced, SQL is fast, and you tune almost nothing. Its limitations are exactly the things it refuses. It refuses cheap storage — you pay warehouse rates for every byte, including cold raw data you touch twice a year. It refuses your formats — your data lives inside the vendor's format, and getting it out is a project. And classically it refuses to separate storage from compute, so an idle warehouse still bills, and a storage-heavy, compute-light workload pays for compute it doesn't use. A **data lake** is the opposite bet. It is just object storage — S3, Google Cloud Storage, Azure Data Lake Storage — holding files in open formats, most importantly **Parquet**, the open columnar format that every query engine on earth can read. Object storage is astonishingly cheap, effectively infinite, durable to eleven nines, and decoupled from any compute by construction: the bytes sit in a bucket, and whatever wants to read them brings its own compute. You can store anything, in any shape, raw, for almost nothing. The catch is everything a database does for you that a bucket does not. A bucket has no transactions: if a Spark job writes a thousand Parquet files and dies after five hundred, a reader sees half a table, and there is no rollback. A bucket has no enforced schema: file number 900 can quietly add a column or change a type, and nothing stops it, so the dataset slowly becomes unreadable as a whole. A bucket has no efficient way to update or delete a single row buried in a multi-gigabyte file. These are not minor gaps — they are the difference between a data store you can trust a query against and the swamp from the introduction. The **lakehouse** is the synthesis, and the synthesizing component is the **open table format**. Iceberg, Delta Lake, and Hudi are not storage and not engines; they are a *metadata layer* that sits between the raw files and the engines reading them. The format maintains, alongside your Parquet files, a transaction log and a manifest of which files constitute the table *as of each moment*. That single piece of bookkeeping is what buys back the warehouse guarantees: - **ACID transactions.** A write produces a new committed snapshot atomically. The thousand-file Spark job either commits all its files as one new table version or commits none — a reader never sees the half-written state, because "the table" is defined by the log, and the log only points at complete commits. - **Schema evolution.** Adding, renaming, or retyping a column is a tracked metadata operation, not a rewrite of every file and not a silent drift. The format knows the schema of every snapshot, so old and new files coexist correctly under one logical table. - **Time travel.** Because every commit is a snapshot and the log remembers them, you can query the table *as of* a past timestamp or version, and you can roll a bad write back. The "why did this query return different rows today" mystery becomes a diffable history. And crucially, the lakehouse keeps the lake's defining property: **storage and compute stay decoupled.** The data is one copy of open Parquet files in one bucket, wrapped by one table format, and *many* engines query it side by side — a SQL warehouse for BI, Spark for heavy ELT, a streaming engine for fresh data, an ML pipeline for training — each engine sized and scaled for its own job, none of them owning the storage. This is the architecture in @fig-de-lakehouse: cheap object storage at the base, the table format wrapping it with database semantics, decoupled engines on top all reading the same tables, and infrastructure as code provisioning the whole thing. ![The lakehouse architecture: cheap object storage holds open columnar files at the base; an open table format (Iceberg, Delta) layers ACID transactions, schema evolution, and time travel on top, giving the lake the guarantees a warehouse had; and decoupled compute engines — SQL, Spark, streaming, ML — all query the same tables. Infrastructure as code provisions the whole platform declaratively and reproducibly.](../assets/diagrams/rendered/de_lakehouse.svg){#fig-de-lakehouse .lightbox} A small, concrete illustration of the difference the format makes. Against a bare lake, "update yesterday's partition" means rewriting files and praying nothing reads mid-rewrite. Against a lakehouse table, it is a transaction: ```sql -- An Iceberg/Delta table: a normal SQL transaction over files in a bucket. -- The MERGE commits atomically; no reader sees a half-applied state. MERGE INTO sales.orders t USING staging.orders_2026_06_24 s ON t.order_id = s.order_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; -- And because every commit is a snapshot, you can look backwards: SELECT * FROM sales.orders FOR SYSTEM_TIME AS OF '2026-06-23 00:00:00'; ``` The SQL looks like a warehouse. The data underneath is cheap open files in a bucket you control. That is the lakehouse promise in one statement. ## Cloud data platforms The warehouse did not disappear; it evolved, and the evolution is itself the lesson. The defining move of the modern cloud warehouse — Snowflake's original insight, and the thing BigQuery and Redshift's RA3 generation adopted — was to **separate storage from compute**, the same decoupling the lakehouse makes architecturally explicit. In a classic warehouse, storage and compute were welded together: more data forced more compute and vice versa, and an idle cluster still billed. Separated, your data sits in cheap object storage, and you spin compute up against it on demand, sizing and scaling each independently. This is *the* cost lever of cloud data platforms. It means a storage-heavy, query-light workload stops subsidizing compute it doesn't use; it means two teams can run two differently-sized warehouses against one copy of the data without fighting for resources; and it means an idle platform costs storage prices, not compute prices. The major managed platforms differ mostly in where they sit on a few axes. **BigQuery** is fully serverless and SQL-first: you write SQL, Google runs the machines, and you pay per terabyte scanned or for reserved slots — superb for SQL analytics teams who want zero operational surface, weakest when you need heavy Spark or strict cost predictability. **Snowflake** is the multi-cloud, data-sharing specialist: it runs on AWS, GCP, and Azure, and its secure data marketplace lets organizations share data without moving it — the choice for enterprise SQL analytics and cross-org sharing, less so for the heaviest ML. **Databricks** is the Spark-and-ML platform, built around Delta Lake and the lakehouse pattern itself: it shines for data science, streaming-plus-batch unification, and open formats, and is overkill for a team that only wants SQL. **Redshift** is the AWS-native warehouse — the natural pick inside an AWS shop, with its RA3 and serverless generations adopting the storage-compute separation that defines the category. The practical pattern many organizations land on is *several* of these against one open-format copy of the data: Databricks for engineering and ML, a SQL warehouse for BI, all reading the same Iceberg or Delta tables — which is only possible *because* the format and the storage are open and decoupled. The **managed-versus-self-hosted** decision threads through all of this. Managed platforms trade money for operational burden: you stop running machines and start writing queries, and the premium you pay buys back the people you'd otherwise need to patch, scale, and babysit a cluster. Self-hosting the open stack — Trino or Spark over Iceberg on your own object storage — trades engineering effort for raw cost at scale, full control, and freedom from any single vendor's format and pricing. The honest default for most teams is to start managed and earn your way into self-hosting only when your scale makes the licensing premium larger than a platform team's salary. One careless decision here, made the other way, is its own kind of war story: the cloud warehouse bills by *bytes scanned*, and an analyst running `SELECT *` against a petabyte table with no partition filter can turn a single query into a four-figure invoice. Partition your tables, require partition filters, and set per-query cost guards before anyone touches the console. ## Infrastructure as code Now provision all of this — the buckets, the catalog, the warehouse, the IAM roles, the networking — without ever opening the cloud console. That is infrastructure as code, and Terraform is its standard. The core idea is **declarative reconciliation**, the thermostat from the mental model. You do not write a script that says "create this bucket, then create that role, in this order." You write down the *desired end state* of your infrastructure in HCL configuration files, and Terraform figures out the difference between that desired state and what currently exists, then makes the changes — in the right dependency order — to converge. A bucket that references a role makes Terraform create the role first; you never sequence it by hand. The desired state is the source of truth, and the world is reconciled to it. The workflow that makes this trustworthy is `plan` then `apply`. `terraform plan` computes and *shows you* the diff — every resource it will create, change, or destroy — before touching anything. `terraform apply` executes that reviewed plan. This is the single most important habit in the practice: you read the plan, you see "1 to add, 0 to change, 0 to destroy," and only then do you apply. The plan is what a hand-clicked console can never give you — a reviewable preview of a change to production infrastructure, the same way a code review previews a change to production code. What ties the declaration to reality is **state**. Terraform records, in a state file, its understanding of which real resources correspond to which configuration. `plan` is a three-way comparison: what you want (the `.tf` files), what Terraform thinks exists (state), and what actually exists (the live cloud). State is also why the file must live in a **remote backend** — an S3 bucket with a lock table, or Terraform Cloud — the moment more than one person is involved. Here is the war story that makes the rule visceral: two engineers ran `terraform apply` against the same *local* state file at the same time, and corrupted it; the recorded state drifted out of sync with reality, infrastructure ended up half-changed, and untangling it was an outage. A remote backend with state locking serializes concurrent applies so this cannot happen. A shared, unlocked state file is a guaranteed incident, eventually. The other two pieces are **modules** and the auditability the whole thing yields. A module is a reusable, parameterized bundle of infrastructure — a "data lake bucket" module, say, that always sets versioning, encryption, lifecycle tiering, and public-access blocking correctly — that you instantiate per environment instead of re-deriving by hand and getting it subtly wrong each time. And because every resource lives in version-controlled HCL, your infrastructure gains exactly what application code has: a full history of who changed what and why, pull-request review on changes before they apply, and the ability to rebuild the entire platform from the repository after a region goes down. The third team's outage — the one with no blueprint to rebuild from — simply cannot happen when the blueprint *is* the repo. ```hcl # A small module instance: a governed data-lake bucket plus its table-format # catalog, declared once and reproducible across dev/staging/prod. module "lakehouse_storage" { source = "./modules/data-lake-bucket" bucket_name = "acme-${var.environment}-lakehouse" versioning = true # time-travel-friendly; required by the table format encryption = "aws:kms" # encrypted at rest, auditable key access lifecycle_tiers = { # cheap storage gets cheaper as it cools infrequent_access_after_days = 30 archive_after_days = 90 } } ``` The `plan` for that change is a few lines you can read in a pull request; the `apply` is reproducible in any environment; and the configuration is the durable record of what your platform *is*. That is the entire value proposition over clicking in a console: a diff, a review, a reproduction, and a recovery — none of which a console gives you. ## Kubernetes for data workloads Sometimes you want to run the stateful data systems themselves — Kafka, Spark, Postgres, ClickHouse — rather than rent them as managed services. Kubernetes is where that happens, and it is also where the stateful-workload caveat from the *Containerization* chapter becomes a load-bearing rule rather than a footnote. The honest framing is that Kubernetes for data buys **multi-framework consistency and multi-cloud portability at the cost of real operational complexity.** If you run Spark, Flink, and Kafka *together* and want one deployment substrate across clouds with no vendor lock-in, Kubernetes earns its keep. If you run a *single* framework, a managed service — EMR or Dataflow for Spark, Confluent Cloud or MSK for Kafka, a managed warehouse for SQL — is almost always the better trade, because the managed service absorbs the operational burden that you would otherwise carry yourself. Kubernetes for data is for platform teams with the expertise to operate clusters, not a default. Two Kubernetes ideas do the heavy lifting for data, and both are about state. The first is the **operator pattern**. A naked Kubernetes does not know how to run a Kafka cluster correctly — how to add a broker, rebalance partitions, or recover a failed node without losing quorum. An *operator* encodes that application-specific knowledge as a controller: you declare what you want with a custom resource (a `Kafka` object asking for three brokers), and the operator continuously reconciles the cluster toward it — the same declarative-reconciliation idea as Terraform and GitOps, applied to a running stateful system. Strimzi does this for Kafka, CloudNativePG for Postgres, the Spark Operator for Spark jobs. The operator is how production-grade stateful data services run on Kubernetes at all. The second idea is **persistent volumes**, and it is where teams lose data. Containers are ephemeral; their writable layer dies with them, and Kubernetes *will* reschedule your pod to another node when you least expect it. Stateful data must therefore live on a `PersistentVolumeClaim` — durable storage that survives the pod, backed by cloud block storage — and never in the pod's own ephemeral storage. The war story is one line of YAML: a database pod backed by `emptyDir` (scratch space) instead of a PVC looked fine until the first reschedule, at which point every byte of data vanished. This is the *Containerization* chapter's "state lives on a mounted volume, never in the container" rule, made concrete and unforgiving. The general mechanics of pods, scheduling, and volumes belong to the **Kubernetes** chapter in Part V; what matters *here* is the data-specific consequence — stateful data systems need operators to run correctly and PersistentVolumeClaims to survive at all, and if you cannot commit to both, you want a managed service. ## Governance A lakehouse without governance drifts back toward the swamp, just more slowly. Two platform-level concerns keep it healthy. The first is the **catalog** — the inventory system from the mental model, made real. A catalog (AWS Glue, Unity Catalog, a Hive Metastore, an Iceberg REST catalog) records what tables exist, what their schemas are, where their files live, and increasingly their lineage and ownership. It is what lets every decoupled engine agree on what "the orders table" *is*, and it is what turns "ask the one engineer who remembers" into a queryable directory. A lakehouse's open table format gives each table its guarantees; the catalog is what makes the *set* of tables discoverable and consistent across engines. The second is **access control over the platform**, and it has two faces that are easy to conflate. There is *data* access control — who can read which tables, columns, and rows, increasingly expressed as policies in the catalog. And there is *infrastructure* access control — who can change the platform itself, which is exactly the IAM and least-privilege story that infrastructure as code makes auditable: roles defined in version-controlled HCL, reviewed before they apply, granting the minimum each workload needs. The two together are what "governed" means: the right people can read the right data, and changes to who-can-do-what are themselves diffable, reviewable history rather than a setting someone changed in a console at 2 a.m. and forgot. ::: {.callout-warning} ## War story: the swamp and the unreproducible console A growth-stage company ran both halves of the introduction at once. Their analytics lake was a bare S3 bucket of Parquet with no table format, no catalog, and no access control beyond "the team can write anywhere." A nightly Spark job updated the previous day's partition by overwriting files in place; one night it crashed at file 600 of 1,100, and for the next eight hours every dashboard read a table that was 600 files in the future and 500 files in the past. There was no transaction to roll back, because there were no transactions — the "table" was whatever files happened to be in the prefix. The fix was not heroics; it was an open table format, which would have made that nightly update a single atomic commit that readers never saw mid-flight. The deeper problem surfaced during the remediation. To rebuild the platform correctly, the team needed to know what the platform *was* — and they couldn't, because every bucket, role, and policy had been clicked into the console by hand over two years by a rotating cast of engineers, none of it recorded anywhere. There was no source of truth to reconcile against, no way to stand up an identical staging copy to test the fix, and no way to prove the rebuilt platform matched the old one. They spent more time reverse-engineering their own infrastructure than fixing the data bug. Both halves of the lesson are the same lesson: **the platform layer must be declared — the tables by a table format, the infrastructure by code — or you are trusting a state that lives nowhere you can read, review, or rebuild.** ::: > **Build it →** See these layers in working systems: > [Project 07: Data Lakehouse](https://github.com/jchu0/applied-cs-projects/tree/main/07-data-lakehouse) > is the direct analog — object storage, an open table format, and decoupled engines over one copy of the data. > [Project 34: Distributed File System](https://github.com/jchu0/applied-cs-projects/tree/main/34-distributed-file-system) > is the durable storage layer the whole stack rests on, built from first principles. > [Project 52: Time-Series Database](https://github.com/jchu0/applied-cs-projects/tree/main/52-time-series-database) > is a stateful storage engine of exactly the kind you'd run on Kubernetes with an operator and persistent volumes. --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — Pick the storage shape and defend it.** You are handed three workloads: (a) a BI team running scheduled SQL over a few hundred gigabytes of clean, structured sales data; (b) an ML team training on tens of terabytes of raw clickstream JSON plus a directory of images; (c) a platform that must serve *both* (a) and (b) from one copy of the data. For each, decide whether a warehouse, a bare lake, or a lakehouse fits, and write one paragraph defending each choice in terms of cost, the reliability guarantees the workload needs, and which engines must query the data. Name explicitly, for case (c), what the open table format buys you that a bare lake would not. 2. **Level II — Sketch a module and explain plan/apply/state.** Sketch (in HCL pseudocode, not a working deployment) a small Terraform module for one data-platform component — a governed lakehouse bucket, or a catalog database, or a warehouse — with at least three parameters and the safe defaults you'd bake in (versioning, encryption, least-privilege access). Then write a short explanation, as if onboarding a teammate, of what `terraform plan`, `terraform apply`, and *remote state with locking* give you that clicking the same resource into the cloud console does not — name the diff, the review, the reproducibility, and the concurrency-safety, and tie each to a failure it prevents. 3. **Level III — Design a lakehouse and argue the win.** Design an end-to-end lakehouse architecture for a company that today runs a classic fused-storage-and-compute warehouse it has outgrown. Specify the four layers — object storage, the open table format, the decoupled compute engines, and the infrastructure-as-code that provisions them — plus the governance layer (catalog and access control). Then make the core argument: explain *why decoupling storage from compute* delivers a cost-and-flexibility win the old warehouse structurally could not, walking through at least two concrete scenarios (an idle period; two differently-sized teams querying one dataset) where the separated architecture wins and explaining precisely *why* in each. Address the migration risk: what does the open table format protect you from that motivated leaving the proprietary warehouse in the first place? ## Summary Data infrastructure is the platform layer that decides, at once, whether your data is cheap, reliable, queryable, and rebuildable. The lake-versus-warehouse fork that defined a generation of data teams — pick cheap-and-flexible-but-untrustworthy, or reliable-but-expensive-and-rigid — is dissolved by the **lakehouse**: cheap object storage holding open columnar files, wrapped by an open table format (Iceberg, Delta, Hudi) that adds back the warehouse's ACID transactions, schema evolution, and time travel, with compute engines decoupled from storage so each scales on its own. That same storage-compute separation is the central cost lever of every modern cloud warehouse. And the whole platform must be declared as **infrastructure as code** — provisioned by Terraform through a reviewable `plan`, reconciled to a version-controlled desired state, and recoverable from the repository — because a platform clicked together by hand is unauditable, unreproducible, and, after an outage, unrecoverable. Stateful data systems can run on Kubernetes when you need multi-framework, multi-cloud consistency, but only with operators and persistent volumes; otherwise managed services are the better trade. Catalogs and access control are what keep the lakehouse governed instead of decaying into a swamp. ### Key takeaways - The lakehouse resolves the lake-vs-warehouse fork: object storage gives you cheap and open; an open **table format** gives you ACID, schema evolution, and time travel on top of it — the lake's cost with the warehouse's guarantees, on one copy of the data. - **Separating storage from compute** is the defining cost-and-flexibility lever — it lets storage-heavy and compute-heavy workloads stop subsidizing each other and lets many engines query one dataset independently. - **Infrastructure as code** turns a platform into a reviewable, reproducible, recoverable artifact: `plan` shows the diff, `apply` reconciles to the committed desired state, and remote state with locking keeps concurrent changes safe. - **State belongs in a remote, locked backend**, and stateful data on Kubernetes belongs on a **PersistentVolumeClaim** — the same "state never lives in the ephemeral thing" rule in two places. - A table format keeps each *table* trustworthy; a **catalog plus access control** keeps the *platform* governed. Without them, a lake decays into a swamp. ### Connections to other chapters - **Data Warehousing & Modeling** (prerequisite): that chapter teaches the dimensional models, columnar storage, and analytical query patterns; *this* chapter is where those models physically live and run. The lakehouse is the modern home for the schemas you learned to design there — the same star schemas, now sitting on open files instead of inside a proprietary warehouse. - **Data Processing Engines** (sibling): the decoupled compute engines that query this storage — Spark, Trino, streaming engines — are the subject of that chapter. The storage-compute separation here is precisely what lets those engines scale independently and share one copy of the data; the two chapters are the two halves of "where data rests" and "what computes over it." - **Containerization** (Part V, prerequisite): the immutable-artifact mindset and the stateful-workload caveat ("state lives on a mounted volume, never in the container") are the foundation this chapter builds on. The Kubernetes-for-data section is that rule taken to its data-platform conclusion — PersistentVolumeClaims are the mounted volume, made unforgiving. - **Kubernetes** (Part V, extension): the general mechanics of pods, scheduling, operators, and volumes belong there; this chapter takes only the data-specific consequences — when stateful data systems should run on Kubernetes at all, and what operators and persistent volumes mean for them. - **CI/CD** (Part IV, extension): infrastructure as code is *deployed through pipelines* — `terraform plan` on a pull request, `terraform apply` on merge — which is the identical declarative-reconcile idea as GitOps, applied to infrastructure instead of application deployments. The platform you declare here ships through the pipeline you build there. ## Further reading ### Essential - *Apache Iceberg documentation* and *Delta Lake documentation* — the two dominant open table formats; read both to see how a transaction log and manifest layer ACID, schema evolution, and time travel over plain files. - Reis & Housley, *Fundamentals of Data Engineering* (O'Reilly) — the canonical treatment of the data-platform layers, the storage-compute separation, and the lake/warehouse/lakehouse landscape. - *Terraform documentation* (HashiCorp) — the declarative model, `plan`/`apply`, state, and modules, from the source. ### Deep dives - Brikman, *Terraform: Up & Running* (O'Reilly) — the practitioner's book on modules, remote state with locking, environments, and the team workflows that make IaC safe at scale. - Armbrust et al., *"Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics"* (Databricks / CIDR, 2021) — the paper that named and argued the lakehouse pattern; read it for the explicit case against the lake-vs-warehouse fork. ### Historical context - The separation-of-storage-and-compute history — Snowflake's *"The Snowflake Elastic Data Warehouse"* (SIGMOD, 2016) and the Amazon S3 design lineage — together explain how cheap, durable, decoupled object storage made everything in this chapter economically possible. - The *Apache Parquet* and *Apache Hive Metastore* projects — the open columnar format and the original catalog that the entire lakehouse stack still rests on.