Streaming & Real-Time Data

Keywords

streaming, kafka, flink, event time, watermarks, windowing, exactly-once, the log, stream processing, real-time analytics

Introduction

The fraud team had a model that worked. Given a transaction — amount, merchant, location, the cardholder’s recent history — it flagged the suspicious ones with better-than-decent accuracy. The trouble was when it ran. Transactions landed in a warehouse all day, and every night a batch job scored the day’s haul and emailed a report. By the time an analyst opened it, the fraudulent charges were a day old: the money had moved, the card was maxed, the customer was already on the phone. The model wasn’t wrong. It was late. The entire value of a fraud score is that it arrives before the transaction settles, and a nightly batch pipeline structurally cannot deliver that — it was computing the insight hours after the only moment it mattered.

So they tried the obvious fix: process each transaction the instant it arrived. Score every event as it comes off the wire, act immediately. This worked in the demo and fell apart in production, in two ways batch had quietly protected them from. First, events did not arrive in order. A mobile reader in a basement buffered a sale and flushed it ten minutes later; a flaky gateway retried and delivered the same charge twice; a timestamp said 10:01 but the event showed up at 10:09. A “transactions in the last minute” counter built on arrival time was simply wrong, and nobody noticed until the numbers stopped reconciling. Second, when a worker crashed mid-batch and restarted, it reprocessed events it had already acted on — double-counting some charges, dropping others, depending on exactly when the commit landed. The naive system was fast and incorrect, which in fraud detection is worse than slow and correct.

This is the shape of the whole field. Streaming lets you act on data now instead of hours later — but the moment you stop processing a finished, bounded dataset and start processing an endless one, time and correctness get genuinely hard in ways batch never forced you to confront. This chapter is about the small set of ideas that make continuous processing both fast and right.

The Core Insight

Batch runs a query over a bounded dataset: here is yesterday’s data, all of it, sitting still; run the aggregation, get an answer, stop. Streaming inverts this. You run a standing computation over an unbounded, continuous stream — the data never stops arriving, so the computation never stops. There is no “all of it,” only “everything so far,” and the answer is something you continuously revise.

That inversion rests on one data structure: the log. Not a logfile — the log, an append-only, ordered, durable, replayable sequence of events. Producers write to the end; the log keeps events in order for as long as you configure; consumers read forward from where they left off. This is what Kafka is. The log’s superpower is decoupling: a producer writes without knowing who reads, and any number of consumers read the same events independently, each at its own pace, each able to replay history from any point.

On top sits stream processing — the standing computation, which Flink exemplifies. A stream processor maintains state (running counts, recent events, session data) and continuously folds new events into it. Here the two problems batch never had become unavoidable:

Event time versus processing time. Events carry a timestamp for when they happened, but they arrive late, out of order, in bursts. Correct analytics must reason about when an event occurred, not when you saw it — which means waiting, judiciously, for stragglers. The mechanism for that wait is the watermark.
Exactly-once semantics. Networks retry, workers crash, consumers rebalance. A correct pipeline must produce the right result despite duplicates and failures — not count a charge twice because a retry resent it, not lose one because a worker died before committing.

Get the log right and producers and consumers stop fighting. Get event time and exactly-once right and the answers are correct. Everything else in streaming is detail on top of these.

A mental model

Picture the log as a durable conveyor belt that moves in one direction and never throws anything away. Producers drop events onto the end; events ride along in the order they landed. Each consumer is a worker beside the belt with its own bookmark, reading at whatever speed it can manage. One worker can be at position 900 while another, slower or newly arrived, is back at 300 reading the same history — and because the belt keeps everything, a worker can move its bookmark backward and re-read the past. That is replay, the whole reason the log feels different from a queue, which deletes each item the instant someone takes it.

A window is how you get a finite answer from an infinite belt: a bounded slice — “all events between 10:00 and 10:05” — so you have something concrete to aggregate. And a watermark is the processor’s statement of confidence about time. When the watermark reaches T, it asserts “I believe I have seen all events with timestamp ≤ T,” so any window ending at or before T can safely close. The watermark is the judgment call at the heart of streaming: wait too little and you close windows before the stragglers arrive (losing data); wait too long and every result is needlessly delayed (losing the point of streaming).

When streaming vs batch

Streaming is not a strictly better batch; it is a different trade. It buys low latency at the cost of real complexity — out-of-order time, state to manage, failure modes batch never had. Figure 32.1 shows the machinery you take on when you choose it. The decision is about whether you need what that machinery buys.

Reach for batch when the work is periodic and latency-tolerant: a nightly revenue roll-up, a weekly reconciliation, training over a year of history. Batch is simpler to reason about (the data holds still), cheaper, and trivially re-runnable. If nothing will act on the data within minutes of arrival, real-time is over-engineering — a five-minute dashboard nobody watches twice a day needs no streaming pipeline behind it.

Reach for streaming when the value of an insight decays in seconds: fraud caught mid-transaction, an alert that must fire before damage spreads, a live dashboard, a feature that feeds a model at request time. The test: if a delay of minutes turns a useful answer useless, you need streaming, and its complexity is the price of admission.

Use both more often than either alone. Most mature platforms run a streaming path for fresh answers and a batch path for complete, authoritative ones — the tension the Lambda and Kappa architectures, later in this chapter, are trying to resolve.

What you’ll learn

Why the log — append-only, partitioned, ordered, replayable — decouples producers from consumers and makes replay possible
How Kafka’s topics, partitions, offsets, and consumer groups scale the log, and how partitioning trades global ordering for parallelism
What stateful stream processing is, how Flink keeps and recovers state, and why checkpointing is the spine of fault tolerance
The difference between event time and processing time, and why correct analytics must reason about when events happened
How windows (tumbling, sliding, session) carve a finite answer out of an unbounded stream, and how watermarks decide when a window is safe to close
What at-most-once, at-least-once, and exactly-once really mean, and why exactly-once is honestly “effectively once”
How real-time analytics serves low-latency queries on fresh data, and what the Lambda-versus-Kappa debate is actually about

Prerequisites

The Data Engineering Landscape — where batch and streaming sit in a platform, and the batch-versus-streaming split this chapter lives on one side of
Distributed systems basics — partitioning, replication, leaders and followers, and why “exactly-once” is hard the moment more than one machine is involved
Comfort with a consumer reading a queue, and with concurrency: many producers and consumers acting at once is the normal case here, not the exception

The log: streaming’s foundation

Everything starts with the log, so it is worth being precise. A log is an append-only sequence of records, each assigned a monotonically increasing offset — 0, 1, 2, 3 — that fixes its position forever. You only ever write to the end; you never modify what is already there. The four properties that follow are the entire reason it is powerful: append-only (writes are fast and contention-free), ordered (every record has a definite place), immutable (history doesn’t change, which makes it auditable and re-readable), and replayable (a reader can start from any offset, including the beginning).

Contrast a traditional message queue, where a message is delivered and then deleted. A queue couples consumers to consumption: once one worker takes a message it is gone, and a second system that wanted it is out of luck. The log refuses to delete. A record written to a Kafka topic stays for the configured retention — hours, days, or forever — and every consumer reads it independently. This single difference enables multiple independent consumers of the same data, replay from any point, event sourcing, and audit logs, all from one write.

Kafka organizes the log into topics — named streams of records (orders, clicks, transactions) — each split into partitions, which is where the log becomes distributed rather than a single file. Each partition is an independent ordered log on a broker; a topic with twelve partitions is twelve logs read and written in parallel across machines. Partitioning is the source of Kafka’s throughput, but it buys that by giving up one thing: ordering is guaranteed only within a partition, never across them. There is no global order across a topic, only per-partition order.

That constraint is why keys matter. When a producer attaches a key, Kafka hashes it to choose a partition, so every record with the same key lands in the same partition and is strictly ordered relative to its siblings. Key by user_id and every event for a user is ordered, while different users spread across partitions for parallelism. This reconciles ordering with scale: not global order (you rarely need it) but order within the key that matters. The common mistake is forcing a single partition to get global ordering — throttling the whole topic to one consumer’s throughput and throwing away the point of the log.

Consumers cooperate through consumer groups. Kafka assigns each partition to exactly one consumer in a group, dividing the topic’s partitions among the members for parallel processing. The arithmetic is unforgiving — one partition to one consumer per group, so if a topic has six partitions and a group has eight consumers, two sit idle. Partition count is the ceiling on a group’s parallelism. Each consumer tracks its position with a committed offset; that offset is what “where I am in the stream” means, and getting offset commits right is most of what reliable consumption is about. The full shape — producers appending by key, partitions giving parallelism, groups reading independently — is the left and center of Figure 32.1.

A producer’s reliability comes down to one setting and one guarantee. The setting is acks: acks=0 fires and forgets (fast, lossy); acks=1 waits for the partition leader only (a leader crash can lose the write); acks=all waits for all in-sync replicas, the durable choice for anything that matters. The guarantee is idempotence — an idempotent producer tags each record so the broker can discard a duplicate sent by a retry, letting you set retries > 0 without a network blip becoming a double-write. The production default is acks=all plus idempotence.

Build it → A working log and streaming substrate: Project 12: Distributed Log System implements the partitioned, replicated append-only log from the inside — offsets, replication, leader election — and Project 08: Streaming Platform wires producers, topics, and consumer groups into an end-to-end event pipeline.

Stream processing and state

The log moves and stores events; it does not compute on them. That is the stream processor’s job, and the leap from a plain consumer to a stream processor is state. A stateless transform — parse JSON, drop negative amounts, route by type — needs no memory; a Kafka consumer with a producer attached (read a topic, transform, write another) is enough. The interesting computations are stateful: count events per user, detect that this charge is ten times the user’s last, group a user’s actions into a session. These require the processor to remember something across events — and that memory is the hard part, because it must survive failure.

Flink is built around this. A Flink job is a graph of operators; a keyed operator — downstream of a keyBy(user_id) — gets its own slice of state per key, partitioned so all of one user’s events and state live together. State mirrors what you store: a value per key (the last transaction amount), a list per key (recent events), a map per key (counts by category). Small state lives on the JVM heap; large state — and production state reaches terabytes — uses an embedded RocksDB backend that spills to disk and supports incremental snapshots. Per event, continuously: read state, fold in the event, write state back, emit a result.

The mechanism that makes this survivable is checkpointing. Periodically the system injects markers called barriers into the streams at the sources; as a barrier flows through the operator graph, each operator snapshots its state to durable storage (S3, HDFS) and passes the barrier on. A completed checkpoint is a globally consistent snapshot: every operator’s state plus the exact source offsets it corresponds to. When a worker dies, recovery is mechanical — restore state from the last checkpoint, rewind the Kafka offsets it recorded, replay forward. Because the log is replayable, the events since the checkpoint are still there to re-read. This is the deep reason streaming demands a log and not a queue: fault tolerance is checkpoint-plus-replay, and replay needs durable, re-readable history. The right side of Figure 32.1 is this machinery.

Build it → Stateful processing at scale: Project 36: Distributed Streaming Analytics implements keyed state, windowed operators, and checkpoint-based recovery — the Flink model taken apart and rebuilt.

Event time, windows, and watermarks

Here is the hard part batch never makes you face. Every event has two times batch silently assumed were the same. Event time is when the event actually occurred — stamped at the source, embedded in the record. Processing time is when your system happened to handle it — wall-clock, at the operator. Between them sits everything that delays an event: a phone buffering clicks offline, a congested network, a retry, a consumer backlog. A mobile click at 10:00:01 might reach Flink at 10:00:09.

Why does the gap matter? Because almost every streaming analytic is “count / sum / average per time window,” and which time you bucket by changes the answer. Bucket that 10:00:01 click by processing time and it lands in the 10:00:05–10:00:10 window — the wrong minute. The “clicks per minute” chart is now wrong and, worse, not reproducible: re-run the same job on the same data on a slower day and the buckets shift, because processing time depends on when you ran it. Bucket by event time and the click lands in the minute it truly happened, every time. The rule is absolute: use event time for correctness; reach for processing time only when you want the lowest possible latency and can accept the inaccuracy.

A window is the bounded slice that turns the unbounded stream into something aggregable, and three shapes answer different questions. Tumbling windows are fixed-size and non-overlapping — [10:00–10:05) [10:05–10:10) — so each event belongs to exactly one; these are periodic aggregates (“sales per hour”). Sliding windows are fixed-size but overlap by a slide interval (a 10-minute window every 5 minutes), so each event falls into several; these give moving averages and smoothed trends. Session windows are dynamic, defined by a gap of inactivity: events cluster into a session, and when none arrives for, say, 30 minutes, it closes — how you reconstruct a user’s visit without arbitrary boundaries. (A caution: session state is per key, so high-cardinality keys all open at once can blow up memory.)

The window poses a question it cannot answer alone: when is it complete? The stream is infinite and events arrive late — so when is it safe to close [10:00–10:05) and emit its count? Close the instant the clock passes 10:05 and you drop every straggler; wait forever and you defeat the purpose. The watermark is the principled answer: a moving assertion in event time, “I am now confident I have seen all events with timestamp ≤ T.” A common strategy is “bounded out-of-orderness” — watermark = (maximum event time seen) − (allowed lateness). Set three minutes of allowed lateness and once the processor has seen an event stamped 10:05, its watermark sits at 10:02; the moment a watermark crosses a window’s end, that window fires. The allowed-lateness knob is the trade: too small and late events miss their window; too large and every result waits needlessly. You profile real data to set it — and events later than your tolerance can go to a side output (a separate late-data stream) rather than vanish, or, with allowedLateness, reopen and update an already-emitted window. The watermark feeding the window-close in Figure 32.1 is this mechanism.

War story: the window that counted by the wrong clock

A retail analytics team shipped a “purchases in the last 5 minutes” dashboard that looked perfect in staging and drifted in production. Test traffic was synthetic and arrived in order, so processing-time and event-time windows agreed. Real traffic did not: a sizable fraction of sales came from mobile point-of-sale readers that buffered transactions offline and flushed them in bursts — sometimes ten or fifteen minutes late. The dashboard bucketed by arrival time, so a 10:02 sale arriving at 10:14 was counted in the 10:14 window. Two buckets were wrong at once: 10:02 looked quiet (its real sales hadn’t “arrived”), and 10:14 showed a phantom spike of old transactions. Finance noticed when the real-time numbers refused to reconcile with the nightly batch totals, which — running over complete data — had been right all along. The fix was a config change: switch the windows to event time, keyed off the timestamp the reader stamped at the sale, with a watermark whose allowed lateness covered the mobile flush delay and the truly-late stragglers routed to a side output. The lesson, stated plainly: if the answer must be correct, window by when events happened, not by when you happened to see them — and the watermark is how you decide how long to wait.

Delivery and exactly-once

Failure is the normal case in a distributed stream, and how a pipeline behaves under it is captured by its delivery guarantee — three of them, differing in what happens on a retry or a crash. At-most-once: an event is processed zero or one times; a mid-flight failure simply loses it — fastest, lossiest, fine only when a dropped event is harmless. At-least-once: an event is processed one or more times; nothing is lost, but a crash between “processed” and “committed” reprocesses it on restart, so duplicates happen. This is the default for a Kafka consumer that commits offsets after processing — and exactly the trap from the introduction, where a rebalance re-delivers events the worker already acted on and a naive counter double-counts.

Exactly-once means every event affects the result precisely once, no losses and no duplicates, even across failures. Strictly, end-to-end, that is nearly impossible — which is why practitioners prefer the honest name effectively once: the observable effect is as if each event were processed once, achieved by two cooperating techniques.

The first is idempotency — design the operation so applying it twice equals applying it once. Key each result by a deterministic identifier (source partition and offset) so a re-delivered event overwrites rather than adds: an upsert, not an append; a set, not an increment. With idempotent writes, at-least-once becomes effectively-once at the sink, because duplicates land on top of themselves. The second is transactions, how Flink-plus-Kafka achieves exactly-once across the whole pipeline. The trick binds three things into one atomic unit: the operator’s checkpointed state, the source offsets it corresponds to, and the output written to the sink. Flink’s checkpoint coordinates with Kafka’s transactional producer so a checkpoint either fully commits — state, offsets, and output together — or rolls back entirely. Downstream consumers read only committed output (read_committed isolation), never the aborted, replayed-over duplicates. Recovery restores state, rewinds offsets, and replays; because the failed attempt’s output was never committed, the replay produces the result exactly once.

The takeaway is a layered defense, not a single switch: make the producer idempotent so retries don’t double-write; commit offsets only after processing so nothing is lost; make sink writes idempotent so duplicates are harmless; and for true exactly-once, lean on the framework’s checkpoint-plus-transaction machinery. “Exactly-once” is not a feature you toggle — it is a property you assemble from idempotency, transactions, and checkpointing, each contributing one piece.

Real-time analytics

Stream processing produces fresh results; real-time analytics is about serving them — answering low-latency queries over data seconds, not hours, old. The processing side computes windowed aggregates and rolling metrics; the serving side puts them somewhere queryable fast enough to back a live dashboard, an alert, or a user-facing analytics product. That layer is usually a specialized OLAP store — ClickHouse for ad-hoc SQL over logs, Druid or Pinot for high-concurrency dashboards at sub-200ms latency — with the last hop to a human often a WebSocket pushing updates as they land. The serving box on the right of Figure 32.1 is this layer.

Two techniques recur because exact computation over an unbounded stream is often too expensive. Incremental aggregation keeps a running summary instead of the raw events (a count, a sum, a sum-of-squares yielding mean and variance), so state stays O(1) per key however many events flow through. Approximate aggregation trades a sliver of accuracy for huge memory savings on high-cardinality questions: a HyperLogLog sketch estimates distinct counts (unique visitors, IPs) to within a couple of percent using about a kilobyte regardless of cardinality, where exact counting would demand a set of every value. For “unique users this hour,” a 2% error is free and the saving is the difference between feasible and not. Both share a merge operation — the same divide-and-combine that lets partial results compute in parallel per partition.

Above all this sits a long-running argument: Lambda versus Kappa. The Lambda architecture runs two pipelines side by side — a batch layer recomputing complete, authoritative results from all history (slow, accurate) and a speed layer serving fresh, approximate results from the stream (fast, eventually corrected) — merged at query time. You get accuracy and freshness, at the cost of maintaining the same logic twice in two codebases that must agree. The Kappa architecture rejects that duplication: if the log retains enough history and your processing is deterministic and replayable, you need no separate batch layer — you just replay the stream. To deploy new logic, spin up a job that reprocesses the log from the beginning and cut over when it catches up. One codebase; the price is a platform that can genuinely replay history at volume — which loops back to why a durable, replayable log is the foundation everything here is built on.

Practical exercise

Difficulty: Level I · Level II · Level III

Level I — Model a use case as a log. Take a concrete scenario — say, a ride-hailing app tracking driver location pings — and model it as producers writing to a partitioned Kafka topic, read by one or more consumer groups. Choose a partitioning key and justify it. Then explain how partitioning gives parallelism (more partitions, more concurrent consumers in a group) and how keying by driver_id preserves ordering within a driver’s pings while spreading load across partitions. State what you give up by not having global ordering, and why that’s acceptable here.
Level II — Window it, and handle a late event. Define a windowed aggregation — “average speed per driver over each 5-minute tumbling window,” keyed by event time. Now introduce a late ping: a driver’s 10:02 location that, because of a tunnel, arrives at 10:09. Walk through what happens under a processing-time window versus an event-time window with a watermark of three minutes’ allowed lateness: which window the ping lands in under each, which result is correct, and what the watermark does to make the event-time version right. Then explain what happens to a ping arriving later than your allowed lateness, and one option for not losing it.
Level III — Design an exactly-once pipeline. Design an end-to-end pipeline for the introduction’s fraud scenario: Kafka source → stateful Flink scoring (per-card state) → a sink that flags suspicious transactions. Specify where each exactly-once ingredient contributes — idempotency (producer and sink), transactions (binding state, offsets, and output into one atomic commit), and checkpointing (the consistent snapshot recovery restores from) — and trace a worker crashing mid-window: which state is restored, which offsets rewound, and why the replay doesn’t double-flag a transaction. Finally, justify streaming over batch here in one paragraph, in terms of when the fraud score has value.

Summary

Streaming inverts batch: instead of querying a bounded dataset that holds still, you run a standing computation over an unbounded, continuous stream. The foundation is the log — append-only, ordered, durable, replayable (Kafka) — which decouples producers from consumers, lets each consumer read independently at its own offset, and makes replay possible. Partitioning buys parallelism at the cost of global ordering, recovered where it matters by keying events. On the log sits stateful stream processing (Flink), made survivable by checkpointing: a consistent snapshot of operator state plus source offsets that recovery restores and replays. The two problems batch never had are event time — reasoning about when events happened, not when you saw them, with windows carving finite answers and watermarks deciding when a window is safe to close — and exactly-once delivery, honestly effectively once, assembled from idempotency, transactions, and checkpointing. Real-time analytics then serves those fresh results at low latency via incremental and approximate aggregation, with Lambda and Kappa offering two answers to how batch and streaming coexist.

Key takeaways

The log is the centerpiece: append-only, ordered, durable, replayable — it decouples producers from consumers and makes replay (and therefore recovery) possible.
Partitioning gives parallelism but only guarantees ordering within a partition; key your events to get ordering where it matters without throttling throughput.
Event time, not processing time, is what correct analytics must window by — results bucketed by arrival time are wrong and not even reproducible.
A watermark is a confidence statement about time (“seen everything up to T”); its allowed-lateness setting is the trade between dropping stragglers and delaying every result.
Exactly-once is effectively once — built from idempotent writes, transactions that bind state-offsets-output atomically, and checkpoint-based recovery; it is assembled, not toggled.

Connections to other chapters

The Data Engineering Landscape (prerequisite): that chapter frames the batch-versus-streaming split that defines a data platform; this one is the streaming half made concrete — the real-time path that runs beside the batch path, not instead of it.
Concurrency and Parallelism Models (foundation): the log and its processors are concurrent systems — many producers and consumers acting at once, backpressure when consumers fall behind, coordination under failure. That chapter’s comparison of concurrency and async models across languages gives you the primitives you reach for to build, not merely operate, a streaming system.
Data Orchestration (sibling): batch and streaming coexist in one platform; orchestration schedules the batch jobs, triggers backfills, and manages the dependency between a streaming sink and the batch layer that reconciles it — the operational glue around the Lambda/Kappa choice here.
Observability (Part IV, extension): a streaming system you cannot see is one you cannot trust. Consumer lag, throughput, watermark progression, and checkpoint duration are the vital signs — a growing lag or a stalling watermark is the early warning that the pipeline is falling behind reality, and catching it is that chapter’s discipline.

Kleppmann, Designing Data-Intensive Applications — the chapters on streams and the log are the single best conceptual grounding for everything here; start with the stream-processing and “unbundling the database” material.
Apache Kafka documentation and Apache Flink documentation — the canonical references for topics/partitions/consumer groups and for event time, watermarks, windows, state, and checkpointing, respectively.

Deep dives

Akidau, Chernyak, and Lax, Streaming Systems — the definitive treatment of event time, windowing, watermarks, and the “what / where / when / how” framing of stream processing.
Jay Kreps, “The Log: What every software engineer should know about real-time data’s unifying abstraction” — the essay that crystallized why the log is the foundational abstraction; short, foundational, and still the best motivation for the whole model.

Historical context

Akidau et al., “The Dataflow Model” (VLDB, 2015) — the paper that unified batch and streaming under one model of windows, watermarks, and triggers, and the intellectual ancestor of Flink’s and Beam’s time semantics.
Akidau et al., “MillWheel: Fault-Tolerant Stream Processing at Internet Scale” (VLDB, 2013) — Google’s earlier system that worked out low-watermarks and exactly-once state for unbounded, out-of-order streams, the lineage behind much of this chapter.

--- title: "Streaming & Real-Time Data" keywords: [streaming, kafka, flink, event time, watermarks, windowing, exactly-once, the log, stream processing, real-time analytics] difficulty: advanced prerequisites: [data-engineering-overview, distributed-systems-basics] estimated_time: "4-5 hours" --- ## Introduction The fraud team had a model that worked. Given a transaction — amount, merchant, location, the cardholder's recent history — it flagged the suspicious ones with better-than-decent accuracy. The trouble was *when* it ran. Transactions landed in a warehouse all day, and every night a batch job scored the day's haul and emailed a report. By the time an analyst opened it, the fraudulent charges were a day old: the money had moved, the card was maxed, the customer was already on the phone. The model wasn't wrong. It was *late*. The entire value of a fraud score is that it arrives before the transaction settles, and a nightly batch pipeline structurally cannot deliver that — it was computing the insight hours after the only moment it mattered. So they tried the obvious fix: process each transaction the instant it arrived. Score every event as it comes off the wire, act immediately. This worked in the demo and fell apart in production, in two ways batch had quietly protected them from. First, events did not arrive in order. A mobile reader in a basement buffered a sale and flushed it ten minutes later; a flaky gateway retried and delivered the same charge twice; a timestamp said 10:01 but the event showed up at 10:09. A "transactions in the last minute" counter built on *arrival* time was simply wrong, and nobody noticed until the numbers stopped reconciling. Second, when a worker crashed mid-batch and restarted, it reprocessed events it had already acted on — double-counting some charges, dropping others, depending on exactly when the commit landed. The naive system was fast and incorrect, which in fraud detection is worse than slow and correct. This is the shape of the whole field. Streaming lets you act on data *now* instead of hours later — but the moment you stop processing a finished, bounded dataset and start processing an endless one, **time and correctness get genuinely hard** in ways batch never forced you to confront. This chapter is about the small set of ideas that make continuous processing both fast and right. ### The Core Insight Batch runs a *query over a bounded dataset*: here is yesterday's data, all of it, sitting still; run the aggregation, get an answer, stop. Streaming inverts this. You run a **standing computation over an unbounded, continuous stream** — the data never stops arriving, so the computation never stops. There is no "all of it," only "everything so far," and the answer is something you continuously revise. That inversion rests on one data structure: **the log**. Not a logfile — *the* log, an append-only, ordered, durable, replayable sequence of events. Producers write to the end; the log keeps events in order for as long as you configure; consumers read forward from where they left off. This is what Kafka is. The log's superpower is **decoupling**: a producer writes without knowing who reads, and any number of consumers read the same events independently, each at its own pace, each able to *replay* history from any point. On top sits **stream processing** — the standing computation, which Flink exemplifies. A stream processor maintains *state* (running counts, recent events, session data) and continuously folds new events into it. Here the two problems batch never had become unavoidable: 1. **Event time versus processing time.** Events carry a timestamp for *when they happened*, but they arrive late, out of order, in bursts. Correct analytics must reason about when an event *occurred*, not when you *saw* it — which means waiting, judiciously, for stragglers. The mechanism for that wait is the **watermark**. 2. **Exactly-once semantics.** Networks retry, workers crash, consumers rebalance. A correct pipeline must produce the right result *despite* duplicates and failures — not count a charge twice because a retry resent it, not lose one because a worker died before committing. Get the log right and producers and consumers stop fighting. Get event time and exactly-once right and the answers are correct. Everything else in streaming is detail on top of these. ### A mental model Picture the log as a **durable conveyor belt** that moves in one direction and never throws anything away. Producers drop events onto the end; events ride along in the order they landed. Each consumer is a worker beside the belt with its own bookmark, reading at whatever speed it can manage. One worker can be at position 900 while another, slower or newly arrived, is back at 300 reading the same history — and because the belt keeps everything, a worker can move its bookmark *backward* and re-read the past. That is replay, the whole reason the log feels different from a queue, which deletes each item the instant someone takes it. A **window** is how you get a finite answer from an infinite belt: a bounded slice — "all events between 10:00 and 10:05" — so you have something concrete to aggregate. And a **watermark** is the processor's statement of confidence about time. When the watermark reaches T, it asserts *"I believe I have seen all events with timestamp ≤ T,"* so any window ending at or before T can safely close. The watermark is the judgment call at the heart of streaming: wait too little and you close windows before the stragglers arrive (losing data); wait too long and every result is needlessly delayed (losing the point of streaming). ### When streaming vs batch Streaming is not a strictly better batch; it is a different trade. It buys low latency at the cost of real complexity — out-of-order time, state to manage, failure modes batch never had. @fig-de-streaming shows the machinery you take on when you choose it. The decision is about whether you need what that machinery buys. **Reach for batch** when the work is periodic and latency-tolerant: a nightly revenue roll-up, a weekly reconciliation, training over a year of history. Batch is simpler to reason about (the data holds still), cheaper, and trivially re-runnable. If nothing will *act* on the data within minutes of arrival, real-time is over-engineering — a five-minute dashboard nobody watches twice a day needs no streaming pipeline behind it. **Reach for streaming** when the value of an insight decays in seconds: fraud caught mid-transaction, an alert that must fire before damage spreads, a live dashboard, a feature that feeds a model at request time. The test: if a delay of minutes turns a useful answer useless, you need streaming, and its complexity is the price of admission. **Use both** more often than either alone. Most mature platforms run a streaming path for fresh answers *and* a batch path for complete, authoritative ones — the tension the Lambda and Kappa architectures, later in this chapter, are trying to resolve. ### What you'll learn - Why **the log** — append-only, partitioned, ordered, replayable — decouples producers from consumers and makes replay possible - How Kafka's topics, partitions, offsets, and consumer groups scale the log, and how partitioning trades global ordering for parallelism - What **stateful stream processing** is, how Flink keeps and recovers state, and why checkpointing is the spine of fault tolerance - The difference between **event time and processing time**, and why correct analytics must reason about when events *happened* - How **windows** (tumbling, sliding, session) carve a finite answer out of an unbounded stream, and how **watermarks** decide when a window is safe to close - What **at-most-once, at-least-once, and exactly-once** really mean, and why exactly-once is honestly "effectively once" - How real-time analytics serves low-latency queries on fresh data, and what the Lambda-versus-Kappa debate is actually about ### Prerequisites - **The Data Engineering Landscape** — where batch and streaming sit in a platform, and the batch-versus-streaming split this chapter lives on one side of - **Distributed systems basics** — partitioning, replication, leaders and followers, and why "exactly-once" is hard the moment more than one machine is involved - Comfort with a consumer reading a queue, and with concurrency: many producers and consumers acting at once is the normal case here, not the exception --- ## The log: streaming's foundation Everything starts with the log, so it is worth being precise. A log is an append-only sequence of records, each assigned a monotonically increasing **offset** — 0, 1, 2, 3 — that fixes its position forever. You only ever write to the end; you never modify what is already there. The four properties that follow are the entire reason it is powerful: **append-only** (writes are fast and contention-free), **ordered** (every record has a definite place), **immutable** (history doesn't change, which makes it auditable and re-readable), and **replayable** (a reader can start from any offset, including the beginning). Contrast a traditional message queue, where a message is delivered and then *deleted*. A queue couples consumers to consumption: once one worker takes a message it is gone, and a second system that wanted it is out of luck. The log refuses to delete. A record written to a Kafka topic stays for the configured retention — hours, days, or forever — and every consumer reads it independently. This single difference enables multiple independent consumers of the same data, replay from any point, event sourcing, and audit logs, all from one write. Kafka organizes the log into **topics** — named streams of records (`orders`, `clicks`, `transactions`) — each split into **partitions**, which is where the log becomes *distributed* rather than a single file. Each partition is an independent ordered log on a broker; a topic with twelve partitions is twelve logs read and written in parallel across machines. Partitioning is the source of Kafka's throughput, but it buys that by giving up one thing: **ordering is guaranteed only within a partition, never across them.** There is no global order across a topic, only per-partition order. That constraint is why **keys** matter. When a producer attaches a key, Kafka hashes it to choose a partition, so every record with the same key lands in the same partition and is strictly ordered relative to its siblings. Key by `user_id` and every event for a user is ordered, while different users spread across partitions for parallelism. This reconciles ordering with scale: not global order (you rarely need it) but *order within the key that matters*. The common mistake is forcing a single partition to get global ordering — throttling the whole topic to one consumer's throughput and throwing away the point of the log. Consumers cooperate through **consumer groups**. Kafka assigns each partition to exactly one consumer in a group, dividing the topic's partitions among the members for parallel processing. The arithmetic is unforgiving — one partition to one consumer per group, so if a topic has six partitions and a group has eight consumers, two sit idle. Partition count is the ceiling on a group's parallelism. Each consumer tracks its position with a committed **offset**; that offset is what "where I am in the stream" means, and getting offset commits right is most of what reliable consumption is about. The full shape — producers appending by key, partitions giving parallelism, groups reading independently — is the left and center of @fig-de-streaming. ![The streaming model: producers append events to a partitioned, append-only log (Kafka), which decouples them from consumers that read independently at their own offsets and can replay history. A stateful stream processor (Flink) applies windowed aggregations keyed by event time, using a watermark to handle late and out-of-order events, then writes to a sink or a real-time serving layer.](../assets/diagrams/rendered/de_streaming.svg){#fig-de-streaming .lightbox} A producer's reliability comes down to one setting and one guarantee. The setting is `acks`: `acks=0` fires and forgets (fast, lossy); `acks=1` waits for the partition leader only (a leader crash can lose the write); `acks=all` waits for all in-sync replicas, the durable choice for anything that matters. The guarantee is **idempotence** — an idempotent producer tags each record so the broker can discard a duplicate sent by a retry, letting you set `retries > 0` without a network blip becoming a double-write. The production default is `acks=all` plus idempotence. > **Build it →** A working log and streaming substrate: > [Project 12: Distributed Log System](https://github.com/jchu0/applied-cs-projects/tree/main/12-distributed-log-system) > implements the partitioned, replicated append-only log from the inside — > offsets, replication, leader election — and > [Project 08: Streaming Platform](https://github.com/jchu0/applied-cs-projects/tree/main/08-streaming-platform) > wires producers, topics, and consumer groups into an end-to-end event pipeline. ## Stream processing and state The log moves and stores events; it does not *compute* on them. That is the stream processor's job, and the leap from a plain consumer to a stream processor is **state**. A stateless transform — parse JSON, drop negative amounts, route by type — needs no memory; a Kafka consumer with a producer attached (read a topic, transform, write another) is enough. The interesting computations are *stateful*: count events per user, detect that this charge is ten times the user's last, group a user's actions into a session. These require the processor to remember something across events — and that memory is the hard part, because it must survive failure. Flink is built around this. A Flink job is a graph of operators; a *keyed* operator — downstream of a `keyBy(user_id)` — gets its own slice of state per key, partitioned so all of one user's events and state live together. State mirrors what you store: a value per key (the last transaction amount), a list per key (recent events), a map per key (counts by category). Small state lives on the JVM heap; large state — and production state reaches terabytes — uses an embedded RocksDB backend that spills to disk and supports incremental snapshots. Per event, continuously: read state, fold in the event, write state back, emit a result. The mechanism that makes this *survivable* is **checkpointing**. Periodically the system injects markers called barriers into the streams at the sources; as a barrier flows through the operator graph, each operator snapshots its state to durable storage (S3, HDFS) and passes the barrier on. A completed checkpoint is a globally consistent snapshot: every operator's state plus the exact source offsets it corresponds to. When a worker dies, recovery is mechanical — restore state from the last checkpoint, rewind the Kafka offsets it recorded, replay forward. Because the log is replayable, the events since the checkpoint are still there to re-read. This is the deep reason streaming demands a *log* and not a queue: fault tolerance is checkpoint-plus-replay, and replay needs durable, re-readable history. The right side of @fig-de-streaming is this machinery. > **Build it →** Stateful processing at scale: > [Project 36: Distributed Streaming Analytics](https://github.com/jchu0/applied-cs-projects/tree/main/36-distributed-streaming-analytics) > implements keyed state, windowed operators, and checkpoint-based recovery — > the Flink model taken apart and rebuilt. ## Event time, windows, and watermarks Here is the hard part batch never makes you face. Every event has two times batch silently assumed were the same. **Event time** is when the event actually occurred — stamped at the source, embedded in the record. **Processing time** is when your system happened to handle it — wall-clock, at the operator. Between them sits everything that delays an event: a phone buffering clicks offline, a congested network, a retry, a consumer backlog. A mobile click at 10:00:01 might reach Flink at 10:00:09. Why does the gap matter? Because almost every streaming analytic is "count / sum / average per time window," and which time you bucket by changes the answer. Bucket that 10:00:01 click by *processing* time and it lands in the 10:00:05–10:00:10 window — the wrong minute. The "clicks per minute" chart is now wrong and, worse, *not reproducible*: re-run the same job on the same data on a slower day and the buckets shift, because processing time depends on when you ran it. Bucket by *event* time and the click lands in the minute it truly happened, every time. The rule is absolute: **use event time for correctness; reach for processing time only when you want the lowest possible latency and can accept the inaccuracy.** A **window** is the bounded slice that turns the unbounded stream into something aggregable, and three shapes answer different questions. **Tumbling** windows are fixed-size and non-overlapping — `[10:00–10:05) [10:05–10:10)` — so each event belongs to exactly one; these are periodic aggregates ("sales per hour"). **Sliding** windows are fixed-size but overlap by a slide interval (a 10-minute window every 5 minutes), so each event falls into several; these give moving averages and smoothed trends. **Session** windows are dynamic, defined by a gap of inactivity: events cluster into a session, and when none arrives for, say, 30 minutes, it closes — how you reconstruct a user's visit without arbitrary boundaries. (A caution: session state is per key, so high-cardinality keys all open at once can blow up memory.) The window poses a question it cannot answer alone: *when is it complete?* The stream is infinite and events arrive late — so when is it safe to close `[10:00–10:05)` and emit its count? Close the instant the clock passes 10:05 and you drop every straggler; wait forever and you defeat the purpose. The **watermark** is the principled answer: a moving assertion in event time, *"I am now confident I have seen all events with timestamp ≤ T."* A common strategy is "bounded out-of-orderness" — watermark = (maximum event time seen) − (allowed lateness). Set three minutes of allowed lateness and once the processor has seen an event stamped 10:05, its watermark sits at 10:02; the moment a watermark crosses a window's end, that window fires. The allowed-lateness knob *is* the trade: too small and late events miss their window; too large and every result waits needlessly. You profile real data to set it — and events later than your tolerance can go to a *side output* (a separate late-data stream) rather than vanish, or, with `allowedLateness`, reopen and update an already-emitted window. The watermark feeding the window-close in @fig-de-streaming is this mechanism. ::: {.callout-warning} ## War story: the window that counted by the wrong clock A retail analytics team shipped a "purchases in the last 5 minutes" dashboard that looked perfect in staging and drifted in production. Test traffic was synthetic and arrived in order, so processing-time and event-time windows agreed. Real traffic did not: a sizable fraction of sales came from mobile point-of-sale readers that buffered transactions offline and flushed them in bursts — sometimes ten or fifteen minutes late. The dashboard bucketed by *arrival* time, so a 10:02 sale arriving at 10:14 was counted in the 10:14 window. Two buckets were wrong at once: 10:02 looked quiet (its real sales hadn't "arrived"), and 10:14 showed a phantom spike of old transactions. Finance noticed when the real-time numbers refused to reconcile with the nightly batch totals, which — running over complete data — had been right all along. The fix was a config change: switch the windows to **event time**, keyed off the timestamp the reader stamped at the sale, with a watermark whose allowed lateness covered the mobile flush delay and the truly-late stragglers routed to a side output. The lesson, stated plainly: if the answer must be correct, window by when events *happened*, not by when you happened to see them — and the watermark is how you decide how long to wait. ::: ## Delivery and exactly-once Failure is the normal case in a distributed stream, and how a pipeline behaves under it is captured by its **delivery guarantee** — three of them, differing in what happens on a retry or a crash. **At-most-once**: an event is processed zero or one times; a mid-flight failure simply loses it — fastest, lossiest, fine only when a dropped event is harmless. **At-least-once**: an event is processed one or more times; nothing is lost, but a crash between "processed" and "committed" reprocesses it on restart, so duplicates happen. This is the default for a Kafka consumer that commits offsets after processing — and exactly the trap from the introduction, where a rebalance re-delivers events the worker already acted on and a naive counter double-counts. **Exactly-once** means every event affects the result precisely once, no losses and no duplicates, even across failures. Strictly, end-to-end, that is nearly impossible — which is why practitioners prefer the honest name **effectively once**: the *observable effect* is as if each event were processed once, achieved by two cooperating techniques. The first is **idempotency** — design the operation so applying it twice equals applying it once. Key each result by a deterministic identifier (source partition and offset) so a re-delivered event overwrites rather than adds: an upsert, not an append; a set, not an increment. With idempotent writes, at-least-once becomes effectively-once *at the sink*, because duplicates land on top of themselves. The second is **transactions**, how Flink-plus-Kafka achieves exactly-once across the whole pipeline. The trick binds three things into one atomic unit: the operator's checkpointed state, the source offsets it corresponds to, and the output written to the sink. Flink's checkpoint coordinates with Kafka's transactional producer so a checkpoint either *fully* commits — state, offsets, and output together — or rolls back entirely. Downstream consumers read only committed output (`read_committed` isolation), never the aborted, replayed-over duplicates. Recovery restores state, rewinds offsets, and replays; because the failed attempt's output was never committed, the replay produces the result exactly once. The takeaway is a layered defense, not a single switch: make the producer idempotent so retries don't double-write; commit offsets only *after* processing so nothing is lost; make sink writes idempotent so duplicates are harmless; and for true exactly-once, lean on the framework's checkpoint-plus-transaction machinery. "Exactly-once" is not a feature you toggle — it is a property you assemble from idempotency, transactions, and checkpointing, each contributing one piece. ## Real-time analytics Stream processing produces fresh results; **real-time analytics** is about *serving* them — answering low-latency queries over data seconds, not hours, old. The processing side computes windowed aggregates and rolling metrics; the serving side puts them somewhere queryable fast enough to back a live dashboard, an alert, or a user-facing analytics product. That layer is usually a specialized OLAP store — ClickHouse for ad-hoc SQL over logs, Druid or Pinot for high-concurrency dashboards at sub-200ms latency — with the last hop to a human often a WebSocket pushing updates as they land. The serving box on the right of @fig-de-streaming is this layer. Two techniques recur because exact computation over an unbounded stream is often too expensive. **Incremental aggregation** keeps a running summary instead of the raw events (a count, a sum, a sum-of-squares yielding mean and variance), so state stays O(1) per key however many events flow through. **Approximate aggregation** trades a sliver of accuracy for huge memory savings on high-cardinality questions: a HyperLogLog sketch estimates distinct counts (unique visitors, IPs) to within a couple of percent using about a kilobyte regardless of cardinality, where exact counting would demand a set of every value. For "unique users this hour," a 2% error is free and the saving is the difference between feasible and not. Both share a `merge` operation — the same divide-and-combine that lets partial results compute in parallel per partition. Above all this sits a long-running argument: **Lambda versus Kappa**. The Lambda architecture runs two pipelines side by side — a *batch* layer recomputing complete, authoritative results from all history (slow, accurate) and a *speed* layer serving fresh, approximate results from the stream (fast, eventually corrected) — merged at query time. You get accuracy and freshness, at the cost of maintaining the same logic twice in two codebases that must agree. The Kappa architecture rejects that duplication: if the log retains enough history and your processing is deterministic and replayable, you need no separate batch layer — you just *replay the stream*. To deploy new logic, spin up a job that reprocesses the log from the beginning and cut over when it catches up. One codebase; the price is a platform that can genuinely replay history at volume — which loops back to why a durable, replayable log is the foundation everything here is built on. --- ## Practical exercise **Difficulty:** Level I · Level II · Level III 1. **Level I — Model a use case as a log.** Take a concrete scenario — say, a ride-hailing app tracking driver location pings — and model it as producers writing to a partitioned Kafka topic, read by one or more consumer groups. Choose a partitioning key and justify it. Then explain how partitioning gives parallelism (more partitions, more concurrent consumers in a group) and how keying by `driver_id` preserves ordering *within* a driver's pings while spreading load across partitions. State what you give up by not having global ordering, and why that's acceptable here. 2. **Level II — Window it, and handle a late event.** Define a windowed aggregation — "average speed per driver over each 5-minute tumbling window," keyed by event time. Now introduce a late ping: a driver's 10:02 location that, because of a tunnel, arrives at 10:09. Walk through what happens under a **processing-time** window versus an **event-time** window with a watermark of three minutes' allowed lateness: which window the ping lands in under each, which result is correct, and what the watermark does to make the event-time version right. Then explain what happens to a ping arriving *later* than your allowed lateness, and one option for not losing it. 3. **Level III — Design an exactly-once pipeline.** Design an end-to-end pipeline for the introduction's fraud scenario: Kafka source → stateful Flink scoring (per-card state) → a sink that flags suspicious transactions. Specify where each exactly-once ingredient contributes — **idempotency** (producer and sink), **transactions** (binding state, offsets, and output into one atomic commit), and **checkpointing** (the consistent snapshot recovery restores from) — and trace a worker crashing mid-window: which state is restored, which offsets rewound, and why the replay doesn't double-flag a transaction. Finally, justify *streaming over batch* here in one paragraph, in terms of when the fraud score has value. ## Summary Streaming inverts batch: instead of querying a bounded dataset that holds still, you run a standing computation over an unbounded, continuous stream. The foundation is **the log** — append-only, ordered, durable, replayable (Kafka) — which decouples producers from consumers, lets each consumer read independently at its own offset, and makes replay possible. Partitioning buys parallelism at the cost of global ordering, recovered where it matters by keying events. On the log sits **stateful stream processing** (Flink), made survivable by checkpointing: a consistent snapshot of operator state plus source offsets that recovery restores and replays. The two problems batch never had are **event time** — reasoning about when events happened, not when you saw them, with **windows** carving finite answers and **watermarks** deciding when a window is safe to close — and **exactly-once delivery**, honestly *effectively once*, assembled from idempotency, transactions, and checkpointing. Real-time analytics then serves those fresh results at low latency via incremental and approximate aggregation, with Lambda and Kappa offering two answers to how batch and streaming coexist. ### Key takeaways - The log is the centerpiece: append-only, ordered, durable, replayable — it decouples producers from consumers and makes replay (and therefore recovery) possible. - Partitioning gives parallelism but only guarantees ordering *within* a partition; key your events to get ordering where it matters without throttling throughput. - Event time, not processing time, is what correct analytics must window by — results bucketed by arrival time are wrong and not even reproducible. - A watermark is a confidence statement about time ("seen everything up to T"); its allowed-lateness setting is the trade between dropping stragglers and delaying every result. - Exactly-once is *effectively once* — built from idempotent writes, transactions that bind state-offsets-output atomically, and checkpoint-based recovery; it is assembled, not toggled. ### Connections to other chapters - **The Data Engineering Landscape** (prerequisite): that chapter frames the batch-versus-streaming split that defines a data platform; this one is the streaming half made concrete — the real-time path that runs beside the batch path, not instead of it. - **Concurrency and Parallelism Models** (foundation): the log and its processors are concurrent systems — many producers and consumers acting at once, backpressure when consumers fall behind, coordination under failure. That chapter's comparison of concurrency and async models across languages gives you the primitives you reach for to *build*, not merely operate, a streaming system. - **Data Orchestration** (sibling): batch and streaming coexist in one platform; orchestration schedules the batch jobs, triggers backfills, and manages the dependency between a streaming sink and the batch layer that reconciles it — the operational glue around the Lambda/Kappa choice here. - **Observability** (Part IV, extension): a streaming system you cannot see is one you cannot trust. Consumer **lag**, throughput, watermark progression, and checkpoint duration are the vital signs — a growing lag or a stalling watermark is the early warning that the pipeline is falling behind reality, and catching it is that chapter's discipline. ## Further reading ### Essential - Kleppmann, *Designing Data-Intensive Applications* — the chapters on **streams and the log** are the single best conceptual grounding for everything here; start with the stream-processing and "unbundling the database" material. - *Apache Kafka documentation* and *Apache Flink documentation* — the canonical references for topics/partitions/consumer groups and for event time, watermarks, windows, state, and checkpointing, respectively. ### Deep dives - Akidau, Chernyak, and Lax, *Streaming Systems* — the definitive treatment of event time, windowing, watermarks, and the "what / where / when / how" framing of stream processing. - Jay Kreps, *"The Log: What every software engineer should know about real-time data's unifying abstraction"* — the essay that crystallized why the log is the foundational abstraction; short, foundational, and still the best motivation for the whole model. ### Historical context - Akidau et al., *"The Dataflow Model"* (VLDB, 2015) — the paper that unified batch and streaming under one model of windows, watermarks, and triggers, and the intellectual ancestor of Flink's and Beam's time semantics. - Akidau et al., *"MillWheel: Fault-Tolerant Stream Processing at Internet Scale"* (VLDB, 2013) — Google's earlier system that worked out low-watermarks and exactly-once state for unbounded, out-of-order streams, the lineage behind much of this chapter.