RAG Is an Architecture, Not a Feature

Where I Started

On why "just add RAG" is where the real work begins, not ends.

The pitch for retrieval-augmented generation sounds trivial: find the relevant documents, paste them into the prompt, let the model answer. People wire that up in an afternoon, watch it work on three test questions, and ship it. Then it meets real data and real users, and the cracks show — confident answers built on the wrong chunk, retrieval that misses the obvious document, latency that creeps up as the corpus grows.

The trivial version isn't wrong. It's just the first ten percent. RAG is a system with several moving parts, and each one is a design decision with real tradeoffs.

The Parts You Actually Have to Design

Chunking. How you split documents decides what can ever be retrieved. Chunks too small lose context; too large dilute relevance and waste tokens. This is the quiet decision that caps the quality of everything downstream.
Embeddings. Which model, and the prefixes and normalization it expects. Get this wrong and your similarity scores are subtly meaningless.
Vector search at the right scale. Exact search is fine for a few thousand vectors. Past that you reach for approximate indexes (HNSW), and past a million you trade memory for speed with quantization. The right answer depends entirely on your corpus size — there is no single "use this."
Hybrid search and reranking. Pure vector search misses exact terms; pure keyword search misses meaning. Combining them — and then reranking the candidates — is often where the biggest quality jump comes from.
Metadata. The thing everyone ignores until they need to filter by date, source, or permission, and discover they threw it away at indexing time.

The Failure Modes Rhyme

Most RAG systems that disappoint fail in the same few ways: chunks sized by accident rather than intent, metadata discarded, a reranker added because a blog post said to — without measuring whether it helped. The fixes aren't exotic. They come from treating retrieval as a pipeline you can evaluate stage by stage, not a black box you hope works.

That's the shift I want to land: RAG isn't a feature you toggle on. It's an information-retrieval system with an LLM on the end, and the retrieval part is where engineers earn their keep.

This is a companion to Chapter 7: RAG Systems in my book, AI Engineering: Building Production-Ready LLM Applications. The full chapter goes deep on chunking, embeddings, vector-index selection by scale, hybrid search and reranking, and the common mistakes — with runnable code.

Read the full chapter free → jameshu.io/books/ai-engineering