Tech Reads
AI Engineering9 min read

RAG in Production: What the Tutorials Don't Cover

Most RAG tutorials get you to a working demo in an afternoon. The jump from that demo to a system that reliably serves enterprise users at scale is where most implementations quietly fail. This is what that jump actually involves.

We have built and maintained RAG systems over a few hundred thousand documents for enterprise clients — contract repositories, policy libraries, technical manuals, financial records. The systems running in production today look almost nothing like what we built first. Not because the initial architecture was stupid. Because the problems that matter in production don't show up until you have real users, real documents, and real edge cases.

What follows is the gap between the tutorial and what we actually run. If you are evaluating a RAG implementation for enterprise use — whether you are building it or buying it — these are the things to ask about.

Your embedding model is the last thing that matters

Every conversation about RAG eventually becomes a conversation about which embedding model to use. OpenAI's ada-002 vs text-embedding-3-large vs a self-hosted Cohere model. It is the wrong conversation to be having first.

Embedding model quality matters. But it contributes maybe 15% of your retrieval quality in most enterprise settings. The other 85% comes from chunking strategy, metadata design, query preprocessing, and whether you are running hybrid search or pure semantic. We have seen systems running a mediocre embedding model with good chunking significantly outperform systems running the best available embedding model on poorly chunked data.

Pick a reasonable embedding model early — text-embedding-3-small works fine for most English enterprise content, Cohere's multilingual-v3 if you need Arabic — and spend your time on the parts of the pipeline that actually determine whether users get good answers.

85%of retrieval quality comes from chunking strategy, metadata design, and hybrid search — not the embedding model everyone obsesses over

Chunking strategy is where you win or lose

The default chunking approach in every tutorial is fixed-size chunks with some overlap. Split at 512 tokens, 50-token overlap, done. This works well enough on the tutorial dataset and falls apart on real enterprise documents for a simple reason: enterprise documents are not uniformly structured.

A legal contract has dense definitional clauses followed by pages of boilerplate followed by schedules that reference the clauses. A technical manual has an overview section, detailed procedures, a troubleshooting section, and appendices. Fixed-size chunking cuts across these structures arbitrarily, splitting a clause mid-sentence or grouping a troubleshooting step with unrelated content from the next section.

What we use instead: structural chunking that respects the document's own hierarchy. For documents with clear heading structure, chunk at the section level, not the token level. For long sections, use semantic chunking that breaks on topic shifts rather than token counts. For tables and structured data, extract them separately — a table embedded in a 512-token chunk loses most of its information when flattened into text.

Our take

Design a chunking strategy per document type — not once globally. A policy document and an invoice are different objects that need different treatment. Fixed-size chunking is a tutorial shortcut, not a production architecture.

The chunking approach needs to be designed per document type, not once globally. A policy document and an invoice are different objects and need different treatment.

Pure semantic search is not enough. Hybrid search almost always is.

Semantic search finds documents with similar meaning. Keyword search finds documents with specific terms. Enterprise queries need both.

A user searching for "section 4.2 of the procurement policy" needs keyword matching — semantic search will find documents about procurement policies in general, not that specific section. A user searching for "what happens if a supplier delivers late" needs semantic search — the document probably uses words like "delivery delay" or "SLA breach" rather than the exact query terms.

We run hybrid search on almost everything now: BM25 for keyword matching, vector search for cosine similarity, and a reciprocal rank fusion step to combine the results. The performance improvement over pure semantic is consistent across every enterprise corpus we have tried — typically 12–18% improvement in answer relevance when measured against a human-rated eval set.

If you are evaluating a RAG vendor and they are running pure vector search, ask why. The answer might be legitimate — there are cases where document structure makes hybrid search less valuable. But "we haven't implemented it yet" is a common reason.

Metadata is a first-class citizen, not an afterthought

Every chunk in a production RAG system should carry structured metadata: document type, source system, date, author, department, access permissions, version, and whatever domain-specific fields are relevant. This metadata does two things that retrieval alone cannot.

First, it enables filtered retrieval. A user in the legal team asking a question should get answers from legal documents, not from every department's documentation. "What is our policy on contractor payments" should retrieve from the legal and finance document libraries, filtered by access permissions, not from across the entire corpus including IT policies and HR handbooks.

Second, it enables cited answers. A user asking about a policy wants to know which version of the policy they are reading and when it was last updated. Without document-level metadata attached to every chunk, the answer "according to the procurement policy" tells the user nothing they can act on. With metadata, the system can say "Section 4, Procurement Policy v2.3, updated March 2026" — and that is a usable answer.

The freshness problem will surprise you at the worst time

RAG systems have a freshness problem that nobody designs for until it causes a real incident. A document gets updated in the source system. The RAG index still contains the old version. Users start getting answers based on superseded information, and there is no warning that the retrieval results are stale.

We have seen this cause real problems. A compliance policy update did not propagate to a RAG system for 11 days because the re-indexing pipeline only ran weekly. During that period, users were getting answers based on the old policy. Nobody noticed until an audit.

The fix is not complicated: implement a document change listener that triggers re-indexing within hours of a document update, not on a weekly schedule. Attach an "indexed at" timestamp to every chunk. Surface that timestamp in the UI so users can see how recent their answer is. For high-stakes content — policies, contracts, regulatory documents — consider treating any answer based on a chunk older than 30 days as requiring a freshness warning.

Nobody wants to build the freshness pipeline. It is unglamorous infrastructure. Build it anyway.

The evaluation set you build in week one will save you in month six

Before you go live, build a hand-curated evaluation set: 50–100 question-answer pairs where you know what the right answer is and which document it comes from. This takes a day. It will pay for itself every time you change a chunking strategy, swap an embedding model, or tune retrieval parameters.

Without this eval set, every change is a guess. You change the chunk size from 512 to 800 tokens and "it seems better." With the eval set, you run it in ten minutes and know that retrieval precision went from 71% to 79% and recall stayed flat. One is engineering. The other is hope. Enterprise clients are not paying for hope.

The RAG systems that fail in production are not the ones with bad embedding models. They are the ones with no eval set, no freshness pipeline, and chunking that was never designed for the actual documents users query.

Share

Related reading