Building a production RAG pipeline: a practical guide
What it actually takes to ship a retrieval-augmented generation system that survives contact with real users.
Retrieval-augmented generation is the most common pattern in production AI systems we ship. The 30-minute tutorial version of RAG (chunk, embed, query, generate) is famously easy to demo and famously hard to make reliable. This post is about the gap between demo and production.
The five real problems
Demo RAG systems hide five problems that production RAG systems have to confront: chunking strategy, retrieval quality, prompt engineering for grounded generation, evaluation, and operational cost. Solve four out of five and the system will quietly fail on the fifth.
Chunking is harder than it looks
Default fixed-size chunking gives you adequate retrieval on adequate data. Real-world documents have structure (headings, tables, footnotes) that fixed-size chunking obliterates. We almost always end up with a custom chunker that respects document structure, plus a separate path for tabular data that goes nowhere near a chunker.
Hybrid retrieval beats vector-only
Pure vector search is excellent at semantic similarity and bad at exact-match queries. Pure keyword search is the opposite. Production RAG almost always combines both, with reranking on top to fuse the results. The cost is one extra service in the stack; the quality difference is dramatic.
Grounded generation requires explicit prompting
Telling a model to use the retrieved context is not enough. We use a structured prompt that distinguishes context from instructions, forces citation, and includes refusal patterns for when retrieval returns no relevant material. Hallucinations are mostly a prompt design problem at this stage of the stack, not a model problem.
Evaluation is the difference between a demo and a system
Build an evaluation harness with at least 50 real user questions and expected answers. Run it on every change to retrieval, prompts, or models. Without evaluation, every improvement is a vibes-based decision; with it, you can move fast without breaking what already worked.
Operational cost
Most teams worry about model inference cost. In our experience, retrieval and reranking costs are the bigger surprise as usage scales, especially if you are using a managed vector database with per-query pricing. Plan the cost model up front; it changes which architectures are viable.
What we ship
A typical AI Advisory RAG build includes a custom chunker for the document type, hybrid retrieval with reranking, a structured prompt template with citation, an evaluation harness as part of CI, and a cost-aware deployment that distinguishes between cheap-and-fast and slow-and-precise paths. The pieces are not novel; the discipline of getting all of them right is.