AI31 May 20265 min read

Retrieval-Augmented Generation (RAG): A Practical Guide

How retrieval-augmented generation works, when to use it, common failure modes, and what a production RAG system actually looks like

By AI Advisory team

Retrieval-augmented generation, or RAG, is the architecture most production AI assistants run on. It is the reason ChatGPT-style interfaces can answer questions about your company handbook, your case law database, or last quarter's sales pipeline without anyone fine-tuning a model. If you are building anything that needs a language model to answer from a specific body of knowledge, RAG is almost certainly the pattern you want first.

This guide explains what RAG is, why it exists, how it works end to end, where it fails, and what production-grade RAG looks like once you move past a weekend prototype.

What RAG actually is

Retrieval-augmented generation is a technique that combines two things: a retrieval system that finds relevant information from a knowledge base, and a generative language model that writes an answer using that information. The model does not memorise your data. It reads the relevant snippets at query time and composes a response grounded in them.

The pattern was formalised in a 2020 paper by Lewis et al. at Facebook AI Research, titled Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. The original motivation was simple: large language models hallucinate when asked about facts they were not trained on, and retraining them every time your data changes is expensive. RAG sidesteps both problems by injecting fresh, relevant context into the prompt.

If you have ever used a chatbot that cites its sources, or an internal assistant that answers questions about your own documents, you have used RAG.

Why RAG exists: the problems it solves

Language models on their own have four well-documented weaknesses that RAG addresses directly.

Hallucination. When a model does not know an answer, it often generates plausible-sounding fiction. OpenAI's own documentation acknowledges this as an inherent limitation of probabilistic generation. RAG reduces hallucination by giving the model real source material to ground its response.

Stale knowledge. A model trained with a cutoff in early 2024 does not know what happened last week. RAG decouples the knowledge from the model: update the index, and the assistant knows the new information instantly. No retraining required.

Private data. Foundation models are trained on public web data. They have never seen your CRM, your contracts, your Confluence space, or your support tickets. RAG is the cleanest way to make a model useful over private information without sending that information into anyone's training pipeline.

Attribution. Regulated industries need to know where an answer came from. A pure language model produces no citations. A RAG system can return the exact document, page, and paragraph that informed each part of its response, which matters for legal, healthcare, financial services, and any context where an auditor will ask.

How RAG works, end to end

A working RAG system has five stages. Skip any of them and quality drops sharply.

1. Ingestion and chunking

You start with a corpus: PDFs, web pages, database rows, transcripts, whatever your knowledge lives in. Each document gets split into chunks, typically 200 to 800 tokens, sometimes with overlap so context is not lost at boundaries. Chunking strategy matters more than people expect. Splitting a legal contract every 500 tokens regardless of structure will sever clauses mid-sentence; splitting it on section headings preserves meaning. Tools like LangChain and LlamaIndex offer structure-aware splitters for code, markdown, HTML, and PDFs.

2. Embedding

Each chunk is passed through an embedding model, which converts the text into a numeric vector, usually 768 to 3072 dimensions, that represents its semantic meaning. Chunks with similar meaning end up close together in vector space. OpenAI's text-embedding-3-large, Cohere's embed-v3, and open-source models like BGE and E5 are the common choices. The vectors get stored in a vector database such as Postgres with pgvector, Pinecone, Weaviate, Qdrant, or Chroma.

3. Retrieval

When a user asks a question, the question itself is embedded into the same vector space and the system finds the nearest chunks, typically the top 5 to 20. This is semantic search: it returns results that mean the same thing as the query, not just results that contain the same keywords. Most serious systems combine vector search with traditional keyword search (BM25) in a hybrid retriever, because pure semantic search misses exact identifiers, product codes, and proper nouns. Microsoft's research on hybrid retrieval shows it consistently outperforms either method alone.

4. Reranking

The top 20 results from retrieval are often noisy. A reranker model, such as Cohere Rerank or a cross-encoder, scores each result against the query more precisely and reorders them. You then pass the top 3 to 8 reranked chunks into the prompt. This step alone often lifts answer quality more than any other single change.

5. Generation

The retrieved chunks are inserted into a prompt template along with the user's question and instructions to the model: "Answer using only the context below. If the answer is not in the context, say so." The language model, GPT-4, Claude, Llama, or whatever you have chosen, produces the response. Good systems also return citations pointing back to the source chunks.

Where RAG fails in practice

The pattern is simple. Production systems are not. Most RAG deployments that disappoint do so for predictable reasons.

Chunking that destroys context. Naive fixed-size chunking on a complex document set is the single most common cause of poor answers. If your knowledge base has tables, hierarchical structure, or cross-references, you need a strategy that respects them. Hierarchical chunking, parent-document retrieval, and structure-aware splitters all help.

Embedding model mismatch. Embeddings trained on general web text underperform on specialised vocabulary, legal language, medical terminology, code, or non-English content. Pick an embedding model that has seen text like yours, or fine-tune one if the cost is justified.

No reranking. Skipping the rerank step is the second most common quality killer. Vector similarity alone is a coarse signal. A reranker brings precision.

Retrieval that misses obvious things. Pure vector search will sometimes fail to find a document that uses the exact phrase in the query. Hybrid retrieval with BM25 catches these cases. Without it, users lose trust quickly because the system fails on "easy" questions.

No evaluation harness. Teams ship RAG systems with no way to measure quality. You need a golden dataset of question-answer pairs, automated scoring (Ragas, TruLens, or custom evals), and a regression test that runs on every change. Without this, you cannot tell whether your last "improvement" actually improved anything.

Weak refusal behaviour. If the model cannot find the answer in the retrieved context, it should say so. Many systems just hallucinate confidently. Explicit refusal prompts and grounding checks fix this.

RAG vs fine-tuning vs long context

RAG is not the only way to make a model knowledgeable. The two alternatives, and where each fits:

Fine-tuning retrains a model on your data. It is the right choice when you need the model to learn a style, a format, or a specific reasoning pattern, not when you need it to know facts. Fine-tuning facts in is expensive, slow to update, and tends to produce confident hallucinations once the data drifts. Use fine-tuning for tone, structure, and task behaviour. Use RAG for knowledge.

Long-context windows. Models like Claude and Gemini now accept 200k to 2 million tokens of context. For some tasks, you can paste the whole knowledge base into the prompt. This works for small, stable corpora and one-off analyses. It does not scale to large knowledge bases, gets expensive fast (you pay per token on every query), and degrades in quality: research from multiple labs shows attention quality drops in the middle of very long contexts ("lost in the middle"). RAG remains more cost-effective and often more accurate for production use over substantial corpora.

Most serious production systems combine all three: RAG for knowledge, fine-tuning for behaviour, and long context for handling complex retrieved material.

What production RAG looks like

A weekend prototype is a vector database, an embedding model, and a prompt. A production system has considerably more moving parts:

Ingestion pipeline that handles new documents, deletions, and updates, with deduplication and versioning.
Hybrid retrieval combining dense vectors and BM25, with metadata filters so users only see documents they have permission to access.
Reranking with a cross-encoder or commercial reranker.
Query rewriting for multi-turn conversations, so "what about last year?" gets expanded into a self-contained query before retrieval.
Citation rendering so users can verify any claim.
Evaluation harness with golden datasets, automated scoring, and regression tests.
Observability: every query, retrieved chunk, prompt, and response logged for debugging and improvement.
Guardrails: refusal patterns, prompt injection defences, PII redaction where required.
Access control: document-level permissions enforced at retrieval, not just in the UI.

For UK organisations, two additional considerations matter. First, the ICO's guidance on AI and data protection applies whenever your RAG system handles personal data, which most internal assistants do. You need a lawful basis, a DPIA for high-risk processing, and clear documentation of how data flows through embedding models and LLM providers. Second, if your LLM provider processes data outside the UK or EU, you need appropriate transfer mechanisms in place.

When RAG is the right call

RAG is the right pattern when:

You need answers grounded in a specific corpus (handbooks, contracts, product docs, support history, case files).
That corpus changes regularly and retraining a model on every change is impractical.
Users need attribution back to source documents.
The data is private and cannot be sent into a training pipeline.
You need to control which documents which users can query (permissions matter).

It is the wrong pattern, or at least not the only pattern, when:

The task is reasoning, planning, or transformation rather than fact retrieval. Use agents or fine-tuned models.
The knowledge is small and stable enough to fit in the system prompt. Just put it there.
You need the model to learn a behaviour or output format. Fine-tune.

For most internal AI assistants, customer support copilots, knowledge search systems, and document Q&A tools, RAG is the right starting architecture.

FAQ

How long does it take to build a production RAG system?

A working prototype over a clean corpus can take a week. A production system with hybrid retrieval, reranking, evaluation, access control, and observability typically takes 8 to 16 weeks for a first build, depending on the complexity of the source documents and integrations. The largest time sinks are almost always ingestion (handling messy real-world documents) and evaluation (building a golden dataset that genuinely reflects user queries). Teams that skip evaluation ship faster and regret it within a quarter, because they cannot tell whether changes are helping or hurting.

What does RAG typically cost to run?

Operating costs split into three buckets: embedding (one-off per document, plus re-embedding when you change models), retrieval infrastructure (the vector database, typically £100-£2000 per month depending on scale), and inference (the LLM calls, usually the largest line item). A mid-sized internal assistant serving 200 users with 50 queries each per day typically costs £500-£3000 per month in LLM and infrastructure spend, before engineering time. Caching common queries and using smaller models for simpler tasks reduces this materially.

Can I build RAG with no-code tools?

For simple use cases, yes. Tools like LangFlow, Flowise, and n8n have RAG components that work for internal prototypes and low-stakes assistants. Vector databases like Pinecone and Supabase offer hosted services that remove infrastructure work. The limits show up when you need hybrid retrieval, custom rerankers, fine-grained access control, or domain-specific chunking strategies. Most no-code RAG stops being adequate once you cross a few thousand documents or need accuracy above 80%. At that point you typically need a custom pipeline in Python or TypeScript.

Does RAG eliminate hallucination?

No. It reduces hallucination substantially but does not eliminate it. The model can still misread retrieved context, combine sources incorrectly, or generate confident statements that are not in the source material. Mitigations include explicit grounding prompts ("answer only from the context"), citation rendering so users can verify, automated grounding checks that flag responses where claims do not appear in the retrieved chunks, and refusal patterns when retrieval returns nothing relevant. A well-built RAG system can get hallucination rates below 5%, but not to zero.

How does RAG handle access control and permissions?

Properly built RAG systems enforce permissions at retrieval time, not in the UI. Each document chunk in the vector index carries metadata about who can access it, and the retrieval query filters on the user's identity before returning results. This matters because if you only filter in the UI, the model still sees data it should not, and prompt injection can sometimes coax it out. For UK organisations under GDPR, retrieval-time filtering is also cleaner from a data protection standpoint: documents a user has no right to see never enter the prompt at all.

Which vector database should I use?

For most mid-market builds, Postgres with the pgvector extension is the pragmatic default. You probably already run Postgres, it handles up to a few million vectors comfortably, and it keeps your operational footprint small. Pinecone, Weaviate, and Qdrant are stronger choices at larger scale or when you need advanced filtering and hybrid search out of the box. Chroma is useful for prototypes. The vector database is rarely the bottleneck in RAG quality - chunking, retrieval strategy, and reranking matter far more. Pick the simplest option that fits your scale and move on.

How do I know if my RAG system is actually working?

You need an evaluation harness, ideally before you ship. Build a golden dataset of 50 to 200 question-answer pairs that reflect real user queries. Score each generated answer on faithfulness (does it match the source?), relevance (does it answer the question?), and completeness. Tools like Ragas and TruLens automate parts of this. Run the eval on every change and track scores over time. Without this, you are guessing. With it, you can confidently say "reranking lifted faithfulness from 78% to 89%" and make engineering decisions on evidence.

Can RAG work over structured data like databases or spreadsheets?

Yes, but the pattern differs. For pure structured data, text-to-SQL or text-to-API is often better than RAG: the model generates a query, the database returns exact results, and the model formats the answer. For mixed corpora (documents that reference structured data, or spreadsheets embedded in reports), hybrid approaches work well: index the text descriptions and metadata, and let the model call tools to fetch precise numeric data when needed. Pure vector search over tables tends to perform poorly because the semantic signal in a row of numbers is weak.

Getting RAG right

RAG is not complicated as a concept, but the gap between a demo and a system your business can actually rely on is wide. The teams that close it are the ones that take chunking, retrieval, reranking, and evaluation seriously rather than treating them as afterthoughts. If you are scoping a RAG build and want a second pair of eyes on the architecture, or you have a prototype that works in demos but not in production, AI Advisory builds these systems for UK mid-market organisations end to end - from corpus design through to evaluation and operation.

Ready to put this into production? book a discovery call.