What is RAG in LLMs? A Practical Explanation with Examples
RAG (retrieval-augmented generation) explained: how it works, a worked example, when to use it, and how it compares to fine-tuning
Retrieval-augmented generation (RAG) is the most common pattern in production LLM systems today. If you have ever asked ChatGPT a question about your company's policies and got a confidently wrong answer, RAG is the fix. It is the architecture behind most useful internal assistants, customer support bots, and document Q&A tools shipped in the last two years.
This article explains what RAG is, walks through a worked example with real prompts and retrieved chunks, covers the architecture choices that matter in production, and shows when RAG is the wrong answer.
What RAG actually is
A large language model like GPT-4, Claude, or Llama is trained on a fixed corpus. Once training stops, the model knows nothing new. It does not know your company's HR policy, your product documentation written last week, or the contract a customer signed yesterday. Ask it about any of those and you get one of two outcomes: a refusal, or a hallucination dressed up as fact.
RAG solves this by separating two jobs. Retrieval finds the relevant information from a knowledge base you control. Generation uses the LLM to write a natural-language answer grounded in what was retrieved. The model becomes a reasoning layer over your data rather than the source of truth itself.
The term was coined in a 2020 paper from Meta AI Research (Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"). The concept has since become the default architecture for any LLM application that needs to answer questions about specific, private, or fresh information.
The four stages of a RAG pipeline
Every RAG system, no matter how complex, has the same four stages. Understanding them is the difference between a chatbot that works and one that quietly hallucinates.
1. Ingestion and chunking
You take your source documents (PDFs, web pages, Confluence, Notion, SharePoint, database records) and break them into smaller chunks. Typical chunk sizes are 200-1000 tokens, with some overlap between chunks to preserve context across boundaries. Chunking is unglamorous and disproportionately important: bad chunking is the single biggest cause of poor RAG performance in production.
2. Embedding and indexing
Each chunk gets passed through an embedding model (OpenAI's text-embedding-3, Cohere's embed-v3, or open-source options like BGE or E5). The model converts text into a high-dimensional vector that encodes its semantic meaning. These vectors get stored in a vector database such as Pinecone, Weaviate, Qdrant, or pgvector on Postgres.
3. Retrieval
When a user asks a question, the question is embedded using the same model and the vector database returns the top-k most similar chunks (typically k=3 to k=20). Production systems usually combine vector similarity with keyword search (BM25) in a hybrid approach, because pure semantic search misses exact-match queries like product codes, names, or acronyms.
4. Generation
The retrieved chunks get stitched into a prompt with the user's question and sent to the LLM. The model is instructed to answer using only the provided context and to say it does not know when the context is insufficient. The output is returned to the user, often with citations back to the source documents.
A worked example: an HR policy assistant
Let's make this concrete. A 400-person company wants an internal Slack bot that answers HR questions from a 180-page policy handbook. Here is how the four stages play out.
Ingestion. The PDF gets parsed and split into 600-token chunks with 100 tokens of overlap. The handbook produces about 320 chunks. Each chunk carries metadata: section title, page number, last-updated date.
Indexing. Each chunk is embedded using OpenAI's text-embedding-3-small (1536 dimensions, around $0.02 per million tokens per OpenAI's pricing). The vectors land in pgvector on Postgres, which is plenty for 320 chunks and lets us avoid a separate vector database.
Retrieval. A user asks: "How much parental leave do I get if my partner has just given birth?" The query is embedded and the database returns the top 5 most similar chunks. One of them is from page 47:
"Partners (including non-birthing parents in same-sex relationships and adoptive parents) are entitled to up to 6 weeks of fully paid partner leave, which must be taken within 12 months of the birth or placement. An additional 4 weeks of unpaid leave may be requested..."
Generation. The prompt sent to the LLM looks roughly like this:
System: You are an HR assistant. Answer using ONLY the context below. If the context does not contain the answer, say "I don't have that information - please contact hr@company.com." Always cite the section and page.
Context:
[Chunk 1, page 47, Parental Leave]: Partners (including non-birthing parents...)
[Chunk 2, page 48, Eligibility]: To qualify for partner leave, employees must have...
[Chunk 3, page 12, Pay during leave]: ...
Question: How much parental leave do I get if my partner has just given birth?
The model returns: "You are entitled to up to 6 weeks of fully paid partner leave, which must be taken within 12 months of the birth. You may also request an additional 4 weeks of unpaid leave (Parental Leave, page 47)."
That answer is grounded, cited, and refusable. If the user asks about something not in the handbook, like share options, the system says it does not know rather than inventing an answer.
Why RAG beats the alternatives for most use cases
There are three other ways to make an LLM aware of your data. Each has its place, but RAG wins for most knowledge-retrieval problems.
Stuffing everything into the prompt. Modern context windows are large (Claude 3.5 Sonnet handles 200k tokens, Gemini 1.5 Pro handles 1-2 million). It is tempting to skip retrieval and just dump the whole knowledge base in. This breaks down fast: cost scales linearly with input tokens, latency increases, and recall degrades on long contexts. Research from Anthropic and others has shown that LLMs reliably attend to information at the start and end of a long context but miss things in the middle (the "lost in the middle" problem documented by Liu et al., 2023).
Fine-tuning. You retrain the model on your data. This is the right tool for teaching style, format, or specialised reasoning patterns. It is the wrong tool for knowledge. Fine-tuned models still hallucinate, cannot cite sources, and need retraining every time the underlying data changes. OpenAI and Anthropic both recommend RAG as the default for knowledge-grounded applications, with fine-tuning reserved for behaviour and tone.
Function calling against a database. If your data is highly structured (orders, inventory, customer records), give the LLM a tool that queries the database directly. RAG is for unstructured or semi-structured text. The two patterns often coexist in production systems.
Where RAG goes wrong in production
Most RAG demos work. Most RAG systems in production after six months do not, at least not without ongoing attention. The failure modes are predictable.
Bad chunking destroys context. Splitting mid-sentence, mid-table, or mid-clause makes retrieved chunks unintelligible. Tables, code blocks, and lists are particularly vulnerable. Production systems use document-aware chunking that respects structure (headings, sections, table boundaries).
Pure semantic search misses exact terms. If a user asks about "policy NHS-2024-A" and your embedding model has never seen that code, semantic search ranks it poorly. Hybrid retrieval (vector + BM25 keyword search, then re-ranking) is the production-grade pattern. Cohere's Rerank and similar cross-encoder models close most of the precision gap.
Stale data. The handbook gets updated and nobody re-indexes. The bot now confidently quotes the old policy. Production RAG needs scheduled re-ingestion, change detection, and ideally a freshness signal in the metadata so the LLM can flag outdated content.
No evaluation harness. Most teams ship a RAG bot, demo it once, and never measure it again. You need a held-out set of questions with known correct answers, scored automatically on retrieval recall (did we get the right chunk?) and answer quality (did the LLM use it correctly?). Frameworks like RAGAS and TruLens make this tractable.
Permissions get ignored. If the knowledge base includes documents not everyone should see (salary data, board minutes, customer contracts), the retrieval layer must filter by the user's permissions. This is non-negotiable for any system handling personal data under UK GDPR - the ICO's guidance on AI and data protection is clear that access controls in upstream systems must persist through to LLM outputs.
The practical decision: when to use RAG
Use RAG when you need an LLM to answer questions about a body of text that is private, fresh, large, or changes frequently. Typical fits:
- Internal knowledge bases, wikis, runbooks, policy documents
- Customer support over product documentation
- Legal, regulatory, or compliance Q&A over a corpus
- Research assistants over a library of papers, reports, or transcripts
- Sales enablement over CRM notes, call transcripts, and battle cards
Skip RAG when the answer is in structured data (use function calling and SQL), when you need the model to learn a new behaviour rather than recall facts (consider fine-tuning), or when the corpus is small enough to fit reliably in context with room to spare (just stuff the prompt).
Conclusion
RAG is not magic. It is a sensible architecture that separates knowledge retrieval from language generation, which makes LLM applications grounded, citable, and updatable. The hard parts are not the LLM call - they are chunking, hybrid retrieval, evaluation, and permissions. Get those right and you have a system that holds up in production. Get them wrong and you have a confident-sounding hallucination engine.
If you are scoping a RAG build and want a sober view of what to do in-house versus what to outsource, AI Advisory works with mid-market teams on production RAG systems across customer support, internal search, and document Q&A. Talk to us if that is the problem on your desk.
Frequently asked questions
What does RAG stand for in LLMs?
RAG stands for retrieval-augmented generation. The term comes from a 2020 Meta AI Research paper by Lewis et al. The "retrieval" half refers to finding relevant chunks of text from a knowledge base, usually via vector search over embeddings. The "augmented generation" half refers to passing those retrieved chunks to a large language model alongside the user's question, so the model generates an answer grounded in the retrieved context rather than only its training data. RAG is now the default architecture for LLM applications that need to answer questions about private, specific, or recently updated information.
How is RAG different from fine-tuning?
RAG and fine-tuning solve different problems. Fine-tuning changes the model's weights so it learns a new behaviour, style, or specialised reasoning pattern, but it is a poor mechanism for teaching facts because the model still hallucinates and cannot cite sources. RAG leaves the model unchanged and instead supplies relevant information at inference time, so answers are grounded in source documents and the underlying knowledge can be updated by re-indexing rather than retraining. As a rule of thumb: fine-tune for behaviour and tone, use RAG for knowledge and facts. Many production systems use both.
What is a vector database and do I need one for RAG?
A vector database stores high-dimensional vectors (embeddings) and supports fast similarity search across millions of them. Popular options include Pinecone, Weaviate, Qdrant, and Milvus. For small to mid-sized corpora (under a few million chunks), you do not need a dedicated vector database - the pgvector extension for Postgres works well and avoids adding a new piece of infrastructure. For very large corpora, multi-tenant SaaS products, or where you need sophisticated filtering and hybrid search out of the box, a purpose-built vector database earns its place.
How much does it cost to run a RAG system?
Costs split into three buckets: embedding (one-off per document, plus per query), LLM inference (per query), and infrastructure (vector database, application hosting). For a typical mid-market internal assistant handling 5,000-10,000 queries per month over a 100-page corpus, expect total monthly costs of roughly £200-£800 on commodity APIs (OpenAI, Anthropic) plus hosting. The dominant variable is LLM inference cost, which depends on model choice and how much context you stuff into each prompt. Using a smaller model for retrieval reranking and a larger one only for final generation cuts cost significantly.
Can RAG hallucinate?
Yes, but far less than an LLM answering from training data alone, and the hallucinations are diagnosable. The most common cause is the retrieval step failing to surface the right chunk, leaving the model to answer from prior knowledge or invent something plausible. The fix is a strict system prompt instructing the model to refuse when context is insufficient, combined with an evaluation harness that measures both retrieval recall (did we fetch the right chunk?) and answer faithfulness (did the model stick to what was retrieved?). Frameworks like RAGAS and TruLens provide automated scoring for both.
How long does a RAG implementation take?
A working prototype against a small corpus can be built in a few days using off-the-shelf tools like LangChain or LlamaIndex with OpenAI embeddings and pgvector. A production-grade system - with hybrid retrieval, reranking, permissions, evaluation, monitoring, and a maintained ingestion pipeline - typically takes 8-16 weeks for a mid-market deployment. The gap between prototype and production is usually underestimated by a factor of three to five, mostly because the prototype skips chunking quality, permissions, and evaluation, which are exactly the things that determine whether the system holds up under real use.
Is RAG compliant with UK GDPR?
RAG can be compliant, but it requires deliberate design. The ICO's guidance on AI and data protection applies: you need a lawful basis for processing, data minimisation in what gets indexed, access controls that persist from source systems through to retrieval, and clear processes for handling subject access and erasure requests against the vector store. If you use a third-party LLM API, you need a data processing agreement and clarity on whether prompts are used for training (most enterprise tiers from OpenAI, Anthropic, and Azure OpenAI now contractually exclude this). Self-hosting open-source models removes the third-party transfer question entirely.
What is hybrid retrieval and why does it matter?
Hybrid retrieval combines vector similarity search (good at semantic matching) with keyword search like BM25 (good at exact terms, product codes, names, and acronyms). Pure vector search alone misses queries where the exact word matters - a user searching for "error code E47" or "Section 12.3" needs lexical match, not semantic similarity. Production RAG systems typically run both searches in parallel, merge the results, and pass them through a reranker (a cross-encoder model like Cohere Rerank or a fine-tuned BERT variant) to produce a final ranked list. Hybrid retrieval is one of the highest-impact improvements over a basic RAG setup.
Further reading
Sources referenced for context not directly cited in the body:
Ready to put this into production? book a discovery call.