What Is RAG in Generative AI? A Practical Explainer
RAG (retrieval-augmented generation) explained: how it works, when to use it, architecture choices, costs, and common failure modes in production
Retrieval-augmented generation (RAG) is the pattern that turns a general-purpose large language model into something useful for your specific business. Instead of relying on whatever the model memorised during training, you fetch relevant documents from your own knowledge base at query time and pass them to the model as context. The model then answers from those documents rather than from its parametric memory.
That's the whole idea. The interesting part is everything around it: how you chunk documents, how you retrieve, how you handle conflicting sources, how you stop the model making things up when retrieval fails, and how you measure whether the system is getting better or worse over time. This article covers the practical mechanics so you can decide whether RAG is the right approach for your use case and what a sensible first build looks like.
What RAG actually is
RAG is a system architecture, not a model. It combines three components: a retriever that searches a knowledge store, a generator (the LLM) that produces the answer, and an orchestration layer that stitches them together. The pattern was formalised in a 2020 paper from Facebook AI Research (Lewis et al.), which introduced the term and showed that grounding generation in retrieved passages outperformed pure parametric models on knowledge-intensive tasks.
The mechanics at query time look like this. A user asks a question. The system converts that question into a vector embedding, searches a vector database (or a hybrid index combining vectors and keywords) for the most relevant chunks of source material, and assembles those chunks into a prompt alongside the original question. The LLM then generates an answer using the retrieved context. A good RAG system also returns citations pointing back to the source documents, so the user can verify the claim.
The reason RAG matters is that base models have three structural problems. They hallucinate when they don't know something. Their training data has a cutoff date, so they don't know about anything recent. And they have no access to private information - your contracts, your product documentation, your customer records, your internal wiki. RAG addresses all three by injecting fresh, private, verifiable content at inference time.
How a RAG pipeline is built
A production RAG pipeline has two halves: an ingestion pipeline that runs ahead of time, and a query pipeline that runs when a user asks something.
Ingestion takes raw source documents (PDFs, HTML, Notion pages, Confluence, SharePoint, ticket histories, transcripts) and turns them into a searchable index. The steps are: extract the text, normalise it, split it into chunks of typically 200-800 tokens with some overlap, generate an embedding vector for each chunk using a model like OpenAI's text-embedding-3-large or an open-source alternative such as BGE or E5, and store the chunks plus vectors plus metadata in a database. Postgres with the pgvector extension is a strong default for most mid-market builds. Pinecone, Weaviate, Qdrant, and Milvus are dedicated alternatives if you outgrow it.
Query time reverses the flow. The user's question is embedded with the same model, the database returns the top k most similar chunks (typically 5-20), an optional reranker reorders them by relevance, and the top results are formatted into a prompt template alongside the question. The LLM generates an answer, and the system returns the answer plus citations.
That's the minimum viable pipeline. A serious build adds: query rewriting (turning a vague question into a better search query), hybrid retrieval (combining vector similarity with traditional keyword search via BM25), reranking with a cross-encoder model, metadata filtering (restrict to documents the user has permission to see), refusal logic when retrieval confidence is low, and an evaluation harness that scores answers against a fixed test set every time you change anything.
When to use RAG and when not to
RAG is the right pattern when you need the model to answer from a defined corpus of knowledge that changes over time and is too large to fit in a prompt. Customer support assistants over a product knowledge base, internal Q&A bots over policy and HR documentation, legal research tools, technical documentation search, and sales enablement assistants over case studies and pricing all fit this shape.
RAG is not the right pattern when the task is reasoning or transformation rather than knowledge lookup. Summarising a document you already have, extracting structured data from an email, generating code from a specification, or writing marketing copy in a particular style are all jobs for prompting plus structured outputs, not RAG. Adding retrieval to a non-retrieval task just adds latency and failure modes.
Fine-tuning is the alternative that comes up most often. The honest answer is that fine-tuning and RAG solve different problems. Fine-tuning teaches a model a style, a format, or a narrow skill - it bakes behaviour into the weights. RAG gives the model access to facts. If your problem is "the model doesn't know our products," RAG is correct. If your problem is "the model doesn't write in our tone of voice or always misses the structured output format," fine-tuning helps. The two are complementary, not competing, and most production systems eventually use both: RAG for facts, light fine-tuning or few-shot prompting for format and tone.
The other comparison is long-context models. Modern frontier models from Anthropic, OpenAI, and Google handle context windows of 200,000 tokens or more, which tempts teams to skip RAG and just stuff the whole corpus into the prompt. This works at small scale and falls over at production scale for three reasons: cost (you pay for every input token on every query), latency (long contexts are slow), and accuracy (research from Anthropic and others shows recall drops measurably in the middle of very long contexts, a problem sometimes called "lost in the middle"). RAG remains the more economical and accurate pattern once your corpus exceeds a few thousand pages.
What makes RAG hard in production
The 30-minute demo of RAG is easy. The production system that consistently gives correct, cited answers across thousands of real user queries is much harder. The failure modes cluster into four areas.
Retrieval failure. The most common cause of bad RAG answers is that the right chunk never made it into the prompt. Either the chunking strategy split a relevant idea across two chunks, the embedding model didn't capture the semantic match (this is common for acronyms, product codes, and domain jargon), or the top-k cutoff was too tight. Hybrid retrieval and reranking address most of this, but it requires deliberate measurement to know whether you have a retrieval problem or a generation problem.
Hallucination despite retrieval. Even with relevant context in the prompt, models sometimes ignore it and answer from parametric memory, especially when the context contradicts what they were trained on. Strong system prompts that explicitly instruct the model to answer only from the provided context, plus refusal patterns ("if the context does not contain the answer, say so"), reduce this significantly but do not eliminate it. The evaluation harness has to test for this specifically.
Stale or conflicting sources. Real corpora are messy. The 2021 employee handbook contradicts the 2024 one. The product spec contradicts the marketing page. RAG will happily retrieve both and confuse the model. Production systems need document recency metadata, source authority weighting, and a deduplication strategy.
Access control. If your assistant is answering questions for staff with different permission levels, retrieval has to enforce row-level security at query time. This is straightforward to design and easy to get wrong, and getting it wrong leaks confidential information. Under UK GDPR, the Information Commissioner's Office has been clear that organisations remain accountable for how AI systems handle personal data - see the ICO's guidance on AI and data protection at ico.org.uk for the current position.
The cost and infrastructure picture
RAG costs split into three buckets: embedding generation, vector storage and search, and LLM inference. For a typical mid-market deployment with a corpus of 50,000-200,000 documents and a few thousand queries a day, the realistic numbers are something like this.
Embedding the corpus is a one-off plus an incremental cost as documents change. OpenAI's text-embedding-3-large is around $0.13 per million tokens at the time of writing - check openai.com/api/pricing for current rates. A 100,000-document corpus averaging 2,000 tokens per document is 200 million tokens, or roughly $26 to embed once. Re-embedding when you change models or chunking strategy multiplies that.
Vector storage on managed services like Pinecone runs from around $70/month for small deployments into the low thousands for larger ones. Self-hosted Postgres with pgvector on a moderate VM is materially cheaper and is our default for most builds where the corpus fits comfortably in memory.
LLM inference dominates the running cost. A query that retrieves 10 chunks of 500 tokens each and generates a 300-token answer uses roughly 6,000 input tokens and 300 output tokens. At GPT-4o pricing, that's a fraction of a penny per query, but at scale across thousands of queries a day it adds up. Caching common queries, routing simple questions to cheaper models, and tightening retrieval to reduce context size are the standard cost-control techniques.
Build effort for a first production-grade RAG system typically runs 8-14 weeks. The first two weeks are corpus analysis and chunking experiments. Weeks three to eight are the retrieval pipeline, prompt engineering, and evaluation harness. The remaining time is integration with whatever channel the assistant lives in (Slack, Teams, intranet, customer-facing widget), access control, observability, and the inevitable round of fixes once real users start asking real questions.
How to evaluate a RAG system properly
The single biggest reason RAG projects underperform is that teams ship them without an evaluation harness and then have no way to tell whether each change is making things better or worse. A useful harness has three layers.
First, a fixed test set of 100-300 representative questions with known-good answers, curated with subject-matter experts. This is unglamorous work and there is no shortcut. Second, automated metrics: retrieval recall (did the right chunk make it into the top k?), answer correctness (does the generated answer match the reference?), faithfulness (does the answer only make claims supported by the retrieved context?), and citation accuracy. Frameworks like Ragas and TruLens provide reasonable starting implementations. Third, human review on a sample, because automated metrics miss subtle failures.
Run the harness on every significant change - new embedding model, new chunking strategy, new prompt, new LLM version - and gate deployment on the results. Without this, you are guessing.
FAQs
Is RAG the same as a chatbot?
No. A chatbot is an interface; RAG is an architecture pattern that can sit behind a chatbot, a search box, an internal tool, or an API. Plenty of chatbots use no retrieval at all - they answer from the model's parametric knowledge or follow scripted flows. And plenty of RAG systems are not chatbots - they power semantic search, document Q&A, research assistants, and code search tools. The conflation is common because customer-facing RAG most often ships as a chat interface, but the two concepts are independent and should be evaluated separately.
How is RAG different from fine-tuning?
Fine-tuning changes the model's weights by training it on additional examples, which is how you teach it a style, a format, or a narrow skill. RAG leaves the weights alone and injects relevant information into the prompt at query time. Fine-tuning is the right answer when you need consistent tone, structured output, or specialised reasoning. RAG is the right answer when you need the model to know facts that change or that weren't in training data. Most production systems use both: RAG for the facts, light fine-tuning or few-shot prompting for format and behaviour. They are complementary, not competing.
What does a RAG system cost to run?
For a mid-market deployment with around 100,000 documents and a few thousand queries per day, expect monthly running costs in the low hundreds to low thousands of pounds, dominated by LLM inference. Embedding the initial corpus is typically £20-£200 as a one-off depending on size and model choice. Vector storage on managed services starts around £60/month; self-hosted Postgres with pgvector is materially cheaper. Build cost for a first production system is typically £30k-£120k depending on corpus complexity, access control requirements, and channel integration scope. Ongoing operation runs lighter, usually 1-3 days per month of engineering effort.
Can RAG eliminate hallucinations?
It reduces them significantly but does not eliminate them. Even with relevant context in the prompt, models occasionally answer from parametric memory or extrapolate beyond what the source supports. Strong system prompts that mandate answering only from provided context, explicit refusal patterns when retrieval confidence is low, citation-based responses that force the model to point at its sources, and a faithfulness metric in your evaluation harness all push the hallucination rate down. Production systems running properly evaluated RAG typically see hallucination rates in the low single digits per cent on in-scope questions, but the residual risk is real and matters most in regulated contexts.
What's the difference between RAG and an AI agent?
RAG is a single retrieval-then-generate step. An agent is a multi-step system that can plan, call tools, retrieve information, take actions, and iterate. RAG is usually one of the tools an agent has available. If your use case is "answer questions from a knowledge base," RAG alone is usually enough and more reliable. If your use case is "handle a customer support ticket end to end, including looking things up, querying systems, and taking actions," you want an agent that uses RAG as one component. Agents add capability and add failure modes; start with RAG and graduate to agents when the task genuinely requires multi-step reasoning.
Do we need a vector database, or will Postgres do?
Postgres with the pgvector extension handles corpora into the millions of vectors comfortably and is our default recommendation for most mid-market builds. Dedicated vector databases like Pinecone, Weaviate, and Qdrant become worthwhile when you need very high query throughput, advanced filtering at scale, or operational features like managed scaling. The decision is rarely about the vector search itself and more about how the rest of your stack looks. If you already run Postgres, sticking with it removes a moving part. If you have no relational database and want a managed service, a dedicated vector store may be simpler.
How long does a RAG project take to ship?
For a first production-grade system, expect 8-14 weeks from kickoff to live deployment. Two weeks for corpus analysis, chunking experiments, and evaluation set creation. Six to ten weeks for the retrieval pipeline, prompt engineering, access control, channel integration, and observability. One to two weeks for user acceptance testing and rollout. A proof of concept that demonstrates the pattern on a small corpus can be standing up in two to three weeks, but the gap between a demo and a system you'd put in front of customers is wider than it looks. The evaluation harness and access control are where most of the underestimation happens.
How does RAG handle GDPR and data residency?
RAG systems process personal data in two places: the corpus you index, and the queries users send. Under UK GDPR, you remain the data controller for both. Practical steps: keep the vector database and document store in-region (UK or EU), use enterprise LLM endpoints that contractually exclude your data from training (Azure OpenAI, AWS Bedrock, and Anthropic's enterprise tiers all support this), enforce row-level access control so retrieval only returns documents the user is entitled to see, log queries and responses for audit, and run a DPIA before launch. The ICO's guidance on AI and data protection at ico.org.uk sets out the current expectations and is the right starting reference.
Closing
RAG is the workhorse pattern of useful generative AI in business. It's not the most glamorous architecture, but it solves the practical problem of getting accurate, cited, up-to-date answers from a model that was trained on the open internet rather than on your business. The technology is mature enough that the differentiator is no longer whether you can build it - the differentiator is whether you build it with proper evaluation, sensible access control, and honest cost discipline. The teams that skip those steps end up with demos that impressed leadership and systems that quietly stop being used.
If you're scoping a RAG build and want a sober view on whether it's the right pattern for your use case, what it will cost to run, and what a realistic delivery plan looks like, AI Advisory runs short discovery engagements that produce a costed roadmap rather than a slide deck.
Ready to put this into production? book a discovery call.