AI Workflow Agency
AI5 min read

What is RAG? Retrieval-Augmented Generation Explained

A practitioner's guide to retrieval-augmented generation: how RAG works, when to use it, costs, pitfalls, and how it compares to fine-tuning

By AI Advisory team

Retrieval-augmented generation (RAG) is the architecture behind most production AI assistants that need to answer questions from a specific body of knowledge - your support docs, your contracts, your product catalogue, your internal wiki. It is the difference between a chatbot that makes things up and one that cites the relevant policy paragraph.

This article explains what RAG actually is, how it works end-to-end, where it fits versus fine-tuning, what it costs, and the practical failure modes that catch teams out on the first build. It assumes you have used ChatGPT or Claude but have not yet shipped a RAG system in production.

What RAG is, in one paragraph

A large language model on its own only knows what was in its training data, cut off at some date, and it has no access to your private information. RAG fixes this by adding a retrieval step before generation: when a user asks a question, the system first searches a knowledge base for the most relevant snippets, then passes those snippets to the model as context alongside the original question. The model generates an answer grounded in the retrieved material. The retrieval step is what makes the response accurate, current, and traceable back to a source.

The term was coined in a 2020 paper by Lewis et al. at Facebook AI Research, which introduced the pattern of combining a dense retriever with a sequence-to-sequence generator. The architecture has since become the default approach for enterprise AI assistants. Andreessen Horowitz's 2024 enterprise survey found that the majority of production generative AI deployments use retrieval rather than fine-tuning as their primary customisation method.

How RAG works, step by step

A working RAG system has two distinct phases: an offline indexing phase that runs whenever your knowledge base changes, and an online query phase that runs every time a user asks something.

Indexing (offline)

You take your source documents - PDFs, Confluence pages, Notion exports, database rows, transcripts - and break them into chunks of roughly 200 to 800 tokens. Each chunk is passed through an embedding model (OpenAI's text-embedding-3-large, Cohere's embed-v3, or an open-source option like BGE) which converts the text into a vector of around 1,500 numbers. That vector represents the semantic meaning of the chunk in a high-dimensional space. You store the vector, the original text, and metadata (source URL, last-modified date, access permissions) in a vector database such as Pinecone, Weaviate, Qdrant, or Postgres with pgvector.

Querying (online)

When a user asks a question, the system embeds the question using the same embedding model. It then searches the vector store for the chunks whose vectors are closest to the question's vector - usually using cosine similarity - and returns the top five to twenty. Those chunks are inserted into a prompt template along with the user's question and instructions like "answer only from the context provided, and cite the source". The full prompt goes to the generation model (GPT-4, Claude, Llama 3, whatever you have selected). The model produces an answer. If you have built it properly, the answer includes citations pointing back to the source documents.

That is the minimum viable loop. Production systems add reranking (a second-stage model that reorders the retrieved chunks for relevance), hybrid search (combining vector similarity with traditional keyword search via BM25), query rewriting (using an LLM to expand or clarify the question before retrieval), and evaluation harnesses to measure accuracy over time.

Why RAG instead of just using a bigger model

Frontier models like GPT-4o and Claude 3.5 Sonnet are extraordinary general reasoners, but they have four limitations that retrieval solves directly.

Training cutoffs. Models do not know about events after their training date. Your pricing changed last month, your product shipped a new feature last week, your compliance team updated the policy yesterday. RAG injects current information at query time.

Private data. The model was never trained on your contracts, your support history, or your internal playbooks. It cannot answer questions about them unless you give it access. RAG is the access mechanism.

Hallucination control. When a model is asked something it does not know, it tends to invent a plausible-sounding answer. Grounding the model in retrieved text and instructing it to refuse when the context does not contain the answer reduces hallucination dramatically. It does not eliminate it - we will get to that - but it moves the failure mode from confident fiction to "I do not have that information in my sources".

Auditability. Regulated industries need to show why the system gave a particular answer. RAG provides citations that point to the specific source chunk used. A fine-tuned model gives you an answer with no trail.

RAG versus fine-tuning

This is the question every technical buyer asks in week one, and the honest answer is that they solve different problems and are often combined.

Fine-tuning teaches a model new behaviours, styles, or formats. It is good for getting a model to consistently produce output in your house tone, to follow a specific reasoning pattern, or to handle a narrow classification task. It is not a good way to teach a model new facts, because facts in training data are diffuse, hard to update, and impossible to cite.

RAG teaches a model new facts at query time. It is good for question-answering over a body of knowledge that changes, for traceable answers, and for systems where the source documents are large or numerous.

A useful heuristic: if you would write the answer differently depending on which document you read, you need RAG. If you would write all answers in the same format regardless of which document, you might need fine-tuning. Most production systems use base models with RAG for facts and prompt engineering for style; fine-tuning enters the picture only when prompt engineering plateaus, which in our experience is rare.

OpenAI's own guidance, in its production documentation, recommends starting with prompt engineering, then adding retrieval, and only fine-tuning when both are insufficient.

What a good RAG system looks like in production

A first prototype is easy. A production system that answers reliably across thousands of questions is harder. The components that separate a demo from something a business actually relies on:

Chunking strategy that respects document structure. Naive fixed-size chunking splits sentences mid-clause and separates headings from the content beneath them. Better chunkers use document structure - markdown headings, PDF sections, semantic boundaries - so each chunk is self-contained. Anthropic's contextual retrieval research, published in 2024, showed that prepending a short LLM-generated summary of the parent document to each chunk reduces retrieval failures by around 35%.

Hybrid retrieval. Pure vector search misses exact-match queries (product codes, error messages, names). Pure keyword search misses paraphrases. Hybrid search runs both and merges results using reciprocal rank fusion or a weighted combination. Microsoft's research on Azure AI Search reports that hybrid plus semantic reranking outperforms either method alone on most enterprise corpora.

Reranking. After retrieving the top fifty candidates by vector similarity, a cross-encoder reranker like Cohere Rerank or BGE Reranker scores each candidate against the query in a more accurate (but slower) way, and you keep the top five. This is one of the highest-impact additions for accuracy.

Refusal patterns. The system prompt must instruct the model to say "I do not have that information" when the retrieved context does not contain the answer. Without this, models confabulate. With it, you get clean failures you can monitor and fix.

An evaluation harness. A set of 100 to 500 question-answer pairs that you run on every change to chunking, retrieval, prompt, or model. Without this, you are flying blind - improvements in one area silently break another. RAGAS and TruLens are common open-source frameworks; many teams roll their own.

Permission filtering. If your knowledge base contains documents that not every user should see, the retrieval step must filter by user permissions before generation. This is non-negotiable in regulated sectors and frequently missed in prototypes.

Where RAG breaks

The pattern is well understood, but the failure modes are subtle.

Bad source data. RAG cannot fix a knowledge base full of contradictions, outdated PDFs, and duplicate documents. Garbage in, confidently cited garbage out. Most RAG projects spend more time on document hygiene than on the AI itself.

Questions that require multi-hop reasoning. "Which of our clients in the Manchester office signed contracts longer than 24 months in 2023?" requires retrieving a list, filtering by date, and joining to contract terms. Standard RAG retrieves chunks and hopes the model can stitch them together. It often cannot. Solutions include agentic RAG (the model issues multiple retrieval calls), text-to-SQL for structured data, or graph-based retrieval.

Lost in the middle. Research from Stanford in 2023 ("Lost in the Middle" by Liu et al.) showed that models attend more to the beginning and end of long contexts than the middle. If you stuff twenty chunks into the prompt, the model may ignore chunks ten to fifteen. Reranking and using fewer, better chunks beats more chunks.

Embedding model drift. If you change embedding models, you must re-index everything. Mixing vectors from different models in the same store produces nonsense results. Pick your embedding model with at least an eighteen-month horizon in mind.

What RAG costs

The cost structure has four components: embedding (one-off per document, plus on updates), vector storage (monthly), retrieval (per query, usually negligible), and generation (per query, dominant).

For a typical mid-market deployment - a knowledge base of 50,000 documents averaging 5,000 tokens each, handling 10,000 queries a month - you are looking at roughly:

- Initial embedding: $50 to $200 one-off using OpenAI's text-embedding-3-large at $0.13 per million tokens.
- Vector storage: $70 to $300 a month depending on provider (Pinecone, Qdrant Cloud, or self-hosted pgvector).
- Generation: $200 to $2,000 a month depending on which model handles queries and how much context you pass. GPT-4o-mini and Claude Haiku are usually sufficient for retrieval-grounded answers and cost roughly a tenth of their flagship siblings.

Build cost for a production-grade first system, including ingestion pipeline, evaluation harness, refusal logic, and a usable interface, typically lands in the £25,000 to £80,000 range for UK mid-market projects, with ongoing operation at £2,000 to £8,000 a month depending on volume and how aggressively you iterate.

When to use RAG and when not to

RAG is the right tool when:

- You have a body of knowledge that is too large to fit in a prompt and changes more than once a quarter.
- Users ask natural-language questions whose answers exist somewhere in your documents.
- You need citations or audit trails.
- The cost of a wrong answer is high enough to justify retrieval infrastructure but not so high that you need deterministic rules.

RAG is the wrong tool when:

- Your data is highly structured and queries are predictable - use SQL or a search engine.
- You need exact, deterministic answers (compliance lookups, regulatory citations of fixed text) - use direct retrieval with templated responses, not generation.
- The corpus is small enough to fit in a long-context prompt (under 100,000 tokens) and rarely changes - just put it in the system prompt.
- Users want to perform actions, not retrieve information - you want an agent with tools, not a RAG pipeline.

Frequently asked questions

Is RAG the same as a vector database?

No. A vector database is one component of a RAG system - the place where document embeddings are stored and searched. RAG is the overall pattern of retrieving relevant context and feeding it to a language model. You can build RAG without a vector database (using keyword search, SQL, or a document store), and you can use a vector database for things other than RAG (recommendation systems, semantic deduplication, image search). Most production RAG systems do use a vector database because semantic similarity search is the most flexible way to find relevant chunks, but the architecture is broader than any single tool.

How long does it take to build a production RAG system?

For a well-scoped first deployment over a defined corpus, expect 8 to 14 weeks from kickoff to production. The first two weeks are discovery: understanding the source documents, the questions users actually ask, and the accuracy bar. Weeks three to eight cover ingestion, retrieval, prompt design, and evaluation harness. Weeks nine to fourteen are iteration against real questions, refusal tuning, and UI polish. Teams that rush to launch in four weeks usually spend the next six months rebuilding because they skipped evaluation and have no way to measure whether changes help or hurt.

Does RAG work with open-source models like Llama 3?

Yes, and increasingly well. The retrieval architecture is model-agnostic - you can swap GPT-4 for Llama 3.1 70B, Mistral Large, or Qwen 2.5 without changing the rest of the pipeline. Open-source models running on your own infrastructure are attractive when data residency matters, when query volume makes API costs uneconomic, or when you need full control over the model. The trade-off is operational complexity: you take on GPU hosting, model updates, and reliability engineering. For most UK mid-market deployments handling under a million queries a month, hosted API models remain more cost-effective once total cost of ownership is included.

How accurate is RAG compared to a human answering from the same documents?

On well-tuned production systems over clean corpora, we typically see answer accuracy in the 85 to 95% range against a held-out evaluation set, where accuracy means "factually correct and grounded in the source". Humans answering the same questions from the same documents land around 90 to 97%. The remaining gap is closed by escalation paths - if the model's confidence is low or the retrieved context is sparse, route the question to a human. The goal is rarely full automation; it is handling the 70 to 80% of routine questions automatically so humans can focus on the hard ones.

What about GDPR and data security?

RAG raises three distinct concerns under UK GDPR. First, lawful basis for processing personal data in your knowledge base - if documents contain personal data, you need a lawful basis to embed and retrieve them. Second, the choice of generation model: sending personal data to a US-based API may require a transfer mechanism such as the UK addendum to the EU SCCs, or you can deploy a UK or EU-hosted model. Third, access control: the retrieval step must respect document-level permissions so that users cannot retrieve content they would not normally see. The ICO's guidance on AI and data protection covers these obligations in detail and should be reviewed before any production rollout involving personal data.

Can RAG hallucinate?

Yes, but far less than an unrooted model, and the hallucinations are different in character. With a strong refusal prompt and good retrieval, the model will mostly say "I do not have that information" when the context is missing. Hallucinations that do occur tend to be subtle - the model paraphrases the source in a way that changes the meaning, or stitches together information from two chunks in a way the source documents do not support. This is why an evaluation harness with adversarial questions matters: you measure not just whether the model answers, but whether it answers correctly and refuses appropriately.

Do I need a separate RAG system for each use case?

Usually yes, at least at the retrieval and prompt level. A support chatbot and an internal legal assistant might share infrastructure - same vector database, same embedding model, same orchestration framework - but their corpora, prompts, refusal patterns, and evaluation sets will differ substantially. Trying to build one universal RAG system across every use case in a business tends to produce something that does each job badly. The pragmatic pattern is a shared platform layer with use-case-specific configurations on top, which gives you reuse without forcing a single retrieval strategy onto incompatible problems.

Where to go from here

If you are exploring RAG for a specific use case - support automation, internal knowledge search, compliance Q&A, contract analysis - the highest-value next step is usually a two-week scoping exercise: map the source documents, sample the real questions, set an accuracy bar, and produce a costed build plan. AI Advisory runs this as a fixed-fee engagement and we are happy to discuss whether RAG is the right pattern for your problem or whether something simpler will do the job.

Ready to put this into production? book a discovery call.

Get started

Ready to automate your operations?

Walk away with a prioritised list of automation and AI wins, costed, sequenced, and yours. The call is 30 minutes, free, and binds you to nothing. The shortest path to knowing whether AI Workflow Agency is the right fit.