AI14 June 20265 min read

What is RAG in AI and How Does It Work?

A practical explainer of Retrieval-Augmented Generation: what RAG is, how it works under the hood, where it fails, and when to use it

By AI Advisory team

Retrieval-Augmented Generation, almost always shortened to RAG, is the technique behind most production AI assistants that answer questions over a company's own data. If you have ever asked a chatbot about a policy document, a product catalogue, or an internal wiki and got a coherent answer with a citation, you were probably talking to a RAG system.

The idea is simple. Large language models are good at writing and reasoning but bad at remembering specific facts they were not trained on. RAG fixes that by retrieving relevant text from a source you control, then handing it to the model as context before it answers. The model stops guessing and starts reading.

This article walks through what RAG is, how the pipeline actually works step by step, where it tends to break, and when it is the right tool for the job.

The problem RAG solves

A foundation model like GPT-4, Claude, or Llama is trained on a snapshot of public text. It has no knowledge of your contract templates, your support tickets from last Tuesday, your product SKUs, or the regulatory guidance your compliance team published this morning. Ask it about any of those and you get one of three outcomes: a refusal, a generic answer, or a confident hallucination.

You have three options to fix that. You can fine-tune the model on your data, which is expensive, slow to update, and still prone to hallucination on specifics. You can stuff everything into the context window, which works for a handful of documents but collapses at scale and gets expensive fast - even with a million-token context, you pay for every token on every call. Or you can retrieve only the relevant chunks at query time and give those to the model. That third option is RAG.

Patrick Lewis and colleagues at Meta AI coined the term in 2020, originally as a way to combine a parametric model (the LLM) with a non-parametric memory (a vector index of documents). The architecture has evolved, but the core trade has not: lower hallucination, fresher answers, and traceable citations, in exchange for the engineering work of building and maintaining a retrieval pipeline.

How a RAG pipeline actually works

A working RAG system has two phases. An offline indexing phase that runs whenever your source content changes, and an online query phase that runs every time a user asks something.

Indexing: turning documents into something searchable

Indexing starts with ingestion. You pull source content from wherever it lives - SharePoint, Confluence, a Postgres database, a folder of PDFs, a CRM, a ticketing system. Each source needs a connector that handles authentication, change detection, and rate limits. This is unglamorous plumbing and it is where most projects lose their first month.

Next comes parsing. PDFs need text extraction (tools like Unstructured or AWS Textract). HTML needs cleaning. Spreadsheets need flattening into a form the model can read. Tables and images are the hardest part; most teams ignore them initially and pay for it later when users ask about a chart.

Then chunking. You cannot store an entire 200-page document as one searchable unit, because retrieval becomes too coarse. You split each document into chunks of roughly 200 to 800 tokens, usually with some overlap so a concept that straddles a chunk boundary still gets caught. Chunking strategy matters more than people expect: splitting by semantic boundary (sections, paragraphs) generally outperforms naive character-count splits.

Each chunk then gets embedded. An embedding model - OpenAI's text-embedding-3-large, Cohere Embed v3, or an open-source option like BGE - converts the text into a vector of typically 768 to 3072 floating-point numbers. Similar meanings produce similar vectors. The vectors and the original text get stored in a vector database (Pinecone, Weaviate, Qdrant, or pgvector inside Postgres) alongside metadata like source URL, document title, author, and last-updated timestamp.

Query: retrieving and generating

When a user asks a question, the system embeds the question using the same model used for the documents. It then runs a similarity search - usually cosine similarity or dot product - against the vector index and pulls back the top-k matching chunks, typically between 5 and 20.

Pure vector search misses things. Vectors capture semantic similarity but struggle with exact matches: product codes, names, acronyms. Production systems almost always run hybrid retrieval, combining vector search with a keyword index like BM25, then fusing the two ranked lists. Anthropic's contextual retrieval write-up reported a 49% reduction in retrieval failures by combining embeddings with BM25 and contextualised chunks.

The retrieved chunks are then optionally re-ranked by a cross-encoder model - Cohere Rerank, or a smaller open-source reranker - which scores each chunk against the question more accurately than the initial similarity search can. Re-ranking adds latency and cost but usually lifts answer quality noticeably.

Finally, the top chunks (after re-ranking, usually 3 to 8 of them) are inserted into a prompt template along with the user's question and instructions like "answer only from the provided context; if the answer is not present, say so." The LLM generates an answer, and the system returns it to the user, typically with citations pointing back to the source documents.

Where RAG goes wrong in production

RAG looks straightforward in a demo. The first production deployment usually exposes a stack of failure modes that the demo never hit.

Retrieval failures. The model can only answer from what retrieval finds. If the chunk containing the answer never makes it into the top-k, the model either refuses or fabricates. This is the single most common cause of RAG underperformance. Causes include bad chunking, weak embeddings for domain-specific language, missing keyword fallback, or a question phrased very differently from the source text.

The standard diagnostic is to build an evaluation set of question-answer pairs where you know which document contains the answer, then measure retrieval recall at various k values. Ragas and TruLens are the open-source tools most teams reach for.

Stale or duplicate content. If your indexing job runs nightly but the policy changed at 9am, users get the old answer. If three slightly different versions of the same document exist in SharePoint, the model picks one effectively at random and may contradict itself across sessions. Source hygiene matters as much as model choice.

Context window stuffing. Cramming 20 chunks into a prompt does not always improve quality. Research from Stanford on the "lost in the middle" problem showed that models attend most strongly to information at the start and end of their context, and miss things buried in the middle. More retrieved chunks can mean worse answers.

Hallucination despite retrieval. Even with good context, models sometimes ignore it and fall back on training data, especially when the context contradicts widely-held priors. Strong prompting ("answer only from the provided context") helps. Newer models hallucinate less than older ones. Citation requirements that force the model to quote source text help more.

Permissions and data leakage. If your vector index contains documents some users should not see, naive retrieval will surface them. Every chunk needs document-level access control metadata, and retrieval has to filter by the requesting user's permissions before generation. Skipping this step is how confidential salary information ends up in a sales rep's chatbot answer.

RAG vs fine-tuning vs long context

The three approaches solve different problems and the right answer is usually a combination.

RAG is best when your knowledge changes often, when traceability matters (regulated industries, customer support), when you have a lot of source material, and when users ask varied factual questions. It is the default for question-answering over company data.

Fine-tuning is best when you need the model to adopt a specific style, follow a particular output format reliably, or perform a narrow task (classification, structured extraction) better than the base model. It is poor at teaching the model new facts; the research consistently shows fine-tuning is unreliable for factual recall compared to retrieval.

Long context (dropping all your documents into a million-token prompt) is increasingly viable for small or medium corpora and has the advantage of zero retrieval engineering. But it is expensive per query, slower, hits the lost-in-the-middle problem, and does not scale past the context limit. For a 50-page handbook, long context is fine. For 50,000 support tickets, RAG wins on cost and latency.

Most production systems combine them. A fine-tuned model that knows your tone and output format, calling a RAG pipeline for facts, occasionally falling back to long-context summarisation for whole-document tasks.

When RAG is the right call

RAG is the right architecture when these conditions hold:

You have a defined corpus of text you control (docs, tickets, policies, product data).
The corpus is too big or changes too often to fit in a prompt.
Users will ask questions whose answers are in the corpus.
You can tolerate the engineering work of running a retrieval pipeline (or pay someone to run it for you).
You need to be able to point at the source of an answer.

It is the wrong call when the task is generative rather than factual (writing marketing copy, brainstorming), when there is no defined corpus, or when the questions require reasoning across hundreds of documents at once (an agent with tool use is usually a better fit there).

A reasonable first build for a mid-sized organisation - say, RAG over a 5,000-document knowledge base with hybrid retrieval, re-ranking, evaluation harness, and a simple chat interface - takes 8 to 14 weeks and £40k to £120k depending on integration complexity. Running costs are dominated by embedding model calls during indexing and LLM calls at query time; a system handling 10,000 queries a month typically runs £200 to £1,500 in model fees, plus vector database hosting.

FAQs

What does RAG stand for and who invented it?

RAG stands for Retrieval-Augmented Generation. The term and the original architecture come from a 2020 paper by Patrick Lewis and colleagues at Meta AI (then Facebook AI Research), published at NeurIPS. The original paper combined a dense passage retriever with a sequence-to-sequence generator and showed strong results on open-domain question answering. The core pattern - retrieve relevant text, condition generation on it - predates the paper in information retrieval research, but the 2020 work formalised it as a unified architecture for use with large language models, and the name stuck.

Is RAG the same as a vector database?

No. A vector database is one component of a RAG pipeline, used to store and search embeddings. RAG is the whole pattern: ingestion, parsing, chunking, embedding, retrieval (often hybrid, often re-ranked), prompt assembly, generation, and citation. You can build RAG without a vector database - using only keyword search (BM25) over a normal text index works for some use cases and is sometimes faster and cheaper than vector search. Conversely, you can use a vector database for things that are not RAG, like recommendation systems or duplicate detection. The pipeline matters more than any single tool in it.

How accurate is RAG compared to just asking ChatGPT?

For questions whose answers are in your private corpus, RAG is dramatically more accurate because the model has the relevant text in front of it. For general knowledge questions, RAG can actually hurt accuracy if your corpus is irrelevant or noisy. The honest answer is that accuracy depends almost entirely on retrieval quality. A well-tuned RAG system on a clean corpus typically achieves 80-95% answer accuracy on in-scope questions; a poorly-tuned one on a messy corpus can drop below 50%. The model is rarely the bottleneck; retrieval and source quality are.

Do I need a vector database, or will Postgres do?

For most projects, Postgres with the pgvector extension is sufficient and often the better choice. It keeps your embeddings next to your application data, simplifies operations, supports metadata filtering natively, and scales comfortably to tens of millions of vectors. Dedicated vector databases like Pinecone, Weaviate, and Qdrant earn their keep at hundreds of millions of vectors, when you need very low latency at high QPS, or when you want specific features like serverless scaling. Start with pgvector, move only if you hit a concrete limit. Premature optimisation in this layer is a common cost trap.

How do I stop RAG from hallucinating?

You reduce hallucination on several fronts. First, improve retrieval so the right context actually reaches the model: hybrid search, re-ranking, better chunking. Second, write a strict system prompt instructing the model to answer only from the provided context and to say "I don't know" otherwise. Third, require citations - make the model quote source text and link to the document. Fourth, evaluate continuously with a question-answer test set so you catch regressions. Fifth, choose a strong model; newer frontier models hallucinate noticeably less. You will not eliminate hallucination entirely, but you can reduce it to a rate comparable to a human researcher.

How long does it take to build a production RAG system?

A proof of concept on a small corpus takes a few days. A production system - with proper ingestion connectors, access control, evaluation, monitoring, hybrid retrieval, re-ranking, and a usable interface - typically takes 8 to 14 weeks for a mid-sized organisation. The build splits roughly into: two weeks on data ingestion and parsing, two to three weeks on retrieval and ranking, two weeks on the application and access control, and the remainder on evaluation, iteration, and deployment. Teams that skip the evaluation harness ship faster and pay for it later in user trust.

What does it cost to run RAG at scale?

Running costs break into three buckets. Embedding model fees during indexing (one-off per document, plus deltas on updates) are typically £0.02 to £0.13 per million tokens. LLM fees at query time depend on the model and context size; a typical query with 4,000 tokens of context and a 500-token answer costs £0.005 to £0.05 on current frontier models. Vector database hosting starts around £50 per month for hosted services or is effectively free on pgvector. A system serving 10,000 queries a month typically runs £200 to £1,500 in total model and infrastructure costs, before engineering time to maintain it.

Can RAG work with non-text data like images, audio, or tables?

Yes, with extra work. Images use multimodal embedding models (CLIP, or vision-capable models like GPT-4o and Claude with image input) so you can retrieve diagrams and screenshots by text query. Audio is usually transcribed first and then treated as text. Tables are the hardest case: naive text extraction loses structure, so production systems either convert tables to a structured form (JSON, markdown) before embedding, or use specialised table-aware parsers like Unstructured.io. Mixed-modal RAG is improving rapidly but adds complexity; most projects start text-only and add modalities once the text pipeline is solid.

Should I build RAG in-house or use an off-the-shelf tool?

It depends on how central the system is to your business. Off-the-shelf options like Glean, Sana, or Microsoft Copilot for enterprise search are fast to deploy and fine for general internal search. Custom RAG makes sense when you need specific integrations, regulated data handling, domain-tuned retrieval, your own model choice, or when retrieval quality is a competitive advantage (customer-facing assistants, legal or medical use cases). A useful middle path is a custom build on top of open frameworks like LangChain or LlamaIndex, which gives you control without writing every component from scratch.

Closing thoughts

RAG is not a magic wand. It is a sensible architecture for grounding language models in text you control, with well-understood failure modes and a maturing set of tools to manage them. If you have a body of knowledge worth more than the model's guesswork, RAG is almost certainly part of the answer.

The hard parts are not the model or the vector database. They are the data hygiene, the retrieval quality, the evaluation discipline, and the access control. Teams that treat those as first-class engineering concerns ship RAG systems users trust; teams that treat them as afterthoughts ship demos that quietly get abandoned. If you want help designing or building one, AI Advisory does this work end-to-end for UK mid-market organisations.

Ready to put this into production? book a discovery call.