AI7 June 20265 min read

Retrieval-Augmented Generation (RAG): How It Works and Why It Matters

How retrieval-augmented generation (RAG) works, why IBM and others back it, and how to build a production RAG system that actually answers correctly

By AI Advisory team

Retrieval-augmented generation (RAG) is the architecture that turned large language models from impressive demo toys into systems you can put in front of customers and regulators. The idea is simple: instead of relying on whatever a model memorised during training, you fetch relevant documents at query time and pass them to the model as context. The model then generates an answer grounded in those documents.

IBM has been one of the more visible enterprise voices on RAG, publishing primers and packaging the pattern inside its watsonx platform. But RAG is not an IBM invention or an IBM product - it is a general technique introduced in a 2020 paper by Lewis et al. at Facebook AI Research, and now implemented by every serious AI vendor and most in-house teams. This guide explains what RAG actually is, how the pieces fit together, where it goes wrong in production, and how to decide whether you need it.

What retrieval-augmented generation actually means

A standard large language model answers from its parametric memory - the weights learned during training. That memory has three problems. It is frozen at the training cut-off, so the model does not know about anything after that date. It does not include your private documents, so it cannot answer questions about your contracts, your product, or your customers. And it hallucinates - the model will confidently invent facts when it has no real information to draw on.

RAG addresses all three by adding a retrieval step before generation. When a user asks a question, the system:

Converts the question into a search query (often a vector embedding, sometimes a keyword query, usually both).
Searches a knowledge store - a vector database, a document index, a SQL database, or some combination - for the most relevant passages.
Passes those passages to the language model along with the original question and an instruction to answer using only the provided context.
Returns the model's answer, ideally with citations back to the source documents.

The model is still doing the generation. The retrieval step gives it the right facts to generate from. This is why IBM, AWS, Google, Microsoft and every enterprise vendor pushes RAG as the default pattern for grounded question answering - it is the cheapest way to get an LLM to answer reliably about your specific data.

The original research and why it matters

The term retrieval-augmented generation was coined by Patrick Lewis and colleagues in the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, published at NeurIPS 2020. The original architecture combined a dense passage retriever with a seq2seq generator (BART), trained end-to-end. The headline result was that RAG models outperformed parametric-only baselines on open-domain question answering while being more factual and easier to update - swap the document index and the model knows new things, no retraining required.

That last property is what made RAG win in enterprise. Retraining or fine-tuning a frontier model on your company data is expensive, slow, and risky. Updating a vector index when your documentation changes is a normal ETL job. The economics are not close.

IBM's contribution has mostly been packaging and evangelism. The IBM Research blog and the IBM Technology YouTube channel have produced widely-cited explainers, and watsonx.ai bundles RAG primitives - embeddings, vector storage via Milvus, prompt templates, evaluation - inside a governance-heavy enterprise stack. If you have seen a clean whiteboard diagram of RAG circulating on LinkedIn, there is a decent chance it came from IBM's Marina Danilevsky.

The components of a production RAG system

A working RAG system has more parts than the four-step summary suggests. Here is what actually sits in the architecture diagram when you build one for a real client.

Document ingestion and chunking

You start with source documents - PDFs, Confluence pages, Notion, SharePoint, support tickets, contracts, product specs. These need to be parsed, cleaned, and split into chunks small enough to fit in the model's context window but large enough to contain a complete idea. Typical chunk sizes are 200-800 tokens with some overlap. Chunking strategy matters more than people expect; bad chunking is the single most common reason a RAG system gives bad answers.

Embeddings and vector storage

Each chunk is converted into a vector embedding - a numerical representation of meaning - using a model like OpenAI's text-embedding-3-large, Cohere's embed-v3, or an open-source option like BAAI's bge-large. Those vectors are stored in a vector database: Pinecone, Weaviate, Qdrant, Milvus, or pgvector inside Postgres. For most mid-market builds, pgvector is the right default because it keeps you on infrastructure you already operate.

Retrieval

At query time, the user's question is embedded with the same model and the vector store returns the top-k most similar chunks. Pure vector search misses things keyword search would catch - acronyms, product codes, names - so production systems use hybrid retrieval: vector search plus BM25 keyword search, combined with a reranker (Cohere Rerank or a cross-encoder) that scores the candidate set more accurately than the initial retrieval can.

Prompt assembly and generation

The retrieved chunks are inserted into a prompt template along with the user's question and instructions about how to answer - typically including a refusal clause ("if the answer is not in the provided context, say you do not know"). The assembled prompt goes to the generation model: GPT-4o, Claude Sonnet, Llama 3.3, Mistral Large, or whatever your stack runs. The model produces an answer and ideally citations.

Evaluation and guardrails

The part most teams skip and then regret. A production RAG system needs an evaluation harness - a set of representative questions with known-good answers, scored automatically on faithfulness (did the answer stay grounded in the retrieved context), answer relevance, and context precision. Frameworks like Ragas and TruLens exist for this. Without evaluation, you are shipping blind.

Where RAG goes wrong in production

Most failed RAG projects fail in predictable ways. The technique is sound; the implementation is rarely the bottleneck. The problems are upstream.

Document quality. If your source documents are inconsistent, outdated, or contradictory, RAG will faithfully retrieve and surface that mess. RAG is not a content strategy. Clean your knowledge base first.

Chunking that splits answers. If the answer to a common question spans three sections of a long document, naive chunking will split it across chunks and the retriever will only fetch one. Solutions include larger chunks with overlap, hierarchical chunking, or document-aware splitters that respect headings.

Embedding model mismatch. A general-purpose embedding model may not understand your domain vocabulary. Legal, medical and highly technical domains often need either a domain-tuned embedding model or aggressive query rewriting.

No reranking. Top-k vector search returns plausibly-related chunks, not necessarily the best ones. Adding a reranker over the top 20-50 candidates and keeping the best 5 routinely improves answer quality by 15-30% in our builds.

Context window stuffing. Throwing 50 chunks into the prompt does not help - models get worse at finding the relevant information as context grows, a well-documented effect called the "lost in the middle" problem (Liu et al., 2023). Fewer, better chunks beat more, mediocre ones.

No refusal pattern. If the model is not explicitly instructed to refuse when the context does not contain the answer, it will hallucinate to fill the gap. The prompt template is doing real work here and needs to be tested.

RAG versus fine-tuning versus long context

The three main ways to give an LLM access to your data are RAG, fine-tuning, and stuffing everything into a long context window. They are not interchangeable.

RAG is the right default when your knowledge changes regularly, when you need citations, when you have more data than fits in context, and when you need to control access by user (different users see different documents). It is also the cheapest to maintain.

Fine-tuning is for changing the model's behaviour - tone, format, domain reasoning patterns - not for teaching it new facts. Teams who fine-tune to inject knowledge usually end up rebuilding as RAG within a year. Combine the two when it makes sense: fine-tune the style, retrieve the facts.

Long context (Gemini 1.5's 1-2 million tokens, Claude's 200k) tempts teams to skip retrieval and just paste the whole knowledge base in. This is expensive at scale, slow, and runs into the lost-in-the-middle problem. Long context is useful inside a RAG system - it lets you pass more retrieved chunks - but not as a replacement for retrieval.

For the vast majority of enterprise question-answering use cases, RAG is the correct architecture. The other two are complements, not alternatives.

Where RAG fits in the IBM watsonx and broader enterprise stack

IBM positions RAG as a core capability of watsonx.ai, with prebuilt connectors to Milvus for vector storage, Granite models for embeddings and generation, and watsonx.governance for evaluation and audit. That packaging matters for regulated industries - banks, insurers, healthcare providers - where the procurement story is as important as the technical one.

You do not need watsonx to build RAG. Most of our builds run on a stack of Postgres with pgvector, OpenAI or Anthropic for generation, Cohere for embeddings and reranking, and a thin Python or TypeScript application layer. For organisations already invested in IBM, watsonx removes some integration work and provides governance primitives that satisfy compliance teams. For everyone else, the open-source-plus-frontier-API path is faster and cheaper.

The bigger picture: RAG is now the assumed default for grounded AI applications. Microsoft's Azure AI Search, AWS Bedrock Knowledge Bases, Google's Vertex AI Search, Databricks' Mosaic AI Agent Framework, and every major LLM platform ship RAG primitives. The question is no longer whether to use RAG but how well you implement it.

What a good RAG build looks like

If you are scoping a RAG project, the things worth getting right at the start are: a tightly defined corpus (resist the urge to index everything on day one), an evaluation set of 50-100 real questions with known answers built before you write any code, a hybrid retrieval pattern with reranking, a refusal-first prompt template, and observability on what the system retrieved and what it answered for every query. Get those five right and you have a system you can iterate on. Skip any of them and you have a demo.

Timelines for a first production RAG build typically run 6-12 weeks: two weeks of discovery and corpus prep, four weeks of build and evaluation iteration, two weeks of user testing and refinement, then go-live. Costs depend on volume, but a mid-market customer-facing assistant grounded in 5,000-50,000 documents usually lands in the £30k-£80k range for initial build, with running costs of £500-£3,000 per month in inference and infrastructure.

Frequently asked questions

Is RAG an IBM technology?

No. Retrieval-augmented generation was introduced in a 2020 research paper by Patrick Lewis and colleagues at Facebook AI Research (now Meta AI). It is a general technique implemented by every major AI vendor and most in-house teams. IBM has been an effective communicator about RAG through its IBM Technology channel and IBM Research blog, and it packages RAG inside watsonx.ai, but the technique itself is open and widely used outside IBM. Searching for "RAG IBM" will surface IBM's explainer content, which is genuinely good, but do not let that create the impression you need IBM software to build a RAG system.

How is RAG different from fine-tuning?

Fine-tuning changes the weights of the model itself, typically to alter its behaviour - tone, output format, domain-specific reasoning. RAG leaves the model unchanged and gives it access to external documents at query time. Fine-tuning is the right choice when you need the model to adopt a particular style or follow specific patterns; RAG is the right choice when you need it to answer questions about specific facts. Teams who try to fine-tune to inject factual knowledge typically rebuild as RAG within a year because retraining for every knowledge update is uneconomic. The two combine well: fine-tune for behaviour, retrieve for facts.

Do I need a vector database to do RAG?

Not necessarily. Vector search is one retrieval method among several, and many production RAG systems combine vector search with keyword (BM25) search for better recall. For small corpora (under 10,000 documents), in-memory vector search inside the application is fine. For larger systems, a dedicated vector store - Pinecone, Weaviate, Qdrant, Milvus - or pgvector inside an existing Postgres database works well. Our default for mid-market builds is pgvector because it keeps everything on infrastructure clients already operate. The vector database is a means to an end; what matters is retrieval quality, not which product stores the vectors.

How do I stop a RAG system from hallucinating?

You cannot eliminate hallucination entirely, but you can reduce it substantially. The main levers are: a refusal-first prompt template that instructs the model to say "I do not know" when the retrieved context does not contain the answer, hybrid retrieval with reranking to ensure the right chunks reach the model, citation requirements so every claim can be traced to a source, and an evaluation harness that scores faithfulness on a held-out test set. Together these typically get faithfulness above 90% on well-scoped corpora. The remaining failure modes usually trace back to corpus quality - if the answer is not in your documents, no amount of retrieval will produce it.

What does it cost to build a production RAG system?

For a mid-market customer-facing or internal assistant grounded in a typical knowledge base of 5,000-50,000 documents, initial build costs land in the £30k-£80k range when delivered by an agency, depending on complexity, integrations, and the depth of evaluation required. Ongoing inference and infrastructure costs usually run £500-£3,000 per month for moderate traffic, dominated by LLM API spend. Self-hosting open-source models on your own GPUs can reduce per-query cost but adds operational overhead that is rarely justified below significant query volumes - typically 100,000+ queries per month.

Can RAG work with my private data without sending it to OpenAI or Anthropic?

Yes. You have three options. First, use a frontier model API with a data processing agreement that prohibits training on your data - both OpenAI and Anthropic offer this for enterprise customers, and the data is not retained or used for training. Second, use a UK or EU-hosted deployment of a frontier model via Azure OpenAI or AWS Bedrock, which keeps data in-region and gives you contractual control. Third, run an open-source model (Llama 3.3, Mistral Large, Qwen 2.5) on your own infrastructure or a sovereign cloud, which keeps data fully in your environment at the cost of higher operational complexity. For most UK mid-market clients, option two satisfies GDPR and information security requirements without the overhead of self-hosting.

How long does it take to build and deploy a RAG system?

A first production RAG build typically takes 6-12 weeks from kickoff to go-live. The breakdown is roughly two weeks of discovery and corpus preparation (often the longest and most underestimated phase), four weeks of build and evaluation iteration, two weeks of user testing with real questions and prompt refinement, and a final phase of deployment, monitoring setup, and handover. Projects that try to compress this timeline almost always cut evaluation, which is exactly the wrong corner to cut. Faster builds are possible for proof-of-concept work on a narrow corpus - two to three weeks is realistic - but production-grade systems with proper evaluation, observability, and refusal patterns need the full window.

What kinds of problems is RAG not suitable for?

RAG is the wrong tool when the task is not retrieval-based. Generating creative content, summarising a single known document, structured data extraction from forms, real-time decision-making against streaming data, and tasks requiring multi-step reasoning across many sources are all better served by other patterns - direct prompting, structured output models, agent frameworks, or traditional ML pipelines. RAG also struggles when answers require synthesising information from many documents simultaneously rather than from a few relevant passages. For those cases, agentic patterns that combine RAG with planning and tool use, or knowledge-graph-based approaches, tend to perform better.

Getting RAG right in production

RAG is not difficult to demo and it is not difficult to ship a version that mostly works. The hard part is the evaluation discipline, the corpus hygiene, and the retrieval quality work that turns a 70%-correct system into a 95%-correct one. That is where most projects either succeed or quietly underperform.

If you are scoping a RAG build and want a second opinion on architecture, evaluation strategy, or vendor selection, AI Advisory works with UK mid-market teams to design and ship production RAG systems end-to-end. Get in touch to talk through your use case.

Ready to put this into production? book a discovery call.