AI7 June 20265 min read

RAG-Based LLMs: How Retrieval-Augmented Generation Actually Works

Retrieval-augmented generation explained: how RAG-based LLMs work, when to use them, what they cost, and where they fail in production

By AI Advisory team

A RAG-based LLM is a large language model that, before answering, fetches relevant text from a knowledge source you control and reads it as part of the prompt. The model still generates the answer in natural language, but it is grounded in passages retrieved at query time rather than relying solely on whatever it absorbed during training. RAG stands for retrieval-augmented generation, a pattern introduced in a 2020 paper by Lewis et al. at Meta AI Research and now the dominant architecture for production question-answering and assistant systems.

The shorthand: a plain LLM answers from memory. A RAG-based LLM answers from memory plus a library it can look things up in. That single change shifts what the system can do, what it costs to run, and how you measure whether it is working.

What problem RAG actually solves

LLMs have three well-known failure modes when used on their own: they hallucinate facts that sound plausible, they have a training cut-off so they do not know about recent events or your internal data, and they cannot cite sources because they do not have any. Fine-tuning helps with style and format but is a poor fix for knowledge - you cannot fine-tune a model every time a policy document changes.

RAG addresses all three. The retrieval step pulls in current, source-attributable content from a store you control: a product knowledge base, a contract library, a Confluence space, regulatory guidance, a CRM. The generation step uses that content as the basis for the answer, which means you can show the user exactly which paragraphs informed the response. When the retrieved context does not contain the answer, a well-designed RAG system refuses rather than guesses.

This is why RAG dominates internal assistants, customer support bots, legal and compliance tools, and any application where being wrong has a cost. According to Databricks research, retrieval-augmented setups consistently outperform long-context-only approaches on enterprise question-answering benchmarks, particularly as corpus size grows beyond what fits in a single context window.

How a RAG pipeline is built

A RAG system has two phases. The first runs offline, ahead of any user query. The second runs every time the user asks something.

Phase one: indexing

You take your source documents - PDFs, web pages, database rows, transcripts - and prepare them for retrieval. The typical steps are:

Parsing. Extract clean text from whatever format the source lives in. PDFs are the usual pain point; tools like Unstructured, LlamaParse, or Azure Document Intelligence handle the messy cases. For HTML and Markdown, simpler parsers work.

Chunking. Split the text into passages of roughly 200-800 tokens. Chunk too large and retrieval gets imprecise; chunk too small and you lose context. Sensible defaults are 500-token chunks with 50-token overlap, but real systems tune this against the document type. Legal contracts chunk differently from chat transcripts.

Embedding. Each chunk goes through an embedding model - OpenAI's text-embedding-3-large, Cohere Embed, or an open-source option like BGE or E5 - which converts it into a dense vector of typically 768 to 3,072 dimensions. Semantically similar passages end up close together in vector space.

Storing. Vectors and the original text go into a vector database. Postgres with the pgvector extension handles most mid-market workloads up to tens of millions of chunks. Dedicated stores like Pinecone, Weaviate, Qdrant, or Milvus are options at higher scale or when you need specific features like hybrid filtering. Many production systems also add a keyword index (BM25 via OpenSearch or Elasticsearch) alongside the vector index for hybrid retrieval.

Phase two: query time

When a user asks a question, the system runs through roughly these steps:

Query processing. The raw question may be rewritten, expanded, or decomposed. A user asking "what changed since last quarter?" is often rewritten into a more retrievable form, or split into multiple sub-queries.

Retrieval. The processed query is embedded and used to find the top-k most similar chunks in the vector store, typically k=5 to k=20. Hybrid systems run a parallel keyword search and merge results using reciprocal rank fusion or a learned reranker.

Reranking. A second-stage model - often a cross-encoder like Cohere Rerank or BGE Reranker - scores each retrieved chunk against the query for true relevance. This step typically improves answer quality more than any other single change once basic retrieval is working.

Generation. The top reranked chunks are inserted into a prompt template alongside the user question and sent to an LLM (GPT-4o, Claude Sonnet, Gemini, Llama 3, or whichever model the application uses). The prompt instructs the model to answer using only the supplied context and to refuse if the answer is not present.

Post-processing. Citations are extracted, the response is checked against safety rules, and the answer is returned to the user with source links.

Why retrieval is harder than it looks

The naive description above - embed everything, search, generate - works for a demo. It rarely survives contact with a real corpus. The common failure points:

Bad chunking destroys answers. If the answer to a question is split across two chunks, neither chunk on its own scores high enough to be retrieved. Tables, lists, and section headers get mangled by generic splitters. Document-aware chunking (preserving headings, keeping tables intact, respecting semantic boundaries) is one of the highest-impact engineering decisions.

Embeddings miss exact-match queries. Vector similarity is great for "what's our refund policy" but weak for "find the SKU starting with X-4471". This is why production systems use hybrid retrieval: dense vectors for semantic recall, sparse keyword indexes for exact terms, identifiers, and proper nouns.

Top-k is not enough. The right answer often sits at rank 15, not rank 3. Without a reranker, you either retrieve too few chunks (missing the answer) or too many (drowning the LLM in irrelevant context, which degrades quality and inflates cost).

Stale data. A RAG system is only as fresh as its index. Production pipelines need incremental updates triggered by source changes - webhooks, change-data-capture, scheduled crawls - not weekly full rebuilds.

Permissions. If your corpus contains documents some users should not see, retrieval has to filter by access control before generation. Bolting this on later is painful. According to ICO guidance on AI and data protection, organisations using personal data in AI systems must apply the same access controls and lawful-basis assessments they would to any other processing - retrieval is no exception.

RAG versus fine-tuning versus long context

These three approaches solve overlapping but distinct problems, and teams often pick the wrong one.

RAG is right when the knowledge changes, when you need citations, when the corpus is too large to fit in a prompt, or when different users need different views of the same content. It is the default for assistants, search, and Q&A.

Fine-tuning is right when you need to teach the model a behaviour, format, or style - structured JSON outputs, a specific tone, a domain-specific classification task. It is poor at teaching new facts and expensive to keep current. Fine-tune for how the model should respond; RAG for what it should know.

Long-context prompting - stuffing a million tokens into Gemini 1.5 or Claude's context window - works for one-off analysis of a single document. It does not work as an architecture for a knowledge base: cost per query scales with corpus size, latency is poor, and recall degrades as context grows. Long context is a tool inside a RAG system (for processing retrieved chunks), not a replacement for one.

The right answer in practice is usually RAG plus light fine-tuning: RAG for the knowledge, a small fine-tune for the output format and refusal behaviour.

What it costs to run

A production RAG system has three cost lines: embedding (one-off plus incremental for new content), storage and retrieval infrastructure, and per-query LLM inference.

For a mid-market deployment - say a 50,000-document corpus serving 10,000 queries a month - realistic monthly costs land in the £400-£2,000 range depending on model choice. GPT-4o or Claude Sonnet at the generation step is the dominant cost; switching to a smaller model like GPT-4o-mini, Claude Haiku, or a self-hosted Llama 3 70B for the bulk of queries can cut that by 60-80% with careful evaluation.

Self-hosting on infrastructure you already run (Postgres with pgvector on an existing RDS instance, for example) is dramatically cheaper than managed vector databases at small to mid scale. Pinecone or Weaviate Cloud start to make sense when you need sub-50ms retrieval at high QPS or specific features like multi-tenancy primitives.

Build cost for a first production system is typically £25,000-£80,000 for a focused use case, rising to £150,000+ for multi-corpus assistants with permissions, evaluation harnesses, and operational tooling. The cost driver is rarely the LLM call - it is the data plumbing, the evaluation work, and the iteration on retrieval quality.

How to know if it is working

The single biggest mistake teams make with RAG is shipping without an evaluation harness. "It looks good in demos" is not a measurement. You need a labelled set of questions, expected answers, and expected source documents, run automatically on every change to the pipeline.

Useful metrics, broadly grouped:

Retrieval quality. Recall@k (is the correct source in the top k results?), MRR (mean reciprocal rank), and context precision (what fraction of retrieved chunks are actually relevant).

Answer quality. Faithfulness (does the answer match the retrieved context?), answer relevance (does it address the question?), and citation accuracy (do the cited sources actually support the claim?). Frameworks like Ragas and TruLens automate much of this with LLM-as-judge scoring.

Operational metrics. P50 and P95 latency, cost per query, refusal rate, and human-feedback signals (thumbs up/down, escalations to a person).

A reasonable target for a production system in a low-stakes domain is 90%+ faithfulness and 85%+ recall@10. Higher-stakes domains - legal, medical, financial advice - need stricter thresholds, more human review, and conservative refusal behaviour. The UK Government's Generative AI Framework sets out useful principles on accountability and human oversight that apply equally to private-sector deployments.

When RAG is the wrong answer

RAG is overused. It is not the right tool when:

The task is generative, not retrieval-driven - writing copy, summarising a single supplied document, code generation from a spec.
The knowledge fits comfortably in a system prompt (a few thousand tokens) and rarely changes. Just put it in the prompt.
You need precise structured data - querying a database for revenue by region is a SQL job, not a RAG job. Text-to-SQL with a verified schema beats embedding your data warehouse.
The corpus is tiny and stable. Five FAQ entries do not need a vector store.

The best architectures often combine approaches: a router that sends factual lookups to SQL, knowledge questions to RAG, and conversational tasks straight to the model.

A short FAQ

What does RAG stand for and who invented it?

RAG stands for retrieval-augmented generation. The term was introduced in a 2020 paper by Patrick Lewis and colleagues at Meta AI Research (then Facebook AI), titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". The original idea combined a dense passage retriever with a sequence-to-sequence generator. The pattern has since become the standard architecture for grounded LLM applications, and the term is now used loosely to cover any system that retrieves context at query time and feeds it to a generative model.

How is RAG different from a normal LLM chatbot?

A normal LLM chatbot answers from whatever the model learned during training. It has no access to your data, no way to cite sources, and a fixed knowledge cut-off date. A RAG-based chatbot adds a retrieval step: before answering, it searches a knowledge source you control and reads the most relevant passages. This means the bot can answer questions about your products, policies, or documents; it can cite the specific source for every claim; and it stays current as your content updates, without retraining the underlying model.

Do I need a vector database to build RAG?

Not always. Postgres with the pgvector extension handles vector search well up to tens of millions of chunks and is the pragmatic default if you already run Postgres. SQLite with sqlite-vec works for smaller deployments and local prototypes. Dedicated vector databases like Pinecone, Weaviate, Qdrant, or Milvus become worth the operational overhead when you need very high query throughput, multi-tenancy primitives, or specific hybrid-search features. For most mid-market projects, starting with pgvector and migrating only if scale forces it is the lower-risk path.

How long does it take to build a production RAG system?

A focused first version - one corpus, one user group, a clear use case like internal policy Q&A - typically takes 6-12 weeks from kickoff to production. The first two weeks cover data discovery and pipeline scoping. Weeks three to eight cover the build: ingestion, retrieval, generation, evaluation harness, and a usable interface. The final weeks cover tuning against the evaluation set and security review. Multi-corpus systems with role-based access, complex permissions, or regulated-industry requirements take longer - 4-6 months is realistic for that scope.

Will RAG keep my data private?

The data stays as private as the infrastructure you put it on. Self-hosted retrieval (your vector store, your servers) plus an LLM provider with a no-training data policy and EU or UK data residency - OpenAI's enterprise tier, Anthropic via AWS Bedrock, Azure OpenAI in a UK region - keeps your content out of model training and within acceptable jurisdictions for UK GDPR. For genuinely sensitive workloads (patient data, classified material), running an open-source model like Llama 3 or Mistral on your own GPUs removes the third-party processor entirely. The architecture choice should follow a documented data protection impact assessment.

How does RAG handle hallucinations?

RAG reduces hallucinations but does not eliminate them. The retrieval step grounds the model in real source content, and a well-designed prompt instructs the model to answer only from that content and refuse otherwise. Production systems add a faithfulness check - either a second model call or a deterministic check that the cited spans appear in the retrieved chunks - to catch cases where the model strays. Refusal behaviour is critical: a system that confidently makes things up when the answer is not in the corpus is worse than one that says "I don't know, here is who to ask".

Can I use RAG with any LLM?

Yes. RAG is an architecture pattern, not a model feature. You can use it with GPT-4o, Claude, Gemini, Llama 3, Mistral, or any other instruction-following model. The model choice affects answer quality, cost, and latency but does not change the pattern. Many production systems route different query types to different models - a cheap fast model for simple lookups, a stronger model for complex reasoning - using the same retrieval layer underneath. Switching the generation model later is straightforward; switching the retrieval design is not, so invest there first.

What skills does my team need to maintain a RAG system?

Day-to-day operation needs a backend engineer comfortable with Python, vector stores, and API integrations, plus someone owning the content side - keeping sources clean, reviewing flagged answers, updating the evaluation set. You do not need ML researchers. The harder skill to find is someone who treats the evaluation harness as a first-class product surface and iterates on retrieval quality systematically rather than tweaking prompts and hoping. Budget for ongoing tuning - a RAG system is not a project you finish, it is a service you operate.

Where to go from here

RAG is the right starting point for almost any internal assistant, customer-facing knowledge bot, or document-heavy workflow where being wrong is expensive. The architecture is well understood, the tooling is mature, and the costs are reasonable. The difficulty is in execution: chunking, hybrid retrieval, reranking, evaluation, and the operational discipline of keeping an index fresh.

If you are scoping a RAG project and want a sober view of what it will cost, how long it will take, and which decisions matter most, AI Advisory builds production RAG systems for mid-market UK businesses. Get in touch for a scoping conversation.

Ready to put this into production? book a discovery call.