What is a RAG Pipeline? A Practitioner's Guide to Retrieval-Augmented Generation
A practical guide to RAG pipelines: how retrieval-augmented generation works, the components, the failure modes, and how to evaluate one
A RAG pipeline (retrieval-augmented generation) is a system that fetches relevant information from your own data before a large language model writes an answer. Instead of relying on whatever the model memorised during training, the model sees a small, curated set of passages pulled from your documents, databases, or APIs at query time, and grounds its response in that material.
If you have ever asked ChatGPT a question about your company's internal policy and got a confident but wrong answer, you have experienced the problem RAG was built to solve. The model knows English; it does not know your 400-page operations manual. RAG plugs that manual into the conversation.
This article walks through what a RAG pipeline actually contains, how the pieces fit together, where it tends to break, and what to measure to know if it is working. It is aimed at people scoping or building one, not casually evaluating whether to care.
Why RAG exists: the problem with raw LLMs
Foundation models like GPT-4, Claude, and Llama are trained on a frozen snapshot of public data. They have three structural limitations that matter for business use.
First, they have a knowledge cut-off. A model trained on data up to mid-2024 does not know about your contract signed in March 2026, your product roadmap, or yesterday's support ticket. Second, they have no access to private data by default. Your CRM, your internal wiki, your case files, your product documentation - none of it is in the model. Third, they hallucinate. When asked a specific question they cannot answer, they often invent a plausible-sounding response rather than refuse. This is well-documented behaviour; Stanford's 2024 study of legal AI tools found hallucination rates of 17-33% even on systems specifically marketed as grounded.
Fine-tuning addresses some of this but is expensive, slow to update, and poorly suited to factual recall (it teaches style and structure better than facts). RAG addresses all three problems by separating knowledge from reasoning. The model handles language and inference. A retrieval system handles knowledge. You can update the knowledge layer in seconds without touching the model.
The anatomy of a RAG pipeline
A production RAG pipeline has six components. Toy demos collapse some of them; serious systems treat each as its own concern.
1. Ingestion and chunking
Source documents (PDFs, HTML pages, Confluence articles, database rows, transcripts) are parsed and split into chunks. Chunk size is a real engineering decision: too small and you lose context, too large and retrieval becomes imprecise and you waste tokens. Typical chunks are 200-800 tokens with a 10-20% overlap so concepts that span boundaries are not lost. Structured documents benefit from semantic chunking that respects headings and sections rather than naive character splits.
2. Embedding
Each chunk is converted into a vector - a list of usually 768 to 3,072 numbers that represent its semantic meaning. Similar concepts produce similar vectors. Popular embedding models include OpenAI's text-embedding-3-large, Cohere Embed v3, and open-source options like BGE and E5. The choice matters: embedding quality directly caps retrieval quality, and re-embedding a large corpus is expensive, so it pays to evaluate before committing.
3. Storage in a vector database
Vectors and their source text are stored in a vector database. Options range from Postgres with the pgvector extension (our default for most mid-market builds), to dedicated systems like Pinecone, Weaviate, Qdrant, and Milvus. For corpora under a few million chunks, pgvector on a properly indexed Postgres instance is usually sufficient and avoids adding another moving part to your stack.
4. Retrieval
At query time, the user's question is embedded with the same model used for the chunks. The database returns the top-k nearest neighbours by cosine similarity (typically k=5 to 20). Naive vector search is rarely good enough on its own. Production systems use hybrid retrieval - combining dense vector search with sparse keyword search (BM25) - because vectors handle paraphrasing well but miss exact product codes, names, and acronyms that keyword search nails.
5. Re-ranking
The top 20-50 retrieved chunks are passed to a cross-encoder re-ranker (Cohere Rerank, BGE-reranker, or similar) that scores each chunk against the query more carefully than the initial similarity score. The top 3-8 survive. This step is often the single biggest quality lever in a RAG pipeline and is skipped in most tutorials.
6. Generation
Surviving chunks are inserted into a prompt template along with the user's question and instructions about how to answer (cite sources, refuse if unsure, format as bullet points, etc.). The LLM generates the response. Production prompts include explicit refusal patterns - "if the context does not contain the answer, say you do not know" - because without them models will reach into training data and confabulate.
A worked example: customer support assistant
Concrete is clearer. Imagine a B2B SaaS company with 12,000 help-centre articles, an API reference, and three years of resolved support tickets. They want a customer-facing assistant that answers product questions.
The ingestion job runs nightly. It pulls articles from the CMS, parses Markdown into chunks of ~500 tokens with heading metadata preserved, embeds each chunk with text-embedding-3-large, and upserts into a pgvector table partitioned by source. Total corpus: roughly 180,000 chunks.
A customer asks: "Why is my webhook returning 401 even though my API key works in Postman?"
The query is embedded. Hybrid retrieval pulls 30 candidates: 20 from vector similarity (which catches conceptual matches like articles about "authentication errors") and 10 from BM25 (which catches the exact string "401" and "webhook"). A Cohere reranker scores all 30 against the query and returns the top 6.
The top 6 chunks - probably containing a known issue about webhook signature verification headers being case-sensitive - are inserted into a prompt with instructions to cite article IDs and refuse if the answer is not in context. Claude or GPT-4 writes the response. The user sees an answer with two linked articles and a quoted code snippet.
None of that response existed in the LLM's training data. The model contributed language fluency and the ability to synthesise across six chunks. The pipeline contributed the facts.
Where RAG pipelines break
Most RAG projects that fail in production fail in predictable ways. Knowing the failure modes upfront saves months.
Retrieval misses the right chunk. If the relevant document is not in the top-k results, the LLM cannot use it. This is by far the most common cause of poor answers. Diagnose by manually checking, for a sample of bad responses, whether the right chunk was retrieved at all. If not, the problem is in chunking, embedding, or retrieval - not in the model.
The right chunk is retrieved but the LLM ignores it. Models can fixate on irrelevant context, especially when chunks contradict each other or when the prompt is poorly structured. Re-ranking and reducing the number of chunks passed to the model usually helps.
Chunks lack context. A 500-token chunk that says "The limit is 50 per minute" is useless without knowing what the limit applies to. Good ingestion prepends section headings, document titles, and metadata to each chunk so retrieved fragments make sense on their own.
Hallucination on edge cases. When the corpus does not contain the answer, weakly-prompted models invent one. The fix is explicit refusal patterns in the system prompt, plus evaluation tests that specifically probe questions outside the corpus to confirm the model says "I do not know" rather than guessing.
Stale data. If ingestion runs weekly but your help articles update daily, customers will get out-of-date answers. Ingestion freshness needs to match the rate of source change. Event-driven ingestion (webhooks from your CMS) is better than scheduled batch jobs for fast-moving sources.
Permissions and data leakage. If your corpus contains documents only some users should see, retrieval must filter by user permissions before the LLM sees the chunks. Bolting this on later is painful; design it in from the start. Under UK GDPR, mishandling personal data inside a RAG pipeline is a processing breach like any other - the ICO's AI guidance is the relevant reference.
How to evaluate a RAG pipeline
You cannot improve what you do not measure. RAG evaluation has three layers.
Retrieval metrics. Build a labelled set of 100-300 questions where you know which chunks should be retrieved. Measure recall@k (did we get the right chunk in the top k?) and mean reciprocal rank (how high did it rank?). Recall@10 below 80% means your retrieval needs work before anything else matters.
Answer quality metrics. For a holdout set of questions with reference answers, measure faithfulness (does the response only claim things supported by the retrieved chunks?), answer relevance (does it actually address the question?), and context precision (how much of the retrieved context was useful?). Frameworks like Ragas and TruLens automate this with LLM-as-judge approaches. Treat the scores as directional, not absolute.
Production metrics. Log every query, the retrieved chunks, the response, and user feedback (thumbs up/down, escalation to human, follow-up question). Sample 50-100 real queries weekly and grade them manually. Patterns emerge quickly: certain query types fail, certain document categories under-retrieve, certain prompts cause specific bad behaviours.
When RAG is the right answer (and when it is not)
RAG is the right choice when you need an LLM to reason over a body of factual information that changes faster than you can fine-tune, when you need source citations for trust or compliance, and when the corpus is too large to fit in a context window even with long-context models.
RAG is the wrong choice for tasks that depend on style or format rather than facts (use fine-tuning or few-shot prompting), for highly structured queries that are really database lookups (use SQL with an LLM frontend), and for tasks where the entire reference document fits comfortably in context (just put it in the prompt). Long-context models with one million-token windows do not eliminate RAG either - retrieval still beats stuffing the context window on cost, latency, and accuracy for most non-trivial corpora, as recent research from Google and others has shown.
A reasonable rule of thumb: if you have more than a few hundred pages of reference material and questions that span multiple documents, RAG earns its complexity. Below that, a simpler approach is often better.
What a production RAG build actually involves
For a mid-market organisation building their first RAG system, expect roughly the following shape. Discovery and corpus scoping take two to three weeks - identifying sources, access patterns, permissions, and freshness requirements. An initial pipeline with ingestion, retrieval, and a basic interface takes four to six weeks. Evaluation harness, re-ranking, refusal patterns, and the iteration cycle that takes accuracy from "demo-good" to "production-good" takes another six to ten weeks. Total: 12-20 weeks for a system you can defend in front of users and auditors.
Ongoing operation matters more than people expect. Corpora drift. New document types arrive. Embedding models improve and become worth re-running. User questions reveal gaps you did not know existed. A RAG system without an owner and a monthly review cadence degrades within a year.
FAQ
How is RAG different from fine-tuning?
Fine-tuning adjusts the model's weights by training it on examples, which is good for teaching style, format, or domain-specific reasoning patterns. RAG leaves the model untouched and supplies fresh information at query time. Fine-tuning is poor at teaching specific facts (the model still hallucinates around them) and is slow and expensive to update. RAG handles facts well and updates instantly when you change the underlying documents. Most production systems combine both: fine-tune lightly for tone or output structure, use RAG for factual grounding. If you only do one, start with RAG - it solves the more common problem.
Do I need a vector database, or can I use Postgres?
For most mid-market RAG builds, Postgres with the pgvector extension is sufficient and preferable. It avoids adding a separate datastore, integrates with your existing backups and access controls, and handles up to several million chunks comfortably on reasonable hardware. Dedicated vector databases like Pinecone, Weaviate, or Qdrant become worthwhile at very high scale (tens of millions of vectors), when you need specific features like distributed sharding, or when query latency requirements are extreme. Start with pgvector. Migrate later if and when you hit real limits, which most organisations never do.
How much does a RAG pipeline cost to run?
Operating costs split across embeddings (one-off per chunk, then incremental on updates), vector storage (cheap), retrieval compute (cheap), and LLM generation (the dominant cost). For a system handling 10,000 queries per month with GPT-4-class generation and re-ranking, expect £300-£1,500 per month in API costs depending on context size and model choice. Self-hosting open-source models on your own infrastructure can reduce per-query cost substantially at the price of higher fixed cost and operational overhead. Build cost is separate and depends on scope - see the timing estimates above.
Is RAG safe under UK GDPR?
RAG can be GDPR-compliant but does not become so automatically. The pipeline processes personal data whenever the corpus contains it, which triggers the usual obligations: lawful basis, data minimisation, retention, subject access rights, and security. Specific risks include personal data being embedded into vectors (which are still personal data), retrieval surfacing information a user should not see, and LLM responses inadvertently disclosing data from one tenant to another. Mitigations include permission-aware retrieval, tenant isolation in the vector store, careful prompt design, and contractual controls on third-party model providers. The ICO's AI and data protection guidance covers the framework.
Can I build a RAG pipeline with no-code tools?
You can build a prototype with platforms like n8n, Flowise, or LangFlow connected to a hosted vector database in a day or two, and for some internal use cases that is enough. Production systems that handle real volume, permissions, evaluation, and iteration need code. The no-code path is excellent for proving the concept and discovering what you actually need; it tends to hit limits when you need custom retrieval logic, fine-grained access control, robust evaluation, or anything that requires testing and version control. Treat no-code as a prototyping accelerator, not a production target.
How do I stop the LLM from hallucinating in my RAG system?
You cannot eliminate hallucination but you can reduce it substantially. Use explicit refusal instructions in the system prompt ("if the context does not contain the answer, say you do not know"). Reduce the number of chunks passed to the model so it is not distracted by irrelevant context. Add re-ranking to improve chunk quality. Require the model to cite specific source IDs in its response so unsourced claims are visible. Run a faithfulness evaluation that scores whether each claim in the response is supported by retrieved context. And monitor production responses with sampling - hallucinations cluster around specific query types you can then fix.
What is the difference between RAG and agentic AI?
RAG is a retrieval pattern: fetch relevant information, then generate. Agentic AI describes systems where an LLM plans and executes multi-step actions, often calling tools, querying APIs, and making decisions about what to do next. The two overlap - many agents use RAG as one of their tools - but they solve different problems. RAG answers "what does our knowledge base say about X?" An agent answers "investigate X by checking these sources, running these queries, and producing this output." Agents are more powerful and more brittle. If your problem is question-answering over a corpus, RAG alone is usually the right shape.
How long before a RAG system needs maintenance?
From day one. Sources change, new documents arrive, user queries reveal gaps, and embedding models improve. A reasonable cadence is weekly review of a sample of production queries, monthly review of evaluation metrics, quarterly review of corpus coverage and retrieval performance, and annual review of the embedding and generation models in use. Systems that go unattended for six months almost always show measurable accuracy decay, partly from corpus drift and partly because the evaluation harness was never built or maintained. Budget for ongoing operation; it is not a build-and-forget asset.
Building yours
RAG pipelines reward careful engineering more than clever prompting. The components are not exotic, but the failure modes are specific, the evaluation discipline is non-obvious, and the difference between a demo that impresses your CEO and a system that holds up in front of 10,000 real users is mostly in the boring parts: chunking, hybrid retrieval, re-ranking, refusal patterns, and an evaluation loop you actually run. If you are scoping a RAG build and want a sanity check on the architecture or the timeline, AI Advisory builds these systems for UK mid-market clients and is happy to walk through your specific case.
Ready to put this into production? book a discovery call.