What is a RAG Pipeline in AI? A Practitioner's Explanation
A clear, technical explanation of RAG pipelines: how retrieval-augmented generation works, the components involved, and where teams get it wrong
Retrieval-augmented generation (RAG) is the architecture most teams reach for when they want a large language model to answer questions about their own data without retraining the model. The short definition: a RAG pipeline retrieves relevant passages from a knowledge source at query time and passes them to a language model as context, so the model can answer using grounded information rather than its parametric memory alone.
That definition gets repeated everywhere. What is less commonly explained is what actually sits inside the pipeline, why each component exists, and how the choices made at each stage determine whether the system is useful in production or quietly hallucinates its way through customer queries. This article walks through the full pipeline as it is built in practice, the trade-offs that matter, and the failure modes that catch teams out.
Why RAG exists at all
Large language models like GPT-4, Claude, and Llama are trained on a fixed corpus up to a cutoff date. They cannot answer questions about your internal documentation, last week's support tickets, or a contract signed yesterday. They also hallucinate confidently when asked about topics outside their training data, which is unacceptable in regulated industries or customer-facing applications.
There are three broad ways to give a model access to private or current information: fine-tuning, long-context prompting, and retrieval-augmented generation. Fine-tuning bakes knowledge into model weights, which is expensive, slow to update, and poor at handling factual recall. Long-context prompting stuffs everything into the prompt window, which is wasteful, expensive at scale, and degrades accuracy as context grows - a finding documented by Liu et al. in their 'Lost in the Middle' research, which showed model performance drops sharply when relevant information sits in the middle of long contexts.
RAG is the pragmatic middle path. The model stays general-purpose. Knowledge lives in a searchable store that can be updated continuously. At query time, only the relevant passages are retrieved and sent to the model. This keeps prompts small, costs predictable, and updates trivial - add a new document, re-index, and the system knows about it within minutes.
The five components of a RAG pipeline
A working RAG pipeline has five stages: ingestion, chunking, embedding and indexing, retrieval, and generation. Each stage has its own design decisions, and getting any of them wrong degrades the whole system.
1. Ingestion
Ingestion is the process of pulling source documents into the pipeline. In a corporate setting this means PDFs, Word documents, Confluence pages, Notion databases, support ticket exports, Slack archives, SharePoint files, transcripts, and structured database records. Each source needs a connector, and each connector needs to handle authentication, rate limits, and incremental updates.
The hard part of ingestion is not pulling the text - it is parsing it correctly. A PDF with tables, a Word document with embedded images, a Confluence page with macros, and a transcript with speaker labels all need different handling. Libraries like Unstructured, LlamaParse, and Azure Document Intelligence exist specifically because raw text extraction loses critical structure. A pipeline that ingests a financial report and discards the table headers will answer questions about that report wrongly, every time.
2. Chunking
Once documents are extracted, they are split into chunks - smaller passages that can be embedded and retrieved independently. Chunking strategy is one of the most undervalued decisions in RAG. Naive fixed-size chunking (say, every 500 tokens) breaks sentences mid-thought and separates context from the facts it explains.
Better strategies preserve semantic boundaries. Recursive character splitting respects paragraphs and sections. Sentence-window chunking embeds individual sentences but retrieves the surrounding context. Document-aware chunking respects markdown headings, code blocks, and table boundaries. For technical documentation, chunking by heading hierarchy tends to outperform fixed-size approaches by a meaningful margin in retrieval accuracy.
Chunk size matters too. Small chunks (100-300 tokens) give precise retrieval but lose context. Large chunks (1000+ tokens) give context but dilute the embedding signal. Most production systems land between 400 and 800 tokens with 10-20% overlap between adjacent chunks.
3. Embedding and indexing
Each chunk is converted into a vector - a list of numbers (typically 768 to 3072 dimensions) that represents its semantic meaning. Embedding models like OpenAI's text-embedding-3-large, Cohere's embed-v3, and open-source alternatives like BGE and E5 do this conversion. Chunks with similar meanings end up close together in vector space, which is what makes semantic search possible.
The vectors are stored in a vector database. Postgres with the pgvector extension is a strong default for teams already using Postgres. Dedicated vector stores like Pinecone, Weaviate, Qdrant, and Milvus offer better performance at scale and more sophisticated indexing options. For most mid-market builds under 10 million chunks, pgvector handles the load comfortably and avoids adding a new piece of infrastructure.
The index itself uses approximate nearest neighbour algorithms - HNSW being the most common - to make similarity search fast. A brute-force scan over a million vectors would take seconds; HNSW returns results in milliseconds with a small accuracy trade-off.
4. Retrieval
At query time, the user's question is embedded with the same model used for the chunks, and the vector database returns the top-k most similar chunks. This is where teams discover that pure semantic search is not enough.
Semantic search excels at conceptual similarity but fails on exact-match queries. A user searching for 'invoice number INV-2847' wants that specific string, not chunks about invoicing in general. The fix is hybrid retrieval: combine vector search with keyword search (BM25), then merge the results. Most production systems use hybrid retrieval by default.
The next refinement is reranking. The initial retrieval returns, say, 20 candidate chunks. A cross-encoder reranker (Cohere Rerank, BGE reranker) scores each chunk against the query with more precision than the bi-encoder embedding model, and the top 3-5 are passed to the language model. Reranking typically improves answer quality by 10-30% on benchmarks, with modest latency cost.
Other retrieval refinements include query rewriting (the model rewrites ambiguous queries before retrieval), HyDE (hypothetical document embeddings, where the model drafts a fake answer and embeds that for retrieval), and metadata filtering (restricting retrieval to documents matching attributes like date, department, or document type).
5. Generation
The retrieved chunks are inserted into a prompt template along with the user's question, and the language model generates an answer. The prompt usually instructs the model to answer only using the provided context, to cite which chunks it used, and to refuse if the context does not contain the answer.
The refusal pattern is what separates a useful RAG system from a dangerous one. A model that confidently fabricates an answer when the retrieval missed is worse than no system at all. Production prompts explicitly tell the model: if the answer is not in the context, say so. Citation patterns - where the model returns the source chunk IDs alongside the answer - let users verify claims and build trust.
Evaluation: the part most teams skip
You cannot improve what you do not measure. A RAG pipeline has at least three things worth evaluating independently: retrieval quality, answer quality, and refusal behaviour.
Retrieval quality is measured with metrics like recall@k (did the correct chunk appear in the top k?) and mean reciprocal rank. You need a labelled evaluation set - a list of questions with the chunks that should be retrieved for each. Building this set takes effort, but without it you are tuning blind.
Answer quality is measured against ground-truth answers. Frameworks like Ragas and TruLens automate this with LLM-as-judge scoring across dimensions like faithfulness (does the answer match the retrieved context?), answer relevance, and context precision.
Refusal behaviour is measured with adversarial questions - things outside the knowledge base, ambiguous questions, and questions designed to provoke hallucination. A system that scores well on faithfulness but answers everything is fragile; a system that refuses too aggressively is useless. The balance matters.
Where RAG breaks in production
The most common failure mode is not the model - it is the data. RAG pipelines surface the quality of the underlying documentation ruthlessly. Outdated procedures, contradictory policies, duplicate versions of the same document, and gaps in coverage all become visible the moment users start asking questions. Many RAG projects stall not because the technology is wrong but because the source content needs editorial work first.
The second common failure is chunking. A pipeline that chunks naively will retrieve fragments that lack context. The answer the model generates is technically grounded in retrieved text but missing the qualifying sentence that lived in the next paragraph.
The third is evaluation drift. A system that worked at launch degrades silently as content grows, query patterns shift, and embedding models are updated. Production RAG needs ongoing monitoring - sampled output review, regression tests against the evaluation set, and alerts when retrieval scores drop.
The fourth, often underestimated, is governance. RAG systems read everything they index. Access control needs to be enforced at retrieval time, not just at the document store. A user querying the chatbot should not see chunks from documents they cannot access in the source system. This is especially material under UK GDPR and the ICO's guidance on AI and data protection, which expects organisations to maintain the same controls over derived systems as over the source data.
When RAG is the right answer, and when it is not
RAG is the right architecture when you have a body of knowledge that changes over time, when answers need to be grounded in citable sources, when the knowledge volume exceeds what fits in a context window, and when access controls matter. Customer support assistants, internal knowledge bases, legal and compliance search, technical documentation chatbots, and analyst research tools are all natural RAG applications.
RAG is the wrong answer when the task is reasoning rather than retrieval - planning, multi-step problem solving, or open-ended generation. It is also wrong when the knowledge is small enough to fit in the prompt window directly (under ~50 pages of text), where the engineering overhead is not justified. And it is wrong when the task is genuinely procedural - workflow automation, data transformation, structured extraction - where a deterministic pipeline beats an LLM with retrieval.
Hybrid architectures are increasingly common. An agent might use RAG to look up policy details, then call a deterministic tool to execute the action. The retrieval pipeline is one component among several, not the whole system.
Frequently asked questions
How long does it take to build a production RAG pipeline?
A first working version takes 4-6 weeks for a focused use case with clean source content. A production-grade system with evaluation harness, monitoring, access control, and editorial workflow typically takes 10-16 weeks. The variation comes from data preparation, not the technical build. If the source content needs cleaning, deduplication, or restructuring, that work often takes longer than the pipeline itself. Teams that try to compress this timeline by skipping evaluation almost always regret it within three months, when users start reporting hallucinations and there is no systematic way to debug what went wrong.
What does a RAG pipeline cost to run?
Running costs break down into embedding (one-off per document, then incremental), vector storage, retrieval compute, and LLM inference. For a knowledge base of around 100,000 chunks serving 10,000 queries per month, expect £200-£800 per month in API costs (OpenAI or Anthropic), £50-£200 in vector database hosting if using a managed service, and modest infrastructure costs for the application itself. Self-hosting embeddings with open-source models like BGE can cut API costs by 60-80% at the expense of running GPU infrastructure. The biggest cost variable is whether you use frontier models (GPT-4 class) or smaller models for generation.
Do I need a vector database, or can I use Postgres?
Postgres with the pgvector extension handles most mid-market RAG workloads comfortably - up to roughly 10 million vectors with sub-100ms query latency on modest hardware. For teams already running Postgres, this avoids adding new infrastructure and keeps embeddings in the same database as the rest of the application data. Dedicated vector stores like Pinecone, Weaviate, and Qdrant become worthwhile at higher scale, when you need advanced features like multi-tenancy isolation, or when query latency at 100ms+ is unacceptable. Start with pgvector. Migrate later if you actually hit its limits.
Is RAG better than fine-tuning?
For factual recall and knowledge access, yes - almost always. Fine-tuning is expensive, slow to update, and poor at preventing hallucination because the model cannot distinguish what it learned during fine-tuning from what it learned during pre-training. RAG keeps knowledge in a separate, updatable store and forces the model to ground its answers in retrieved passages. Fine-tuning is the right choice for changing model behaviour - tone, format, refusal patterns, output structure - not for teaching it facts. Most production systems use both: RAG for knowledge, light fine-tuning or prompt engineering for behaviour.
How do I handle access control in a RAG system?
Enforce permissions at retrieval time, not just at ingestion. Each chunk should carry metadata about which users, groups, or roles can access it. At query time, the retrieval filter restricts results to chunks the requesting user is authorised to see. Inheriting permissions from the source system (SharePoint, Confluence, Google Drive) keeps the model in sync with whatever access changes happen upstream. This is non-negotiable under UK GDPR for personal data and for any system handling client-confidential or commercially sensitive content. Auditing - logging which chunks were retrieved for which user - is the other half of the control.
What about hallucination - can RAG eliminate it?
RAG reduces hallucination significantly but does not eliminate it. The model can still misread retrieved context, conflate two chunks, or generate plausible-sounding additions. The mitigations are prompt design (explicit instructions to use only provided context and refuse when it is insufficient), citation requirements (the model must point to which chunk supports each claim), and evaluation (sampled review of outputs against retrieved context). Faithfulness scoring with tools like Ragas catches the bulk of hallucinations in testing. For high-stakes applications - legal, medical, financial advice - add human-in-the-loop review for any output that the system flags as low-confidence.
Can I build a RAG pipeline with no-code tools?
You can build a prototype quickly with tools like Flowise, Langflow, or n8n's AI nodes wired to a vector database. These are excellent for proving the concept and demonstrating value to stakeholders. They are less suited to production at scale, where you need fine control over chunking, hybrid retrieval, reranking, evaluation, and access control. Most teams that ship serious RAG systems end up writing code - typically Python with LangChain or LlamaIndex, or direct API calls when those frameworks add more abstraction than value. No-code is a great starting point and a poor finishing point.
What is the difference between RAG and an AI agent?
RAG is a retrieval pattern - it gives a model access to external knowledge at inference time. An agent is a control pattern - it lets a model decide what actions to take, including when to retrieve, what tools to call, and how to chain steps together. The two are complementary. A customer support agent might use RAG to look up policy, call an API to check order status, and then generate a response combining both. RAG without an agent is a question-answering system. An agent without RAG is a tool-using assistant with no access to your knowledge base.
Closing thought
RAG pipelines are conceptually simple and operationally demanding. The five-stage architecture is well understood, but the engineering between stages - chunking strategy, hybrid retrieval, reranking, evaluation, access control, monitoring - is where production systems are won or lost. Teams that treat RAG as a weekend prototype tend to ship something that works in demos and fails under real query volume. Teams that invest in evaluation from day one ship systems that improve continuously and earn user trust.
If you are scoping a RAG build and want a pragmatic view on architecture, evaluation, and the data work that needs to happen first, AI Advisory builds and operates RAG systems for UK mid-market teams. Start a conversation and we will walk you through what a sensible first project looks like.
Ready to put this into production? book a discovery call.