AI Workflow Agency
AI5 min read

RAG Pipeline Development: A Practical Build Guide

How to design, build and evaluate a retrieval-augmented generation pipeline that works in production

By AI Advisory team

Most retrieval-augmented generation (RAG) projects fail in the same place: the demo works, then real users ask real questions and the answers drift. The model invents policy numbers, cites the wrong document, or refuses confidently when the answer is right there in the corpus. The fix is rarely a better model. It is a better pipeline.

This guide walks through how to build a RAG pipeline that survives contact with production traffic. It covers the four decisions that matter (ingestion, chunking, retrieval, generation), the evaluation harness you need before launch, and the operational costs people forget to budget for.

What RAG actually is, and when it is the right tool

Retrieval-augmented generation is a pattern, not a product. A user query goes to a retriever, which fetches relevant chunks from a knowledge base. Those chunks are passed to a language model along with the query, and the model generates an answer grounded in the retrieved text. The retriever is usually a vector search (semantic similarity), often combined with keyword search (BM25) for hybrid retrieval.

RAG is the right tool when three conditions hold: the knowledge changes often enough that fine-tuning is impractical, the answers need to be traceable to source documents, and the corpus is large enough that you cannot fit it in a single prompt. If your corpus fits in 200k tokens and rarely changes, just stuff it into the context window of a long-context model. If you need the model to learn a style or a structured output format, fine-tune. RAG is for grounded recall over a body of text that updates.

The pattern is well documented in the original 2020 paper from Lewis et al. at Facebook AI Research, and the architecture has not fundamentally changed since. What has changed is the quality of embedding models, the maturity of vector stores, and the evaluation tooling around it. Those are where most of the engineering work now lives.

Ingestion: the unglamorous half of the project

Roughly 60% of the effort on a RAG build goes into ingestion: getting documents out of their source systems, normalising them, extracting clean text, and keeping the index in sync as the source changes. The retrieval and generation code is the visible part. The plumbing is where projects slip.

For document-heavy corpora, you need to handle PDFs (including scanned PDFs that need OCR), Office documents, HTML pages, Confluence or Notion exports, ticket systems, and often a SharePoint or Google Drive feed. Each has its own failure mode. PDFs lose table structure unless you use a layout-aware parser like Unstructured, LlamaParse, or AWS Textract. HTML needs boilerplate stripping. Office documents need conversion that preserves headings, because headings are signal.

Three ingestion decisions to make explicitly:

  • Source of truth. Where does the canonical version live? If it is SharePoint, your pipeline polls or subscribes to webhooks there. Do not let two teams maintain two copies.
  • Update cadence. Real-time, hourly, daily, or weekly? Most B2B knowledge bases tolerate hourly. Anything claiming real-time should be challenged: it usually means daily plus a manual refresh button.
  • Deletion handling. When a document is removed at source, your index needs to know. Soft-delete in the vector store and reconcile against the source listing on each sync. Stale documents in the index are a top cause of bad answers post-launch.

Track metadata aggressively: source URL, last-modified timestamp, document owner, access-control tags, document type. You will need every field later for filtering, citation, and security.

Chunking: where most RAG projects bleed quality

Chunking is splitting documents into the pieces that get embedded and retrieved. It sounds trivial. It is the single biggest lever on answer quality.

The naive approach is fixed-size chunks (say, 512 tokens with 50 tokens of overlap). It works for homogenous text. It fails on anything structured: contracts split mid-clause, tables get cut in half, code blocks lose their context. The output is a retriever that returns plausible-looking chunks that do not actually contain the answer.

Better strategies, in rough order of sophistication:

  • Recursive character splitting respecting paragraph and sentence boundaries. The default in LangChain and a sensible baseline.
  • Structure-aware chunking that respects headings, list items, and table boundaries. For Markdown or well-tagged HTML, this is straightforward. For PDFs, you need a layout parser first.
  • Semantic chunking that uses an embedding model to find natural topic shifts within a document and splits there.
  • Parent-child chunking (sometimes called small-to-big retrieval): embed small chunks for retrieval precision, but pass the larger parent section to the generator for context. This is usually the highest-impact pattern for technical documentation.

Tune chunk size to your content. Legal and policy text tolerates larger chunks (800-1200 tokens) because clauses are interdependent. FAQs and product descriptions work better with smaller chunks (200-400 tokens). Run the same evaluation set across three chunk configurations before you commit. Differences of 10-15 points on retrieval recall are common.

Retrieval: hybrid by default

Pure vector search is good at semantic matching ("how do I cancel my subscription" finds "account termination procedure"). It is bad at exact-match recall (product codes, names, acronyms, version numbers). Pure keyword search (BM25) is the opposite. The fix is hybrid retrieval: run both, then combine.

The combination is usually reciprocal rank fusion (RRF), which merges two ranked lists without needing to calibrate the score scales. Most production-grade vector stores now support hybrid out of the box: Weaviate, Qdrant, Elasticsearch, OpenSearch, and pgvector with the right extensions all do this. For a default mid-market stack, Postgres with pgvector plus a tsvector full-text index is hard to beat - one database, no extra infrastructure, comfortable to 10 million chunks.

Above the retriever, add a reranker. A cross-encoder model (Cohere Rerank, BAAI bge-reranker, or Voyage) re-scores the top 20-50 results and returns the top 5-10 to the generator. Rerankers typically lift answer quality by 10-20% on standard benchmarks and are cheap to run because they only see a small candidate set. If you skip one optimisation in your first build, do not skip this one.

Other retrieval patterns worth knowing:

  • Metadata filtering. Restrict retrieval by document type, date range, or access-control tag before the vector search runs. Essential for any multi-tenant or permissioned corpus.
  • Query rewriting. Use a small LLM to expand or decompose the user query before retrieval. A question like "compare our 2024 and 2025 returns policies" needs to become two retrievals, not one.
  • HyDE (hypothetical document embeddings). Generate a hypothetical answer with the LLM, embed it, and search with that embedding. Useful when user queries are short and your corpus is long-form.

Generation, grounding, and refusal

The generation step is the easiest to get superficially right and the hardest to get robustly right. The prompt template needs to instruct the model to answer only from the provided context, cite the source chunks, and refuse when the context does not contain the answer.

Three guardrails that matter:

  • Explicit refusal patterns. Tell the model what to say when retrieval returns nothing relevant. "I do not have information about that in the current knowledge base" beats a hallucinated guess every time. Test this with deliberately out-of-scope queries.
  • Citation enforcement. Require the model to cite chunk IDs inline. Post-process the output to render those as links back to source documents. If a sentence has no citation, flag it.
  • Groundedness checks. A second LLM call (or a smaller classifier) verifies that each claim in the answer is supported by the retrieved chunks. This catches the residual hallucination rate, which for a well-built RAG pipeline on GPT-4 class models sits around 2-5% on adversarial test sets.

Model choice depends on latency and cost budgets. For most internal-facing assistants, a mid-tier model (GPT-4o-mini, Claude Haiku, Gemini Flash) handles RAG well because the heavy lifting is in the context, not the parametric knowledge. Reserve the flagship models for queries that require multi-step reasoning over the retrieved evidence.

Evaluation: the part teams skip and regret

You cannot improve what you do not measure, and RAG quality is not measurable by eyeballing. Build an evaluation harness before you build the assistant. The pattern that works:

  1. Collect 100-300 real queries from users, support tickets, sales calls, or domain experts. Label each with the expected answer and the source document(s) that contain it.
  2. Run the pipeline against this set on every change. Track three metrics: retrieval recall (did the right document appear in top-k?), answer correctness (judged by an LLM-as-judge against the labelled answer), and groundedness (is every claim supported?).
  3. Add adversarial cases: out-of-scope questions, ambiguous queries, queries that require combining two documents, queries with typos or jargon.

Tools like Ragas, DeepEval, and LangSmith automate most of this. Run the suite in CI so a chunking change or prompt edit cannot regress quality silently. The teams who treat this as optional spend their first six months in production firefighting; the teams who build it day one ship confidently.

Operating costs and the budget conversation

RAG is not free to run. For a 5 million chunk corpus serving 1000 queries a day, expect roughly:

  • Embedding model calls: one-off cost for initial indexing (a few hundred pounds for the whole corpus on OpenAI text-embedding-3-large or similar), plus ongoing for updates.
  • Vector store hosting: £100-£500 per month for a managed service at this scale, or the cost of a Postgres instance you already run.
  • Generation: the dominant cost. At 4000 tokens in and 500 tokens out per query on a mid-tier model, expect 1-3p per query, so £300-£900 per month at 1000 queries a day.
  • Reranker: typically a tenth of a penny per query.
  • Evaluation runs: if you run the full suite on every deploy, budget for it. A 300-question suite costs a few pounds per run on a flagship model as judge.

The numbers scale linearly with traffic. The trap is treating RAG as a one-off build cost; in practice, ongoing model spend, monitoring, and quarterly re-evaluation of chunking and retrieval strategy account for most of the lifetime cost of the system.

A reference stack for a first build

If you are building your first production RAG pipeline and want a defensible default, this stack will not embarrass you:

  • Ingestion: Unstructured or LlamaParse for documents, custom connectors for SaaS sources, orchestrated in n8n or Airflow.
  • Storage: Postgres with pgvector for vectors and metadata; S3 or equivalent for raw documents.
  • Embeddings: OpenAI text-embedding-3-large or Voyage voyage-3 for English; multilingual-e5 if you need open-weights.
  • Retrieval: hybrid (pgvector + Postgres full-text), top 30 candidates, reranked to top 6 with Cohere Rerank or bge-reranker-v2.
  • Generation: Claude Haiku, GPT-4o-mini, or Gemini Flash by default; flagship model for complex queries via a router.
  • Orchestration: LlamaIndex or LangChain for the retrieval graph; FastAPI for the service layer.
  • Evaluation: Ragas in CI, LangSmith for production traces.

This is not the only valid stack, but every component is mature, well-documented, and replaceable. Start here, run for three months, then optimise based on what your evaluation data tells you.

Frequently asked questions

How long does a RAG pipeline take to build?

A first production-grade RAG assistant for a single corpus typically takes 8-14 weeks from kickoff. The first two weeks are discovery, source-system access, and assembling the evaluation set. Weeks three to six cover ingestion, chunking experiments, and the baseline retrieval pipeline. Weeks seven to ten add reranking, refusal patterns, citation, and the production UI. The final weeks are user testing, evaluation-driven tuning, and rollout. Simple internal knowledge-base assistants can ship faster (4-6 weeks) if the source is clean Confluence or Notion. Multi-source assistants with PDF-heavy corpora and access controls take longer.

Should we fine-tune a model instead of using RAG?

For grounded recall over a changing knowledge base, no. Fine-tuning teaches a model style, format, or behaviour - not facts. Facts in fine-tuned weights cannot be cited, cannot be updated without retraining, and still hallucinate. The two patterns combine well: use RAG for the knowledge and fine-tune only if you also need a specific output structure or domain tone that prompting cannot achieve reliably. In our experience, fewer than one in five RAG projects benefits from any fine-tuning, and almost none need it before the second iteration of the system.

How accurate is RAG in production?

A well-built RAG pipeline with hybrid retrieval, reranking, and refusal patterns typically achieves 85-95% answer correctness on in-domain queries, with hallucination rates of 2-5% on adversarial tests. Accuracy depends heavily on the quality of the source corpus: if the answer is not in the documents, no pipeline can find it. The biggest gains usually come from cleaning the corpus and improving chunking, not from changing model or vector store. Track correctness over time on a stable evaluation set, because retrieval quality drifts as the corpus grows.

What about data security and GDPR?

For UK and EU deployments, the main concerns are data residency, processor agreements, and access controls within the retrieval layer. Use UK or EU-region endpoints for embedding and generation models (OpenAI, Anthropic, Google, AWS Bedrock all offer these). Sign DPAs with each provider. Tag every chunk with the access-control metadata of its source document and enforce filtering at retrieval time, so a user only retrieves chunks they are entitled to see. The Information Commissioner's Office has published guidance on AI and data protection that covers the controller/processor analysis for systems like these.

How do we prevent hallucinations?

Layered defences, not a single fix. First, retrieval has to actually return the right context - measure retrieval recall and improve it before touching the generator. Second, prompt the model to answer only from context and to refuse when context is insufficient. Third, require inline citations and post-process to verify they exist. Fourth, run a groundedness check on outputs, either with a smaller verifier model or with structured claim extraction. Fifth, monitor production traces and feed failure cases back into the evaluation set. Hallucination rates below 5% are realistic; zero is not, and any vendor claiming zero is selling something.

Can we build this with no-code tools?

For a proof of concept on a small corpus, yes. Tools like n8n, Flowise, and the various managed RAG platforms can stand up a working pipeline in a day or two. For production at any scale, no-code hits limits quickly: evaluation tooling is weak, chunking strategies are fixed, hybrid retrieval is often unavailable, and you cannot reproduce a customer-specific bug because you cannot inspect the pipeline state. A common pattern is to prototype in no-code to validate the use case, then rebuild the production version in code with the same retrieval logic.

Vector database: which one should we use?

For most mid-market builds, Postgres with pgvector is the right starting point: you almost certainly already run Postgres, the operational burden is zero, and it scales to tens of millions of chunks. Move to a dedicated vector store (Qdrant, Weaviate, Pinecone) when you need sub-50ms p99 latency at very high QPS, advanced filtering at scale, or features like multi-tenancy isolation that pgvector handles less elegantly. Do not start with the dedicated tool; start with what you can operate, and switch when measurements tell you to.

How do we keep the index in sync with source systems?

Event-driven where possible, scheduled where not. SharePoint, Google Drive, Confluence, and most modern SaaS sources expose webhooks or change feeds; subscribe to these and process deltas. For systems without change feeds, schedule a daily or hourly poll that compares last-modified timestamps and re-indexes only what changed. Always handle deletions explicitly: on each sync, reconcile your index against the source's current document list and remove anything no longer present. Build a small admin view that shows index health (document count, last sync time, recent failures) so non-engineers can spot drift.

Closing

RAG is a mature pattern with well-understood failure modes. The difference between a pipeline that demos well and one that ships is rarely about model choice; it is about ingestion discipline, chunking strategy, hybrid retrieval, and an evaluation harness that runs on every change. Get those four right and the rest is plumbing.

If you are scoping a RAG build and want a second opinion on the architecture, the evaluation strategy, or the realistic budget, AI Advisory designs and ships these systems for mid-market businesses across the UK. Get in touch to talk through your corpus and use case.

Ready to put this into production? book a discovery call.

Get started

Ready to automate your operations?

Walk away with a prioritised list of automation and AI wins, costed, sequenced, and yours. The call is 30 minutes, free, and binds you to nothing. The shortest path to knowing whether AI Workflow Agency is the right fit.