AI Workflow Agency
AI5 min read

RAG Architecture Explained: Components, Patterns, and Production Tradeoffs

A practitioner's guide to RAG architecture: retrieval, embeddings, vector stores, generation, evaluation, and the patterns that actually work in production

By AI Advisory team

Retrieval-Augmented Generation (RAG) is the architecture pattern that lets a large language model answer questions using your own data without retraining the model. It is the dominant pattern behind production AI assistants, internal knowledge bots, and grounded customer-facing chat systems. It is also the pattern most teams get wrong on the first attempt, because the failure modes hide in the retrieval layer rather than the model itself.

This article walks through what RAG architecture actually is, the components involved, the common patterns (naive, hybrid, agentic, graph), the evaluation work that separates a demo from a production system, and the tradeoffs you will face when you build one. It assumes you are technically literate but have not yet shipped a RAG system to production.

What RAG actually is, and why it exists

A large language model like GPT-4, Claude, or Llama 3 has two limitations that matter for business use. First, its knowledge stops at a training cutoff date. Second, it has never seen your private data - your contracts, support tickets, product docs, policies, or CRM records. Fine-tuning can address some of this, but it is expensive, slow to update, and a poor fit for facts that change often.

RAG sidesteps both problems. At query time, the system retrieves the most relevant chunks of your data and inserts them into the model's context window alongside the user's question. The model then generates an answer grounded in those retrieved chunks rather than from its parametric memory alone. The pattern was first formalised in Lewis et al. (Meta AI, 2020), and has since become the default architecture for enterprise AI assistants.

The mental model worth holding: RAG is a search engine bolted to a writer. The search engine finds relevant passages, the writer synthesises them into an answer. If the search engine returns garbage, the writer produces garbage in fluent prose. Most failed RAG projects fail at search, not generation.

The core components of a RAG system

Every RAG architecture, regardless of vendor or stack, has the same six components. Understanding each one is the difference between a system that works and a system that hallucinates confidently.

1. Document ingestion and chunking

Source documents - PDFs, Confluence pages, Notion exports, database rows, transcripts - are loaded and split into smaller passages called chunks. Chunk size is a critical hyperparameter: too small and you lose context, too large and you dilute retrieval relevance. A typical starting point is 500-1000 tokens per chunk with 10-20% overlap, but the right answer depends on your content. Legal contracts need bigger chunks than support FAQs.

This step is also where most data quality problems are introduced. PDFs with multi-column layouts, scanned images, tables, and footnotes routinely produce garbled text. Tools like Unstructured, LlamaParse, and Azure Document Intelligence exist because raw text extraction is genuinely hard.

2. Embeddings

Each chunk is converted into a vector - typically 768 to 3072 floating-point numbers - using an embedding model. The vector represents the semantic meaning of the chunk such that semantically similar text produces nearby vectors. OpenAI's text-embedding-3-large, Cohere's embed-v3, and open-source options like BGE and Nomic are common choices. The MTEB leaderboard tracks current performance.

Embedding choice matters more than people expect. A bad embedding model puts unrelated content near your query in vector space, and no amount of clever prompting downstream will rescue retrieval. Domain-specific embeddings (medical, legal, code) can outperform general-purpose ones by 10-20% on relevant benchmarks.

3. Vector store

The vectors and their source text are stored in a vector database optimised for similarity search. Options range from Postgres with the pgvector extension (our default for most mid-market builds) to dedicated stores like Pinecone, Weaviate, Qdrant, and Milvus. For small corpora under a few million chunks, pgvector is usually sufficient and avoids adding a new piece of infrastructure. For high-throughput or filtered search at scale, a dedicated store earns its place.

4. Retriever

At query time, the user's question is embedded using the same model used at ingestion, and the vector store returns the top-k most similar chunks (typically k=5 to 20). This is the heart of RAG. A pure vector (dense) retrieval misses exact-match queries - product codes, names, acronyms - so production systems almost always combine dense retrieval with keyword (sparse) search like BM25. This is called hybrid retrieval.

5. Re-ranker (optional but usually worth it)

Top-k retrieval is fast but imprecise. A re-ranker - a smaller cross-encoder model like Cohere Rerank or BGE-reranker - takes the top 20-50 candidates from retrieval and reorders them by genuine relevance to the query. Adding a re-ranker typically improves answer quality by 10-30% in our internal evaluations, at a latency cost of 100-300ms per query. For anything customer-facing, it is usually a worthwhile trade.

6. Generator

The final step. The top re-ranked chunks are inserted into a prompt template alongside the user's question and any system instructions, and sent to the LLM for generation. The prompt typically instructs the model to answer only from the provided context, to cite which chunk supported each claim, and to refuse if the answer is not present. Good refusal behaviour is what separates a trustworthy RAG system from a confident liar.

The common architectural patterns

RAG has evolved from a single pattern into a family of patterns, each suited to different problem shapes.

Naive RAG

The textbook implementation: chunk, embed, retrieve top-k, stuff into prompt, generate. It works for narrow domains with clean documents and simple questions. It breaks on multi-hop questions ("what was our refund policy in 2023 versus now?"), ambiguous queries, and large heterogeneous corpora. Most proof-of-concept RAG demos are naive RAG; most of them do not survive contact with real users.

Hybrid retrieval RAG

Dense vector search combined with BM25 keyword search, results merged via reciprocal rank fusion or a learned ranker. This handles exact-match cases (product SKUs, error codes, names) that pure embeddings struggle with. Almost every production system we ship uses hybrid retrieval as the baseline. Elasticsearch, OpenSearch, and Weaviate all support hybrid natively; with pgvector you combine it with Postgres full-text search.

Query transformation RAG

The user's raw query is rewritten before retrieval. Patterns include HyDE (generate a hypothetical answer, embed that, retrieve against it), multi-query expansion (generate three variations of the question and retrieve for each), and step-back prompting (generate a more abstract version of the question first). These help when user queries are short, ambiguous, or use different vocabulary than the source documents.

Agentic RAG

Instead of a fixed retrieve-then-generate pipeline, an agent decides what to retrieve, when, and how many times. It can issue multiple sub-queries, search different indexes, call tools, and reason over intermediate results. Frameworks like LangGraph, LlamaIndex's agent modules, and OpenAI's function calling make this practical. Agentic RAG is more powerful for complex questions but slower, more expensive, and harder to evaluate. Reserve it for cases where naive or hybrid RAG demonstrably underperforms.

Graph RAG

Documents are pre-processed into a knowledge graph of entities and relationships, and retrieval traverses the graph rather than purely matching vectors. Microsoft Research's GraphRAG paper (2024) showed substantial gains on questions requiring synthesis across many documents. The trade is significant: graph construction is expensive and brittle to schema changes. Worth considering for research, intelligence, and analyst-style use cases where questions span the whole corpus.

Evaluation: the part most teams skip

If you cannot measure your RAG system, you cannot improve it, and you certainly cannot defend it to a sceptical compliance officer. Evaluation has two layers.

Retrieval evaluation measures whether the right chunks were returned for a given question. The metrics are precision@k, recall@k, and mean reciprocal rank. You need a labelled set of question-to-chunk mappings - typically 100-500 examples produced by subject matter experts, or bootstrapped with an LLM and human-verified.

Generation evaluation measures whether the final answer is faithful (grounded in retrieved context, no hallucinations), relevant (answers what was asked), and complete (covers the salient points). Frameworks like RAGAS, TruLens, and DeepEval automate much of this using LLM-as-judge patterns, but you still need a human-reviewed golden set as the ground truth.

Our rule of thumb: spend at least 25% of build time on evaluation infrastructure. The teams that ship reliable RAG systems are the ones with an evaluation harness running on every change. The teams that ship hallucinating RAG systems are the ones who eyeballed it on five questions and pushed to production.

Production concerns that determine whether it works

The architecture diagram is the easy part. These are the questions that decide whether a RAG system survives in production.

Freshness. How quickly do document changes propagate to the vector store? Webhook-driven re-indexing is the right pattern; nightly batch re-embedding is a smell. For sources like Confluence, Notion, or your CRM, build event-driven ingestion from day one.

Access control. If user A cannot see document X in the source system, the RAG system must not surface document X to user A. This means storing per-chunk ACLs in the vector store and filtering at retrieval time. Building this in retrospectively is painful; design it in from the start. The ICO's guidance on AI and data protection is the relevant reference for UK deployments.

Citations and traceability. Every generated answer should cite the source chunks it was based on, with deep links back to the original document. This is the single biggest driver of user trust we see in deployed systems, and it makes auditing tractable.

Refusal behaviour. The system must reliably say "I do not know" when the answer is not in the retrieved context. Naive prompting often fails this; you need to test refusal explicitly with adversarial questions, and tune the prompt and retrieval threshold together.

Cost. A RAG query typically costs 5-50x a plain LLM call because of the retrieved context tokens. At scale, embedding storage, re-ranker latency, and inference costs compound. Caching frequent queries, using smaller models for routing, and tuning chunk size all matter. Budget for monthly running costs to be 20-40% of build costs on the first year.

Multi-modal content. If your documents contain tables, charts, and images that carry meaning, text-only RAG will miss it. Vision-language models like GPT-4o and Claude 3.5 Sonnet now support image-aware retrieval; ColPali and similar approaches index page images directly. Worth evaluating if your source material is visually dense.

When RAG is not the right answer

RAG is the default for grounded question-answering over private corpora. It is not the right tool for everything.

If the task is generation in a specific style or format (legal drafting, brand voice writing), fine-tuning may serve better. If the data is small and stable enough to fit in the model's context window directly, long-context prompting with caching (Anthropic and Google both support this) can outperform RAG with less infrastructure. If the task is mathematical or logical reasoning, tool use and code execution beat retrieval. If the task is structured data lookup, generating SQL against a database beats embedding the database.

The honest framing: RAG is one pattern in a toolkit. Build a clear evaluation of the alternatives before committing.

Frequently asked questions

What is the difference between RAG and fine-tuning?

RAG keeps the model unchanged and supplies relevant information at query time through retrieval. Fine-tuning modifies the model's weights using training examples. RAG is better for factual knowledge that changes (policies, product information, support content) because you update the document store, not the model. Fine-tuning is better for style, format, and behaviour - teaching the model how to respond, not what to know. Most production systems we build use RAG as the foundation and add light fine-tuning only where a specific output format or refusal pattern needs to be locked in. The two are complementary, not competing.

How much does it cost to build a production RAG system?

For a focused first build - one corpus, one user group, one channel - expect £25,000 to £80,000 for the build phase, with running costs of £500 to £5,000 per month depending on query volume and model choice. Enterprise-scale builds with multiple data sources, complex access control, and high availability requirements run £80,000 to £250,000. The major cost drivers are data ingestion complexity (clean APIs versus scraping PDFs), evaluation harness development, and integration with existing systems. Avoid vendors quoting under £15,000 for a real production system; that price point only works for thin wrappers around a vendor tool.

Which vector database should I choose?

For most mid-market builds, Postgres with the pgvector extension is the right starting point. It avoids adding new infrastructure, supports hybrid search via full-text indexing, handles tens of millions of vectors comfortably, and lets you keep ACLs and metadata in the same database. Move to a dedicated vector store - Pinecone, Weaviate, Qdrant, or Milvus - when you need sub-100ms latency at very high query volumes, complex filtered search across hundreds of millions of vectors, or features like multi-tenancy isolation. Choosing a dedicated store on day one without these requirements adds operational overhead you do not need.

How do I stop a RAG system from hallucinating?

Hallucinations in RAG come from two places: the model ignoring the provided context, or the retrieval returning irrelevant context that the model then synthesises plausibly. Fix retrieval first with hybrid search, re-ranking, and chunk-size tuning - most hallucinations are actually retrieval failures dressed up. Then tighten generation with a strict prompt that instructs refusal when context is insufficient, low temperature settings, and per-claim citation requirements. Finally, run an evaluation harness with faithfulness scoring (RAGAS or similar) on every deployment. You will not eliminate hallucinations entirely, but a well-built RAG system should hold faithfulness above 95% on in-domain questions.

How long does it take to build a RAG system?

A working prototype with one data source takes one to three weeks. A production system with proper evaluation, access control, monitoring, and integration takes 8 to 16 weeks. The discovery and data preparation phase is usually the longest - getting clean access to source documents, understanding the access control model, building the labelled evaluation set with subject matter experts. The actual retrieval and generation pipeline is often the fastest part. Be sceptical of timelines that promise a production system in under six weeks unless the corpus is small and clean and the use case is genuinely narrow.

Is RAG GDPR-compliant for UK deployments?

RAG itself is a technical pattern and is not inherently compliant or non-compliant. Compliance depends on how you handle personal data within it. Key considerations under UK GDPR: lawful basis for processing personal data in source documents, ensuring retrieval honours data subject access rights and deletion requests (you must be able to remove a person's data from the vector store, not just the source), avoiding personal data being sent to LLM providers that process it outside the UK or EU without adequate safeguards, and maintaining audit logs for accountability. The ICO has published specific guidance on AI and data protection that is the right reference point. For sensitive deployments, self-hosted models and on-premise vector stores remove the cross-border processing concern entirely.

Can I use RAG with open-source models instead of OpenAI or Anthropic?

Yes, and for many use cases this is the right choice. Llama 3.1, Mistral, Qwen, and DeepSeek now offer commercial-friendly licences and quality competitive with proprietary models for many tasks. Self-hosting removes the data residency and lock-in concerns, and at high query volumes can be substantially cheaper. The trade is operational: you take on model serving, GPU infrastructure, and the engineering work to keep up with model releases. For most mid-market builds we recommend starting with a hosted frontier model to validate the use case, then evaluating self-hosted alternatives once usage patterns are clear and the cost case is real.

How does RAG handle non-text content like tables and images?

Standard text-based RAG struggles with tables (where layout carries meaning) and ignores images entirely. Three approaches address this. First, table-aware extraction tools like Unstructured and Azure Document Intelligence convert tables into structured markdown or HTML that preserves relationships. Second, vision-language models like GPT-4o and Claude 3.5 Sonnet can process images directly when retrieved alongside text. Third, document-image retrieval approaches like ColPali index page-level images and let the model see the original layout. For documents where charts, diagrams, and tables are central to meaning - financial reports, scientific papers, engineering specs - one of these approaches is essential, not optional.

Closing thoughts

RAG is the architecture that finally makes large language models useful on private data, but it rewards careful engineering and punishes shortcuts. The components are simple individually; the work is in chunking decisions, retrieval quality, evaluation rigour, access control, and refusal behaviour. Get those right and you have a trustworthy system. Skip them and you have a confident liar.

If you are scoping a RAG build and want a second opinion on architecture choices, evaluation strategy, or vendor selection, AI Advisory runs short discovery engagements specifically for this. Get in touch to discuss your use case.

Ready to put this into production? book a discovery call.

Get started

Ready to automate your operations?

Walk away with a prioritised list of automation and AI wins, costed, sequenced, and yours. The call is 30 minutes, free, and binds you to nothing. The shortest path to knowing whether AI Workflow Agency is the right fit.