AI Workflow Agency
AI5 min read

RAG in Business: What Retrieval-Augmented Generation Actually Does

A practical explainer of retrieval-augmented generation for business: how RAG works, where it pays back, what it costs, and how to deploy it safely

By AI Advisory team

Retrieval-augmented generation, or RAG, is the architecture most businesses end up using when they want a language model to answer questions about their own data. It is not a product, not a model, and not a vendor. It is a pattern: fetch relevant content from a source you control, hand it to a language model along with the question, and have the model write an answer grounded in what you fetched.

That definition sounds simple, and the basic version is. What makes RAG worth understanding in detail is everything that sits around it: how you chunk and index documents, how retrieval actually works, where it breaks, what it costs to run, and how it interacts with GDPR and information security obligations. This article covers each of those, with enough specificity to make a build-or-buy decision.

What RAG actually is

A large language model like GPT-4, Claude, or Llama 3 only knows what was in its training data. Ask it about your company's refund policy, your contract templates, your supplier list, or last quarter's board pack, and it will either refuse or hallucinate. RAG fixes that by inserting a retrieval step before generation.

The flow looks like this. A user asks a question. The system converts the question into a vector embedding - a numerical representation of meaning. It searches a vector database for chunks of your documents whose embeddings are mathematically close to the question's embedding. It pulls the top matches, typically five to twenty short passages, and inserts them into a prompt along with the original question. The language model then generates an answer using those passages as evidence.

The result is an answer that cites your data rather than the open internet. Done well, the model will refuse to answer when the retrieved passages do not contain the information, which is the behaviour you want in any business context where wrong answers cost money or attract regulatory attention.

RAG sits between two alternatives. On one side is prompting alone, which is cheap but ignorant of your data. On the other is fine-tuning, which bakes knowledge into the model weights but is expensive, slow to update, and a poor fit for facts that change weekly. RAG occupies the middle ground: cheaper than fine-tuning, more accurate than prompting alone, and refreshable in near real time because you just re-index documents.

Where RAG pays back in a business context

RAG is not a general productivity boost. It pays back in specific places where three conditions hold: you have a corpus of documents that people need to query, the questions are repetitive enough to justify the build, and accuracy matters enough that hallucinations are not acceptable.

The most common production use cases we see in UK mid-market:

Customer support deflection. A RAG assistant trained on your help centre, product documentation, and historical ticket resolutions handles tier-one questions and escalates the rest. McKinsey's 2024 State of AI report flagged customer service as one of the highest-ROI generative AI applications, with deployments typically reducing handle time by 20-40%. The architecture is straightforward and the payback period is short because support volume is measurable.

VERIFY: McKinsey State of AI 2024 specific handle-time figure.

Internal knowledge assistants. Sales teams asking which case studies apply to a given prospect. Legal teams checking precedent clauses across hundreds of contracts. Operations teams finding the right SOP. These are document-heavy organisations where institutional knowledge sits in SharePoint, Confluence, Google Drive, or a CRM and nobody can find it. A RAG layer over those sources reliably saves hours per person per week, and the productivity gain compounds.

According to Gartner research, knowledge workers spend roughly 20% of their week searching for information. RAG does not eliminate that, but it materially reduces it for repeat questions.

Compliance and policy Q&A. Heavily regulated firms - financial services, insurance, life sciences - have policy libraries that staff must follow but rarely read. A RAG assistant answers "can I do X under our AML policy" with a quoted, traceable answer that points at the source document. The audit trail is the feature, not just the convenience.

Onboarding and training. New joiners can ask a RAG bot the questions they would otherwise interrupt colleagues with. The corpus is HR handbooks, technical docs, codebase READMEs, and meeting notes. The productivity gain is concentrated in the first three months of tenure but it is real.

Sales enablement. Reps asking "what objection responses do we have for this competitor?" or "which case study fits a Series B fintech?" get faster, more consistent answers than digging through Salesforce attachments.

The pattern across all of these: high-frequency, low-to-medium-stakes questions against a defined corpus, where the alternative is a human searching SharePoint or interrupting a colleague.

How a production RAG system is built

The toy version of RAG fits on a slide. The production version has half a dozen components that each need attention.

Ingestion and chunking. Documents are pulled from source systems (SharePoint, Confluence, S3, Google Drive, your CMS) and split into chunks. Chunk size matters: too small and you lose context, too large and you dilute relevance. A typical starting point is 500-1000 tokens per chunk with 10-20% overlap. PDFs, tables, and images need specific handlers because naive text extraction loses structure.

Embedding. Each chunk is converted into a vector using an embedding model. OpenAI's text-embedding-3-large, Cohere's embed-v3, and open-source options like BGE or E5 are the common choices. Embedding is cheap, typically under £0.10 per million tokens, but you pay it again every time you re-index.

Vector storage. Embeddings live in a vector database. Postgres with the pgvector extension handles workloads up to roughly 10 million chunks comfortably and is our default for mid-market clients because it sits in the same database as the rest of the application. Above that scale, dedicated stores like Pinecone, Weaviate, Qdrant, or Milvus become worth the operational overhead.

Retrieval. The query is embedded, the vector store returns nearest neighbours, and a reranker often re-orders the candidates for relevance. Hybrid retrieval - combining vector similarity with traditional keyword search (BM25) - consistently outperforms vector-only retrieval, particularly for queries that include product codes, names, or jargon. Anthropic's contextual retrieval research showed hybrid plus contextual chunking reduces retrieval failures by up to 49%.

Generation. Retrieved chunks are inserted into a prompt template along with instructions about how to behave: cite sources, refuse if the answer is not in the context, format the response a certain way. The model generates an answer. Choice of model matters less than people assume - GPT-4o, Claude Sonnet, and Llama 3.1 70B all perform comparably for well-retrieved RAG. Quality of retrieval dominates quality of generation.

Evaluation and observability. This is where most internal builds fail. Without an evaluation harness measuring retrieval recall, answer faithfulness, and refusal accuracy on a labelled test set, you have no way to know whether your changes are improving or degrading the system. Tools like Ragas, TruLens, and LangSmith make this tractable. Plan for it from day one or accept that you are flying blind.

What it costs to run

Cost varies more than vendors admit, but the components are predictable.

Build cost for a production-grade RAG system covering a single corpus typically runs £25k-£80k for the first version, depending on data quality, integration complexity, and the evaluation depth required. Heavily regulated deployments with audit, refusal patterns, and human-in-the-loop review run higher.

Running cost has three lines. Embedding costs are negligible at typical scale - under £100 per month for most corpora. Inference costs depend on volume and model choice; £500-£5,000 per month is the normal range for a mid-market deployment serving thousands of queries. Hosting (vector database, application servers, monitoring) adds another £200-£1,000 per month.

Maintenance cost is the line people forget. Source documents change, evaluation sets need updating, models get deprecated, and retrieval quality drifts. Budget 0.5-1 day per week of engineering attention for an active deployment, or contract it out.

Compared with fine-tuning the same knowledge into a custom model - which runs into six figures for the training and recurs every time the source data changes meaningfully - RAG is dramatically cheaper to operate. Compared with buying an off-the-shelf assistant that connects to your data, custom RAG costs more up front but gives you control over retrieval quality, refusal behaviour, and data residency.

Where RAG breaks

RAG fails in predictable ways and they are worth knowing before you commission a build.

Bad source data. If your documents are out of date, contradictory, or duplicated across systems, RAG will confidently surface the wrong answer. The model has no way to know which version is current. Document hygiene is a prerequisite, not an afterthought.

Questions that need reasoning across documents. RAG is good at "what does the policy say about X" and poor at "summarise how our policy has changed across three revisions." Multi-hop and aggregation queries need agentic patterns or pre-computed summaries.

Tabular and numerical data. Vector search treats tables as text and loses structure. If your queries depend on "how many customers did we churn in Q3," the answer is a SQL query against your warehouse, not a RAG lookup. Text-to-SQL is a different pattern.

Refusal calibration. Out of the box, models try to be helpful and will answer even when retrieval returns nothing useful. You need explicit prompt instructions and evaluation against "unanswerable" questions to get refusal behaviour right. This is the single most important quality lever for any compliance-adjacent deployment.

Permissions and data leakage. If a user can query the RAG system, they can effectively read everything indexed. Document-level access control needs to be enforced at retrieval time - filter the candidate set by the user's permissions before generation. Skipping this is how people accidentally expose HR data to the whole company.

RAG, GDPR, and information security in the UK

RAG systems process personal data when the source documents contain personal data, which most internal corpora do. That triggers UK GDPR obligations.

The ICO's guidance on AI and data protection is the primary reference. The practical implications for RAG: you need a lawful basis for processing, a data protection impact assessment for higher-risk deployments, a clear retention policy for query logs, and a way to handle subject access and erasure requests against indexed content.

The model choice has data residency implications. OpenAI and Anthropic offer EU/UK data processing through their enterprise tiers; the consumer API tiers route through US infrastructure. For sensitive deployments, self-hosted open-weight models like Llama 3.1 or Mistral on UK or EU infrastructure are the safer default. Self-hosted inference is slower to set up but removes a category of compliance argument.

Information security obligations - ISO 27001, SOC 2, sector-specific frameworks - apply to the RAG system in the same way they apply to any application processing the same data. Treat it as a regulated workload from the start rather than retrofitting controls later.

Build, buy, or wait

For most mid-market businesses, the decision tree is shorter than the vendor noise suggests.

Buy if your use case is generic customer support over public help content and you do not need control of retrieval quality or refusal behaviour. Intercom Fin, Zendesk's AI agent, and similar packaged products are mature and cheap to deploy. The trade-off is limited customisation and pricing that scales with volume.

Build if you have a defined internal corpus, specific accuracy or compliance requirements, or use cases that span multiple systems. A custom RAG build gives you control of every quality lever and keeps data inside your boundary. Plan for the evaluation harness and the maintenance cost, not just the initial deployment.

Wait if your source data is a mess. Six weeks cleaning up SharePoint will produce better RAG outcomes than six months of clever engineering on bad inputs. The system can only retrieve what is well-organised and current.

The mistake we see most often is treating RAG as a side project rather than an operational system. It needs an owner, an evaluation cadence, and a budget for ongoing iteration. Treat it like a product and it pays back; treat it like a hackathon output and it decays.

Frequently asked questions

How is RAG different from fine-tuning?

Fine-tuning changes the model's weights by training it on your data, which is expensive, slow, and difficult to update when the underlying facts change. RAG leaves the model untouched and inserts your data into the prompt at query time, so updates are as simple as re-indexing a document. For knowledge that changes more than once a quarter - which is almost all business knowledge - RAG is the correct default. Fine-tuning is appropriate for teaching the model a style, a structured output format, or a specialised reasoning pattern, not for teaching it facts.

How long does it take to deploy a production RAG system?

A focused single-corpus deployment - say, customer support over your help centre - typically takes 6-10 weeks from kickoff to production. That includes ingestion pipelines, retrieval tuning, evaluation harness, refusal behaviour, and integration into your support channel. Internal knowledge assistants spanning multiple source systems usually run 10-16 weeks because the integration work dominates. The first two to three weeks are discovery and data assessment, which is where the timeline either holds or slips depending on the state of the source documents.

Do we need to use OpenAI or can we self-host?

You can self-host. Open-weight models like Llama 3.1 70B, Mistral Large, and Qwen 2.5 perform competitively for well-retrieved RAG, and inference cost on your own infrastructure is predictable. The trade-off is operational overhead - GPU provisioning, model serving, monitoring, scaling. For deployments with strict data residency requirements or query volumes high enough that API costs become painful (typically above 100k queries per month), self-hosting pays back. Below that, the API model from OpenAI, Anthropic, or AWS Bedrock is usually the cheaper total cost of ownership.

How do we stop the system from making things up?

Three controls in combination. First, retrieval quality: if the right passage is in the top results, the model rarely hallucinates. Invest in hybrid retrieval, reranking, and chunking strategy. Second, prompt design: explicit instructions to answer only from the provided context and to say "I don't know" when the context is insufficient. Third, evaluation: a labelled test set including "unanswerable" questions, scored regularly, so you catch regressions. With those three in place, hallucination rates on well-scoped RAG systems typically run below 2% on production traffic. Without them, expect 10-20%.

What does RAG cost to run per month?

For a mid-market deployment handling thousands of queries per week, total running cost is usually £1,000-£6,000 per month. Inference (the language model API or self-hosted GPU) is the largest line at £500-£5,000. Vector database hosting and supporting infrastructure adds £200-£1,000. Embedding costs are negligible at typical scale. The variability is driven mostly by query volume and model choice - GPT-4-class models cost roughly 10x what Llama 3 8B costs to serve, and for many RAG workloads the smaller model is fine because retrieval is doing the heavy lifting.

Can RAG work with our existing SharePoint and Confluence?

Yes, and these are the most common source systems we integrate with. Microsoft Graph API gives access to SharePoint and OneDrive; Confluence has a stable REST API; Google Drive, Notion, and most modern document systems expose what you need. The work is not the connection itself but handling permissions correctly - the RAG system needs to know which documents a given user is allowed to see and filter retrieval accordingly. Skipping that step is the single most common cause of data leakage incidents in internal RAG deployments, so it should be part of the initial architecture, not a later addition.

Who runs and maintains a RAG system after launch?

It needs an owner with enough technical depth to interpret evaluation results and tune retrieval. In smaller organisations that is usually a senior engineer or data scientist with 0.5-1 day per week dedicated to it. In larger deployments it becomes a small team covering data engineering, ML engineering, and product. The ongoing work is updating evaluation sets as new question patterns emerge, monitoring quality metrics, refreshing source documents, managing model deprecations, and handling user feedback. If you do not have that capacity internally, a retainer with the agency that built it is the standard arrangement and typically runs 2-5 days per month.

How do we measure whether RAG is actually working?

Three metric families. Retrieval metrics: recall@k (is the right passage in the top results) and mean reciprocal rank. Generation metrics: faithfulness (does the answer match the retrieved context), answer relevance, and refusal accuracy on unanswerable questions. Business metrics: deflection rate for support deployments, time-to-answer for internal assistants, user satisfaction scores. Run the technical metrics weekly against a labelled test set; track business metrics monthly against a baseline. A deployment that improves on technical metrics but not business metrics usually has a problem with surfacing or workflow integration rather than the RAG itself.

Getting started

RAG is the right default for any business application where a language model needs to answer questions grounded in your data. The technology is mature, the patterns are well-understood, and the cost profile suits mid-market budgets. What separates deployments that pay back from those that decay is operational discipline: a clean corpus, an evaluation harness, refusal calibration, and an owner with time allocated to maintain it.

If you are weighing a RAG build against an off-the-shelf assistant or a fine-tuning project, AI Advisory runs two-week assessments that produce a costed recommendation grounded in your specific data and use cases. Get in touch to scope one.

Ready to put this into production? book a discovery call.

Get started

Ready to automate your operations?

Walk away with a prioritised list of automation and AI wins, costed, sequenced, and yours. The call is 30 minutes, free, and binds you to nothing. The shortest path to knowing whether AI Workflow Agency is the right fit.