AI Workflow Agency
AI5 min read

RAG vs Fine-Tuning: How to Choose (and When to Combine Them)

A practical comparison of retrieval-augmented generation and fine-tuning

By AI Advisory team

The question "should we use RAG or fine-tune the model?" gets asked in roughly half of all AI build conversations, and the honest answer is almost always "RAG first, fine-tune later if you must, and quite possibly both." The two techniques solve different problems. Treating them as competitors is the source of a lot of wasted budget.

This guide explains what each approach actually does, where each one earns its keep, what they cost to run, and how to decide between them for a real production system. The framing is aimed at teams shipping AI into operations rather than research labs, so the trade-offs are practical: latency, hosting costs, evaluation, and the awkward moment six months in when your underlying model gets deprecated.

What each technique actually does

Retrieval-augmented generation (RAG) is an architecture pattern. You keep the language model as-is, and at inference time you fetch relevant documents from a vector store (or a hybrid of vector plus keyword search), stuff them into the prompt, and ask the model to answer grounded in that context. The model's weights never change. The knowledge lives in the retrieval index, which you can update any time you like.

Fine-tuning modifies the model itself. You take a base model (open-source like Llama 3, Mistral, or Qwen, or a hosted model via OpenAI's or Anthropic's fine-tuning APIs) and continue training it on labelled examples of inputs and desired outputs. The knowledge - or more accurately, the behaviour - is baked into the weights. To update it, you retrain.

The crucial point most teams miss: RAG is for knowledge, fine-tuning is for behaviour. If your problem is "the model doesn't know our product catalogue," that is a knowledge problem, and RAG solves it. If your problem is "the model won't reliably output structured JSON in our schema" or "the tone is wrong for our brand," that is a behaviour problem, and fine-tuning is the right tool. Many production systems need both.

When RAG is the right answer

RAG should be your default starting point for any system where the model needs to reason over information it wasn't trained on. That covers most enterprise use cases: internal documentation assistants, customer support bots grounded in a help centre, sales enablement tools that pull from CRM and product docs, legal and compliance assistants reading policy documents, research tools sitting over a corpus of reports.

The arguments for starting with RAG are straightforward. First, your knowledge changes - product specs get updated, policies get revised, prices move. With RAG you re-index; with fine-tuning you retrain. Second, you can cite sources. Every answer can come back with the document IDs or URLs the model used, which is non-negotiable for regulated sectors and a strong trust signal everywhere else. Third, you can swap the underlying model without touching the knowledge layer. When GPT-5 or Claude 4 lands, your RAG pipeline still works; only the generation call changes.

RAG also handles the GDPR and data residency conversation more cleanly. Under UK GDPR, if a customer exercises their right to erasure, you delete the relevant document from your index and they are gone from the system. With a fine-tuned model, that data is now in the weights, and the ICO's guidance on AI and data protection makes clear this is a meaningful problem - the regulator expects you to be able to honour data subject rights, and "the model has memorised it" is not a defence.

When fine-tuning earns its keep

Fine-tuning becomes the right answer when prompting alone cannot reliably produce the behaviour you need. The clearest signals:

  • Format compliance. You need the model to produce a specific structured output (a particular JSON schema, a specific markup format, a domain-specific language) and few-shot prompting still drifts. Fine-tuning on a few hundred examples typically lifts schema compliance from ~90% to ~99%+.
  • Tone and style. Your brand voice is distinctive enough that prompt instructions feel like an arms race. A few thousand labelled examples teach the model what "on-brand" looks like better than a 2,000-token system prompt ever will.
  • Latency and cost at scale. A fine-tuned 8B or 13B open-source model running on your own infrastructure can be 5-10x cheaper than calling GPT-4-class models for high-volume use cases, and often faster.
  • Narrow classification or extraction tasks. Sentiment, intent detection, entity extraction, document classification - tasks with clear labels and limited output spaces - fine-tune brilliantly and often beat much larger general-purpose models.

Fine-tuning is not the answer when you want the model to "know" new facts. This is the most common and most expensive mistake. People fine-tune on a thousand internal documents hoping the model will absorb the content, and what they get is a model that has vaguely heard of the topics and confabulates plausibly when asked. Research from Anthropic and others has repeatedly shown that fine-tuning is poor at injecting new factual knowledge - it adjusts behaviour, not the underlying world model. If you want the model to answer questions about your documents, retrieve those documents at inference time.

The cost and latency picture

The economics matter more than most decision-makers realise, and they cut both ways.

RAG costs are dominated by three things: the vector database (Pinecone, Weaviate, Qdrant, or pgvector on your existing Postgres), the embedding compute when you index documents, and the inflated prompt size at inference time. That third one bites: if you stuff 4,000 tokens of retrieved context into every call to a frontier model, you are paying for 4,000 input tokens on every single query. At scale this dominates the bill. The mitigations are aggressive chunking strategy, reranking to keep only the top 3-5 chunks, and using a smaller model where retrieval quality lets you get away with it.

Fine-tuning costs split into training and inference. Training a LoRA adapter on a 7B-13B open-source model is genuinely cheap now - a few hundred pounds of GPU time on Runpod, Modal, or similar, plus your labelling effort. Hosted fine-tuning via OpenAI is more expensive per training run but removes the infrastructure work. The bigger ongoing cost is inference hosting if you self-host: a single A100 or H100 instance running 24/7 to serve a fine-tuned model is £1,500-£4,000/month depending on provider and commitment. That only makes sense at meaningful query volume.

A rule of thumb: if you are under ~50,000 queries per month, hosted models with RAG are almost certainly cheaper overall. Above ~500,000 queries per month with predictable patterns, a fine-tuned open-source model on dedicated infrastructure usually wins. Between those two, it depends on the specifics.

Latency is a quieter consideration but often decides production fitness. RAG adds 100-400ms for retrieval before generation even starts. A fine-tuned smaller model can return a full response in less time than a frontier model takes to start streaming. For voice agents and real-time interfaces this is decisive.

The case for combining both

The most robust production systems we build use both techniques. The pattern looks like this: a fine-tuned model handles the behavioural layer - output structure, tone, refusal patterns, classification of incoming queries - and RAG handles the knowledge layer.

A worked example. A financial services client wanted an internal assistant for their advice teams. The requirements: cite source documents on every answer, refuse to give regulated advice, output in a specific structured format their CRM could ingest, and stay current as policy documents updated weekly.

The architecture: a LoRA-fine-tuned Llama 3 8B handles the output structure and refusal behaviour - it has learned what a compliant response looks like, what to refuse, and how to format citations. RAG over their policy corpus (pgvector + BM25 hybrid retrieval, reranked with a cross-encoder) supplies the actual content. The fine-tuned model was trained on 1,800 labelled examples of good and bad responses; the RAG index updates nightly from their document management system.

Neither approach alone would have worked. Pure RAG with a frontier model could not reliably hit the structured output and refusal requirements without an enormous system prompt that still drifted. Pure fine-tuning could not stay current with weekly policy changes. Together they shipped a system that has handled tens of thousands of queries with no compliance incidents.

This pattern - fine-tune for behaviour, retrieve for knowledge - is now the default architecture for any non-trivial production system. The McKinsey State of AI 2024 report notes that organisations achieving measurable EBIT impact from generative AI are overwhelmingly those running customised systems rather than off-the-shelf chat interfaces, and "customised" almost always means this kind of hybrid.

A decision framework

Work through these questions in order. The first "yes" usually points to your answer.

  1. Does the model need to reason over information that changes more than monthly, or that you do not control? If yes, you need RAG. This is non-negotiable.
  2. Can you achieve acceptable behaviour with prompting and few-shot examples? Test this properly before deciding to fine-tune. Modern frontier models with well-designed prompts and 5-10 in-context examples solve a remarkable amount. Fine-tuning is not free engineering complexity.
  3. Is your query volume high enough to justify dedicated infrastructure? If you are under 50k queries/month, stick with hosted models and good prompting. The cost case for self-hosted fine-tuned models needs scale to pencil out.
  4. Do you have, or can you create, 500+ high-quality labelled examples? Fine-tuning needs data. If you cannot generate the training set, you cannot fine-tune well. The work of labelling is usually 70% of a fine-tuning project's effort.
  5. Are latency or per-query cost critical constraints? If yes, a fine-tuned smaller model becomes attractive even at moderate volume.
  6. Do you have a behavioural problem prompting cannot solve? Structured output drift, tone, classification accuracy - these are the cases where fine-tuning genuinely adds value over good prompting.

For roughly 70% of the build conversations we have, the answer is RAG-only with a frontier model, careful prompt engineering, and a proper evaluation harness. For about 20%, it is RAG plus a fine-tuned smaller model serving the behavioural layer. For maybe 10%, it is fine-tuning alone - typically narrow classification or extraction tasks where there is no external knowledge to retrieve.

Evaluation: the part everyone skips

Whichever approach you choose, the thing that separates a system that works from a system that ships and then quietly degrades is your evaluation harness. You need a held-out test set of 100-500 representative inputs with known good outputs, automated scoring (exact match, semantic similarity, LLM-as-judge with a strong rubric), and a CI step that runs the eval on every change to prompts, retrieval logic, model version, or fine-tuning data.

Without this, you have no way to know whether your last "improvement" actually improved anything. With it, you can confidently swap models, refactor prompts, and iterate on retrieval - all the things you will inevitably need to do as the field moves.

The eval matters more for fine-tuned systems because the cost of getting it wrong is higher: a bad fine-tune is days of work to redo, where a bad prompt change is minutes to revert. Build the eval first. It is the single highest-impact piece of engineering on any serious AI build.

Frequently asked questions

Is RAG always cheaper than fine-tuning?

No, but it usually is at moderate scale. RAG's costs are dominated by inflated prompt sizes at inference, which scale linearly with query volume. Fine-tuning has a higher fixed cost (training, evaluation, hosting) but lower per-query cost once running. The crossover point depends on your query volume, prompt size, and whether you self-host. As a rough heuristic: under 50,000 queries per month, RAG with hosted models almost always wins on total cost. Above 500,000 queries per month with predictable patterns, a self-hosted fine-tuned model often wins. Between those, model the specific numbers - do not guess.

Can you fine-tune a model to know your company's documents?

Technically yes, practically no. Fine-tuning on documents will produce a model that has vaguely absorbed the topics and will confabulate plausibly when questioned, which is worse than not knowing - it sounds confident while being wrong. Research consistently shows that fine-tuning is poor at injecting reliable factual knowledge. For "the model should know our documents," the correct architecture is RAG: keep the documents in a retrieval index, fetch the relevant ones at query time, and have the model answer grounded in that retrieved context. This also gives you citations and lets you update documents without retraining.

How much training data do I need to fine-tune effectively?

For LoRA fine-tuning on behaviour tasks (format compliance, tone, narrow classification), 500-2,000 high-quality labelled examples is usually enough to see meaningful improvement over a strong base model with good prompting. Full fine-tuning or harder tasks can need 10,000+. The quality of the examples matters far more than the quantity - 500 carefully curated examples beat 5,000 mediocre ones every time. Budget 60-70% of your fine-tuning project effort on data preparation: gathering examples, labelling, reviewing edge cases, and building a held-out evaluation set. The training run itself is usually a few hours.

What is the GDPR position on fine-tuned models?

The ICO's guidance treats fine-tuned models as containing personal data when personal data was in the training set. This means data subject rights (access, erasure, rectification) apply to the model itself, which is technically very hard to honour - you typically have to retrain to remove an individual's data. RAG sidesteps most of this because personal data lives in the retrieval index, where deletion is a database operation. For any system processing personal data of UK or EU residents, the default recommendation is to keep personal data in retrievable storage rather than baking it into model weights, and to document this choice in your DPIA.

Should I use OpenAI's fine-tuning or self-host an open-source model?

Depends on volume, latency requirements, and data sensitivity. OpenAI's fine-tuning is straightforward, removes infrastructure work, and gives you a strong base model - good for getting started and for cases where you do not have the scale to justify dedicated infrastructure. Self-hosting an open-source model (Llama 3, Mistral, Qwen) on Runpod, Modal, AWS, or your own GPUs gives you lower per-query costs at scale, full control over the data path (important for some regulated sectors), and freedom to use techniques like quantisation and speculative decoding. The break-even is typically around 200,000-500,000 queries per month, but the data sensitivity argument can override the cost argument.

How do I know if my RAG system is actually working?

Build an evaluation set of 100-500 questions with known correct answers and known correct source documents. Measure two things separately: retrieval quality (did the system fetch the right documents?) and generation quality (given the right documents, did it produce the right answer?). Retrieval is usually scored with precision@k and recall@k. Generation is scored with a combination of exact match where applicable, semantic similarity, and LLM-as-judge against a rubric. Run this evaluation on every change. The most common failure mode is retrieval - your model is fine, but the right document never made it into context. If you only measure end-to-end answer quality, you will not see this.

What about combining RAG with prompting before fine-tuning?

This is the right order. Start with RAG plus a strong frontier model and careful prompt engineering. Build the evaluation harness. Get to the best result you can with this stack. Only then consider whether fine-tuning addresses a specific remaining gap - usually behaviour, format, latency, or cost. Skipping the prompt-engineering step and jumping to fine-tuning is the most common mistake, because it bakes the wrong assumptions into model weights that are then expensive to change. A good rule: if you cannot articulate what specific behaviour fine-tuning will fix that prompting cannot, you are not ready to fine-tune.

How long does it take to build a production RAG system?

For a well-scoped first build, 6-12 weeks from kickoff to production is realistic. Two weeks for discovery, content audit, and retrieval design. Four to six weeks of build and iteration - chunking strategy, embedding model selection, hybrid retrieval, reranking, prompt engineering, and the evaluation harness. Two to four weeks of hardening: guardrails, monitoring, refusal patterns, security review, and load testing. Adding a fine-tuned behavioural layer on top adds another 4-6 weeks, mostly for data preparation. Anyone promising a production-grade RAG system in two weeks is shipping a demo, not a system that will hold up in front of real users.

Getting the architecture right the first time

The cost of choosing the wrong architecture is not just the wasted build - it is the six months of suboptimal results before the team admits the approach is not working, and the political difficulty of changing direction once a decision has been defended. Most teams would benefit from spending more time on the decision and less on the build. RAG-first, prompt-engineered carefully, with a proper evaluation harness, solves more problems than people expect. Fine-tuning, deployed surgically against specific behavioural gaps, completes the picture for the cases that need it.

If you are weighing this decision for a real project and want a second opinion grounded in production builds rather than vendor talking points, the team at AI Advisory runs free 30-minute architecture reviews where we walk through your specific use case and give you a straight answer on which approach fits. No deck, no pitch - just the trade-offs as they apply to your situation.

Further reading

Sources referenced for context not directly cited in the body:

Ready to put this into production? book a discovery call.

Get started

Ready to automate your operations?

Walk away with a prioritised list of automation and AI wins, costed, sequenced, and yours. The call is 30 minutes, free, and binds you to nothing. The shortest path to knowing whether AI Workflow Agency is the right fit.