RAG Analysis: How to Evaluate Retrieval-Augmented Generation Systems
A practitioner's guide to RAG analysis - what to measure, how to test retrieval and generation, and how to spot failure modes before production
RAG analysis is the practice of measuring how well a retrieval-augmented generation system actually works - how accurate its retrieval is, how grounded its answers are, and where it fails. It is the difference between a demo that wows a stakeholder and a system you can put in front of customers without lying awake at night.
Most teams skip it. They build a RAG pipeline, run ten queries that look fine, and ship. Three months later the support team is fielding complaints about hallucinated policy answers and nobody can explain why retrieval missed the obvious document. This article walks through what RAG analysis means in practice, what to measure, how to test each stage of the pipeline, and the failure modes that recur across builds.
What RAG analysis actually means
Retrieval-augmented generation combines a retrieval system (usually vector search over a chunked document store, often with keyword search alongside) with a language model that generates answers grounded in the retrieved context. RAG analysis is the structured evaluation of that combined system.
It splits into three questions:
- Is retrieval finding the right context? Given a user query, does the system return the documents that actually contain the answer?
- Is generation faithful to the retrieved context? Does the model use what was retrieved, or does it improvise from its pretraining?
- Is the end-to-end answer useful? Does the response actually resolve the user's question, in the right tone, with appropriate caveats?
You need to analyse each layer separately. A system can have 95% retrieval accuracy and still generate rubbish if the prompt template invites the model to embellish. Conversely, a beautifully tuned generation prompt cannot rescue retrieval that misses the relevant chunk.
This is why generic LLM benchmarks are useless for RAG. MMLU and HellaSwag tell you nothing about whether your support assistant will correctly answer a question about your refund policy. RAG analysis has to be done on your data, with queries that look like your users' queries.
The metrics that matter
There is no shortage of metrics in the RAG evaluation literature. The ones that earn their place in a working evaluation harness are below.
Retrieval metrics
Recall@k: of the documents that are genuinely relevant to a query, what fraction appear in the top k results? If your generator sees the top 5 chunks and the right chunk is at position 12, the model has no chance. Recall@k is the single most important retrieval metric because it sets the ceiling on everything downstream.
Precision@k: of the top k results, what fraction are actually relevant? Low precision means the generator wastes context window on noise and is more likely to be distracted.
Mean Reciprocal Rank (MRR): how high up the list does the first relevant document appear? Useful when there is one correct answer per query.
nDCG: normalised discounted cumulative gain, which rewards getting the most relevant results into the top positions. Worth tracking when relevance is graded rather than binary.
Generation metrics
Faithfulness (sometimes called groundedness): does every factual claim in the answer trace back to the retrieved context? This is the hallucination metric. Frameworks like Ragas and TruLens score it by decomposing the answer into atomic claims and checking each against the retrieved chunks, usually with an LLM judge.
Answer relevance: does the answer actually address the question that was asked? An answer can be perfectly faithful to the context and still miss the point.
Context relevance: how much of the retrieved context is actually used in the answer? If you retrieve 4,000 tokens and the answer only uses one sentence, your retrieval is bloated and your costs are higher than they need to be.
End-to-end metrics
Eventually you have to grade whole answers. A simple 1-5 rubric covering correctness, completeness, and tone, applied to a few hundred representative queries, gives you the headline number you report to the business. Pair it with a binary "would you send this to a customer?" gate for the harshest cut.
Building an evaluation dataset
You cannot do RAG analysis without a test set. Building one is unglamorous and usually skipped, which is precisely why teams that do it ship better systems.
Aim for 150-300 query-answer pairs to start. More is better, but 150 well-chosen examples beats 2,000 scraped ones. The set should cover:
- Known-answer questions: queries where you know exactly which document and which passage contains the answer. These let you measure retrieval recall directly.
- Multi-hop questions: queries that require combining information from two or more documents. RAG systems often fail here because retrieval finds one source and the model confidently answers from incomplete context.
- Adversarial queries: questions that look answerable but are not, ambiguous queries, queries with false premises. A good system refuses or asks for clarification. A poor one confabulates.
- Out-of-scope queries: things your knowledge base does not cover. The system should decline cleanly. If it answers anyway, you have a refusal problem.
- Real user queries: once you have any production traffic, sample from it. Real users phrase things in ways your team will not predict.
Synthetic generation has a role here. Tools like Ragas can produce candidate questions from your corpus, which is useful for bulk coverage. Always have a human review the synthetic set before relying on it - LLM-generated test sets contain plenty of malformed or trivially answerable questions.
Testing the retrieval layer in isolation
Before you ever evaluate generated answers, fix retrieval. This is the highest-impact stage of RAG analysis because almost every system has retrieval problems and almost every team underestimates them.
Run your evaluation queries through the retriever only. For each query, log the top 10 chunks returned, then compare against the known-relevant chunks. You will find some predictable failure modes:
Chunking is wrong. Fixed-size chunks split mid-sentence or mid-table. Headings end up in one chunk and the content they describe in the next. The fix is usually semantic chunking that respects document structure, or a parent-document retriever that returns larger context windows once a small chunk matches.
Embeddings miss the query intent. Dense vector search struggles with rare terms, product codes, and exact phrases. A query for "error code SRV-4471" often retrieves general troubleshooting content rather than the specific page. Hybrid search - dense vectors plus BM25 keyword search, with reciprocal rank fusion - usually fixes this. The Microsoft research on hybrid search and reranking is the practical reference here (VERIFY: https://learn.microsoft.com/en-us/azure/search/hybrid-search-overview).
Reranking is missing. Vector search gives you candidates; a cross-encoder reranker (Cohere Rerank, BGE reranker, or a small fine-tuned model) reorders them by genuine relevance. Adding reranking typically lifts recall@5 by 10-20 percentage points on real datasets.
Metadata filters are absent. If your corpus contains documents from multiple business units, product versions, or time periods, vector search will happily return the wrong version. Metadata filters at retrieval time are not optional for any non-trivial knowledge base.
Testing the generation layer
Once retrieval is solid, evaluate generation by feeding the model gold-standard context (the chunks you know contain the answer) and grading the output. This isolates generation behaviour from retrieval noise.
Look for three patterns:
Confabulation under good context. The right information is in the prompt and the model still invents something. Usually a prompt problem - the instructions are not strict enough about staying within the provided context, or the model is being asked to be "helpful" in ways that encourage filling gaps.
Refusal failure. You feed the model context that does not answer the question and it answers anyway. The fix is an explicit refusal instruction and few-shot examples of correct refusals in the prompt. Test this hard - it is the most common production failure.
Citation drift. The model claims a fact and cites chunk 3, but chunk 3 does not actually support it. If your UI shows citations, you need to verify them programmatically, not just trust that the model labelled correctly.
Then repeat with realistic retrieval (not gold-standard) and compare. The gap between gold-context performance and real-retrieval performance is your retrieval improvement budget.
LLM-as-judge: useful but not infallible
Most modern RAG evaluation uses an LLM to grade answers. Ragas, TruLens, DeepEval, and Arize Phoenix all rely on it. This is reasonable - human grading does not scale and LLM judges correlate well with human judgement on faithfulness and relevance tasks, as documented in the original LLM-as-judge research from the LMSYS team (VERIFY: https://arxiv.org/abs/2306.05685).
It has limits worth knowing. LLM judges have known biases: they prefer longer answers, they prefer answers that match their own style, and they are inconsistent on borderline cases. Mitigations:
- Use a different model family as judge than as generator. If you generate with GPT-4, judge with Claude or Gemini.
- Calibrate with human-labelled examples. Hand-grade 50 outputs yourself, then check that the LLM judge agrees on at least 80% of them.
- Use binary or three-point scales rather than 1-10. Judges are more reliable at coarse decisions.
- Track judge agreement over time. If you change the judge model or prompt, expect the absolute numbers to shift even if the underlying system is unchanged.
Failure modes to test for explicitly
Across the RAG builds we have shipped, the same failure modes recur. Include each as a category in your test set:
- Stale information: the corpus contains both old and current versions of a policy and retrieval picks the old one.
- Conflicting sources: two documents disagree and the model picks one without flagging the conflict.
- Numeric drift: figures, dates, and prices get subtly wrong in the answer even though the source is correct.
- Tone mismatch: the system answers a customer query in internal jargon, or vice versa.
- Prompt injection: a document in the corpus contains text like "ignore previous instructions and..." and the model obeys. Test with deliberately poisoned chunks.
- Personally identifiable information leakage: under GDPR, retrieving and surfacing personal data that was not relevant to the query is a problem. The ICO's guidance on AI and data protection is the relevant reference (ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/).
Making analysis continuous
A one-off evaluation is better than nothing but it dates fast. Models change, corpora grow, user behaviour shifts. Build the evaluation harness into CI so every change to chunking strategy, embedding model, retriever configuration, or prompt template triggers a rerun against the test set. Any regression on recall@5 or faithfulness should block the deploy.
In production, log enough to keep analysing. For each query, store the retrieved chunk IDs, the prompt sent to the model, the response, and any user feedback signal (thumbs, follow-up question, escalation). Once a week, sample 50 logged queries and grade them. New failure modes show up in production that your test set will miss, and the test set should grow to cover them.
This is the loop that turns a RAG demo into a system that holds up. Without it you are guessing.
Frequently asked questions
How is RAG analysis different from regular LLM evaluation?
Regular LLM evaluation measures a model's knowledge and reasoning in isolation - benchmarks like MMLU test what the model already knows. RAG analysis measures a system, not a model. The question is not "does the model know X?" but "given the documents the retriever surfaced, does the system produce a correct, grounded answer?" That requires evaluating retrieval and generation as separate stages, using your corpus and your queries, with metrics like recall@k and faithfulness that have no counterpart in standard LLM benchmarks.
Which RAG evaluation framework should I use?
Ragas is the most established for faithfulness, answer relevance, and context metrics, and integrates well with LangChain and LlamaIndex pipelines. TruLens is strong for production monitoring with its feedback function model. DeepEval is closer to a unit-testing experience and fits well in CI. Arize Phoenix is the best of the open-source observability options. For most mid-market builds, start with Ragas for offline evaluation and add Phoenix or TruLens once you have production traffic worth observing. The frameworks are not mutually exclusive.
How big should my evaluation dataset be?
Start with 150-300 query-answer pairs covering the categories described above (known-answer, multi-hop, adversarial, out-of-scope, real-user). That is enough to detect meaningful regressions between system versions. Grow it as you find production failure modes - every bug report should add at least one test case. Teams running mature RAG systems typically maintain test sets of 1,000-5,000 examples, but you do not need that on day one. Quality and coverage matter more than raw size, and a small set you actually run on every change beats a large one that gathers dust.
Can I use an LLM to judge my RAG outputs reliably?
Yes, with caveats. LLM judges correlate well with human grading on faithfulness and relevance tasks and are the only way to scale evaluation beyond a few hundred examples. Use a different model family for judging than for generating to reduce self-preference bias. Calibrate by hand-grading 30-50 outputs and checking the LLM judge agrees on at least 80%. Prefer binary or three-point scales over 1-10. And treat judge scores as a relative signal for tracking changes over time, not as an absolute measure of quality - the numbers shift when you change the judge.
What does it cost to run RAG evaluation in CI?
For a 200-example test set evaluated with GPT-4-class judge models, expect £2-£8 per full run depending on context length and how many judge calls you make per example (faithfulness alone usually needs 2-4 calls). That is per change, not per day, so monthly cost on an actively developed system might be £100-£400. You can cut this substantially by using cheaper judges (GPT-4o-mini or Claude Haiku) for routine runs and reserving the strongest judge for release candidates. Compared to the cost of shipping a regression to production, it is trivial.
How do I evaluate RAG systems handling sensitive or regulated data?
The evaluation harness itself needs the same data controls as the production system. Run evaluation in the same security boundary as production, with the same access controls and logging. Do not send sensitive context to third-party judge APIs unless your data agreement permits it - use a self-hosted judge model or a contractually compliant API. Add explicit test categories for PII leakage, regulated-data refusal, and out-of-scope queries that should not be answered. The ICO's AI guidance and your own DPIA should drive what gets tested. Treat evaluation logs as production data for retention and access purposes.
How often should I rerun RAG analysis?
Run the full evaluation suite on every change to retrieval configuration, prompt templates, embedding models, or the underlying LLM. Run a lighter smoke-test suite (20-30 examples) on every deploy. Sample and grade 30-50 production queries weekly to catch drift the test set misses. Run a full re-grade quarterly to check that the judge itself has not drifted as judge models are updated. If your corpus changes frequently - daily ingestion of new documents - add a retrieval-only check after each ingestion to confirm the new content is findable.
When is RAG the wrong approach entirely?
RAG is the wrong tool when the task does not require external knowledge - summarisation, translation, and creative writing usually do not. It is wrong when answers must be exact and deterministic, like calculating tax owed or generating legal contract clauses, where a rules engine or a code-generation approach is safer. It is wrong when the knowledge fits comfortably in a system prompt and never changes. And it is often wrong when fine-tuning on a stable, well-defined task would give better latency and lower cost. Analyse the problem before defaulting to RAG.
Closing thought
RAG analysis is not glamorous work but it is the difference between a system that survives contact with real users and one that quietly embarrasses you. The teams that ship reliable RAG in production are the teams that invested in the evaluation harness before they invested in the third clever retrieval trick. Start with a 200-example test set, separate retrieval and generation evaluation, automate it in CI, and grow the set from production failure modes.
At AI Advisory we build RAG systems with the evaluation harness as a first-class deliverable, not an afterthought - because clients who can see how their system performs are the clients who keep us on retainer. If you are scoping a RAG build or trying to fix one that is underperforming, we can help.
Ready to put this into production? book a discovery call.