AI7 June 20265 min read

CAG vs RAG: Cache-Augmented Generation and Retrieval-Augmented Generation Compared

Cache-augmented generation vs retrieval-augmented generation - how they work, when each wins, and how to choose for production LLM systems

By AI Advisory team

Retrieval-augmented generation (RAG) became the default pattern for grounding large language models in private data because, in 2023, context windows were small and inference was expensive. Both constraints have loosened. Frontier models from Anthropic, Google and OpenAI now ship with context windows between 200,000 and 2,000,000 tokens, and prompt caching has cut the cost of reusing long prompts by 75-90% at the major providers. That has opened the door to a simpler alternative: cache-augmented generation (CAG), where you preload the entire knowledge base into the model's context once and reuse the cached state across queries.

The choice between CAG and RAG is not a fashion question. It changes your latency profile, your monthly inference bill, the freshness of answers, the engineering surface you have to maintain, and the failure modes you have to defend against. This article explains how each pattern works, the tradeoffs that actually matter in production, and a decision framework for picking the right one - or, more often, the right hybrid.

How RAG works, briefly

RAG splits the problem of "answer using my data" into two systems. A retrieval layer indexes documents as embeddings in a vector database (Pinecone, Weaviate, pgvector, Qdrant), often combined with keyword search (BM25) for hybrid retrieval. At query time, the user's question is embedded, the top-k most relevant chunks are pulled from the index, and those chunks are stitched into the prompt as context. The generation model - GPT-4o, Claude Sonnet, Gemini, an open-weights model - then answers using only the retrieved material.

The pattern works because it sidesteps two old problems. Models do not need to memorise your corpus through fine-tuning, and you do not pay to send the entire corpus on every request. You pay for retrieval, then for generation over a small slice. The cost is engineering complexity: you now own a chunking strategy, an embedding model choice, an index, a reranker, an evaluation harness, and a refresh pipeline. The Stanford 2024 RAG survey catalogued more than 30 distinct sub-components that production RAG systems typically include.

How CAG works

Cache-augmented generation skips the retrieval step. You take your entire knowledge base - or the slice relevant to a given assistant - and place it directly in the model's context window at the start of every conversation. The model's attention mechanism then operates over the whole corpus on each query, with no separate retrieval system in the loop.

The naive version of this is wildly expensive: paying to process 500,000 tokens of context on every request would bankrupt most use cases. What makes CAG viable is prompt caching. Anthropic's prompt caching, OpenAI's automatic prompt caching, and Google's context caching all let you process a long static prefix once, store the model's internal key-value state, and reuse it across subsequent requests at 10-25% of the original input cost. Anthropic publishes cache writes at 1.25x the base input price and cache reads at 0.1x, a 12.5x cost ratio between cached and uncached tokens (VERIFY: anthropic.com/pricing).

The term CAG was popularised by a 2024 paper from National Chengchi University ("Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks", arXiv:2412.15605). The paper showed that for bounded knowledge tasks fitting within context limits, CAG matched or beat RAG on answer quality while removing retrieval latency entirely.

The honest comparison

Both patterns ground a model in external data. They differ on five axes that matter in production.

Corpus size

RAG scales to corpora of effectively any size - millions of documents, hundreds of millions of chunks. CAG is bounded by the model's context window. Claude Sonnet's 200,000 tokens hold roughly 150,000 words, or 300-500 pages of dense documentation. Gemini 1.5 Pro's 2,000,000-token window holds around 1.5 million words. If your knowledge base fits, CAG is on the table. If it does not, you either chunk by domain into multiple cached contexts or use RAG.

Latency

RAG adds a retrieval round-trip before generation: embed the query, search the index, optionally rerank, then call the model. Well-tuned systems add 100-400ms. CAG adds nothing - the cache is already warm, the model just generates. First-token latency on a cached 100k-token prefix is typically 30-50% faster than the equivalent RAG call, because there is no retrieval step and the model's attention has already been computed over the prefix.

Cost per query

RAG is cheaper per query when the corpus is large and the queries are diverse. You pay for embedding the query (cheap), retrieval (cheap), and generation over maybe 4,000-16,000 tokens of context. CAG cost depends on cache hit rate. If you can amortise the cache write across thousands of queries within the cache's lifetime, CAG is cheaper per query than RAG. If queries are sparse and the cache expires between them, CAG is dramatically more expensive. Anthropic's caches have a default 5-minute TTL, with a 1-hour option at higher write cost.

Freshness

RAG updates are cheap. Index a new document, it is queryable immediately. CAG updates invalidate the cache - any change to the prefix forces a fresh cache write at full input price. For knowledge bases that change hourly (support tickets, inventory, news), RAG wins comfortably. For knowledge bases that change weekly or monthly (product documentation, policy manuals, legal corpora), CAG handles updates fine.

Engineering surface

RAG requires you to own a vector database, an embedding pipeline, a chunking strategy, a reranker, and an evaluation harness for retrieval quality. The 2024 Databricks State of Data + AI report found that average production RAG stacks involved 4-6 distinct services. CAG requires you to own a prompt assembly script and a cache key strategy. That difference matters more than people admit when you cost out the first year of running each.

When CAG wins

CAG is the right default when four conditions hold. First, the knowledge base fits comfortably in the context window with headroom for queries and answers - typically under 150,000 tokens for a 200k model, or under 1.5 million for a 2M model. Second, the corpus is relatively static, changing on a weekly cadence or slower. Third, query volume is high enough to amortise the cache write - several hundred queries per cache TTL is a reasonable floor. Fourth, latency matters - customer-facing chat, voice assistants, agent loops with multiple model calls.

Concrete examples we see working in production: internal policy assistants over a 50,000-token employee handbook, customer support bots grounded in 80,000 tokens of product documentation, legal-clause checkers operating over a 200,000-token contract template library, and developer assistants grounded in a single repository's documentation. In each case, the corpus is bounded, updates are infrequent, and queries are frequent enough that the cache stays warm.

When RAG wins

RAG remains the right choice when any one of these conditions holds. The corpus exceeds available context - most enterprise document estates fall in this category, where SharePoint, Confluence, Google Drive and email between them produce gigabytes of indexable text. Updates are continuous - ticketing systems, CRM records, inventory, news feeds. Queries are highly diverse and span unrelated domains, so most cached context would be wasted on any given call. Citation and provenance requirements are strict, which is easier to satisfy when retrieval explicitly returns the source chunks the answer is grounded in. Or you need fine-grained access control, where different users should see different slices of the corpus - far easier in a retrieval layer than in a shared cached prefix.

RAG also wins when you genuinely need hybrid search behaviour - combining vector similarity with structured filters (date, author, customer ID, region). Pushing structured filtering into a CAG setup is awkward; in RAG it is a standard query parameter.

The hybrid pattern most production systems end up at

In practice, most mature systems run both. The pattern looks like this: a stable, frequently-referenced "core context" - product documentation, common policies, glossaries, brand guidelines - is loaded into the prompt cache. A retrieval layer handles the long tail - customer-specific records, recent tickets, dynamic pricing, anything that changes between sessions. The model receives the cached core plus the retrieved slice on every call.

This pattern shows up in support automation, sales-assistant tools, and internal knowledge bots. It captures most of CAG's latency and cost benefits for the 60-80% of queries that hit the core corpus, while keeping RAG's flexibility for the long tail. It is also operationally honest - you stop pretending that any single pattern handles every query type well.

A second hybrid pattern worth knowing: CAG for the system prompt and few-shot examples, RAG for the user-specific data. Anthropic's documentation calls this "caching the static portion" and it is the cheapest way to get prompt caching working with an existing RAG stack. You do not have to commit to CAG end-to-end to benefit from cache mechanics.

A decision framework

Run these five questions in order. Stop at the first "no".

1. Does the corpus fit in a single model's context window with at least 30% headroom for queries, answers and conversation history? If not, you need RAG or a chunked-by-domain CAG setup. The 30% headroom rule exists because attention degradation at the tail of a fully-packed context window is real - the "lost in the middle" effect documented by Liu et al. (2023, Stanford) means accuracy drops on facts placed 60-80% through a context.

2. Is the corpus stable on a weekly or slower cadence? If not, the cache write costs eat your savings.

3. Can you generate enough query volume to keep the cache warm? Run the maths: cache write cost divided by per-query savings versus RAG. If the breakeven is more queries than you will get in the cache TTL, CAG loses on cost.

4. Are latency or simplicity strong drivers? If you are happy with 300-500ms first-token latency and willing to own a retrieval stack, RAG is fine. If you need sub-200ms or have a small team, CAG looks better.

5. Do you have strict citation, access-control, or per-user filtering requirements? If yes, RAG or the hybrid pattern is easier to operate and audit.

If you answered yes to all five, build CAG first. Otherwise, default to RAG with prompt caching on the static portions and revisit as your corpus and traffic patterns stabilise.

Frequently asked questions

Is CAG just RAG with a bigger context window?

No, but the relationship is closer than the marketing suggests. CAG removes the retrieval system entirely - there is no vector database, no embeddings, no chunking strategy, no reranker. You just put everything in context and let the model's attention do the work. RAG with a big context window still runs retrieval and still picks chunks; it just picks more of them. The operational difference matters: CAG is one fewer system to maintain, monitor, and evaluate. The cost difference matters too - CAG only works economically because of prompt caching, which is a separate mechanism from large context windows.

What does CAG cost compared to RAG in practice?

It depends on cache hit rate. As a rule of thumb: at 100+ queries per cache TTL against a 100,000-token corpus, CAG with Anthropic's prompt caching runs roughly 40-60% cheaper per query than RAG over the same corpus on the same model. Below 20 queries per TTL, CAG is more expensive because the cache write dominates. Above 1,000 queries per TTL, CAG can be 70-80% cheaper. Build a small spreadsheet with your actual query volume, corpus size, and provider pricing before committing - the breakeven point moves with all three.

Does CAG suffer from the "lost in the middle" problem?

Yes, it can. The Stanford research on context-window accuracy showed that facts placed in the middle 40-60% of a long context are recalled less reliably than facts at the start or end, sometimes by 10-15 percentage points. This affects both CAG and RAG-with-large-contexts. Mitigations are similar: structure the corpus with clear section headers, put the most important reference material at the start or end, and use evaluation harnesses to measure recall on middle-of-context facts before going live. Newer models (Claude 3.5 Sonnet onwards, Gemini 1.5 Pro) have substantially reduced but not eliminated this effect.

How do I handle access control with CAG?

This is one of CAG's genuine weak points. If different users should see different slices of the corpus - a customer-support assistant that should never expose internal pricing notes, or a multi-tenant SaaS where each customer has their own data - you have two options. Build one cached context per user or tenant, which inflates cache write costs proportionally to user count. Or use the hybrid pattern: cache only the shared corpus, retrieve user-specific data per query through a permissioned retrieval layer. For any system with non-trivial access control, plan on the hybrid.

What happens when my corpus grows past the context window?

Three options. Split the corpus into domain-specific caches and route queries to the right one - good for products with clear topical separation, harder when queries cross domains. Move to a model with a larger window - Gemini's 2M-token window buys time but not unlimited room. Or move to a hybrid pattern with RAG handling the overflow. Most production CAG systems we have seen eventually become hybrids as the corpus grows. Plan for that transition from day one - structure your prompt assembly so adding a retrieval step later is a small change, not a rewrite.

Is CAG compatible with agents and tool use?

Yes, and it is often a strong fit. Agent loops make multiple model calls per user request - planning, tool selection, tool-result interpretation, response drafting. Each call benefits from cached context, which is why agent frameworks tend to favour CAG or hybrid patterns. The economics improve sharply when one user request triggers 5-10 model calls against the same cached prefix. Anthropic, OpenAI and Google have all published agent reference architectures that assume prompt caching is in use.

Do I need a vector database at all if I use CAG?

Not for the core knowledge base. You may still want one for ancillary uses: semantic search across the corpus for analytics, similarity search for deduplication, or to support the retrieval half of a hybrid pattern. If pure CAG covers your use case and you have no other need for embeddings, you can genuinely skip the vector database - that simplification is one of CAG's main appeals. For most mid-market deployments, though, expect to want embeddings for at least one adjacent use case within the first year.

Which providers support prompt caching well enough for CAG?

As of early 2026: Anthropic offers explicit prompt caching with 5-minute and 1-hour TTLs and clear pricing. OpenAI offers automatic prompt caching on long prompts with no explicit cache controls but transparent discount on repeated prefixes. Google offers context caching on Gemini 1.5 and 2.0 models with explicit TTL controls. Open-weights deployments (Llama, Mistral, Qwen) can implement KV-cache reuse manually via vLLM or TensorRT-LLM, which is more work but eliminates per-token provider fees. For most teams, start with whichever frontier model your stack already uses, then evaluate open-weights once usage patterns stabilise.

How do I evaluate whether CAG or RAG produces better answers on my data?

Build an evaluation set of 100-300 representative questions with known correct answers from your corpus. Run both patterns against the same set, measure answer accuracy, citation accuracy if relevant, p50 and p95 latency, and cost per query. The 2024 Chengchi paper used standard QA benchmarks (HotpotQA, SQuAD) and found CAG matched or exceeded RAG on bounded corpora. Your results will depend on your corpus structure, query patterns, and model choice. Do not pick the pattern based on Twitter discourse - run the eval.

Closing

The CAG vs RAG debate is, beneath the hype, a question about your corpus shape, your update cadence, and your query volume. RAG remains correct for large, dynamic, multi-tenant corpora. CAG is increasingly correct for bounded, stable, high-traffic knowledge bases - which describes more mid-market use cases than people expected when the pattern emerged in late 2024. Most production systems end up running a hybrid, and that is the right answer for most readers of this article.

If you are weighing CAG, RAG, or a hybrid for a production system and want a costed recommendation grounded in your actual corpus and traffic, AI Advisory runs two-week technical assessments that produce a working prototype of both patterns on your data with measured cost and latency. Get in touch to scope one.

Ready to put this into production? book a discovery call.