AI7 June 20265 min read

What is Chunking in RAG? A Practical Guide to Splitting Documents for Retrieval

How chunking works in RAG pipelines, which strategies actually retrieve well, and the chunk sizes, overlap and metadata patterns that production systems use

By AI Advisory team

Chunking is the step in a Retrieval-Augmented Generation pipeline where you split source documents into smaller pieces before embedding and indexing them. It sounds trivial. It is not. The way you chunk determines what the retriever can find, what context the LLM sees, and ultimately whether the answers are grounded or hallucinated. Most RAG systems that underperform are not let down by the model or the vector database. They are let down by chunking choices made in the first afternoon of the build and never revisited.

This guide explains what chunking is, why it matters, the main strategies in use, and the concrete parameters that tend to work in production. It is written for engineers and technical leads building or auditing a RAG system, not for a general audience.

What chunking actually is

A RAG pipeline has three logical stages: ingest, retrieve, generate. During ingest you take documents - PDFs, Confluence pages, HTML, Word files, transcripts, database rows - and you produce embeddings stored in a vector index. The model used to produce those embeddings has a fixed input size, and even when it does not, embeddings of very long passages lose discriminative power. So you split each document into chunks first, embed each chunk, and store the vector alongside the chunk text and metadata.

At query time the user's question is also embedded, the vector index returns the top-k chunks by similarity, and those chunks are pasted into the prompt as context. The LLM then answers using that context. The retrieval quality is bounded by what your chunks contain. If the answer to a question is split across two chunks and neither contains enough signal on its own, the retriever will miss it. If a chunk contains three unrelated topics, its embedding becomes a blurry average and matches nothing well.

Good chunking produces passages that are semantically coherent, self-contained enough to be useful out of context, and small enough to embed cleanly but large enough to carry real information. That is the entire job.

Why chunk size matters more than people think

The dominant failure mode in early RAG builds is chunks that are either too large or too small.

Too large - say 2,000 tokens per chunk - and each embedding becomes an average of multiple topics. A question about refund policy retrieves a chunk that is 30% refund policy and 70% shipping terms, because the embedding sits in the middle. The LLM then gets a noisy context window and either answers from the wrong section or refuses. You also waste context budget and money: a top-5 retrieval at 2,000 tokens per chunk is 10,000 tokens before you have written the system prompt.

Too small - say 100 tokens - and chunks lose the surrounding context that makes them meaningful. A sentence that says "This must be completed within 14 days" is useless without the preceding sentence telling you what "this" refers to. Retrieval recall drops because the discriminative terms are scattered across many tiny chunks, and the top-k might return five chunks from the same paragraph rather than five different relevant sources.

The sweet spot for most prose documents is 256-512 tokens per chunk, with 10-20% overlap between adjacent chunks. For dense reference material like API documentation or legal clauses, 512-1,024 tokens often works better. For conversational transcripts and chat logs, 128-256 tokens with speaker turns preserved. These are starting points, not laws. You should measure.

The main chunking strategies

There are five strategies you will encounter, in roughly increasing order of effort and sophistication.

Fixed-size chunking

Split the document into N-token (or N-character) chunks with a fixed overlap. Simple, fast, and the default in libraries like LangChain's CharacterTextSplitter. The downside is that it cuts mid-sentence and mid-paragraph, which damages embedding quality. Use it for prototyping and for genuinely unstructured text where nothing else applies.

Recursive character chunking

The same idea, but the splitter tries a hierarchy of separators - double newline, single newline, full stop, space - and only falls back to a harder cut when the chunk is still too long. LangChain's RecursiveCharacterTextSplitter is the most common implementation. This is the sensible default for most prose. It respects paragraph and sentence boundaries while still giving you predictable chunk sizes.

Document-structure-aware chunking

If your source has structure - Markdown headings, HTML tags, PDF sections, code blocks - use it. Split on H2 boundaries, keep tables intact, never split a code block. For Confluence and Notion exports this is the single biggest quality win, because the document author has already told you where the semantic boundaries are. Tools like unstructured.io and LlamaIndex's node parsers do this well.

Semantic chunking

Embed each sentence, then group adjacent sentences whose embeddings are close, splitting where the cosine distance between consecutive sentences spikes. This produces chunks that track topic shifts in the document rather than arbitrary length. It is slower at ingest time and the gains are real but modest - typically a few percentage points on retrieval benchmarks. Worth it for high-stakes corpora, overkill for a customer support knowledge base.

Agentic or LLM-assisted chunking

Send the document to an LLM and ask it to propose chunk boundaries, or to produce a summary and a set of standalone passages. Expensive at ingest, but useful for documents with awkward structure - complex contracts, research papers, mixed-format reports. Anthropic's contextual retrieval pattern, published in September 2024, falls in this family: it uses an LLM to prepend a short context blurb to each chunk before embedding, which they report cuts retrieval failure rate by up to 49% (see Anthropic's contextual retrieval write-up).

Overlap, metadata and the things people forget

Three details matter as much as the strategy itself.

Overlap. Adjacent chunks should share 10-20% of their tokens. This stops a relevant sentence from falling exactly on a chunk boundary and being effectively invisible. Too much overlap inflates your index size and creates near-duplicate retrievals; too little creates blind spots.

Metadata. Every chunk should carry the source document ID, the section heading, the page number or URL, the document date, and any access control tags. Metadata does two jobs: it lets you filter retrievals (only show chunks from documents this user can see, or only chunks from the last 12 months), and it lets you cite sources in the final answer. Without source citation, the user cannot verify the answer and you cannot debug retrieval failures. Postgres with pgvector handles this well because you get proper SQL filtering alongside vector search.

Chunk-level context. A chunk pulled from page 47 of a policy document means nothing without knowing it is from the policy document. Prepend a short header to each chunk before embedding - something like "Document: Employee Handbook 2025, Section: Annual Leave" - so the embedding captures the document context, not just the local sentence. This is cheap and consistently improves retrieval.

How to measure whether your chunking is working

You cannot eyeball chunking quality. Build an evaluation harness early, before you tune anything.

Start with 50-200 question-answer pairs grounded in your real documents. For each question, record which chunk or chunks contain the correct answer. Then for each chunking configuration you want to test, measure two things: recall@k (does the top-k retrieval include at least one chunk containing the answer) and mean reciprocal rank (how high does the correct chunk rank). A retrieval that returns the right chunk at position 1 is meaningfully better than one that returns it at position 8, because LLMs attend more strongly to earlier context.

For end-to-end quality, run the full RAG pipeline on the same questions and score the final answers. Frameworks like Ragas and TruLens automate this with metrics for faithfulness (does the answer stick to the retrieved context), answer relevance, and context precision. Re-run the eval every time you change chunking, embedding model, or retriever. This is how you avoid the trap of "we changed three things and quality got worse, which one was it."

The UK's National Cyber Security Centre has published guidance on building secure machine learning systems that is worth reading alongside the eval work, particularly around prompt injection risk in retrieved content (NCSC machine learning principles).

A practical recipe for most teams

If you are starting a RAG build today and want a defensible default, this is where to begin.

Use document-structure-aware chunking where the source has structure (Markdown, HTML, well-formed PDF). Fall back to recursive character chunking for everything else. Target 400-500 tokens per chunk for prose, 800-1,000 for reference material. Set overlap at 50-75 tokens. Prepend each chunk with document title and section heading before embedding. Store chunks in pgvector or a managed vector store like Pinecone or Weaviate, with full metadata for filtering and citation.

Use a strong embedding model - OpenAI's text-embedding-3-large, Voyage's voyage-3, or Cohere's embed-v3 are the current credible defaults. Add a reranker on top of vector retrieval; Cohere Rerank and Voyage rerankers consistently lift precision by retrieving 20-30 candidates and reordering to the top 5.

Build the eval harness in the first week, not the third month. Tune chunk size, overlap and prepended context against the eval, in that order. Only reach for semantic or LLM-assisted chunking if recall is still below 90% after the basics are sound.

Most teams will not need anything more exotic than this. The teams that do - heavily regulated content, multilingual corpora, code-heavy documentation - will know from the eval results that they have a specific problem to solve, rather than chasing chunking fashion.

Frequently asked questions

What chunk size should I use for RAG?

For most prose documents, 256-512 tokens per chunk with 10-20% overlap is a sensible starting point. Reference material like API docs or legal clauses can go to 512-1,024 tokens because each section is denser and more self-contained. Conversational transcripts work better at 128-256 tokens with speaker turns preserved. These are starting points, not final answers. Build an evaluation set of 50-200 question-answer pairs from your actual documents and measure recall@k against different chunk sizes. The right size for your corpus is the one that maximises recall and rank position on your eval, which is rarely the size the first tutorial suggested.

What is the difference between chunking and embedding?

Chunking is the act of splitting documents into smaller passages. Embedding is the act of converting each chunk into a numeric vector that represents its meaning, using a model like OpenAI's text-embedding-3-large or Cohere embed-v3. Chunking happens first; embedding is applied to the output of chunking. The two interact closely because embedding models have a maximum input length (typically 8,000 tokens for current models) and their semantic resolution degrades on very long inputs. So chunking choices effectively constrain what the embedding model can usefully represent. You tune them together against retrieval metrics, not in isolation.

Do I need overlap between chunks?

Yes, in almost all cases. Overlap of 10-20% of chunk length stops relevant information from falling exactly on a chunk boundary and being lost between two adjacent chunks. Without overlap, a sentence that ends one chunk and the related sentence that begins the next get embedded separately and may both fail to retrieve. Too much overlap, above 30%, inflates index size and creates near-duplicate results in the top-k. A practical default is 50-75 tokens of overlap for chunks of 400-500 tokens. The exact figure matters less than having some overlap configured.

Is semantic chunking worth the extra complexity?

Sometimes. Semantic chunking, where you split on sentence-embedding distance rather than fixed boundaries, typically lifts retrieval metrics by a few percentage points on well-structured prose. It is slower at ingest and harder to debug. The honest answer for most teams: get recursive character chunking with document-structure awareness working first, prepend section headings to each chunk, and build a reranker into the retrieval step. These three steps usually close most of the gap to semantic chunking. Only adopt semantic chunking if your evaluation shows recall is still capped after the basics are sound, which is uncommon for typical knowledge-base RAG.

How does chunking affect cost?

In two places. At ingest you pay to embed every chunk - smaller chunks mean more chunks and more embedding calls, though embedding is cheap (around $0.13 per million tokens for text-embedding-3-large). At query time you pay for the chunks sent to the LLM as context: a top-5 retrieval at 1,000 tokens each is 5,000 input tokens per query before the system prompt and question. Halving chunk size doubles ingest cost but lets you retrieve more chunks within the same context budget. The bigger cost lever is usually how many chunks you put in the prompt, not their size.

How should I handle tables, code and images?

Treat them as atomic units. Never split a table or code block across chunks - retrieval that returns half a table is worse than not retrieving it at all. For tables, consider extracting them separately and storing both a structured representation (JSON or Markdown) and a natural-language summary, embedding the summary for retrieval. For code, chunk on function or class boundaries using a language-aware parser like tree-sitter. For images in PDFs, use a vision model to generate descriptions and embed those alongside the surrounding text. Libraries like unstructured.io and LlamaParse handle a lot of this automatically and are worth evaluating before you build custom extractors.

Should chunking change when I switch embedding models?

Often, yes. Different embedding models have different optimal input lengths and different tolerances for noisy chunks. A model trained on short queries and short passages will degrade faster on 1,000-token chunks than a model trained on longer contexts. When you change embedding model, re-run your evaluation across at least three chunk-size configurations before declaring the new model better or worse. The same applies to reranking models: a reranker that excels at scoring 200-token passages may underperform on 800-token ones. Treat chunking, embedding and reranking as a coupled system that gets tuned together.

How do I stop the LLM hallucinating from retrieved chunks?

Three layers help. First, ensure every chunk carries enough context to stand alone - prepended section headings, source title, date. Second, instruct the model in the system prompt to answer only from the supplied context and to say it does not know when the context is insufficient; include a refusal pattern with examples. Third, cite sources in the final answer with links back to the source document, which both helps the user verify and creates a feedback loop where bad retrievals get reported. None of this fully eliminates hallucination, but together they reduce it to a level where the system is safe to deploy in production support and internal-knowledge use cases.

Closing thought

Chunking is unglamorous work that decides whether your RAG system is useful or embarrassing. Spend a week getting it right at the start - structure-aware splitter, sensible chunk size, overlap, metadata, prepended context, evaluation harness - and you will spend less time later explaining why the chatbot cannot find things that are clearly in the documents. If you want a second pair of eyes on a RAG build that is not retrieving what it should, AI Advisory runs RAG audits and builds production retrieval pipelines for UK mid-market teams.