AI7 June 20265 min read

Embeddings in RAG: What They Are and Why They Matter

A practitioner's guide to embeddings in RAG: how they work, which models to pick, chunking, evaluation, and the failure modes that bite in production

By AI Advisory team

If you are building a retrieval-augmented generation (RAG) system, embeddings are the component that decides whether the model answers from the right context or hallucinates confidently from the wrong one. They are also the part most teams treat as a black box, picking whatever the tutorial used and moving on. That is usually fine until retrieval quality drops in production and nobody knows where to look.

This article explains what embeddings actually are inside a RAG pipeline, how they are produced, which models are worth considering in 2026, and the practical decisions (chunk size, dimensionality, hybrid search, re-ranking) that separate a demo from a system you can put in front of customers.

What an embedding actually is

An embedding is a fixed-length vector of floating-point numbers that represents the meaning of a piece of text. A sentence like "the invoice is overdue" becomes something like a 1,536-dimensional vector: [0.0123, -0.0456, 0.0789, ...]. The numbers themselves are meaningless to a human. What matters is the geometry between them.

Two pieces of text with similar meaning end up close together in that vector space, even if they share no words. "The invoice is overdue" and "payment hasn't arrived yet" should sit near each other. "The invoice is overdue" and "the warehouse is closed on Sundays" should not. Similarity is normally measured with cosine similarity, which compares the angle between two vectors and returns a score between -1 and 1.

The embedding model is what produces these vectors. It is a neural network (almost always a transformer) trained on hundreds of millions of text pairs so that semantically related pairs land close together and unrelated pairs land far apart. OpenAI's text-embedding-3-large, Cohere's embed-v3, Voyage's voyage-3, and open-source models like BGE, E5, and Nomic Embed all do this same job with different trade-offs in quality, dimensionality, language coverage and cost.

Where embeddings fit in a RAG pipeline

RAG has two phases: indexing (done once, then incrementally) and query (done every time a user asks something). Embeddings appear in both.

At indexing time:

Source documents (PDFs, Confluence pages, HubSpot tickets, whatever) are extracted to plain text.
That text is split into chunks of a few hundred to a couple of thousand tokens.
Each chunk is sent to the embedding model, which returns a vector.
The vector is stored in a vector database (Pinecone, Weaviate, Qdrant, pgvector in Postgres, Azure AI Search) alongside the original chunk text and metadata like source URL, last-modified date, and access permissions.

At query time:

The user's question is sent to the same embedding model, producing a query vector.
The vector database performs an approximate nearest-neighbour search (usually HNSW or IVF) to return the top k chunks whose vectors are closest to the query vector.
Those chunks are passed into the LLM's prompt as context, along with the original question.
The LLM generates an answer grounded in the retrieved chunks.

The critical point: you must embed queries with the same model you used for the documents. Vectors from different models are not comparable, even if they happen to have the same dimensionality. Swapping embedding models means re-embedding the entire corpus.

Choosing an embedding model

The decision usually comes down to four axes: retrieval quality, dimensionality, cost, and where the model can run. The MTEB leaderboard on Hugging Face is the standard reference for quality benchmarks across retrieval, classification, clustering and reranking tasks.

Hosted commercial models are the default for most production builds. OpenAI's text-embedding-3-large (3,072 dimensions, configurable down to 256 via Matryoshka representation learning) is the workhorse. Cohere's embed-v3 family is strong on multilingual content and supports input-type hints (query vs document) that improve retrieval. Voyage AI's models consistently top the MTEB retrieval benchmarks. Pricing sits roughly between $0.02 and $0.13 per million tokens depending on model and tier, which is trivial compared with generation costs for most workloads.

Open-source models matter when data cannot leave your infrastructure or when embedding volume is very high. BGE-M3, E5-Mistral-7B-instruct, and Nomic Embed are credible options that match or beat commercial models on specific benchmarks. The cost is operational: you need GPU inference (or accept slow CPU inference) and someone to keep it running. For most mid-market builds the hosted model wins on total cost of ownership unless there is a hard data-residency constraint.

Dimensionality matters more than people realise. A 3,072-dimensional vector takes 12KB at float32. A million chunks at full dimensionality is 12GB of vectors before any index overhead. Models like text-embedding-3-large support truncation to 1,024 or 512 dimensions with modest quality loss, which can cut storage and query latency by 3-6x. Test it on your data before committing.

Chunking: the decision that breaks more RAG systems than the embedding model

An embedding represents one chunk of text. If the chunk is the wrong shape, retrieval fails no matter how good the model is.

Three common mistakes:

Chunks too large. A 2,000-token chunk that contains five distinct facts produces an averaged vector that is a weak match for any single fact. Retrieval surfaces it for everything and nothing.
Chunks too small. A 50-token chunk often lacks the context needed to be meaningful on its own. "The deadline is 31 March" is useless without knowing what deadline.
Splitting mid-thought. Fixed-size splitters that cut at character count will split a table down the middle, separate a heading from its content, or break a code block.

Practical defaults that work for most document types: 400-800 token chunks with 10-15% overlap, split on semantic boundaries (paragraphs, headings, markdown structure) rather than character count. For structured documents like contracts or technical manuals, splitting by section with the section title prepended to each chunk often beats any generic strategy. For chat logs or tickets, keep one conversation per chunk and prepend metadata (customer, date, product).

A pattern worth knowing: contextual retrieval, published by Anthropic in late 2024, prepends a short LLM-generated summary of how each chunk relates to the full document before embedding it. Their published results showed a 35% reduction in retrieval failures, rising to 49% when combined with BM25 and re-ranking. The extra LLM call at indexing time is cheap because it only runs once per chunk.

Hybrid search and re-ranking: embeddings alone are not enough

Pure vector search has a known weakness: it is bad at exact matches. If a user searches for an order number, a SKU, a person's name, or a specific error code, dense embeddings will often miss the obvious result because the query is dominated by semantically common words.

The fix is hybrid search: run a vector search and a keyword search (BM25, the algorithm behind Elasticsearch and Postgres full-text search) in parallel, then fuse the results. Reciprocal Rank Fusion (RRF) is the standard merging algorithm and works well with no tuning. Most production vector databases now support hybrid search natively, including Weaviate, Qdrant, Azure AI Search and pgvector with the pg_search extension.

The second upgrade is re-ranking. After hybrid retrieval returns the top 20-50 candidates, a cross-encoder model (Cohere Rerank, Voyage rerank-2, BGE reranker) scores each candidate against the query directly and reorders them. Cross-encoders are slower than bi-encoder embeddings (they have to process query and document together) but far more accurate. You only run them on a small candidate set, so latency stays acceptable. In practice, re-ranking is the single highest-impact improvement you can make to a RAG system after getting chunking right.

Evaluation: how you know your embeddings are working

Most teams ship RAG with no evaluation harness and discover problems through customer complaints. Don't be that team.

At minimum, build a golden set of 50-200 real queries with the correct source chunks identified. Re-run retrieval against this set every time you change the embedding model, chunking strategy, or index settings. Track three metrics:

Recall@k: of the queries where a correct chunk exists, how often does it appear in the top k results? This is the ceiling on answer quality.
Mean Reciprocal Rank (MRR): where in the results does the first correct chunk appear? Higher is better.
Faithfulness: does the final generated answer actually reflect what was in the retrieved chunks? This catches cases where retrieval succeeds but generation hallucinates anyway. Frameworks like Ragas and DeepEval automate this with an LLM-as-judge approach.

Run evaluation in CI. Every change to chunking, embedding model, or prompt should produce a measurable delta. Without this, you are guessing.

Production realities: cost, latency, freshness, security

Cost. Embedding generation is cheap at query time (one API call per query) and one-off at index time. The hidden cost is re-embedding: every time you change models, chunking strategy, or document preprocessing, you pay to re-embed the whole corpus. For a million chunks at text-embedding-3-large, that is roughly $130 plus engineering time.

Latency. Embedding a query adds 50-200ms. Vector search on a well-indexed million-chunk corpus is 20-100ms. Re-ranking 50 candidates adds 200-500ms. Budget around 500-800ms for retrieval before the LLM generation call, which itself is the dominant latency.

Freshness. Plan for incremental indexing from day one. New documents need to be embedded and added; deleted documents need to be removed; updated documents need their old vectors purged before the new ones go in. A queue-based indexer (n8n, Temporal, or a simple cron + worker) handles this. Stale indexes are a top cause of RAG drift.

Security and access control. Embeddings inherit no permissions. If your indexer ingests a confidential HR document, it will happily be returned to any user whose query is similar to it. The standard pattern is to store access metadata (user IDs, group IDs, document classification) alongside each vector and filter at query time. The UK ICO's guidance on AI and data protection is clear that purpose limitation and access control still apply to AI-processed data. Building this in retrospectively is painful; build it in upfront.

Common failure modes worth knowing

Embedding drift across model versions. When a provider releases a new version, do not assume vectors are compatible. They are not. Pin model versions and plan migrations explicitly.

Multilingual mismatches. If your corpus is English but users query in French, you need a multilingual embedding model. Otherwise the French query vector lands nowhere near the English document vectors.

Acronyms and jargon. Domain-specific terms ("S/4HANA", "IR35", "MiFID II") often embed poorly because the base model has not seen them in context. Fine-tuning the embedding model or, more practically, hybrid search with BM25 catches these.

The "lost in the middle" problem. Even if retrieval surfaces the right chunk, LLMs are demonstrably worse at using context that appears in the middle of a long prompt versus the start or end. Keep retrieved context tight (5-10 chunks max) and put the most relevant material first.

Frequently asked questions

Do I need to fine-tune an embedding model for my domain?

Usually no. For most mid-market use cases, a strong off-the-shelf model combined with good chunking, hybrid search and re-ranking will outperform a fine-tuned embedding on a weak pipeline. Fine-tuning makes sense when you have a large volume of domain-specific terminology that base models handle poorly (specialist legal, medical, or industrial vocabulary), and when you have at least a few thousand query-document pairs to train on. Otherwise the engineering and evaluation overhead rarely pays back. Start with the strongest hosted model, instrument retrieval quality, and only consider fine-tuning when you have evidence that semantic mismatch (not chunking or retrieval strategy) is the bottleneck.

What is the difference between embeddings and vector search?

Embeddings are the numerical representations of text; vector search is the algorithm that finds the closest embeddings to a query vector. You need both, and they are separable concerns. The embedding model decides what "similar" means semantically. The vector search engine (the index) decides how quickly you can find similar vectors at scale. A bad embedding model with a fast index gives you fast bad results. A great embedding model with a slow brute-force search gives you accurate results that take seconds per query. Production RAG needs both a strong embedding model and an efficient index, usually HNSW-based, with hybrid keyword search layered on top for exact-match queries.

How much does a production RAG system cost to run?

For a mid-market deployment with around a million chunks and 10,000 queries per day, the inference costs typically land between £400 and £1,500 per month. Breakdown: embedding queries at roughly £30-£100, vector database hosting at £150-£500 (managed services like Pinecone, or self-hosted Qdrant on a small VM), re-ranking API costs around £100-£300, and LLM generation £150-£600 depending on model choice. The dominant cost is generation, not retrieval. Initial indexing is a one-off of £50-£200 for the embedding API call. Engineering and operational cost (people keeping it running) usually exceeds infrastructure cost by 3-5x.

Can I use embeddings without a vector database?

For small corpora, yes. Up to about 50,000 chunks, you can store vectors in Postgres with the pgvector extension and run perfectly good HNSW-indexed searches. For up to a few thousand vectors, even a NumPy array in memory works. The reason to adopt a dedicated vector database is operational: managed services handle index rebuilds, replication, hybrid search, metadata filtering and access control out of the box. Most of our mid-market builds start on pgvector (because the client already runs Postgres) and only migrate to Pinecone, Weaviate or Qdrant when scale, multi-tenancy, or hybrid search requirements justify it.

How often should I re-embed my documents?

Embed new and updated documents incrementally, ideally within minutes of the source changing. Full re-embedding of the entire corpus only happens when you change the embedding model, chunking strategy, or preprocessing pipeline. Plan for re-embedding every 12-18 months as a matter of course, because embedding models improve meaningfully on that cadence. Set up your indexer so a full rebuild can run in the background against a shadow index, then atomically swap the live index when complete. This avoids downtime and lets you A/B test new embedding models against your golden evaluation set before committing.

The embedding vector itself is derived from personal data and inherits the same protection. The ICO's position is that you cannot treat embeddings as anonymised data, because the source text is recoverable in combination with the chunk store, and even alone the vectors can leak information through inversion attacks. Practical implications: log a lawful basis for embedding personal data, apply the same retention rules to vectors as to the source documents, support deletion requests by removing both the chunk and its vector, and apply access controls at query time. If you use a hosted embedding API, you also need a data processing agreement with the provider and clarity on data residency.

What is the best embedding model in 2026?

There is no single best model; there is a best model for your data, language coverage, latency budget, and deployment constraints. The current strong default for English mid-market RAG is OpenAI's text-embedding-3-large or Voyage's voyage-3-large, both of which top retrieval benchmarks on MTEB. For multilingual content, Cohere's embed-multilingual-v3 and BGE-M3 are the leaders. For self-hosted, BGE-M3 and Nomic Embed v1.5 are credible. Run your top two or three candidates against your own evaluation set with your own chunks. Benchmark scores correlate with real-world performance but do not predict it.

How do I migrate from one embedding model to another?

Plan for a dual-index period. Stand up a new index alongside the existing one, re-embed all chunks with the new model into the new index, and run queries against both for a period to compare results on your evaluation set. Only when the new index is demonstrably equal or better do you switch traffic over. Never try to mix vectors from two models in the same index; cosine similarity between them is meaningless. Budget the API cost of re-embedding the full corpus and the engineering time for index rebuild, which together typically runs to a few thousand pounds for a million-chunk corpus plus a week of engineering time.

Where to take this next

Embeddings are the quiet workhorse of RAG. Get the model choice, chunking strategy, hybrid search and evaluation harness right, and the rest of the pipeline becomes much easier to reason about. Get them wrong and you will spend months tuning prompts and re-ranking models to compensate for retrieval that was never going to work. The order of work matters: chunking first, then hybrid search, then re-ranking, then model selection, then (rarely) fine-tuning. Always with evaluation running in the background.

If you are scoping or rebuilding a RAG system and want a second pair of eyes on the architecture before you commit to a stack, AI Advisory runs short technical reviews that produce a costed plan you can hand to your engineering team.

Ready to put this into production? book a discovery call.

Embeddings in RAG: What They Are and Why They Matter

What an embedding actually is

Where embeddings fit in a RAG pipeline

Choosing an embedding model

Chunking: the decision that breaks more RAG systems than the embedding model

Hybrid search and re-ranking: embeddings alone are not enough

Evaluation: how you know your embeddings are working

Production realities: cost, latency, freshness, security

Common failure modes worth knowing

Frequently asked questions

Do I need to fine-tune an embedding model for my domain?

What is the difference between embeddings and vector search?

How much does a production RAG system cost to run?

Can I use embeddings without a vector database?

How often should I re-embed my documents?

What is the best embedding model in 2026?

How do I migrate from one embedding model to another?

Where to take this next

Keep reading.

RAG in AI Explained in Simple Terms

Automation and AI: How to Choose a Partner That Actually Ships

AI and ML Data Integration Services: What Actually Leads the Market

Ready to automate your operations?

What an embedding actually is

Where embeddings fit in a RAG pipeline

Choosing an embedding model

Chunking: the decision that breaks more RAG systems than the embedding model

Hybrid search and re-ranking: embeddings alone are not enough

Evaluation: how you know your embeddings are working

Production realities: cost, latency, freshness, security

Common failure modes worth knowing

Frequently asked questions

Do I need to fine-tune an embedding model for my domain?

What is the difference between embeddings and vector search?

How much does a production RAG system cost to run?

Can I use embeddings without a vector database?

How often should I re-embed my documents?

Are embeddings safe under UK GDPR?

What is the best embedding model in 2026?

How do I migrate from one embedding model to another?

Where to take this next

Keep reading.

RAG in AI Explained in Simple Terms

Automation and AI: How to Choose a Partner That Actually Ships

AI and ML Data Integration Services: What Actually Leads the Market

Ready to automate your operations?