AI7 June 20265 min read

Cosine Similarity in RAG: How It Works and When It Fails

A practitioner's guide to cosine similarity in RAG systems: the maths, why it dominates vector search, where it breaks, and what to use instead

By AI Advisory team

Cosine similarity is the default scoring function in almost every retrieval-augmented generation (RAG) system shipped in the last three years. If you have built a chatbot grounded in your own documents, queried Pinecone, Weaviate, Qdrant, or run pgvector with the <=> operator, you have used cosine similarity, whether you noticed or not.

It is also the source of a lot of quiet retrieval failure. Engineers ship a RAG pipeline, the demo looks great, and three weeks later support is fielding complaints that the assistant keeps surfacing the wrong document. Nine times out of ten, the root cause is not the LLM. It is the assumption that cosine similarity between embeddings is a reliable proxy for semantic relevance. Sometimes it is. Often it is not.

This article covers what cosine similarity actually measures, how it is used inside a RAG pipeline, why it is the default, where it breaks down, and what to do about it in production.

What cosine similarity actually measures

Cosine similarity is a measure of the angle between two vectors. Given two vectors A and B, it computes:

cos(θ) = (A · B) / (||A|| × ||B||)

That is the dot product of the two vectors divided by the product of their magnitudes. The result is bounded between -1 and 1. A score of 1 means the vectors point in the same direction, 0 means they are orthogonal, and -1 means they point in opposite directions. In practice, when comparing embeddings from modern text encoders like OpenAI's text-embedding-3-small or Cohere's embed-v3, scores typically fall between 0.0 and 0.9, rarely going negative.

The critical property is that cosine similarity ignores vector magnitude. It only cares about direction. Two vectors of wildly different lengths can have a cosine similarity of 1 if they point the same way. This is deliberate: in text embedding spaces, magnitude often correlates with frequency or document length, neither of which should dominate semantic matching. By stripping out magnitude, cosine similarity isolates the directional signal that (in theory) encodes meaning.

Contrast this with Euclidean distance, which measures straight-line distance between two points and is sensitive to both direction and magnitude. Or the dot product, which combines direction and magnitude. Cosine similarity is, mathematically, the dot product of L2-normalised vectors. If you normalise your embeddings to unit length at ingestion, dot product and cosine similarity produce identical rankings - which is why production systems often use dot product under the hood for speed.

How cosine similarity fits into a RAG pipeline

A standard RAG pipeline has five stages: chunking, embedding, indexing, retrieval, and generation. Cosine similarity lives in the retrieval stage, but its effectiveness is determined by decisions made earlier.

At ingestion, source documents are split into chunks (typically 200-800 tokens), each chunk is passed through an embedding model that returns a fixed-length vector (1,536 dimensions for OpenAI's text-embedding-3-small, 1,024 for Cohere's embed-v3), and those vectors are stored in a vector database alongside the original text and metadata.

At query time, the user's question is embedded using the same model, and the vector database performs a nearest-neighbour search using cosine similarity (or a fast approximation of it) to return the top k chunks - usually k = 4 to 20. Those chunks are concatenated into the LLM's context window along with the question, and the model generates a grounded answer.

The cosine similarity score is doing one specific job here: ranking which chunks are most likely to contain the answer. Everything else - prompt construction, refusal logic, citation handling - is downstream. If retrieval ranks the wrong chunk first, the LLM will either hallucinate or politely tell the user it cannot find the answer, even when the answer is sitting in chunk number seven.

Why cosine similarity became the default

There are three practical reasons cosine similarity dominates RAG retrieval, and one historical one.

First, it is cheap to compute. With pre-normalised vectors, cosine similarity reduces to a dot product - a single SIMD-friendly operation per comparison. Modern approximate nearest neighbour (ANN) indexes like HNSW, IVF, and ScaNN are heavily optimised for this. A well-tuned HNSW index can return top-10 results from a million-vector corpus in under 10 milliseconds on a single CPU core.

Second, modern text encoders are explicitly trained to produce cosine-friendly embeddings. Models like Sentence-BERT, E5, BGE, and the OpenAI embedding family use contrastive learning objectives - typically InfoNCE or triplet loss - that pull semantically similar text closer together in cosine space and push dissimilar text further apart. The encoder and the similarity function are co-designed. Using Euclidean distance on these embeddings usually produces worse retrieval, because the model was never asked to make magnitude meaningful.

Third, it is bounded and interpretable. A score between -1 and 1 (or 0 and 1 for normalised text embeddings) is easy to threshold, log, and reason about. Engineers can set a cutoff of 0.75 below which a chunk is discarded as irrelevant. That is much harder with unbounded distances.

The historical reason is that information retrieval has used cosine similarity since the 1970s, when it was applied to TF-IDF vectors in the SMART system at Cornell. The intuition - that the angle between term frequency vectors captures topical overlap better than raw counts - carried over directly when dense embeddings replaced sparse ones in the late 2010s.

Where cosine similarity quietly fails

The defaults look clean in a benchmark. They get messy in production. Here are the failure modes that show up repeatedly in real RAG deployments.

Surface-form bias

Embedding models are trained on web text. They learn that questions and answers often share vocabulary - which means chunks that lexically mirror the query rank high, even when they do not contain the answer. Ask "what is our refund policy for enterprise customers?" and you may get back a marketing page that uses all those words but never states the policy, ranked above the actual policy document that phrases things differently.

The needle-in-a-haystack problem

If the answer is one sentence buried in a 600-token chunk that otherwise covers a different topic, the chunk's overall embedding gets dominated by the surrounding content. The relevant sentence is invisible to cosine similarity at the chunk level. This is why chunking strategy matters more than embedding model choice for most production systems.

Negation and antonyms

Cosine similarity in dense embedding space frequently rates "the policy allows X" and "the policy does not allow X" as highly similar - often above 0.9. The negation is a small directional perturbation in a 1,536-dimensional space. For factual or compliance use cases this is dangerous, and re-ranking with a cross-encoder is usually required to catch it.

Domain drift

General-purpose embedding models are trained on general-purpose text. Drop them into legal contract review, clinical notes, or industrial maintenance logs and retrieval quality degrades sharply. The vocabulary, syntax, and discourse structure of specialist domains push relevant chunks away from queries in cosine space. The fix is domain-adapted embeddings (fine-tuning E5 or BGE on your corpus) or a hybrid retrieval setup, not a different similarity metric.

The popularity collapse

In large corpora, certain chunks - boilerplate footers, common introductions, frequently repeated definitions - end up with embeddings that sit near the centre of the embedding space and score moderately well against almost any query. They become noisy false positives at every k. Dedup and metadata filtering are the practical answer.

What to do about it in production

Cosine similarity is not the enemy. It is a fast, well-understood first-pass ranker. The mistake is treating it as the final answer. Production RAG systems that actually work tend to layer several techniques on top.

Hybrid retrieval. Combine dense cosine retrieval with sparse keyword search (BM25 via Elasticsearch, OpenSearch, or pgvector's tsvector). Dense retrieval catches semantic paraphrases; sparse retrieval catches exact terminology - product codes, names, jargon - that dense models often miss. Most serious vector databases now expose hybrid search natively. Weaviate, Qdrant, and Pinecone all support it; pgvector users typically build it manually using tsvector alongside the embedding column. Microsoft's research on Azure AI Search showed hybrid retrieval with re-ranking outperformed pure vector retrieval on most internal benchmarks.

Re-ranking with a cross-encoder. After cosine similarity returns the top 50-100 candidates, pass them through a cross-encoder like Cohere Rerank, BGE-reranker, or a custom-trained model. Cross-encoders look at the query and candidate together (rather than encoding each independently), which means they actually read the candidate in the context of the question. They are too slow for first-pass retrieval over millions of chunks, but excellent for re-ranking a shortlist. The latency cost is usually 100-300ms; the relevance gain is often substantial.

Chunking that respects structure. Most retrieval failures attributed to "the embedding model" are actually chunking failures. Split on semantic boundaries (sections, headings, paragraphs) rather than fixed token counts. Add a short summary or title to each chunk before embedding so the dense vector reflects the chunk's purpose, not just its prose. Overlap chunks by 10-15% to avoid splitting answers across boundaries.

Metadata filtering. Cosine similarity does not know that a 2019 policy document is stale or that one chunk belongs to a different business unit. Filter on document date, source, language, and access control before ranking, not after. Every production vector database supports this; not every team uses it.

Evaluation harness. The single biggest gap in most RAG projects is the absence of a measurable retrieval evaluation. Build a set of 50-200 representative questions with known correct chunks, and measure recall@k, MRR, and nDCG every time you change the embedding model, chunking strategy, or re-ranker. Without this, you are tuning blind. Frameworks like Ragas, TruLens, and DeepEval cover the basics.

Choosing a similarity metric in practice

For text embeddings from modern encoders, cosine similarity (or equivalently, dot product on normalised vectors) is the right default. The question is rarely "should I use cosine or Euclidean". The question is what to do after cosine similarity returns its first-pass ranking.

Use Euclidean distance only when working with embeddings explicitly trained for it - some image encoders, some older models. Use dot product (unnormalised) when magnitude carries meaning, which is rare in text. Use Manhattan distance essentially never for embeddings; it has niche uses in high-dimensional sparse spaces but is not relevant to standard RAG.

In pgvector, the three operators are <=> for cosine distance, <-> for Euclidean (L2), and <#> for negative inner product. Pick one at table creation time and build your index accordingly - HNSW and IVFFlat indexes are metric-specific, and you cannot mix at query time without a full reindex.

FAQ

Is cosine similarity the same as the dot product?

Not quite, but very close in practice. Cosine similarity is the dot product divided by the product of vector magnitudes, which makes it a measure of direction only. If you L2-normalise your embeddings to unit length at ingestion - which most production pipelines do - then cosine similarity and dot product produce identical rankings, and dot product is faster to compute. Most vector databases internally use dot product on normalised vectors and call it cosine similarity. The distinction matters only if you are working with unnormalised embeddings, in which case dot product will favour longer documents.

What is a good cosine similarity score for RAG retrieval?

It depends entirely on the embedding model. For OpenAI's text-embedding-3-small, relevant matches typically score 0.4 to 0.8, and a threshold of around 0.35-0.45 is a reasonable cutoff for discarding obvious irrelevance. For Cohere's embed-v3 or BGE models, the score distributions differ. Do not transplant thresholds across models. The right approach is to build a labelled evaluation set, plot the score distribution for known-relevant and known-irrelevant pairs, and pick the threshold where they separate. Absolute scores are less useful than relative ranking; in most RAG systems, you keep the top k regardless of score and let the LLM handle borderline cases.

Should I use cosine similarity or learn a custom similarity function?

For the vast majority of RAG systems, cosine similarity on a good off-the-shelf embedding model is the correct choice. The engineering effort to train a custom similarity function rarely pays back compared to other improvements - better chunking, hybrid retrieval, cross-encoder re-ranking, or domain-adapting the embedding model itself. Custom similarity functions make sense for high-stakes domains (legal discovery, clinical decision support) with large labelled datasets and dedicated ML teams. For everything else, fix retrieval by improving inputs to cosine similarity, not by replacing the metric.

Why does my RAG system retrieve irrelevant chunks even with high cosine scores?

High cosine similarity does not mean semantic relevance - it means directional alignment in embedding space. Common causes: chunks that share vocabulary with the query but not meaning (surface-form bias), chunks where the relevant sentence is diluted by surrounding unrelated content, negations that embed similarly to their positive form, and boilerplate chunks that sit near the centre of embedding space and score moderately well against everything. The fixes are structural: better chunking, hybrid retrieval combining dense and sparse signals, a cross-encoder re-ranker on the shortlist, and metadata filtering to exclude stale or out-of-scope documents before ranking.

How does cosine similarity scale to millions of documents?

Exact cosine similarity comparison is linear in corpus size and becomes too slow above roughly 100,000 vectors. Production systems use approximate nearest neighbour (ANN) indexes - HNSW (the default in most vector databases), IVF, and ScaNN - that trade a small amount of recall for orders-of-magnitude speedup. A well-tuned HNSW index can serve top-10 queries from a 10-million-vector corpus in 10-30 milliseconds on commodity hardware. The tradeoff is memory: HNSW typically needs 1.5-2x the raw vector size in RAM. For corpora above 100 million vectors, quantisation (product quantisation, scalar quantisation) reduces memory at the cost of further recall loss.

Does cosine similarity work for multilingual RAG?

Only if your embedding model is multilingual. Cosine similarity is just a measure on whatever vector space the encoder produces. If you embed English and German text with a monolingual English model, the German vectors will be incoherent and cosine scores will be meaningless. Use a model explicitly trained for multilingual or cross-lingual retrieval - Cohere's embed-multilingual-v3, BGE-M3, or multilingual E5. These embed text from different languages into a shared space, so a French query can retrieve a relevant German chunk. Test cross-lingual retrieval quality explicitly; it is usually weaker than monolingual retrieval in any single language.

When should I move beyond cosine similarity entirely?

When you have exhausted the cheaper improvements and still have a retrieval quality gap that matters commercially. The usual order: fix chunking, add hybrid retrieval, add a cross-encoder re-ranker, add metadata filtering, fine-tune the embedding model on your domain, then consider more exotic retrieval architectures like ColBERT (late-interaction retrieval) or learned sparse retrieval (SPLADE). Each step adds operational complexity. Most teams find that hybrid retrieval plus a re-ranker closes 80% of the gap, at which point further work has diminishing returns and the LLM's prompt and grounding logic become the binding constraint.

Conclusion

Cosine similarity is the workhorse of RAG retrieval for sound reasons: it is fast, mathematically clean, well-matched to how modern text encoders are trained, and easy to reason about. It is also a first-pass filter, not a final relevance judgement. Production systems that perform well treat it as one signal in a stack that also includes sparse retrieval, re-ranking, structured chunking, and metadata filtering, all measured against a labelled evaluation set. Teams that ship a vanilla cosine-only pipeline and call it done are the ones running incident reviews three months later.

If you are building a RAG system and want a retrieval architecture that holds up in production rather than just in the demo, AI Advisory builds and operates these pipelines for UK mid-market teams. Get in touch to discuss your use case.

Ready to put this into production? book a discovery call.