AI7 June 20265 min read

Knowledge Graphs in RAG: What They Are and When to Use One

What a knowledge graph adds to a RAG pipeline, how GraphRAG works, when it beats vector search, and what it costs to build and run in production

By AI Advisory team

Retrieval-augmented generation has a known failure mode: ask a question that requires joining facts across multiple documents, and the model either hallucinates the link or returns a confident answer based on the one chunk it happened to retrieve. Knowledge graphs are the most common fix. They give the retriever something denser than cosine similarity to work with - explicit relationships between entities - and they let the generator reason over those relationships rather than guessing.

This article explains what a knowledge graph is in the context of RAG, how GraphRAG pipelines actually work, when the added complexity is worth it, and what the build looks like in practice. It assumes you already understand the basics of vector retrieval and embeddings.

What a knowledge graph actually is

A knowledge graph is a structured representation of entities (nodes) and the relationships between them (edges), with both nodes and edges carrying typed properties. "Acme Ltd acquired Beta Holdings in 2023" becomes two nodes (Acme Ltd, Beta Holdings) connected by an edge (acquired) with a property (year: 2023). Stack millions of these triples together and you have a queryable graph of facts.

The format has been around for decades - the term was popularised by Google's 2012 announcement of its own Knowledge Graph, but the underlying ideas go back to semantic networks in the 1960s and RDF/OWL standards from the W3C in the early 2000s. What changed in 2024 was the realisation that LLMs are good at two things graphs needed: extracting structured triples from messy prose, and reasoning over the resulting graph in natural language.

In a RAG context, a knowledge graph sits alongside (or replaces) the vector store. Where a vector index answers "which chunks of text are semantically similar to this query," a graph answers "which entities are connected to this entity, and how." The two are complementary - most production GraphRAG systems use both.

Why vector-only RAG breaks on certain questions

Standard RAG with a vector database works well for single-hop, locally-answerable questions. Ask "what is our refund policy for enterprise customers" and a well-chunked knowledge base will return the relevant paragraph. The model summarises it. Done.

It breaks down on three classes of question:

Multi-hop queries. "Which of our suppliers are owned by companies that have been sanctioned in the last two years?" requires joining a supplier list, an ownership graph, and a sanctions list. No single chunk contains the answer. Top-k retrieval returns plausible-looking but unconnected fragments, and the model fills in the gaps with guesswork.

Global or thematic queries. "What are the main themes across all 400 customer interviews we ran last year?" cannot be answered by retrieving the top 10 most similar chunks. The answer requires summarising the whole corpus. Microsoft's GraphRAG paper (Edge et al., 2024) called these "query-focused summarisation" tasks and showed vector RAG performs particularly badly on them.

Relationship questions. "How is Person A connected to Project B?" depends entirely on the path between two entities. Vector similarity does not encode paths.

The fix in all three cases is to make the relationships first-class citizens in your retrieval index, not implicit features the embedding model might or might not have captured.

How GraphRAG pipelines work

A production GraphRAG pipeline has four stages: ingestion, graph construction, retrieval, and generation.

Ingestion and entity extraction

Documents come in (PDFs, web pages, transcripts, database rows). The pipeline chunks them, then runs each chunk through an LLM with a prompt that extracts entities and relationships in a structured format - typically JSON triples of the form (subject, predicate, object) with source citations attached. The Microsoft GraphRAG implementation does this with a single prompt that returns both entities and claims; LlamaIndex's KnowledgeGraphIndex takes a similar approach.

Extraction quality is the single biggest determinant of downstream performance. Cheap models (GPT-4o-mini, Claude Haiku) work for clean prose but miss nuance on legal or technical documents. You typically want a stronger model for extraction even if generation uses something cheaper.

Graph construction and enrichment

Extracted triples are merged into a graph database. Neo4j is the most common choice; Memgraph, ArangoDB, Amazon Neptune, and TigerGraph are credible alternatives. Postgres with the AGE extension works for smaller graphs and keeps everything in one database.

Two enrichment steps matter:

Entity resolution. "Acme Ltd", "Acme Limited" and "ACME" need to collapse to one node. This is the boring 80% of the build and is usually done with a mix of fuzzy matching, embedding similarity, and LLM adjudication for ambiguous cases.
Community detection. Algorithms like Leiden cluster the graph into communities of densely-connected entities. The Microsoft GraphRAG approach then generates an LLM summary for each community at multiple levels of hierarchy. These community summaries are what enable the global thematic queries vector RAG cannot handle.

Retrieval

At query time, GraphRAG does not just retrieve chunks. It typically does some combination of:

Entity linking - identify which entities in the graph the query refers to.
Subgraph extraction - pull the local neighbourhood around those entities (usually 1-2 hops).
Community selection - for global queries, retrieve the relevant pre-computed community summaries.
Vector retrieval over the original chunks - still useful for grounding the answer in source text.

The retrieved context now contains structured relationships plus raw text plus pre-built summaries. That is a lot more for the generator to work with than a flat list of top-k chunks.

Generation

The final LLM call takes the query and the assembled context and produces an answer with citations back to source documents. Nothing exotic - the value was added earlier in the pipeline.

When a knowledge graph is worth the extra complexity

GraphRAG is not free. You add a graph database to your stack, the extraction step roughly doubles ingestion cost, and retrieval logic gets considerably more complex to debug. Build one when at least two of these are true:

Your queries are inherently relational. Compliance, due diligence, fraud investigation, customer 360, scientific literature review, and anything involving organisational hierarchies all benefit. If your typical user question contains the words "connected to", "owned by", "caused by", or "related to", you need a graph.

You need explainable answers. Regulated industries (finance, healthcare, legal) often require an auditor to see exactly which facts the system used to reach a conclusion. A graph traversal is far easier to explain than "these were the top 8 cosine-similar chunks." The ICO's guidance on explaining AI decisions (ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence) puts real weight on this for any system making decisions about people.

Your corpus has implicit structure already. Product catalogues, employee directories, case law databases, scientific papers with citations - these are already graph-shaped. Building a graph is largely a matter of making the implicit explicit, and the lift is much smaller than building one from pure prose.

You need thematic / global synthesis. Research, market intelligence, customer feedback analysis. Microsoft's benchmarks on GraphRAG showed substantial wins over baseline RAG on holistic questions about a corpus, with comprehensiveness and diversity scores roughly 70-80% better than vector-only baselines (Edge et al., "From Local to Global", arXiv:2404.16130, 2024).

Skip the graph when your use case is single-document Q&A, customer support over a stable FAQ, or anywhere a well-tuned hybrid search (BM25 + dense vectors + reranker) gives acceptable answers. That covers most chatbots.

What the build actually looks like

A realistic GraphRAG build for a mid-market client typically breaks down as follows. These are our rough numbers for a project covering a corpus of 5,000-50,000 documents.

Discovery and schema design (1-2 weeks). Decide what entity and relationship types matter. This is product work, not engineering - the schema reflects how the business thinks about its domain. A legal team's schema looks nothing like a manufacturer's.

Extraction pipeline (2-3 weeks). Prompt engineering for entity extraction, batching, retry logic, cost controls. Run it on a 5% sample, evaluate, iterate. Plan for the LLM extraction cost - at GPT-4o prices, expect roughly £0.30-£1.50 per thousand pages depending on chunk size and prompt design.

Graph storage and entity resolution (2-3 weeks). Neo4j or Postgres+AGE, plus the deduplication logic. This is where most of the engineering depth lives.

Retrieval and generation layer (2-3 weeks). Hybrid retrieval combining graph traversal and vector search, prompt assembly, citation handling, refusal patterns when the graph cannot answer.

Evaluation harness (1-2 weeks, ongoing). A test set of representative queries with expected answer characteristics. Without this you cannot tell whether a prompt change improved or regressed quality.

Total: 8-13 weeks for a first production system, with ongoing iteration. Budget for graph database hosting (Neo4j AuraDB starts around £55/month for small graphs, scales to thousands), LLM costs (highly variable, dominated by extraction), and the engineering retainer to keep extraction quality high as new document types arrive.

Stack choices and trade-offs

A few decisions matter more than the rest:

Graph database. Neo4j is the default for a reason - the largest ecosystem, mature Cypher query language, good Python and JavaScript drivers, and a free Community Edition for self-hosting. Memgraph is faster for streaming workloads. If you already have Postgres, the AGE extension lets you avoid adding a database. For very large graphs (hundreds of millions of edges) consider TigerGraph or Neptune.

Framework. LlamaIndex has the most mature graph-aware retrieval abstractions (PropertyGraphIndex). LangChain's graph integrations work but feel less polished. Microsoft's open-source GraphRAG library is excellent if its opinionated pipeline matches your use case but harder to customise. For anything bespoke, writing the retrieval logic directly against the graph driver is often cleaner than wrestling with a framework.

Extraction model. GPT-4o, Claude Sonnet, or Gemini 1.5 Pro all extract well. Cheap models save money on the bulk of extractions but produce noisier graphs. A common pattern is a two-pass approach: a cheap model does the first pass, a strong model re-extracts the chunks where the cheap model produced low-confidence output.

Hosting and data residency. For UK clients with GDPR-sensitive data, self-hosting Neo4j and using an LLM provider with EU/UK data residency (Azure OpenAI in UK South, AWS Bedrock in eu-west-2, or a self-hosted open model) is usually the right call. The ICO has been clear in its guidance on international data transfers that the controller carries the risk, so simplifying the question by keeping data in-region pays for itself.

Common failure modes

Three failures account for most of the bad GraphRAG projects we see:

Over-engineering the schema. Teams spend six weeks designing a perfect ontology and never ship. Start with five entity types and five relationship types, ship something, then extend. The schema should be discovered, not designed.

Skipping entity resolution. Without deduplication the graph fills with near-duplicate nodes and queries return fragmented neighbourhoods. This is the single most common reason GraphRAG underperforms its potential in production.

No evaluation harness. If you cannot measure quality, every prompt change is a guess. Build the test set before the first prompt is written. Use a mix of single-hop, multi-hop, and global queries, and grade with both deterministic checks (was the right entity in the answer?) and LLM-as-judge scoring for free-text quality.

FAQ

How is GraphRAG different from regular RAG?

Regular RAG retrieves text chunks based on semantic similarity to the query, then asks an LLM to answer using those chunks. GraphRAG adds a knowledge graph - an explicit, structured representation of entities and their relationships - to the retrieval step. The system can traverse relationships, pull in connected entities, and use pre-computed community summaries for thematic queries. The result is much better performance on multi-hop questions, relationship queries, and global synthesis tasks, at the cost of a more complex pipeline and higher ingestion costs.

Do I need to replace my vector database with a graph database?

No. Most production GraphRAG systems use both. The graph handles relationships and structured traversal; the vector store handles semantic similarity over the original text. At query time, retrieval typically combines a graph traversal (to find connected entities) with a vector lookup (to ground the answer in source passages). If you already have a working vector RAG system, adding a graph alongside it is the usual path, not ripping out what works.

How long does it take to build a GraphRAG system?

For a first production system on a corpus of 5,000-50,000 documents, plan for 8-13 weeks from kickoff to a working pipeline. Discovery and schema design takes the first two weeks. Extraction, graph construction, and entity resolution dominate the middle. Retrieval logic and an evaluation harness fill the back end. Expect ongoing iteration after launch - new document types and new query patterns will keep extraction prompts and schema evolving for the first six months.

What does it cost to run?

The dominant costs are LLM calls for ingestion (one-time per document, but rerun whenever extraction prompts change) and graph database hosting (ongoing). Extraction at GPT-4o prices typically lands between £0.30 and £1.50 per thousand pages. Neo4j AuraDB starts around £55/month for small graphs and scales into the thousands for large ones; self-hosted Neo4j Community Edition is free but requires infrastructure work. Query-time LLM costs are similar to standard RAG. For a mid-sized corpus, total monthly run cost is usually a few hundred to a few thousand pounds.

Can I use open-source models for the extraction step?

Yes, with caveats. Llama 3.1 70B, Qwen 2.5, and Mistral Large all produce usable triples, particularly when given few-shot examples in the prompt. The trade-off is extraction quality - on complex documents (legal contracts, scientific papers, multi-party transcripts) frontier closed models still pull ahead. A pragmatic approach is to use open-source for the bulk of extraction and reserve a frontier model for low-confidence chunks identified by a first pass. This keeps costs down while preserving graph quality where it matters.

How does GraphRAG handle data updates?

Incremental updates are one of GraphRAG's harder problems. When a document changes, you need to identify which triples in the graph came from it, remove them, re-extract, and re-merge. Community summaries and entity resolution may need to recompute. In practice, teams either accept a delay (batch reprocessing nightly or weekly) or maintain document-to-triple lineage so individual documents can be updated cleanly. For high-churn corpora, this overhead is real and should factor into the build-vs-buy decision.

Is GraphRAG worth it for a customer support chatbot?

Usually not. Customer support typically asks single-hop questions answerable from one or two passages in a stable knowledge base. Well-tuned hybrid search (BM25 + dense vectors + a reranker) handles this at a fraction of the complexity. Consider GraphRAG for support only when you have a complex product with many interacting components, customers asking compatibility or configuration questions across that ecosystem, and a knowledge base structured around those relationships. Otherwise the engineering effort is better spent on better chunking, better prompts, and a good evaluation harness.

What skills do I need on the team to maintain this?

One engineer who is comfortable with Python, an LLM SDK, and a graph database (Cypher for Neo4j) can maintain a GraphRAG system once it is live. The trickier skill is the ongoing prompt engineering and evaluation work - someone needs to look at extraction outputs, identify systematic errors, and adjust prompts or schema. This is closer to a data quality role than a pure engineering one. For mid-market clients, retaining the agency that built it for ongoing iteration is usually cheaper than hiring a specialist.

Where to go from here

Knowledge graphs are not a silver bullet for RAG, and they are not always the right answer. They are the right answer when your questions are relational, your corpus has structure to exploit, or your users need to see how an answer was reached. When those conditions hold, GraphRAG produces qualitatively different results from vector-only RAG - results that change what users can ask, not just how well the system answers what they already ask.

If you are weighing up whether a knowledge graph fits your use case, the fastest way to find out is a two-week discovery on your actual corpus and query patterns rather than a generic proof-of-concept. AI Advisory runs these as fixed-fee engagements and ships a costed build plan at the end. Get in touch to scope one.