00. Knowledge graph retrieval — First-principles overview¶

Module 08_rag_system_design taught you to retrieve relevant chunks of text. Module 09_advanced_rag_patterns taught you to rerank and refine those chunks. This module is the discipline for the questions where chunk retrieval is structurally insufficient — questions whose answers live in the path between facts, not in any single fact.

A platform engineer at a Bengaluru legal-research SaaS company audits the AI assistant's failure cases. The model is good; the prompt is tuned; the chunk retrieval returns plausibly-relevant text for every question. And the answers are wrong on a specific class of question: "Which judges from the Bombay High Court issued rulings on data-protection cases citing the 2017 Puttaswamy judgment?" The chunks the retriever finds discuss judges; they discuss Bombay High Court; they discuss Puttaswamy. None of them say which judges did all three. The answer requires joining facts: judge ⇒ ruling ⇒ court ⇒ topic ⇒ cited-precedent. The chunks contain the facts; the chunks do not contain the join. Three months later, the team has built a knowledge graph over their case corpus — judges as entities, rulings as entities, courts as entities, citations as typed edges — and the same question is answered in milliseconds by a graph traversal that the chunk retriever would have needed luck to approximate.

That structural difference — facts as nodes, relationships as typed edges, answers as paths — is what this module is about. Each chapter is one surface of the graph retrieval discipline. The opening incident is what happens when the answer is structural and the retrieval is flat; the rest of the module is what to build instead.

What knowledge graph retrieval is, in one sentence¶

Knowledge graph retrieval is the discipline of representing facts as typed nodes and edges, and answering questions through structured traversal — so that multi-hop, relational, and aggregation queries land in operational time rather than being approximated by similarity over flat chunks.

Read the sentence right to left.

Approximated by similarity over flat chunks — vector retrieval finds passages that look like the question, not passages that connect to the answer. For relational questions, "looks like" misses.
Multi-hop, relational, aggregation queries — the question's answer is a path, an intersection, or a count, not a passage.
Structured traversal — the answer is computed by walking the graph, not by embedding-distance.
Typed nodes and edges — facts have identity (a node is a thing) and connections have meaning (an edge has a label).

If a team has shipped retrieval-augmented generation and is seeing systematic failures on questions that require joining facts, the question is not whether the model needs prompting differently. The question is whether the retrieval substrate can answer the question's shape at all.

The six surfaces of graph retrieval¶

Every production knowledge-graph retrieval system has six load-bearing surfaces.

Surface	One-liner	Pressure it answers
The data model	Nodes, edges, properties, types — the schema	structure: facts need identity and labelled connections
The construction pipeline	Extracting graph structure from raw text and structured sources	scale: graphs grow with the corpus, not by hand
The storage engine	Graph database (property graph or RDF triple store)	traversal: pointer-following must be fast
The entity linker	Resolving text mentions to canonical nodes	consistency: "Apple" the company and "Apple" the fruit must be different nodes
The query layer	Traversal queries, hybrid graph+vector, sub-graph extraction	latency: questions become traversal plans, not full scans
The maintenance layer	Updates, deletions, drift, evaluation	freshness: the world changes, the graph must follow

The module's twelve chapters develop each surface in turn, then synthesise. The final file is the honest admission of what graph retrieval still cannot do.

What this module is not about¶

General retrieval-augmented generation. That is 08_rag_system_design. This module assumes you know flat retrieval and shows when it fails.
Vector retrieval infrastructure. Covered in 02_ai_infrastructure/03_vector_retrieval_infrastructure. This module uses vectors as one input to hybrid retrieval, not as the whole substrate.
Pure database theory. Cypher and SPARQL syntax are introduced where they teach the concept; this is not a database manual.
Knowledge representation as a research discipline. Description logics, ontologies, OWL reasoning — sketched lightly; the focus is production retrieval.

The recurring vocabulary¶

These terms appear in every chapter.

Name	Surface	What it is
the knowledge graph	Data model	the full set of typed nodes and edges encoding the corpus's facts
the entity	Data model	a named, identified node — a person, organisation, concept, event
the relationship	Data model	a typed edge connecting two entities, carrying a label like `authored`, `cites`, `works_at`
the multi-hop query	Query	a question whose answer requires traversing two or more edges
the property graph	Storage	a node-and-edge model where nodes and edges both carry key/value properties
the triple store	Storage	an RDF model where facts are subject-predicate-object triples
the entity linker	Construction	the component that resolves a text mention to a canonical entity ID
the graph embedding	Query	a learned vector for nodes or sub-graphs that approximates graph distance
the community	Query	a densely-connected sub-graph used for summarisation and broad-scope queries
the hybrid retriever	Query	the combiner that uses both graph traversal and vector similarity per query class
the graph schema	Data model	the typed catalogue: which node types exist, which edges connect which types
the construction pipeline	Construction	the ETL that turns raw text into nodes and edges

The journey: when the graph is needed, then how it works¶

This module has three acts.

Act 1 — Diagnose the failure (file 01). Why flat retrieval fails the questions a knowledge graph answers easily.

Act 2 — Build the graph (files 02–06). Data model, construction, storage, entity linking, embeddings. By file 06 the graph exists as queryable production substrate.

Act 3 — Retrieve from the graph (files 07–12). Graph-RAG architecture, communities, multi-hop reasoning, hybrid retrieval, maintenance, evaluation.

Synthesis (file 13). Honest admission of what graph retrieval still cannot do.

Memory map¶

#	File	Surface	What it adds
01	flat-retrieval-failure	—	the case that forces graph retrieval to exist
02	graph-data-model	Data model	nodes, edges, properties, triples, schema design
03	knowledge-graph-construction	Construction	extracting graph from text and structured sources
04	graph-databases	Storage	property graph vs. triple store; the engines that traverse
05	entity-linking	Construction	resolving mentions to canonical entities
06	graph-embeddings	Query	learned vectors over graph structure
	— milestone: graph is queryable —
07	graph-rag-architecture	Query	the question-to-context pipeline
08	community-detection	Query	grouping graphs into summarisable sub-graphs
09	multi-hop-reasoning	Query	scoring and constraining longer paths
10	hybrid-graph-vector	Query	combining graph traversal with vector similarity
11	graph-maintenance	Maintenance	updates, deletions, drift, freshness
12	graph-evaluation	Maintenance	measuring graph quality and answer quality
	— milestone: graph is operable —
13	honest-admission	Boundaries	what graph retrieval still cannot do

Three traversal paths use this map. Prerequisite path — top to bottom. Failure path — when a multi-hop query fails, match the failure to a surface (construction, linking, query, maintenance). Synthesis path — pick two surfaces and ask how they compose (construction + linking = how the same entity gets one canonical node across the corpus).

How this module relates to its neighbours¶

08_rag_system_design — flat retrieval; the substrate this module supplements.
09_advanced_rag_patterns — reranking and refinement; complements graph retrieval at the candidate-set layer.
02_ai_infrastructure/03_vector_retrieval_infrastructure — the vector substrate hybrid retrieval depends on.
06_evidence_data_pipelines — the data pipelines feeding both flat and graph retrieval.
15_reasoning_routing_verification — the routing decision (flat vs. graph vs. hybrid) per query class.
17_schema_driven_generation — structured outputs that may be backed by graph queries.

Top resources¶

Microsoft GraphRAG — https://github.com/microsoft/graphrag — the reference production-grade Graph RAG system.
Neo4j Cypher manual — https://neo4j.com/docs/cypher-manual/ — Cypher language and patterns.
Wikidata — https://www.wikidata.org/ — practical reference for entity identifiers and statements.
HotpotQA — https://hotpotqa.github.io/ — multi-hop benchmark useful for seeing where flat retrieval breaks.
W3C SPARQL — https://www.w3.org/TR/sparql11-query/ — RDF triple store query language.
Stanford CS520 Knowledge Graphs notes — overview of triples, schema, and construction pipelines.

What's coming¶

01-flat-retrieval-failure.md — Why nearest-text-chunk fails multi-hop questions.
02-graph-data-model.md — Nodes, edges, properties, triples, schema design.
03-knowledge-graph-construction.md — Extracting graph structure from text and structured sources.
04-graph-databases.md — Property graphs, triple stores, the engines that traverse.
05-entity-linking.md — Resolving mentions to canonical entities.
06-graph-embeddings.md — Learned vectors over graph structure.
07-graph-rag-architecture.md — Question-to-context pipeline.
08-community-detection.md — Grouping graphs into summarisable sub-graphs.
09-multi-hop-reasoning.md — Scoring and constraining longer paths.
10-hybrid-graph-vector.md — Combining graph traversal with vector similarity.
11-graph-maintenance.md — Updates, deletions, drift, freshness.
12-graph-evaluation.md — Measuring graph and answer quality.
13-honest-admission.md — What graph retrieval still cannot do.

Bridge. Before designing nodes, edges, or queries, we feel why flat retrieval fails on a class of questions. The first chapter is that diagnosis. → 01-flat-retrieval-failure.md