Skip to content

04. Graph Databases — The engine that knows how to follow relationships

~14 min read. Index-free adjacency is the trick that makes graph traversal blazing fast.

Continues from the first-principles overview in 00-first-principles.md. The graph query engine needs an engine under the hood — a database that stores relationships as direct pointers rather than forcing a full-table scan every time you want to follow a connection.


1) The core problem with relational databases and graphs

Picture a 3-hop query: find all colleagues of colleagues of a given person. In SQL this is three nested JOINs on a large PERSON_KNOWS table.

SELECT p3.name
FROM   person_knows pk1
JOIN   person_knows pk2 ON pk1.knows_id = pk2.person_id
JOIN   person_knows pk3 ON pk2.knows_id = pk3.person_id
WHERE  pk1.person_id = ?

Each JOIN scans the entire table. At 1 M rows, that's three nested scans: cost grows with table size N, not hop count. Graph databases use index-free adjacency: each node stores direct pointers to its neighbouring nodes. No table scan. Each hop is one pointer follow.


2) Index-free adjacency: the key insight

Relational (B-tree index)           Graph (index-free adjacency)
─────────────────────────           ────────────────────────────
query: find neighbours of A         query: find neighbours of A
  → look up A in B-tree index         → go to A's node in memory
  → scan matching rows                → follow edge list pointer
  → filter                            → return neighbours directly
  cost: O(log N + k)                  cost: O(k)
  where N = table size                where k = degree of A only
  degrades with table growth          stays constant as graph grows

See. At 1-hop the difference is small. At 5 hops it's the difference between milliseconds and hours. The graph query engine is only fast because the graph engine stores the knowledge graph as a network of direct pointers, not a flat table.


3) Cypher: the graph query language (Neo4j)

Cypher reads like ASCII art of the graph you're looking for.

Find all people who work at tech companies:

MATCH (p:Person)-[:WORKS_AT]->(c:Company {sector: "tech"})
RETURN p.name, c.name

The -[:WORKS_AT]-> is the relationship pattern. (p:Person) and (c:Company) are entity type filters.

Two-hop: find co-workers of Sundar Pichai:

MATCH (sundar:Person {name: "Sundar Pichai"})
      -[:WORKS_AT]->(:Company)
      <-[:WORKS_AT]-(colleague:Person)
RETURN colleague.name

The graph query engine resolves this using two pointer follows. No JOIN. No full scan.


4) Worked numerical example: traversal cost comparison

Scenario: social network, 1 M nodes, average degree 50.

┌──────────────────────┬──────────────────────┬──────────────────┐
│  Operation           │  SQL (nested JOINs)  │  Neo4j (hops)    │
├──────────────────────┼──────────────────────┼──────────────────┤
│  1-hop neighbours    │  ~1 000 rows checked │  50 ptr follows  │
│  2-hop neighbours    │  ~50 000 rows        │  2 500 follows   │
│  3-hop neighbours    │  ~2 500 000 rows     │  125 000 follows │
│  Wall time (3-hop)   │  ~8 s                │  ~14 ms          │
└──────────────────────┴──────────────────────┴──────────────────┘

At 3 hops, graph DB is ~570× faster. The gap widens further at deeper traversal.


5) Major graph database options and RDF querying

┌─────────────────┬──────────────┬────────────────────────────────┐
│  Database       │  Model       │  Best for                      │
├─────────────────┼──────────────┼────────────────────────────────┤
│  Neo4j          │  Property    │  Cypher, OLTP traversal        │
│  Amazon Neptune │  RDF + Prop  │  AWS integrations, compliance  │
│  TigerGraph     │  Property    │  GSQL, analytics at scale      │
│  ArangoDB       │  Multi-model │  Document + graph hybrid       │
│  Memgraph       │  Property    │  In-memory, streaming graphs   │
└─────────────────┴──────────────┴────────────────────────────────┘

For Graph RAG: Neo4j and Neptune are most common in production. Neo4j's vector index (since v5) lets you run both graph traversal and ANN search inside one database — the graph query engine and graph embedding in the same engine.


SPARQL for RDF graphs

When the knowledge graph is stored as RDF (e.g., Wikidata), SPARQL is the query language.

Find all companies headquartered in California:

SELECT ?company ?name WHERE {
  ?company rdf:type  schema:Organization .
  ?company schema:name ?name .
  ?company schema:addressLocality "California" .
}

SPARQL pattern triples match the (subject, predicate, object) format directly. The graph query engine here is the SPARQL engine's triple-pattern join processor. Yes?


Where this lives in the wild

  • Neo4j at NASA JPL — spacecraft component dependency graphs; engineers query multi-hop failure propagation paths that SQL JOINs could not handle in real time.
  • Amazon Neptune at Capital One — fraud graph where account → transaction → device chains are traversed in milliseconds for real-time transaction scoring.
  • TigerGraph at UnitedHealth Group — member-provider-claim graphs at 200 M nodes for risk analysis; GSQL traversal runs under 100 ms per patient query.
  • Memgraph at Pex — streaming social media graph where new edges are inserted at 1 M/sec and graph query engine queries run on live data without batch lag.
  • Wikidata SPARQL endpoint — public interface to 100 M+ RDF triples; data engineers federate it with local graphs for enrichment pipelines.

Pause and recall

  1. What is index-free adjacency and why does it matter at 5 hops?
  2. In Cypher, what symbol represents a directed relationship between two entities?
  3. At 3-hop depth, how many times faster was Neo4j than SQL in the worked example?
  4. What does Neo4j's vector index add to the graph query engine's capabilities?

Interview Q&A

Q: Why does SQL performance degrade with hop depth but graph DBs do not? A: SQL JOIN cost depends on table size N, not just degree k. Each hop adds a full table scan proportional to table size. Graph traversal cost depends only on local degree at each node — topology-bound, not data-volume-bound.

Common wrong answer to avoid: "Graph DBs cache more aggressively" — caching is orthogonal; the fundamental difference is the traversal algorithm, not the cache.

Q: Why would you still keep a relational DB alongside a graph DB? A: Relational DBs excel at aggregations, reporting, and tabular data. Graph DBs excel at traversal. The knowledge graph answers "how are things connected" while the relational DB answers "how many" and "what is the total".

Common wrong answer to avoid: "Graph DBs fully replace relational DBs" — they are complementary; graph DBs are poor at cross-table aggregations.

Q: Why does Cypher use ASCII-art path patterns instead of SQL-style syntax? A: The path pattern (a)-[:R]->(b) maps directly to the visual structure of the graph. This makes query construction intuitive for graph-shaped problems and reduces the impedance mismatch between the query language and the data model.

Common wrong answer to avoid: "It's just a syntax preference" — the pattern-matching mental model prevents a whole class of query design errors.

Q: Why add a vector index inside a graph database instead of a separate vector store? A: Keeping graph embedding (vector) search and graph query engine (graph) traversal in the same system avoids round-trip latency and lets the query planner combine them in a single execution plan — critical for sub-100 ms Graph RAG latency.

Common wrong answer to avoid: "Separate systems are more scalable" — operational complexity and network latency are real costs that in-database integration avoids.


Apply now (5 min)

Exercise. Write a two-hop Cypher query for a domain you know (e.g., author → book → publisher). Estimate how many row comparisons the equivalent SQL would need. Then count the pointer-follows the graph query would use instead.

Sketch from memory. Draw the index-free adjacency structure for a node with three neighbours. Show a B-tree index scan next to it for contrast. Label the cost of each approach at hop 1 and hop 3.


Bridge. The database engine is ready. But it's only useful if we can reliably identify which entity in the graph a real-world text mention refers to. That's the entity linking problem. → 05-entity-linking.md