02. Graph Data Model — Nodes, edges, and the grammar of facts¶

~13 min read. The triple is the atom. Everything else is built from it.

Continues from the first-principles overview in 00-first-principles.md. Each entity on the knowledge graph is a node, each relationship is a typed edge, and together they form the grammar that makes relational queries possible.

1) Picture before everything¶

A knowledge graph is a labelled, directed multigraph. Three components: nodes (entities), edges (relations), and labels (types).

┌─────────────┐   CEO_OF    ┌─────────────┐
│  Sundar     │────────────▶│  Google LLC │
│  Pichai     │             └──────┬──────┘
└─────────────┘                    │
                            PART_OF│
                                   ▼
                           ┌─────────────┐
                           │  Alphabet   │
                           │  Inc.       │
                           └─────────────┘

The entity "Sundar Pichai" connects to the entity "Google LLC" via the relationship labelled CEO_OF. "Google LLC" connects to "Alphabet Inc." via PART_OF. That is a complete two-hop chain — one multi-hop junction away from the root.

2) The triple: smallest unit of graph knowledge¶

Every fact is a triple: (subject, predicate, object) — also called SPO.

subject          predicate        object
──────────────   ──────────────   ──────────────
Sundar Pichai    CEO_OF           Google LLC
Google LLC       PART_OF          Alphabet Inc.
Alphabet Inc.    FOUNDED_IN       Mountain View
Google LLC       PRODUCT          Google Search

The graph query engine operates on these triples. To answer "Where was Alphabet founded?", it chains: - Find Alphabet Inc. - Follow FOUNDED_IN edge. - Return Mountain View.

One triple = one fact. A graph = a bag of triples.

3) RDF vs property graph — two flavours¶

RDF (Resource Description Framework) - Strict: each node and edge is a URI. - No properties on edges. - Triples only. - Used in Wikidata, DBpedia, public linked data.

Property graph - Nodes and edges both carry property bags. - More flexible for engineering. - Used in Neo4j, Amazon Neptune, TigerGraph.

  RDF triple                    Property graph
  ─────────────────────         ─────────────────────────────
  (subject, pred, obj)          node: {id, type, props…}
  node = URI only               edge: {type, props…}
  edge = URI only               edge can carry weight, date…

  (Pichai, CEO_OF, Google)      Pichai ──[CEO_OF since:2015]──▶ Google

Property graphs are easier to query with Cypher. RDF is better when you need cross-dataset federation. The knowledge graph analogy holds in both: each entity is a node, each relationship is a typed edge.

4) Worked numerical example: storage comparison¶

Small example: 5 entities, 8 relationships.

Adjacency matrix (5×5 = 25 cells, most empty):

     A  B  C  D  E
A  [ 0  1  0  1  0 ]
B  [ 0  0  1  0  0 ]
C  [ 0  0  0  0  1 ]
D  [ 0  0  0  0  1 ]
E  [ 0  0  0  0  0 ]

25 cells stored, only 5 are 1. 80 % wasted space. Worse: adjacency matrix loses edge labels.

Triple store (only what exists):

(A, rel1, B)
(A, rel2, D)
(B, rel3, C)
(C, rel4, E)
(D, rel5, E)

5 triples, 0 wasted cells. As graphs grow sparse (millions of nodes, thousands of edges per node), the saving multiplies.

5) Ontology: the schema of the knowledge graph¶

An ontology defines what types of entities exist and what types of relationships are allowed.

┌─────────────────────────────────────────────────┐
│  Ontology (schema)                               │
│                                                  │
│  Person ──[CEO_OF]──▶ Organisation               │
│  Organisation ──[PART_OF]──▶ Organisation        │
│  Organisation ──[FOUNDED_IN]──▶ Place            │
└─────────────────────────────────────────────────┘

Without an ontology, any edge label is allowed. The graph becomes a jungle. With an ontology, the graph query engine knows which relationships are valid at each entity type. That constraint makes traversal faster and answers more trustworthy.

Where this lives in the wild¶

Wikidata — 100 M+ items modelled as RDF triples; powers sidebar facts in Wikipedia.
Google Knowledge Graph — property graph backing entity cards in Search; engineers query typed edges to assemble fact panels.
LinkedIn's entity graph — nodes for Person, Company, Job; edges for WORKS_AT, CONNECTED_TO, HAS_SKILL; powers recruiter search and "People Also Viewed."
Neo4j at eBay — product taxonomy graph where PART_OF and COMPATIBLE_WITH edges let recommendation engineers find substitutable items.
Amazon Neptune at financial firms — compliance graphs model regulatory relationships as typed edges; legal engineers traverse ownership and control chains.

Pause and recall¶

What are the three components of a triple, and which one becomes the relationship?
Why does an adjacency matrix waste space compared to a triple store?
What does an ontology add to a property graph that raw triples lack?
In the diagram, which entity is the multi-hop junction between Sundar Pichai and Alphabet?

Interview Q&A¶

Q: Why prefer a property graph over RDF for most production systems? A: Property graphs allow edge properties (timestamps, weights, confidence scores) without creating extra nodes. RDF requires reification — a verbose pattern that complicates queries. For most engineering tasks, property graphs are simpler to build and query.

Common wrong answer to avoid: "RDF is outdated" — RDF is still the right choice for public linked data federation; the tradeoff is expressiveness vs. interoperability.

Q: Why store a sparse graph as triples instead of an adjacency matrix? A: Real-world knowledge graphs are extremely sparse — average degree ~10 out of millions of possible connections. The adjacency matrix is O(N²) in memory; the triple store is O(E) where E is the number of actual edges. At billion-node scale the matrix is physically impossible.

Common wrong answer to avoid: "Matrices are faster for lookup" — pointer-based graph DBs achieve O(1) per hop through index-free adjacency; matrices are only faster for dense linear algebra.

Q: Why does the ontology matter for the graph query engine's correctness? A: Without a schema, the graph query engine cannot prune invalid paths early. It may traverse a FOUNDED_IN edge expecting an Organisation and land on a Date, causing downstream errors. The ontology constrains traversal to well-typed paths.

Common wrong answer to avoid: "Ontologies are only for documentation" — in graph DBs like Neptune, type constraints enforce data integrity at write time.

Q: Why does adding edge properties matter for knowledge graph construction? A: Facts have temporal scope and confidence. An edge WORKS_AT without a since property cannot model job changes. Without confidence, the graph cannot rank contradicting triples.

Common wrong answer to avoid: "Properties on edges complicate queries" — modern graph query languages handle edge properties cleanly; the modelling benefit outweighs query cost.

Apply now (5 min)¶

Exercise. Pick any three facts about a company you know. Write them as SPO triples with typed predicates. Then draw the mini graph with proper typed edges. Count: how many entities and how many relationships did you create?

Sketch from memory. Draw a property graph node for a Person with three properties. Draw an edge from that node to an Organisation node. Label the edge type and add one edge property (e.g., since).

Bridge. We know what a knowledge graph looks like. Now the hard question: how do you build one automatically from raw text? That's the construction pipeline. → 03-knowledge-graph-construction.md