03. Knowledge Graph Construction — From raw text to typed triples¶

~15 min read. The pipeline that turns sentences into a traversable knowledge graph.

Continues from the first-principles overview in 00-first-principles.md. The knowledge graph doesn't exist until someone surveys every entity, draws every relationship between them, and labels each line with what kind of connection it carries.

1) The four-stage construction pipeline¶

Raw text contains facts in prose form. The pipeline converts prose into explicit triples.

Raw text
   │
   ▼
┌──────────────────────┐
│  1. NER              │  extract entity spans
│     Named Entity     │  (stations on the map)
│     Recognition      │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  2. Co-reference     │  "He" → "Sundar Pichai"
│     Resolution       │  collapse aliases
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  3. Relation         │  extract typed edges
│     Extraction       │  (relationships between stations)
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  4. Schema           │  align to ontology;
│     Alignment        │  merge duplicate stations
└──────────────────────┘
           │
           ▼
Knowledge graph triples

See. Each stage has its own failure mode. Errors compound downstream. A missed entity in stage 1 means the relationship in stage 3 has a dangling end.

2) Worked example: extracting triples from three sentences¶

Input text:

Sentence A: "Sundar Pichai leads Google, which is a subsidiary of Alphabet."
Sentence B: "He joined the company in 2004."
Sentence C: "Alphabet is headquartered in Mountain View, California."

Stage 1 — NER:

Sundar Pichai  → PERSON
Google         → ORGANISATION
Alphabet       → ORGANISATION
2004           → DATE
Mountain View  → LOCATION
California     → LOCATION

Stage 2 — Co-reference: "He" resolves to "Sundar Pichai." "The company" resolves to "Google."

Stage 3 — Relation extraction:

(Sundar Pichai, LEADS,          Google)
(Google,        SUBSIDIARY_OF,  Alphabet)
(Sundar Pichai, JOINED,         Google, since:2004)
(Alphabet,      HQ_IN,          Mountain View)
(Mountain View, LOCATED_IN,     California)

Five triples from three sentences. Each entity is a node, each typed connection is a relationship.

Stage 4 — Schema alignment: Is "LEADS" the same as "CEO_OF"? Depends on the ontology. Map "LEADS" → "CEO_OF" if the ontology defines that synonym. Merge "Google" in Sentence A with "the company" after co-reference.

3) Precision vs recall tradeoff in extraction¶

High-precision extractors use strict patterns and miss rare phrasings. High-recall extractors use broader classifiers and introduce noise.

Worked numbers:

Extractor type   Precision   Recall   F1
─────────────────────────────────────────
Rule-based       0.91        0.55     0.68
ML classifier    0.78        0.81     0.79
LLM prompt       0.72        0.88     0.79

Rule-based: great for well-defined domains (finance, legal). ML classifier: balanced for general text. LLM prompt: highest recall, lowest precision; needs post-filtering.

The graph query engine downstream suffers most from low precision: a wrong relationship can lead traversal to the wrong entity confidently.

4) Entity co-reference and alias resolution¶

"Apple" can be the fruit, the company, or a music label. Two sentences about "Apple" may refer to entirely different entities.

Co-reference resolution assigns each mention to exactly one entity.

Mention        Candidates                 Chosen station
─────────────────────────────────────────────────────────
"Apple"        Apple Inc., apple (fruit)  Apple Inc.  (by context: stock price)
"the firm"     Apple Inc.                 Apple Inc.  (referent in prior sentence)
"Tim Cook"     Person entity              Tim Cook

Without this step the knowledge graph has duplicate entities: "Apple," "the firm," and "AAPL" become three separate nodes with disconnected relationships. The graph query engine can't navigate between them.

5) LLM-assisted construction: gains and risks¶

Modern pipelines use LLMs to extract triples directly from text. Prompt: "Extract (subject, predicate, object) triples from this paragraph."

Gain: high recall, handles paraphrase. Risk: hallucinated triples with no textual support.

┌────────────────────────────────────────┐
│  LLM extraction output                 │
│                                        │
│  (Pichai, BORN_IN, Chennai) ← correct  │
│  (Pichai, STUDIED_AT, MIT)  ← wrong!   │
│    (Pichai went to IIT, not MIT)        │
└────────────────────────────────────────┘

The question is what to do. Add a verification pass: check extracted triples against source sentences. A triple with no textual evidence gets a low confidence score or is dropped. Yes?

Where this lives in the wild¶

IBM Watson Knowledge Studio — annotation teams label NER and relation schemas; ML models trained on those labels build domain-specific KGs.
Diffbot's Knowledge Graph — automated web scraping plus NLP pipeline extracts 400 M+ entities from web pages into a live commercial graph.
Meta's Unified Data Entity — co-reference resolution across Facebook, Instagram, and WhatsApp event streams links user activity to a single entity graph.
BioGPT at Microsoft Research — biomedical relation extraction pipeline builds protein-gene-disease graphs from PubMed to power drug discovery queries.
Bloomberg's enterprise KG — financial NLP pipeline extracts issuer-event-date triples from earnings reports and news, powering analyst workstation search.

Pause and recall¶

What happens to relationships if co-reference resolution fails in stage 2?
In the worked example, what triple captures Sundar Pichai's relationship to Google?
Why does low precision hurt the graph query engine more than low recall?
What check prevents an LLM from adding hallucinated triples to the knowledge graph?

Interview Q&A¶

Q: Why not skip co-reference resolution and just extract triples from each sentence independently? A: Cross-sentence facts get split across disconnected mentions. "He founded X" and "The founder is Y" never link to the same entity. The knowledge graph ends up with floating nodes and broken relationships that the graph query engine can't traverse end-to-end.

Common wrong answer to avoid: "Modern LLMs handle co-reference automatically" — they improve on it but still fail at long-document pronoun resolution and alias chains.

Q: Why prefer ML relation extraction over pure rule-based extraction for general domains? A: Rule-based systems enumerate patterns manually; they miss paraphrase. ML classifiers generalise from annotated examples to unseen phrasings, achieving higher recall at acceptable precision cost.

Common wrong answer to avoid: "Rules are always more precise" — for narrow domains that's true, but rule maintenance cost explodes as domain breadth grows.

Q: Why is schema alignment the hardest stage in construction pipelines? A: Different extractors, languages, and sources produce different predicate names for the same relationship (LEADS, CEO_OF, HEADS, IS_CHIEF_OF). Mapping them without over-collapsing distinct relations requires human or semi-supervised judgment.

Common wrong answer to avoid: "String matching solves synonym alignment" — "PART_OF" and "SUBSIDIARY_OF" are not synonyms; shallow matching creates data quality bugs.

Q: Why add a confidence score to extracted triples instead of accepting everything? A: Downstream graph query engine decisions based on uncertain triples silently produce wrong answers. Confidence scores let the system acknowledge uncertainty, filter low-quality paths, or flag answers for review rather than presenting them as facts.

Common wrong answer to avoid: "Confidence scores slow down retrieval" — they are stored as edge properties at write time and add no runtime cost to traversal.

Apply now (5 min)¶

Exercise. Take one paragraph from any article. Run it mentally through the four stages: NER → co-reference → relation extraction → schema. Write out the resulting triples. Count how many are you confident in vs. uncertain about.

Sketch from memory. Draw the four-stage pipeline as a vertical flow diagram with box-drawing characters. Label each stage's output and its most common failure mode.

Bridge. We can now build a knowledge graph from text. But to query it at scale, we need a database engine that understands graph traversal natively. → 04-graph-databases.md