05. Entity Linking — Which station does "Apple" mean?¶

~13 min read. The disambiguation step that maps ambiguous text mentions to exact graph nodes.

Continues from the first-principles overview in 00-first-principles.md. Every entity on the knowledge graph has a unique ID. Entity linking is the process of matching a noisy text mention to the correct entity so the graph query engine can start from the right node.

1) The problem: one name, many stations¶

"Apple" appears in text. It could be: - Apple Inc. (technology company) - Apple Records (music label) - apple (the fruit)

The knowledge graph has all three as separate entities. The graph query engine must start from the exact one the writer meant. Getting this wrong propagates through every downstream hop.

text mention: "Apple released a new chip today"
                    │
                    ▼
         ┌──────────────────────┐
         │  Candidate generation│
         │  (string match +     │
         │   alias lookup)      │
         └──────────┬───────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │  Disambiguation      │
         │  (context scoring)   │
         └──────────┬───────────┘
                    │
                    ▼
              Apple Inc.  ✓

2) Stage 1: Candidate generation¶

Given a surface form (the mention text), generate a shortlist of candidate entities.

Three methods: 1. Exact match — "Apple Inc." → Apple Inc. node directly. 2. Alias lookup — alias table: "AAPL", "Apple", "Apple Computer" all → Apple Inc. 3. Embedding similarity — encode the mention, find nearest entity embeddings.

Alias table (example)
─────────────────────
"AAPL"          → Q312
"Apple Inc."    → Q312
"Apple Records" → Q401
"Apple Corp"    → Q401
"apple"         → Q89

Alias tables are hand-curated or mined from redirect pages in Wikidata/Wikipedia. They handle most cases. Embedding similarity handles novel mentions not in the alias table.

3) Stage 2: Disambiguation — choosing the right station¶

Given candidates, rank them by fit to context.

Approach: score each candidate using context features.

feature                           weight
──────────────────────────────────────────
entity type matches context        +0.4
prior probability (entity freq.)   +0.3
context window embedding similarity +0.2
entity description match           +0.1

Worked numerical example. Mention: "Apple released a new chip today." Context window: "semiconductor", "M4", "performance", "laptop."

Candidate	Type match	Prior	Emb sim	Total
Apple Inc.	0.4	0.30	0.20	0.90
Apple Records	0.0	0.05	0.02	0.07
apple (fruit)	0.0	0.10	0.01	0.11

Apple Inc. wins with score 0.90. The graph query engine now has the correct starting entity: Q312.

4) Entity linking at query time vs build time¶

Build time (KG construction): Link every mention in source documents to a canonical entity. Store the resolved entity ID in the triple. The knowledge graph contains unambiguous node IDs, not raw text strings.

Query time (Graph RAG): Extract entities from the user query. Link them to entities in the knowledge graph. Start traversal from those stations.

user query: "What products did the company Jobs co-founded ship in 2023?"
                                        │
                       entity extraction│
                                        ▼
                              "Jobs" → Steve Jobs (Q5765)
                              "company co-founded" → Apple Inc. (Q312)
                                        │
                            graph traversal│
                                        ▼
                              Apple Inc. ──[SHIPPED_IN:2023]──▶ ...

A wrong link at query time sends the graph query engine to the wrong entity. Every hop after that is wrong too.

5) Nil linking and low-confidence handling¶

Not every mention has a matching entity on the knowledge graph. New entities, abbreviations, typos — these may produce no good candidate.

The question is what to do. Two strategies: 1. Nil prediction — predict "this mention doesn't link to anything known." Safe but loses coverage. 2. Threshold-based fallback — if best candidate score < 0.3, trigger NIL; otherwise link.

┌─────────────────────────────────────────────┐
│  score ≥ 0.7  →  link with high confidence  │
│  0.3 ≤ score < 0.7  →  link with warning    │
│  score < 0.3  →  NIL; flag for review       │
└─────────────────────────────────────────────┘

The Graph RAG system can then decide: use the linked entity or fall back to plain vector search. Yes?

Where this lives in the wild¶

Google's BERT-based entity linker — maps every news article mention to Freebase/ Knowledge Graph IDs at search indexing time; powers entity cards in Search results.
Wikidata's EntityLinker — Wikipedia editors use automated tools to link article mentions to Wikidata Q-IDs, keeping the knowledge graph internally consistent.
Bloomberg's NLP pipeline — financial entity linker maps "BofA," "Bank of America," and "BAC" to the same issuer entity, enabling cross-ticker analysis.
Amazon Product Graph — product attribute linker maps "iPhone 15 Pro" in seller descriptions to the canonical product entity, avoiding duplicate listings.
Microsoft Academic Graph — author name disambiguation links "J. Smith" in papers to the correct researcher node using co-author networks as context features.

Pause and recall¶

Why does a wrong entity link at query time ruin every downstream graph hop?
In the worked example, what feature gave Apple Inc. the highest single boost?
When should you predict NIL instead of linking to the best candidate?
What is the difference between build-time linking and query-time linking?

Interview Q&A¶

Q: Why not just match entity names exactly and skip the disambiguation model? A: Exact matching fails on aliases, abbreviations, and polysemous names like "Apple" or "Amazon." A production system encounters all three regularly. Without disambiguation, the graph query engine lands on the wrong entity silently.

Common wrong answer to avoid: "Aliases are rare" — in financial, legal, and product data, the alias problem is more common than the clean-name case.

Q: Why use prior probability as a disambiguation feature? A: Most mentions of "Apple" in tech news refer to Apple Inc. The prior captures this base rate. Without it, rare homonymous entities with strong context signals can unfairly win. The prior grounds the model in global entity frequency.

Common wrong answer to avoid: "Prior makes the model biased toward popular entities" — that's true and it's usually the correct behaviour; rare entities need strong context evidence to override the prior.

Q: Why is entity linking harder at query time than at document indexing time? A: Queries are short — often 5-15 words. Short context means embedding similarity is less discriminative. Also, query entities are often abbreviations or informal forms that don't appear in alias tables. Build-time linking has full paragraphs for context.

Common wrong answer to avoid: "Query time is the same as document time" — query brevity is a fundamentally different disambiguation challenge.

Q: Why does a knowledge graph use canonical Q-IDs instead of storing raw text strings? A: Strings are ambiguous and vary across sources. Q-IDs (or similar canonical identifiers) are globally unique — all mentions of Apple Inc. across documents, languages, and time collapse to one entity, making traversal and deduplication correct by construction.

Common wrong answer to avoid: "String matching is good enough at small scale" — even small graphs accumulate alias conflicts that break the graph query engine silently.

Apply now (5 min)¶

Exercise. Pick three company names that are also common words or have famous namesakes. For each, list two candidate entities and decide which wins given a technology news context. Calculate a rough score using the four features from the worked example.

Sketch from memory. Draw the two-stage entity linking pipeline: candidate generation → disambiguation. Label the inputs, outputs, and failure mode at each stage.

Bridge. Now we can link mentions to entities. But working with the raw graph topology is slow for some queries. The fix is to learn vector representations of the graph — embeddings that make similarity between entities computable without traversal. → 06-graph-embeddings.md