Skip to content

13. Honest Admission — Where the knowledge graph still breaks

~15 min read. A strong engineer knows the failure modes as well as the capabilities.

Continues from the first-principles overview in 00-first-principles.md. Even the best knowledge graph has missing entities, wrong relationships, stale multi-hop junctions, and routes the graph query engine refuses to guess. These are not bugs to hide — they are engineering constraints to admit and design around.


1) Extraction quality is a ceiling, not a floor

Everything downstream depends on what went into the knowledge graph. A triple extraction pipeline with F1 = 0.70 means 30% of real facts are missing and 25% of claimed facts are wrong (precision 0.75 in our earlier example).

That is the ceiling for the graph query engine and the answer layer. No amount of clever traversal recovers a fact that was never extracted.

extraction F1 = 0.70
  max achievable KG quality = 0.70
  graph query engine accuracy ≤ 0.70
  answer correctness ≤ 0.70

You cannot out-engineer a bad knowledge graph foundation.


2) Scalability is still an open problem

Large knowledge graphs at web scale (Google, Wikidata, Meta) have billions of nodes. Three unresolved challenges:

Challenge 1: Traversal at billion-node scale. Even with index-free adjacency, a 5-hop query touching 50⁵ = 312 M nodes needs aggressive pruning. Current production systems prune hard — and pruning can cut off the correct path.

Challenge 2: Embedding retraining. When new entities and relationships are added daily, transductive models (TransE) require full retraining. Inductive models (GraphSAGE) help but don't scale to relation-type changes.

Challenge 3: Full validation. Worked example from file 11: a graph with 1 B edges at 2 M validations/hour takes 500 hours. No organisation validates everything continuously.

graph scale   1 M edges   100 M edges   1 B edges
─────────────────────────────────────────────────
full pass     0.5 hrs      50 hrs        500 hrs
viable?       yes          maybe         no

Scale forces approximation, sampling, and accepted incompleteness.


3) Ontology alignment is a human problem

Two teams model the same relationship differently. "Subsidiary" vs "owned by" vs "controlled by." "Employee" vs "contractor" vs "consultant."

Automated alignment tools exist (embedding-based, rule-based). None of them replace human judgment for ambiguous cases.

At large organisations with multiple business units, ontology drift is common:

Team A: (Person, WORKS_AT, Company)
Team B: (Person, EMPLOYED_BY, Organisation)
Team C: (Staff, HAS_ROLE_AT, Entity)

All three mean the same thing but create three disconnected sub-graphs. The graph query engine crossing between them fails silently. Fixing requires an ontology governance process — not just more engineering.


4) When knowledge graphs are overkill

Not every problem needs a knowledge graph. Use a graph when: - Multi-hop relational queries are frequent. - The same entities appear across many documents. - Provenance and auditability matter. - Explicit typed relations carry meaning that flat text loses.

Do not use a graph when:

┌─────────────────────────────────────────────────────┐
│  Situation                    │  Better tool        │
├───────────────────────────────┼─────────────────────┤
│  One-hop local lookups only   │  Plain vector RAG   │
│  Entities don't repeat        │  BM25 + vector      │
│  Unstructured opinions/prose  │  Summarisation      │
│  Real-time streaming events   │  Stream processor   │
└─────────────────────────────────────────────────────┘

The graph embedding (pure vector search) may be sufficient and simpler to maintain. Building a graph to answer questions that plain retrieval handles is over-engineering.


5) What we honestly don't know

Open problem 1: Evaluation benchmarks don't reflect production. FB15k-237 and WebQSP are clean, curated graphs. Real production graphs are noisy, inconsistently labelled, and partially stale. High benchmark scores don't predict production performance.

Open problem 2: LLM + graph interaction. LLMs sometimes ignore the retrieved subgraph and use parametric memory instead. We don't yet have reliable methods to force strict faithfulness to graph context.

Open problem 3: Automated ontology learning. We cannot reliably learn a correct ontology from raw text without human supervision. The boundary between "relation type A" and "relation type B" is often judgement-dependent.

Open problem 4: Confidence calibration. Graph confidence scores are not well-calibrated probabilities. A triple with confidence 0.85 does not mean an 85% chance of correctness. Calibration methods for KG extraction remain an active research area.

┌────────────────────────────────────────────────────┐
│  Honest position                                   │
│                                                    │
│  strong at:  multi-hop factual Q&A                 │
│              structured provenance                 │
│              entity disambiguation                 │
│                                                    │
│  weak at:    broad, fuzzy, or opinion queries      │
│              billion-scale freshness               │
│              cross-ontology alignment              │
│              calibrated confidence                 │
└────────────────────────────────────────────────────┘

Where this lives in the wild

  • Google Knowledge Graph team — engineers openly acknowledge coverage gaps and prioritise filling them by entity type; completeness is never claimed globally.
  • Wikidata community — openly tracks "constraint violations" where relationships connect wrong entity types; thousands of violations exist at any given time.
  • Microsoft GraphRAG research paper — explicitly benchmarks both local and global modes, acknowledges latency and cost as barriers to wide deployment.
  • Diffbot commercial graph — SLA documentation states extraction precision per entity type; some rare entity types have precision as low as 0.60.
  • Enterprise AI practitioners — experienced teams document when they abandoned a graph approach and reverted to plain retrieval after maintenance cost exceeded value.

Pause and recall

  1. If extraction F1 is 0.70, what is the maximum possible answer correctness?
  2. Why does a high benchmark score on FB15k-237 not predict production performance?
  3. Name two situations where plain vector RAG beats a knowledge graph.
  4. What is the "calibration" problem for KG confidence scores?

Interview Q&A

Q: Why admit limits in an interview? Won't that hurt your evaluation? A: Interviewers at senior levels value calibrated judgment over uncritical enthusiasm. Admitting "KGs are overkill when entities don't repeat" shows design maturity. A candidate who claims graphs solve everything signals they haven't run them in production.

Common wrong answer to avoid: "Always recommend the most sophisticated approach" — recommending a simpler tool when it fits better is a sign of engineering seniority.

Q: Why can't better LLMs fully replace explicit graph traversal? A: LLMs improve generalisation but not factual grounding. An LLM without a knowledge graph will hallucinate intermediate entities at multi-hop depth because the facts are not in its weights or context. Grounding requires an explicit external store — not more parameters.

Common wrong answer to avoid: "GPT-5 will solve this" — scaling has shown diminishing returns on structured multi-hop factual accuracy relative to retrieval-grounded systems.

Q: Why is ontology alignment a governance problem, not just an engineering problem? A: Automated tools can propose alignments but cannot resolve semantic ambiguity. Whether "contractor" should merge into "WORKS_AT" depends on legal, HR, and business context — judgment that must be codified in policy, not inferred from text co-occurrence.

Common wrong answer to avoid: "Embeddings can align ontologies automatically" — they cluster similar terms but cannot make the business rule decision about what's the same.

Q: Why does the LLM sometimes ignore the retrieved graph context? A: LLMs are trained to be helpful and fluent. When graph context is sparse or formatted awkwardly, the LLM completes from parametric memory — which can be wrong. Mitigation requires formatting context clearly, using citation prompts, and measuring faithfulness.

Common wrong answer to avoid: "More graph context always helps" — overloading the context window with unstructured graph dumps can make faithfulness worse, not better.


Apply now (5 min)

Exercise. Take a real project you know. Argue both sides: why a graph would add value, and why it might be overkill. List two specific open problems from this file that would affect your scenario. Propose one honest limitation you would include in any design document.

Sketch from memory. Draw the ceiling diagram from Section 1 — extraction F1 flowing down through retrieval accuracy to answer correctness. Label each layer with a realistic F1 number from a production system you've read about.


Bridge. You now see graphs as they are: powerful for relational reasoning, limited by extraction quality, scale, and ontology cost. Next, the curriculum moves to a neighbouring domain: making structured data — tables, schemas, databases — available to LLMs through code generation. → 00-eli5.md