12. Graph Evaluation — Measuring the map, the route, and the answer separately¶

~15 min read. Evaluating only the final answer hides which layer actually broke.

Continues from the first-principles overview in 00-first-principles.md. The knowledge graph quality, the graph query engine accuracy, and the final answer quality must be measured as separate layers. A correct-looking answer can hide a broken map underneath.

1) Three evaluation layers¶

Layer 1: Knowledge graph quality
    │  Is the knowledge graph accurate? Are stations and relationships correct?
    ▼
Layer 2: Retrieval quality
    │  Does the graph query engine find the right path?
    ▼
Layer 3: Answer quality
       Is the LLM answer faithful to the retrieved context?

Each layer can independently pass or fail. A lucky answer on a bad graph means you can't reproduce it. A perfect graph with bad retrieval still produces wrong answers. Measuring only the final answer tells you that something is broken — not what or where.

2) Layer 1: KG quality metrics¶

The gold standard for KG evaluation: a human-annotated set of ground-truth triples.

gold triple set:   100 true triples
predicted set:     80 triples
correct among predicted: 60

Precision = correct / predicted = 60 / 80  = 0.75
Recall    = correct / gold      = 60 / 100 = 0.60
F1        = 2 × P × R / (P + R)
          = 2 × 0.75 × 0.60 / (0.75 + 0.60)
          = 0.90 / 1.35
          = 0.667

So F1 ≈ 0.67. The knowledge graph captures 60% of real facts and gets 75% of its claims right. For production: precision > 0.85 is usually required to avoid flooding the graph query engine with wrong relationships.

3) Layer 2: retrieval quality metrics¶

Does the graph query engine traverse the right path?

Hit@K: is the correct answer entity in the top-K retrieved nodes?

K=1:  is the answer the top-1 retrieved entity?
K=3:  is the answer in the top-3?
K=10: is the answer in the top-10?

MRR (Mean Reciprocal Rank): average of 1/rank for the correct answer.

Query 1: correct answer at rank 1  → 1/1 = 1.0
Query 2: correct answer at rank 3  → 1/3 = 0.33
Query 3: correct answer at rank 2  → 1/2 = 0.50
MRR = (1.0 + 0.33 + 0.50) / 3 = 0.61

Hop accuracy: did the graph query engine traverse the correct number of hops and follow the correct sequence of relationships?

A system with high Hit@1 but low hop accuracy answers correctly but for wrong reasons — a red flag for reliability.

4) Layer 3: answer quality metrics¶

Faithfulness: is every claim in the answer supported by the retrieved context?

┌─────────────────────────────────────────────────────────────────┐
│  Claim in answer: "Google Cloud revenue is $8B per quarter"     │
│  Supported by retrieved context? YES → faithful                 │
│                                                                 │
│  Claim: "Meta acquired Instagram for $2B in 2014"               │
│  Supported by retrieved context? NO → unfaithful                │
│   (context says $1B in 2012)                                    │
└─────────────────────────────────────────────────────────────────┘

Correctness: does the answer match the gold answer?

Both are needed. A faithful answer repeats the retrieved context — but the context might be wrong. A correct answer matches ground truth — but may be a lucky hallucination. You want both: faithful AND correct.

5) Benchmarks and the dashboard you actually watch¶

┌─────────────────┬────────────────────────────────────────────────┐
│  Dataset        │  What it tests                                 │
├─────────────────┼────────────────────────────────────────────────┤
│  FB15k-237      │  Link prediction on Freebase; tests embedding  │
│                 │  models for triple completion                  │
├─────────────────┼────────────────────────────────────────────────┤
│  WebQSP         │  Multi-hop Q&A over Freebase; 2-hop max;       │
│                 │  tests graph traversal accuracy                │
├─────────────────┼────────────────────────────────────────────────┤
│  HotpotQA       │  Multi-hop Q&A over Wikipedia passages;        │
│                 │  tests retrieval-and-reason systems            │
├─────────────────┼────────────────────────────────────────────────┤
│  MuSiQue        │  Stricter multi-hop; answers require 2-4 hops; │
│                 │  tests chain completeness                      │
└─────────────────┴────────────────────────────────────────────────┘

Yes? Different benchmarks stress different layers. Use FB15k-237 to evaluate the graph embedding (embeddings). Use WebQSP and HotpotQA to evaluate the graph query engine (traversal). Use faithfulness metrics to evaluate the LLM generation layer.

Building an evaluation dashboard

Keep three separate dashboards.

┌─────────────────────────────────────────────────┐
│  Dashboard 1: Graph quality                     │
│    extraction precision, recall, F1             │
│    per-relation breakdown                       │
│    entity coverage rate                         │
├─────────────────────────────────────────────────┤
│  Dashboard 2: Retrieval quality                 │
│    Hit@1, Hit@3, MRR                            │
│    hop accuracy per query type                  │
│    **multi-hop junction** resolution rate           │
├─────────────────────────────────────────────────┤
│  Dashboard 3: Answer quality                    │
│    faithfulness rate                            │
│    correctness vs gold answers                  │
│    failure mode breakdown by error type         │
└─────────────────────────────────────────────────┘

When answer quality drops, you look first at Dashboard 3 (did the LLM ignore context?), then Dashboard 2 (did the graph query engine return the right subgraph?), then Dashboard 1 (is the knowledge graph itself missing the key relationship?).

Where this lives in the wild¶

Microsoft GraphRAG evaluation suite — faithfulness, recall@K, and community summary quality are tracked per query type on enterprise document corpora.
Wikidata data quality team — precision/recall of triple extraction from Wikipedia edits is monitored continuously; anomaly alerts trigger human review.
Google's Research Evaluation (KGQA benchmark) — internal evaluation on Freebase drives decisions about embedding model updates vs. coverage improvements.
Diffbot's data quality dashboard — extraction F1 per entity type is tracked against monthly human spot-check samples to maintain commercial SLAs.
Enterprise AI teams using LangChain RAGAS — faithfulness and correctness metrics evaluate Graph RAG pipelines before production deployment.

Pause and recall¶

In the F1 example, what does 0.67 tell you about the state of the knowledge graph?
Why is a system with high Hit@1 but low hop accuracy a reliability risk?
What is the difference between faithfulness and correctness?
Which benchmark tests link prediction on embedding models: FB15k-237 or HotpotQA?

Interview Q&A¶

Q: Why evaluate graph quality and answer quality as separate layers? A: A correct answer on a bad graph means you got lucky. A wrong answer on a good graph means retrieval or generation failed — not the graph. Only separating layers tells you which component to fix.

Common wrong answer to avoid: "Users only care about the final answer" — operational reliability requires knowing why each failure occurred, not just that it occurred.

Q: Why is faithfulness distinct from correctness in RAG evaluation? A: Faithfulness measures whether the answer is supported by retrieved context — it can be 1.0 even if the context is factually wrong. Correctness measures whether the final answer matches the truth. A faithful wrong answer points to bad retrieval; an unfaithful correct answer points to LLM hallucination.

Common wrong answer to avoid: "If the answer is correct, faithfulness doesn't matter" — an unfaithful-but-correct answer will fail on the next similar query when context differs.

Q: Why include hop accuracy alongside Hit@K in retrieval evaluation? A: Hit@K tells you the answer was retrieved. Hop accuracy tells you the graph query engine followed the correct relationship chain. A system can guess the right answer by a wrong path — it will fail on structurally similar questions.

Common wrong answer to avoid: "Hit@K is sufficient for multi-hop systems" — correctness of path matters for robustness, not just for the current answer.

Q: Why should per-relation precision be tracked separately from overall precision? A: Overall precision averages over relation types. A high-precision common relation (WORKS_AT) can mask a low-precision rare relation (SUBSIDIARY_OF). The rare but important relations are exactly what break complex multi-hop queries.

Common wrong answer to avoid: "Overall F1 is sufficient for monitoring" — silent failures on rare relations are the hardest bugs to find without per-relation tracking.

Apply now (5 min)¶

Exercise. Take a gold set of 10 triples. Simulate an extractor that predicts 8 triples, 6 of which are correct. Compute precision, recall, and F1 by hand. Then design one faithfulness question that would fail even if F1 is high.

Sketch from memory. Draw the three evaluation layers as a vertical stack. Label the primary metric for each layer. Draw an arrow from "answer quality drops" through the diagnostic path down to the most likely root cause.

Bridge. We can evaluate all three layers. But honest engineering also means admitting what graph systems still can't do well. The final file faces those limits directly. → 13-honest-admission.md