12. Memory Evaluation — Measure recall, not vibes¶
~18 min read. Memory systems feel useful easily, but production quality needs evidence, not anecdotes.
Built on the ELI5 in 00-eli5.md. The librarian — the selector of memories — must be evaluated, otherwise the desk-note may fill with convincing nonsense.
1) What we are actually evaluating¶
See the stack.
memory quality
├── was the right memory stored?
├── was it retrieved when needed?
├── was it injected clearly?
└── did the answer improve?
you may miss where failure happened. Storage may be wrong.
Retrieval may be wrong. Prompt packing may be wrong.
Or the model may ignore good memory. So memory evaluation should be layered.
Simple, no? We need both offline and online checks.
Offline tells us component quality. Online tells us user impact.
2) Core metrics¶
Useful metrics include: - recall@k for retrieval
- precision@k for retrieval
-
memory hit rate
-
answer improvement rate
-
contradiction rate
-
stale-memory rate Picture before formulas.
question ──→ relevant memories exist? ── yes/no
│
├── retrieved? ───────────────→ recall view
├── useful among retrieved? ──→ precision view
└── helped final answer? ─────→ product view
recall = 7/10 = 0.70. If you retrieved 10 memories and 6 were useful,
precision = 6/10 = 0.60. Both matter.
High recall with terrible precision floods the desk-note. High precision with low recall misses important context.
That is why one metric never tells the full story.¶
3) How to build evaluation datasets¶
A good eval set contains questions that truly need memory.
Not generic knowledge questions. Examples:
- "What tone does this user prefer?"
-
"What happened in the previous outage attempt?"
-
"Which account restriction applies here?" Each eval case should include:
-
the current query
-
the gold relevant memory items
-
the desired answer behaviour
- negative memories that should not be retrieved
your eval may reward noisy retrieval. Now what is the problem?
Gold memory sets are costly to label. Yes.
Still, without them, you are mostly measuring vibes.
4) Worked example: retrieval eval for preferences¶
Suppose we test 5 user queries. For each query, there are 2 gold memories.
So total gold memories = 10. The system retrieves 12 memories overall.
Among them, 8 are gold. Recall = 8/10 = 0.80.
Precision = 8/12 = 0.67. Now inspect failure cases.
Two missed memories were old but still relevant. So recency was over-weighted.
Four retrieved memories were semantically close but useless. So the librarian needs stronger task filtering.
See. Metrics gave the signal.
Inspection gave the reason. That pair is important.
Numbers alone are not enough.¶
5) Product-level evaluation¶
Memory systems ultimately serve user outcomes.
So also measure: - fewer repeated questions
- lower task completion time
-
fewer contradiction complaints
-
higher user satisfaction on continuity
- lower rate of privacy mistakes
A/B testing helps. Task-based human review helps.
Red-team tests help too. The cleanup-bell should also be evaluated.
Did expired memories still influence output? Did deleted profile facts reappear?
That is part of memory quality. Not a side issue.
Look. Good evaluation makes memory boring.
Boring is what production needs.¶
Where this lives in the wild¶
-
OpenAI and Anthropic product teams — applied scientist evaluate whether memory retrieval actually improves continuity and usefulness across sessions.
-
Intercom AI support teams — product analyst can measure whether agents repeat fewer questions after adding case memory.
-
GitHub Copilot agent teams — engineer need task-success and contradiction metrics when memory influences coding workflows.
-
Enterprise search assistants — ML engineer evaluate recall and precision of retrieved memories before prompt injection.
- Healthcare assistant teams — safety reviewer need deletion and stale-memory tests, not only helpfulness scores.
Pause and recall¶
- Why should memory evaluation be layered instead of only answer-level?
- What does recall@k tell you that precision@k does not?
- Why should eval cases include forbidden memory ids?
- In the worked example, what tuning mistake did the missed memories reveal?
Interview Q&A¶
Q: Why evaluate retrieval separately from final answer quality? A: A bad answer can come from bad storage, retrieval, injection, or model usage. Separate evaluation localizes the failure.
Common wrong answer to avoid: "Because retrieval metrics are easier to compute" — convenience is not the main reason; diagnosability is.
Q: Why are recall and precision both necessary for memory evaluation? A: Recall measures what you missed. Precision measures how much noise you added. Good systems need both.
Common wrong answer to avoid: "Because one is for offline and one is for online" — both can be measured offline; they capture different failure modes.
Q: Why include negative or forbidden memories in the eval set? A: Retrieval quality is not only about finding some relevant item. It is also about excluding tempting but wrong context.
Common wrong answer to avoid: "Only to make the benchmark harder" — the point is realism and precision measurement.
Q: Why test deletion and expiry in memory evals? A: A memory system that recalls deleted or expired data is failing, even if helpfulness looks high.
Common wrong answer to avoid: "Because compliance teams ask for it" — it is a correctness issue as much as a policy issue.¶
Apply now (5 min)¶
Exercise: Write three eval cases for a memory-enabled agent. For each, list the query, the gold memory, one forbidden memory, and the expected answer trait. Sketch from memory: Draw the evaluation stack from storage quality to retrieval quality to injection quality to final answer impact.
Bridge. We can now measure memory systems honestly. Good. The last step is to admit what still remains uncertain, unsafe, and genuinely unresolved. → 13-honest-admission.md