09. Memory Retrieval Patterns — Ask the cabinet at the right time¶
~18 min read. Stored memory has zero value until retrieval policy decides what comes back into the prompt.
Built on the ELI5 in 00-eli5.md. The librarian — the selector of useful notes — decides whether the desk-note gets the right memory or distracting junk.
1) Retrieval is a trigger problem first¶
Look. You do not fetch all memory on every turn. That would flood the desk-note. So the first question is,
"Should retrieval happen now at all?"
new user turn
│
├── fully answerable from desk-note? ─ yes ─→ skip retrieval
│
├── missing stable fact? ───────────── yes ─→ address-book lookup
│
├── missing prior event? ───────────── yes ─→ diary-page lookup
│
├── missing related knowledge? ─────── yes ─→ filing-cabinet lookup
│
└── memory not needed ──────────────── yes ─→ answer directly
precision drops. The model sees extra noise. So what to do? Use lightweight classifiers or rules.
Detect references like: - "as usual" - "last time" - "my preference"
- "same customer"
- "continue that plan" These are retrieval hints.
2) Relevance, recency, and importance¶
Once retrieval triggers, ranking begins.
A simple mental model is three signals.
Picture the trade-off first.high relevance, old low relevance, fresh
┌──────────────────┐ ┌──────────────────┐
│ maybe retrieve │ │ probably ignore │
└──────────────────┘ └──────────────────┘
high relevance, fresh high importance, medium match
┌──────────────────┐ ┌──────────────────┐
│ strong candidate │ │ may still retrieve│
└──────────────────┘ └──────────────────┘
score = 0.5×relevance + 0.3×recency + 0.2×importance. Memory A: 0.9, 0.2, 0.8 gives 0.45 + 0.06 + 0.16 = 0.67. Memory B: 0.7, 0.9, 0.5 gives 0.35 + 0.27 + 0.10 = 0.72. Memory C: 0.6, 0.4, 1.0 gives 0.30 + 0.12 + 0.20 = 0.62.
So B wins. See. Freshness helped B beat A. But importance kept C competitive.
The exact weights depend on product needs. The lesson is deeper. Nearest meaning alone is not enough.
3) Hybrid retrieval patterns¶
Production systems often combine retrieval modes.
For example: - exact filter by user or org - semantic search over filing-cabinet chunks - rule lookup for profile fields
- time-window lookup for diary-pages
- rerank before prompt injection This hybrid approach is strong. The librarian uses the right tool for the right store.
Do not force vector search onto everything. A permission bit should be exact. A last-meeting event should be time-aware. A similar prior bug may be semantic.
Simple, no? Memory architecture improves when retrieval respects data type.
4) Worked example: "continue the proposal"¶
A user says, "Continue the proposal we were drafting for the Northstar deal."
The librarian sees several needs. Address-book lookup: - user likes concise bullets Diary-page lookup:
- last session ended after pricing section draft Filing-cabinet search:
-
retrieved chunk about Northstar's compliance requirements Recent desk-note:
-
current ask is to continue the proposal The final memory pack might include:
- concise bullet preference
-
last episode: pricing section completed, risks pending
-
compliance chunk for Northstar Not included:
- unrelated dinner preference
-
old bug discussion about invoices
-
stale brainstorm from a different account See the principle. Retrieval is selective assembly. Not blind replay.
5) Common mistakes¶
Mistake one: always retrieve top-k. That ignores query type. Mistake two: never retrieve unless asked explicitly. Then subtle continuity cues are lost.
Mistake three: inject raw retrieved memory without compression. The desk-note gets noisy. Mistake four: rank without feedback. You never learn what helped.
Mistake five: use one score for every memory type. The address-book, diary-page, and filing-cabinet behave differently. So what to do? Have per-store retrieval logic.
Rerank after initial candidate generation. Measure whether retrieved items were actually used. And let the cleanup-bell reduce low-value candidates over time.
Where this lives in the wild¶
-
Gmail Smart Compose assistants — knowledge worker need profile lookup and recent thread state, not brute-force retrieval from every old email.
-
ChatGPT Memory — OpenAI selectively injects durable preferences when the current query benefits from them.
- Intercom Fin — support rep needs ticket event memory, customer profile data, and semantic retrieval over help content in one response flow.
- Salesforce Einstein Copilot — seller blends CRM facts, last-meeting events, and relevant account documents when preparing next-step suggestions.
- GitHub Copilot coding agents — engineer need recent tool traces, repo facts, and semantically similar prior fixes without flooding the prompt.
Pause and recall¶
- Why is retrieval a trigger problem before it is a ranking problem?
- What three signals were combined in the worked scoring example?
- Why should a permission bit not be fetched with plain semantic search?
- In the proposal example, which memories were correctly excluded?
Interview Q&A¶
Q: Why use hybrid retrieval instead of one universal retriever? A: Different memory types have different shapes and guarantees. Exact profile fields, timed episodes, and semantic documents need different lookup methods.
Common wrong answer to avoid: "Because hybrid systems are more accurate by default" — they help because memory types require different retrieval semantics.
Q: Why can aggressive retrieval reduce answer quality? A: Extra memory increases prompt noise and can bias the model toward irrelevant details. More memory is not the same as better context.
Common wrong answer to avoid: "Because models dislike long prompts" — the deeper issue is relevance competition, not prompt length alone.
Q: Why rerank after initial candidate generation? A: Initial search is optimized for recall. Reranking applies task-specific judgement, policy, and prompt-budget constraints.
Common wrong answer to avoid: "Only because vector search is approximate" — even exact search still needs task-aware final selection.
Q: Why measure whether retrieved memory was used? A: Retrieval quality is not only about matching. You need to know whether the injected memory improved the answer or just consumed space.
Common wrong answer to avoid: "Because unused memory is expensive" — cost matters, but utility measurement is the main reason.¶
Apply now (5 min)¶
Exercise: Take one user query. List three retrieval triggers it should activate. Then score three candidate memories using relevance, recency, and importance. Choose the final two to inject. Sketch from memory: Draw the hybrid retrieval flow from query to exact lookup, diary lookup, semantic search, merge, and rerank.
Bridge. Retrieval decides what enters memory. Fine. But every memory system also needs a way to delete, decay, and forget. → 10-forgetting-strategies.md