Skip to content

09. Memory Retrieval Patterns — Ask the cabinet at the right time

~18 min read. Stored memory has zero value until retrieval policy decides what comes back into the prompt.

Built on the ELI5 in 00-eli5.md. The librarian — the selector of useful notes — decides whether the desk-note gets the right memory or distracting junk.


1) Retrieval is a trigger problem first

Look. You do not fetch all memory on every turn. That would flood the desk-note. So the first question is,

"Should retrieval happen now at all?"

new user turn
      ├── fully answerable from desk-note? ─ yes ─→ skip retrieval
      ├── missing stable fact? ───────────── yes ─→ address-book lookup
      ├── missing prior event? ───────────── yes ─→ diary-page lookup
      ├── missing related knowledge? ─────── yes ─→ filing-cabinet lookup
      └── memory not needed ──────────────── yes ─→ answer directly
This trigger step saves cost. It also protects relevance. If the librarian fetches constantly,

precision drops. The model sees extra noise. So what to do? Use lightweight classifiers or rules.

Detect references like: - "as usual" - "last time" - "my preference"

  • "same customer"
  • "continue that plan" These are retrieval hints.

2) Relevance, recency, and importance

Once retrieval triggers, ranking begins.

A simple mental model is three signals.

final score = relevance + recency + importance
Picture the trade-off first.
high relevance, old           low relevance, fresh
┌──────────────────┐          ┌──────────────────┐
│ maybe retrieve   │          │ probably ignore  │
└──────────────────┘          └──────────────────┘

high relevance, fresh         high importance, medium match
┌──────────────────┐          ┌──────────────────┐
│ strong candidate │          │ may still retrieve│
└──────────────────┘          └──────────────────┘
Now a simple worked score. Suppose we compute:

score = 0.5×relevance + 0.3×recency + 0.2×importance. Memory A: 0.9, 0.2, 0.8 gives 0.45 + 0.06 + 0.16 = 0.67. Memory B: 0.7, 0.9, 0.5 gives 0.35 + 0.27 + 0.10 = 0.72. Memory C: 0.6, 0.4, 1.0 gives 0.30 + 0.12 + 0.20 = 0.62.

So B wins. See. Freshness helped B beat A. But importance kept C competitive.

The exact weights depend on product needs. The lesson is deeper. Nearest meaning alone is not enough.


3) Hybrid retrieval patterns

Production systems often combine retrieval modes.

For example: - exact filter by user or org - semantic search over filing-cabinet chunks - rule lookup for profile fields

  • time-window lookup for diary-pages
  • rerank before prompt injection
    query
     ├── address-book exact lookup
     ├── diary-page time lookup
     ├── filing-cabinet semantic search
     └── merge + rerank ──→ final memory pack
    
    This hybrid approach is strong. The librarian uses the right tool for the right store.

Do not force vector search onto everything. A permission bit should be exact. A last-meeting event should be time-aware. A similar prior bug may be semantic.

Simple, no? Memory architecture improves when retrieval respects data type.


4) Worked example: "continue the proposal"

A user says, "Continue the proposal we were drafting for the Northstar deal."

The librarian sees several needs. Address-book lookup: - user likes concise bullets Diary-page lookup:

  • last session ended after pricing section draft Filing-cabinet search:
  • retrieved chunk about Northstar's compliance requirements Recent desk-note:

  • current ask is to continue the proposal The final memory pack might include:

  • concise bullet preference
  • last episode: pricing section completed, risks pending

  • compliance chunk for Northstar Not included:

  • unrelated dinner preference
  • old bug discussion about invoices

  • stale brainstorm from a different account See the principle. Retrieval is selective assembly. Not blind replay.


5) Common mistakes

Mistake one: always retrieve top-k. That ignores query type. Mistake two: never retrieve unless asked explicitly. Then subtle continuity cues are lost.

Mistake three: inject raw retrieved memory without compression. The desk-note gets noisy. Mistake four: rank without feedback. You never learn what helped.

Mistake five: use one score for every memory type. The address-book, diary-page, and filing-cabinet behave differently. So what to do? Have per-store retrieval logic.

Rerank after initial candidate generation. Measure whether retrieved items were actually used. And let the cleanup-bell reduce low-value candidates over time.


Where this lives in the wild

  • Gmail Smart Compose assistants — knowledge worker need profile lookup and recent thread state, not brute-force retrieval from every old email.

  • ChatGPT Memory — OpenAI selectively injects durable preferences when the current query benefits from them.

  • Intercom Fin — support rep needs ticket event memory, customer profile data, and semantic retrieval over help content in one response flow.
  • Salesforce Einstein Copilot — seller blends CRM facts, last-meeting events, and relevant account documents when preparing next-step suggestions.
  • GitHub Copilot coding agents — engineer need recent tool traces, repo facts, and semantically similar prior fixes without flooding the prompt.

Pause and recall

  1. Why is retrieval a trigger problem before it is a ranking problem?
  2. What three signals were combined in the worked scoring example?
  3. Why should a permission bit not be fetched with plain semantic search?
  4. In the proposal example, which memories were correctly excluded?

Interview Q&A

Q: Why use hybrid retrieval instead of one universal retriever? A: Different memory types have different shapes and guarantees. Exact profile fields, timed episodes, and semantic documents need different lookup methods.

Common wrong answer to avoid: "Because hybrid systems are more accurate by default" — they help because memory types require different retrieval semantics.

Q: Why can aggressive retrieval reduce answer quality? A: Extra memory increases prompt noise and can bias the model toward irrelevant details. More memory is not the same as better context.

Common wrong answer to avoid: "Because models dislike long prompts" — the deeper issue is relevance competition, not prompt length alone.

Q: Why rerank after initial candidate generation? A: Initial search is optimized for recall. Reranking applies task-specific judgement, policy, and prompt-budget constraints.

Common wrong answer to avoid: "Only because vector search is approximate" — even exact search still needs task-aware final selection.

Q: Why measure whether retrieved memory was used? A: Retrieval quality is not only about matching. You need to know whether the injected memory improved the answer or just consumed space.

Common wrong answer to avoid: "Because unused memory is expensive" — cost matters, but utility measurement is the main reason.

Apply now (5 min)

Exercise: Take one user query. List three retrieval triggers it should activate. Then score three candidate memories using relevance, recency, and importance. Choose the final two to inject. Sketch from memory: Draw the hybrid retrieval flow from query to exact lookup, diary lookup, semantic search, merge, and rerank.


Bridge. Retrieval decides what enters memory. Fine. But every memory system also needs a way to delete, decay, and forget. → 10-forgetting-strategies.md