10. Memory-layer bugs — the fourth suspect lies through what it remembers¶
~14 min read. Stale state, leaked tenants, drifted embeddings — when recall poisons the decision.
Built on the ELI5 in 00-eli5.md. The fourth of our suspects — the agent's memory and retrieval layer — must now stand in the lineup.
Start with the picture¶
The prompt was clean, the tool returned correctly, and the loop did not spin — yet the agent recommended the wrong product. The reason is that the agent did not act on this turn alone; it acted on what it remembered, and what it remembered was wrong. Memory is a stack of layers, and each layer can lie in its own way.
┌─────────────────────────────────┐
turn buffer │ last user message, last reply │ ← short-lived
└─────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
conv summary │ "user wants enterprise pricing" │ ← compressed
│ (rewritten every 20 turns) │
└──────────────────────────────────┘
│
▼
┌───────────────────────────────────┐
user │ {tier: pro, segment: gaming, │ ← rarely written
profile │ region: in, last_updated: 6w} │
└───────────────────────────────────┘
│
▼
┌────────────────────────────────────┐
retrieval │ vector store of past tickets + │ ← shared across users
pool │ docs + ephemeral facts │
└────────────────────────────────────┘
Four floors. Seven crimes. We walk each one through the lineup.
Pattern 1 — Stale state¶
The summary was written three days ago. The user has since changed preferences. Summary still says "user prefers gaming GPUs." Today the user shops for enterprise racks.
Trace signature: prompt holds an older fact that contradicts the current turn. The model trusts the summary because it looks authoritative.
Elimination test (per lineup): clear the summary, rerun. If the bug disappears, the confession is stale state.
Fix: add last_updated to every summary, re-summarize on user contradiction, TTL by importance.
Pattern 2 — Cross-session leakage¶
Tenant A asks "what is our renewal date?" Tenant B asks the same minutes later. B sees A's date.
This is the multi-tenancy crime first warned about in module 16 chapter 12. The retrieval pool was queried without a tenant filter, or the cache key omitted the tenant.
Trace signature: the case file for B has a span whose retrieved chunk carries a different tenant_id. Smoking gun.
Elimination test: clear all state, replay B's turn with tenant filter enforced. If the wrong fact vanishes, leakage is confirmed.
Fix: tag every write with tenant_id, filter every read by tenant_id, separate indexes where stakes are high, add a contract test that fails on mismatch.
Pattern 3 — Retrieval drift¶
Embedding model swapped from text-embedding-3-small to text-embedding-3-large. Top-K shifted. Documents that ranked #1 now rank #7.
Trace signature: same query, same corpus, different top-K vs last week. The cold case flag — no single trace looks wrong, but aggregate behavior shifted.
Elimination test: pin the old embedding model, rerun failing examples. Behavior should return.
Fix: version embeddings explicitly, store embedding_model_id per document, re-embed the entire corpus on model change, run a retrieval-relevance eval before any swap.
Pattern 4 — Embedding staleness¶
Catalog updated Friday. Prices changed. Vector store still holds Tuesday's embeddings. Agent retrieves the old description and quotes the wrong price.
Trace signature: the retrieved document is older than the source-of-truth. A timestamp comparison reveals it instantly.
Elimination test: force a re-index, rerun. If the answer corrects itself, staleness is the confession.
Fix: re-embed pipeline on source change, alert if indexed_at < updated_at, TTL re-index for volatile docs.
Pattern 5 — Memory pollution¶
Turn 4 the agent hallucinated: "user is in Germany." The memory writer dutifully stored it. Turns 5, 6, 7, 8 — all think the user is in Germany now. One bad turn poisoned the well.
Trace signature: a memory record whose source span is an LLM output, not a verified tool call. Later turns cite this record as fact.
Elimination test: clear all state at the polluted record, replay from the suspect turn. Bug vanishes.
Fix: write to long-term memory only from verified sources — tool outputs, user statements, confirmed actions. Never auto-promote a model-extracted fact. Tag every record with source: tool | user | model.
Pattern 6 — Wrong-scope memory¶
Short-term buffer holds what should be long-term — "user is a doctor." Buffer rolls over after 20 turns and the fact disappears. Next session, agent forgets and gives generic advice.
Reverse case: long-term store holds what should be short-term — "user is buying a gift for sister." Six months later, agent still asks about the sister.
Trace signature: a fact that should persist is missing in a new session, or a fact that should die is still alive.
Elimination test: read user profile and retrieval pool side by side. Wrong level is the confession.
Fix: decide write-level by fact type, not recency. Identity facts → profile. Active task state → session. Episodic events → diary with timestamp.
Pattern 7 — Context-window overflow¶
Prompt is 130k tokens. Window is 128k. The framework silently drops the oldest 5k. That dropped chunk held the user's stated constraint. Agent answers without it.
Trace signature: input length in the witness note is near the model limit. Older messages missing from the assembled prompt.
Elimination test: shrink the conversation, send the same turn, see if the right answer returns.
Fix: track prompt length as an evidence tag, alert at 80% of window, summarize before truncating, pin critical facts outside the rolling buffer.
The "memory dump at suspicious turn" diagnostic¶
When in doubt, dump everything the agent could see at turn N.
=== memory dump @ session=sess_881 turn=14 ===
turn_buffer:
- turn 12: user: "show me enterprise GPU racks"
- turn 13: agent: "you mentioned gaming earlier..."
- turn 14: user: "no, that was last quarter"
conversation_summary (updated 3d ago):
"User maya prefers gaming GPUs." ← stale
user_profile (updated 6w ago):
segment: "gaming", tier: "pro" ← stale
last_updated: 2026-04-01
retrieval_pool (top 3, k=3):
doc_91 tenant=acme relevance=0.81
doc_44 tenant=zeta relevance=0.79 ← wrong tenant!
doc_12 tenant=acme relevance=0.77
prompt_token_count: 7,221 / 128,000
embedding_model: text-embedding-3-large (changed 6d ago)
One dump tells the story. Stale summary, stale profile, cross-tenant leak, recent embedding change. Four crimes visible in one snapshot — every line is an evidence tag in disguise. Make this dump a single button on your trace view, attached to every case file.
Worked example — the personalization that swapped product lines¶
A B2B SaaS agent suddenly began recommending gaming SDKs to enterprise customers. The complaint slip arrived with three trace IDs. The lineup ran. Prompt clean. Tools healthy. Loop normal. Now memory.
The engineer pulled the memory dump for "maya@bigco.com."
user_profile:
segment: "gaming enthusiast"
last_updated: 2026-04-02 (6 weeks ago)
source_span: span_31a7
Pulling span_31a7, the source emerged.
Six weeks ago, maya had browsed public docs without logging in.
A cookie heuristic guessed her segment as "gaming."
The profile writer trusted the heuristic.
Maya later signed in as an enterprise admin.
The profile was never updated.
Every session since started with segment=gaming baked into context.
Recommendations followed.
The fix had two parts:
- Staleness TTL. Each profile attribute carries
expires_at. Heuristic-sourced attributes expire in 14 days. User-confirmed attributes last a year. - Source-aware confidence. Attributes carry
confidenceandsource. Retrieval prefers high-confidence, recent attributes over stale heuristics.
The lock — a regression eval — was added. It plays back: anonymous browse → sign-in as enterprise → make request. The agent must recommend enterprise SKUs. The cold case that haunted the team for two weeks was closed.
Agent-memory failures across shipped products¶
- Notion AI workspace memory — per-workspace memory boundaries; every retrieval is scoped by
workspace_id, so a query in Acme's workspace cannot reach Beta's pages no matter how relevant. The role is making pattern 2 (cross-session leakage) structurally impossible at the index level. - Cursor's memory bank — project-scoped persistent memory written to a
.cursordirectory; the role is teaching that the lock for stale state can be a file in the repo that the user can read and edit. - Mem0 memory layer — explicit memory-extraction layer with
user_idandagent_idkeys; the role is decoupling memory from any specific framework and making writes auditable per-entity. - LangMem — LangChain's managed memory store with namespaces and TTL; the role is converting "what is the TTL of this fact?" from a design question into a per-write parameter.
- LangChain
ConversationSummaryMemory— rolling summary buffer that compresses old turns into a paragraph; classic pattern 1 source when the summary is rewritten with a stale assumption baked in and never re-validated against later turns. - ChatGPT memory feature — known staleness pain; an outdated remembered fact ("I am a beginner") biases answers months after the user has become expert. Explicit memory editing and per-conversation toggles were added partly to mitigate this.
- ChatGPT custom instructions — separate from memory but the same failure mode: a stale persona baked into every turn until the user notices and rewrites it.
- Anthropic Claude Projects context — project-scoped knowledge that lives outside the conversation; the role is preventing wrong-scope writes by separating "facts about this project" from "facts about this turn."
- Letta (formerly MemGPT) persistent memory — hierarchical memory (core, archival, recall) has a known class of bugs where facts written to the wrong tier behave like wrong-scope memory; short-tier eviction silently loses what should have been archival.
- Mem.ai — documented memory-drift cases where the same query returns different recalled notes week to week as the personal corpus grows, requiring periodic relevance recalibration. Canonical pattern 3 (retrieval drift) in a consumer product.
- GitHub Copilot session memory — chat session memory scoped to a workspace; the role is showing that even a single-tenant product needs explicit scope to avoid cross-repo bleeding when the user has multiple folders open.
- Sourcegraph Cody context tiers — query-time context (file, repo, graph) layered explicitly so the agent knows which tier a fact came from; the role is making
sourcetags first-class so pattern 5 (pollution) can be filtered out at retrieval. - Coral by Cohere — session memory is explicit and reset-able; engineers can clear and replay to isolate stale-state bugs without touching long-term stores.
- OpenAI Assistants thread memory — thread-scoped messages and
vector_storeattachments; pattern 2 leakage shows up when the same assistant is used across users without per-user thread isolation. - Slack AI channel memory — channel-scoped retrieval ensures DMs do not leak into channel summaries; the role is the same as workspace boundaries but at a finer grain.
- Vectara HHEM faithfulness checks — runs against retrieved chunks to catch pattern 5 (pollution) when a hallucinated fact has been promoted into the retrieval pool.
- Pinecone namespaces — per-namespace isolation in the vector index; the role is making pattern 2 (cross-tenant leakage) a one-line schema decision instead of a per-query filter the developer might forget.
Recall — memory layers, leakage, and the cold-case embedding swap¶
- Which memory layer is most often responsible for "the agent acts like it knows old facts about me that are no longer true"?
- What is the trace signature of a cross-tenant leak, and what is the one-line fix?
- Why does an embedding-model swap qualify as a cold case rather than a single-trace bug?
- In the memory-dump diagnostic, which four pieces of evidence appear in one view?
Interview Q&A¶
Q: Your agent works fine in test but recommends wrong products in production. Traces show clean prompts, tools, and loop. Where do you look next?
A: The memory layers. Dump everything the agent could see at the suspicious turn — turn buffer, conversation summary, user profile, retrieved chunks. Compare each to ground truth. Check last_updated and source tags. Stale state and wrong-scope writes hide here.
Common wrong answer to avoid: "Bump the model version" — switching models hides the memory bug by changing how the wrong context gets interpreted. The wrong context is still in the prompt. The bug returns.
Q: How do you debug a suspected cross-tenant leak without exposing other tenants' data to your debugger? A: Use synthetic tenants. Provision tenant_X and tenant_Y in staging with fake data. Replay the failing flow with both. If a query for tenant_Y returns chunks tagged tenant_X, the leak is reproduced — without touching real customer data. Then enforce a hard tenant filter at query time and add a contract test.
Common wrong answer to avoid: "Look at the production traces for both tenants" — that itself violates tenancy isolation. The investigation must be reproducible on synthetic data.
Q: You changed your embedding model and retrieval quality dropped. The vector store still has old embeddings for most documents. What is the correct response? A: Never mix embedding models in one index. Each produces a different geometry; cosine similarity across geometries is meaningless. Either reindex everything with the new model or roll back. The "partial reindex" interim state is itself the bug.
Common wrong answer to avoid: "Reindex only popular documents and leave the rest" — this creates an inconsistent vector space where ranking is unstable. Long-tail queries get unpredictable mixes of old and new geometry.
Q: A memory record was written from a hallucinated model output and now poisons subsequent turns. How do you prevent this class of bug structurally?
A: Restrict who can write to long-term memory. Tool outputs, user statements, and confirmed actions can write. Model-extracted facts cannot write directly — they must be verified before promotion. Tag every record with source so the failure mode is at least greppable.
Common wrong answer to avoid: "Lower the model's temperature" — temperature does not stop hallucinated facts from being written. The writer's policy is the bug, not the model's sampling.
Apply now (10 min)¶
Step 1 — model the exercise. Here is the memory-dump table I would build for a "the agent recommended a product the user already returned last month" complaint, on the chapter's four memory layers:
| Layer | Evidence pulled | Last-updated | Source tag | Suspected role |
|---|---|---|---|---|
| Turn buffer | last 6 turns of the chat | this turn | user | clean |
| Conversation summary | "user prefers Brand X" | 38 days ago | model-extracted | stale, possibly hallucinated |
| User profile | preferred-brand: Brand X | 38 days ago | profile-write from summary | stale, polluted by summary |
| Retrieval pool | "returns rarely affect preference" | static doc | KB | clean |
Two layers point at one source: a model-extracted summary written six weeks ago that has since been promoted into the user profile. The fix is to require human or tool confirmation before any model extraction writes to profile-level memory.
Step 2 — your turn. Pick an agent you have built. List its memory layers from top to bottom. For each, answer in one line: what is the TTL of facts here, and who is allowed to write? Then take the most recent unexplained behavior you saw and ask: could a six-week-old fact in any of these layers explain it? Run the memory-dump diagnostic in your head.
Step 3 — reproduce from memory. Draw the four-floor memory diagram — turn buffer, conversation summary, user profile, retrieval pool. Mark which floor is most prone to staleness, which to leakage, which to drift, which to pollution. If you can do this cold, you carry the chapter.
What you should remember¶
This chapter explained why a clean prompt, a clean tool call, and a clean loop can still produce a confidently wrong agent. The case file has four memory layers — turn buffer, conversation summary, user profile, retrieval pool — and each one fails in a different way. Staleness lives in the user profile when an old fact stays past its truth date. Cross-tenant leakage lives in the retrieval pool when the namespace filter is absent. Pollution lives wherever model-extracted "facts" can write back to long-term memory without verification. The diagnostic that resolves all of these is the same: dump every memory layer the agent could see at the suspicious turn, and compare each piece of evidence to ground truth.
You also learned why an embedding-model swap is a cold case rather than a single-trace bug. The bad answer at turn 47 looks identical to every prior bad answer; the failure is in the geometry of the index, not in this conversation. The trace alone cannot reveal it. Only checking the embedder version against the index version exposes the cause.
Carry this diagnostic forward: when a clean lineup leaves only the model standing accused, dump memory first. Memory is the suspect that hides behind the model's accent and gets the model blamed for its sins. If last_updated is older than the fact's truth window, the bug is here.
Remember:
- Four memory layers, four failure shapes. Always dump all four side by side before naming the suspect.
- Tag every memory record with
sourceandlast_updated. Untagged memory is a cold case waiting to happen. - Model-extracted facts must not write to long-term memory without verification. The writer's policy is the bug, not the model's sampling.
- Cross-tenant leaks reproduce on synthetic tenants. Never debug them on production traces.
- An embedding swap that left old vectors in the index is a geometry bug, not a retrieval bug. Reindex fully or roll back.
Bridge. Memory cleared the lineup. Prompt, tool, loop, memory — all alibis verified. Only one suspect remains. The model itself. Version regressions, capability cliffs, sudden refusals. The last interrogation begins now. → 11-model-layer-bugs.md