06. What the agent forgets — Memory tiers, retrieval as tool, and the cost of remembering¶
~18 min read. A long-running agent that forgets its own last move is not a long-running agent — it is the same short-running agent paying full price six times in a row. Memory is the difference between an agent that compounds knowledge across turns and one that pays for amnesia on every iteration.
The support agent that quoted the wrong country's policy¶
Tuesday afternoon. A customer-support agent handles Priya's chat: "It has been 14 days, my refund has not arrived. Is something wrong?" Priya is in India. The agent's system prompt says it has access to the policy knowledge base. It searches, gets eight chunks back, and three of them mention "refund window: 7 days." The agent tells Priya her refund is overdue.
Except the 7-day window is US policy. The Indian window is 21 days. Priya's refund is on track.
Two hours later, a different user asks the same agent about a file it read three turns ago. The agent re-reads the file from scratch — it has no memory of ever reading it. Three thousand tokens wasted. The same fact, re-derived, re-paid.
Both failures share a root: the agent has no architecture for what it knows, when to fetch more, and where knowledge lives between turns. It is functional. It forgets.
What we know so far¶
File 05 established that the agent now has a loop, tools, composition patterns, and a standard protocol (MCP). Every request ends, and the agent's context window empties. Nothing survives unless someone builds a place for it to live.
What this file solves¶
This file answers three questions that determine whether an agent compounds knowledge or re-derives it:
- Where does knowledge live? Three tiers — context window, session memory, persistent memory — each with different cost, durability, and failure modes.
- How does knowledge get there? Two strategies — always-inject (pre-load context before the agent thinks) vs retrieval-as-tool (the agent decides to fetch).
- What does remembering cost? Every token in the prompt is paid on every turn. Memory is a cost center, and the architect's job is to make that cost proportional to value.
Three tiers of agent memory¶
An agent's knowledge lives in layers. Conflating them produces the dump-everything-into-chat-history failure that every team discovers around turn six.
┌─────────────────────────┐
│ TIER 1: CONTEXT WINDOW │ Free to read, ephemeral, dies when the request ends.
│ system prompt │ What the agent sees RIGHT NOW.
│ tool schemas │ ~5-8k tokens.
│ current user message │ Every token here is paid on every turn.
│ retrieved chunks │
└─────────────────────────┘
▼ survives within a conversation
┌─────────────────────────┐
│ TIER 2: SESSION MEMORY │ Persists across turns within one conversation.
│ structured scratchpad │ Goal, files read, open questions, rejected paths.
│ summarised history │ ~1-3k tokens of curated state.
│ tool-result cache │ Avoids re-reading the same file.
└─────────────────────────┘
▼ survives across conversations
┌─────────────────────────┐
│ TIER 3: PERSISTENT │ Survives across sessions. Vector store, database,
│ MEMORY │ or structured key-value.
│ user preferences │ Queried on demand, not loaded by default.
│ codebase facts │ Subject to staleness, bloat, and cross-tenant leak.
│ prior resolutions │ Requires explicit eviction policy.
└─────────────────────────┘
Each tier answers a different question. Tier 1: what does the agent need right now? Tier 2: what did the agent learn this session? Tier 3: what did the agent (or its predecessors) learn over many sessions?
The architectural job is not "do we have memory?" — every system has some memory, even if it is just chat history. The job is: what lives in which tier, how long it stays, and what triggers promotion or eviction.
The cost of amnesia — a concrete comparison¶
Same coding agent. Same model. Same bug: auth_handler.login() throws AttributeError on empty passwords. Two memory designs.
Design A — stateless (Tier 1 only).
TURN 1 read auth_handler.py (~3,000 tokens)
note: login() at line 47 calls is_blank()
TURN 2 read auth_handler.py AGAIN (re-discovery: 3,000 tokens)
read utils.py (1,800 tokens)
TURN 3 read auth_handler.py AGAIN (re-discovery: 3,000 tokens)
re-derive same facts
TURN 4 read auth_handler.py AGAIN (re-discovery: 3,000 tokens)
run failing test
TURN 5 read auth_handler.py AGAIN (re-discovery: 3,000 tokens)
propose fix
TURN 6 write patch
Total input: ~22,000 tokens. Four re-reads of the same file — pure waste.
Design B — session scratchpad (Tier 1 + Tier 2).
TURN 1 read auth_handler.py
scratchpad ← "login() at L47 calls is_blank() before .lower()"
TURN 2 [scratchpad in prompt — skip re-read]
read utils.py
scratchpad ← prior + "is_blank returns None on None input"
TURN 3 [scratchpad in prompt — skip both re-reads]
run failing test
scratchpad ← prior + "NoneType has no .lower() at L51"
TURN 4 propose fix from full scratchpad
TURN 5 write patch, confirm green
Total input: ~12,000 tokens. Same fix. 45% cheaper. Two fewer iterations. The reasoning is auditable end-to-end.
Design C — session scratchpad + persistent memory (all three tiers). Last Tuesday's session learned that is_blank() returns None on None input. That fact is now in the persistent store. Wednesday's session queries the store, skips reading utils.py entirely, and fixes the bug in three turns.
The lesson: agents that "feel dumb" in production are usually smart agents starving for memory.
Why long context is not a substitute for memory architecture¶
The temptation: "give the model 1M tokens and skip the tiers." Three reasons this fails:
- Attention is non-uniform. Models attend weakly to the middle of long prompts. A fact buried at position 400k is functionally invisible regardless of context length.
- Every token is paid every turn. A 500k-token chat history is a 500k-token bill every iteration. Memory architecture is the discipline of paying only for what drives the next decision.
- No eviction, no freshness. A longer window accumulates stale facts without removing them. The model cannot distinguish "current reality" from "something that was true three hours ago."
Memory tiers are not a workaround for small contexts. They are what lets you reason about cost, freshness, and auditability separately for each kind of fact.
Retrieval as a tool — the agent decides when to remember¶
Two strategies for getting knowledge into the agent's context. Both legitimate. Very different cost shapes.
Always-inject (pre-load before the agent thinks)¶
Every user turn triggers a retrieval call. The results go into the prompt before the model generates. Simple. Predictable. Expensive.
Decide-to-retrieve (the agent chooses)¶
The model decides whether retrieval is needed. For greetings, clarifications, follow-ups answerable from prior turns — the agent skips retrieval. For substantive questions, it retrieves. Cheaper. Riskier.
| Property | Always-inject | Decide-to-retrieve |
|---|---|---|
| Per-turn retrieval calls | 1.0 | 0.3 – 0.6 |
| Added latency | +200 – 600 ms every turn | 0 ms or +200 – 600 ms |
| Token cost (4 chunks × 600 tok) | +2,400 every turn | 0 or +2,400 |
| Failure mode | Wasted tokens on small talk | Agent skips retrieval when it was needed → hallucinated grounding |
| Best for | Support agents (wrong-with-confidence = trust event) | Coding agents (user corrects easily, most turns are navigation) |
The decision is product-shaped, not technical. When wrong-with-confidence destroys customer trust, always-inject. When the user can fluently correct the agent and 70% of turns are navigation steps, decide-to-retrieve.
A common hybrid: always-inject for "answer" intents, decide-to-retrieve for "navigate" intents. A lightweight classifier picks the intent and routes accordingly.
The retrieval tool schema — where most teams fail¶
A retrieval tool is not a search bar. A search bar serves a human who reads three results and picks. A retrieval tool serves a model that picks the first chunk and quotes it. The schema is the only defense against confident hallucination.
Vague schema (Tool A):
The agent searches "refund not arrived 14 days". Gets 8 chunks from mixed regions. Quotes US policy to an Indian customer. Disaster.
Typed schema (Tool B):
{
"name": "search_policy_docs",
"parameters": {
"query": {"type": "string", "minLength": 3, "maxLength": 500},
"top_k": {"type": "integer", "minimum": 1, "maximum": 10, "default": 4},
"region_filter": {"type": "string", "enum": ["IN", "US", "EU", "APAC", "ANY"]},
"time_range": {"type": "string", "enum": ["current", "last_quarter", "all"]},
"return": {"type": "string", "enum": ["chunks", "summary", "ids", "chunks_with_metadata"]}
},
"required": ["query", "region_filter"]
}
Every parameter is a reasoning dimension the model commits to before retrieving. Strip any one and you ship the failure mode that parameter was preventing.
What the response must include:
Without version, the agent cannot cite which policy applied. Without confidence, it cannot decide whether to trust the result. Without region echoed back, it cannot verify the filter was actually applied by the backend.
The three retrieval failure shapes¶
1. Empty result. Tool returns []. Most agents handle this correctly — they escalate. Unambiguous signal.
2. Wrong scope. Chunks come back, but from the wrong region/version/tenant. Looks authoritative. Agent quotes them. This is the Priya failure. Fix: required scope filter in schema + applied-filter echo in response.
3. Confidently-wrong content. Chunks are on-topic but factually outdated or contradicted by a newer doc outside top_k. Hardest to catch. Three structural fixes:
- Confidence floor. If no chunk crosses 0.7, escalate. Catches "topically relevant but distant" matches.
- Recency boost. Tool internally biases toward newer versions. Older docs still appear for audit, but score lower.
- Contradiction detection. If two top chunks contradict each other on the same topic, tool returns
contradiction_detected: true. Agent's instructions: contradiction → escalate, never paper over.
ITER 1
search_policy_docs(query="partial refund process", region_filter="IN", ...)
→ chunk 1 (v2026-03-01, confidence=0.71): "14 days"
→ chunk 2 (v2025-09-01, confidence=0.68): "7 days"
→ contradiction_detected=true
ITER 2
escalate_to_human(reason="contradicting policy chunks")
→ "Priya, I've escalated to a human (ticket ESC-4912)."
The agent admits uncertainty. That is the correct behaviour, and the only way to get it is to surface contradictions at the tool boundary, not pray the model notices inside its prompt.
Session memory — the scratchpad that makes the agent learn within a conversation¶
A scratchpad without structure is just another transcript. The way to get benefits without failure modes is a fixed key schema:
goal — the immutable target for this session
files_read — what has been read and what it contained (summarised)
open_questions — what still needs answering before the agent can act
rejected_paths — things already tried that did not work (do NOT repeat)
next_action — the agent's own statement of intent for the next turn
These five keys turn the workbench into something you can read during an incident. When a customer reports "the agent kept re-trying the same broken fix," open the trace, look at rejected_paths over time, and see whether the agent was updating that key correctly.
Summarisation as state management¶
By turn 8, even a scratchpad design can blow the budget. The fix is summarisation as part of the state system:
| Component | Before curation | After curation |
|---|---|---|
| Last two chat turns (full) | 1,400 | 1,400 |
| Summary of turns 1-6 | — | 400 |
| Recent tool logs | 5,200 | 1,000 |
| Summary of older tool logs | — | 200 |
| Scratchpad | 800 | 800 |
A 35% reduction in per-turn cost with no loss of decision-driving facts. For any session longer than four turns, summarisation is what keeps the agent affordable.
The summariser uses a fixed template: "user asked X, agent did Y, outcome Z, learned W." Facts that must survive verbatim go into the scratchpad keys. Everything else compresses.
Persistent memory — when facts should survive across sessions¶
Session memory dies when the session ends. Usually a feature — fresh sessions do not want last session's confusion bleeding in. But some facts should survive.
Three questions before adding a persistent layer:
- Does the fact apply across sessions? "User prefers concise replies" → yes. "User said hello at 09:14" → no. Test: does the fact survive a week of inactivity?
- What is the freshness window? A codebase fact from Tuesday is useful Wednesday. From six months ago — stale. Persistent memory needs expiry or recency-weighting.
- Is retrieval cheaper than re-derivation? Querying a vector store: ~120ms, ~150 tokens. Re-reading
utils.py: ~30ms, ~1,800 tokens. For frequently-needed facts, persistent wins.
If any condition fails, just re-derive each session. Persistent memory is not free.
Multi-tenant isolation in persistent memory¶
For multi-tenant agents, persistent memory is where the strictest isolation rules apply:
tenant_idis a required column on every durable write — enforced at the storage layer, never at the model boundary.- Vector embeddings are namespaced per tenant — similarity search over tenant A's namespace cannot return tenant B's vectors.
- Audit logs record every persistent read with
tenant_id, retained for the contractual window.
A single cross-tenant leak in persistent memory contaminates every future session that hits it. The model is the attack surface, not the access-control layer.
Three failure modes that memory introduces¶
Stateless systems do not have these. The moment you add memory, you accept all three:
Stale. The scratchpad says "test FAIL" but the agent already applied a fix and the test passes. If the scratchpad was not updated, the next turn reasons from a fact that is no longer true. Rule: every state-mutating tool result triggers a scratchpad write. A tool call that does not update state is an event that did not happen.
Bloated. Forty scratchpad entries by turn six. The relevant rejected_paths are buried in noise. The model re-tries options it already rejected. Fix: eviction. Completed sub-goals collapse to one-line "done." Rejected paths older than three turns move to a compact slot. A 12-entry scratchpad with the current situation beats a 40-entry scratchpad with the entire history.
Leaky. The scratchpad captured a comment from line 73 of auth_handler.py containing an internal Slack channel name. That channel name now lives in the prompt of every subsequent turn — including turns where the agent calls external tools. If the persistent store kept it and a different customer triggers a retrieval, the channel name leaks across tenants. Memory is the surface where secrets accumulate; the architect scrubs at the boundary.
The cost accounting — memory as a budget line¶
Every token in the prompt is a cost centre. Memory architecture is the art of making that cost proportional to value.
Six-turn session, always-inject, top_k=4, 600 tok/chunk:
System prompt + schemas: 3,000 tokens (fixed)
Retrieved chunks (4 × 600 × 6): 14,400 tokens (grows with turns)
Chat history (growing): 3,000 tokens by turn 6
Scratchpad: 800 tokens (curated)
─────────────────────────────────────────────
Total by turn 6: ~21,200 tokens
Savings from decide-to-retrieve (only 3 of 6 turns retrieve):
Retrieved chunks (4 × 600 × 3): 7,200 tokens
Total by turn 6: ~14,000 tokens (~34% cheaper)
Savings from switching to "summary" return on follow-up turns:
Summary (200 tok × 3 turns): 600 tokens
Total by turn 6: ~7,400 tokens (~65% cheaper)
Three knobs the architect turns:
- Chunk size. 400-600 tokens is the Q&A sweet spot. Smaller for codebase search (200-400) where tokens-per-fact is denser.
- top_k. Lower (2-3) → tight, may miss. Higher (6-10) → broader recall, model may quote wrong one. Production default: top_k=4 with confidence floor.
- Return shape.
chunks_with_metadatafor grounding-critical turns.summaryfor follow-ups.idswhen the agent will selectively fetch full docs.
Choices and tradeoffs — the design matrix¶
| Decision | Option A | Option B | Deciding factor |
|---|---|---|---|
| Session memory format | Chat transcript | Structured scratchpad | Scratchpad wins past 3 turns (cost, auditability) |
| Retrieval strategy | Always-inject | Decide-to-retrieve | Is wrong-with-confidence a trust event? |
| Persistent memory | Add from day one | Add when compounding is proven | One-shot tasks: skip. Repeat-codebase tasks: add. |
| Summarisation trigger | Every N turns | When budget threshold hit | Budget-threshold is more adaptive |
| Eviction policy | Time-based | Relevance-based | Relevance-based catches stale facts faster |
| Tenant isolation | Model-enforced | Backend-enforced | Always backend. The model is the attack surface. |
Real-world recognition¶
The split in the wild:
Memory-less / Tier 1 only:
- OpenAI function-calling demos (one shot, no state between turns)
- Slack /summarize bots (one call per invocation)
- Basic LangChain LLMChain with no memory= parameter
Session memory (Tier 1 + 2):
- Claude Code — session state + CLAUDE.md as durable per-repo memory
- Cursor agent mode — session scratchpad + project index
- Aider — chat history per repo with explicit /clear
Full three-tier with retrieval-as-tool: - GitHub Copilot coding agent — task state across issue → branch → PR lifecycle - Intercom Fin 2 — case-state survives across customer turns; KB retrieval as a typed tool - Harvey — case memory per legal matter; multi-corpus retrieval with citation discipline - Salesforce Agentforce — CRM retrieval as tool + case memory across multi-step flows - MemGPT / Letta — explicit tiered store (core, archival, recall) as a research architecture
The pattern across production agents: explicit state schema, retrieval as a first-class tool with typed parameters, persistent layer with isolation rules.
Interview Q&A¶
Q1. Why is a long-context model not a substitute for a memory architecture?
A. Three reasons. Attention is non-uniform — the middle of a 500k-token prompt is functionally invisible. Every token is paid every turn — 500k chat history means a 500k-token bill per iteration. And long context provides no eviction or freshness mechanism — stale facts accumulate without removal. Memory architecture decides what deserves to live where, with explicit summarisation and eviction.
Wrong answer to avoid: "Long context replaces memory because the model can see everything." It can see everything in principle and attend to a slice of it in practice.
Q2. When would you choose always-inject over decide-to-retrieve?
A. When wrong-with-confidence is a customer-trust event: support agents, legal agents, medical-adjacent agents. Cost: ~1.5-3× per-turn tokens, +200-600ms latency. Payoff: structurally impossible for the agent to confabulate a grounded answer. Decide-to-retrieve fits coding agents where users correct easily and 70% of turns are navigation.
Wrong answer to avoid: "Always-inject is always better." For a coding agent where most turns are small navigation steps, it burns money on nothing.
Q3. What are the three failure modes memory introduces, and how do you prevent each?
A. Stale (scratchpad has outdated facts) — fix: every state-mutating tool result triggers a scratchpad write. Bloated (too many entries, model ignores relevant ones) — fix: eviction; completed goals collapse, old rejected paths compact. Leaky (secrets accumulate in memory and surface in wrong contexts) — fix: scrub at the boundary, tenant-namespace everything durable.
Wrong answer to avoid: "Memory only has upsides." Memory adds three new failure shapes the stateless system did not have.
Q4. How would you design durable memory for a multi-tenant product so one customer's facts cannot leak into another's session?
A. Four moves. tenant_id as required column on every durable write. Isolation enforced at storage layer, not model boundary. Vector embeddings namespaced per tenant. Audit logs recording every durable read with tenant_id. The model is the attack surface, not the access-control layer.
Wrong answer to avoid: "We trust the model to filter by tenant." The model is where prompt injection happens.
Q5. What is the difference between a retrieval tool in an agent and a RAG pipeline?
A. A RAG pipeline runs once: retrieve → augment → generate. A retrieval tool is something the model chooses to invoke, with arguments it picks, possibly multiple times per turn, refining queries based on what came back. RAG is a special case of retrieval-as-tool where the loop length is one.
Wrong answer to avoid: "They are the same thing." They share the retrieval engine but the control flow — and therefore the schema requirements — are fundamentally different.
Q6. Your agent's scratchpad has 40 entries by turn 6 and the model keeps re-trying rejected approaches. Diagnosis?
A. Scratchpad bloat. With 40 entries, rejected_paths are buried under noise — the model technically sees them but functionally ignores them under attention pressure. Fix: eviction. Completed sub-goals → one-line "done." Old rejected paths → compact separate slot. A 12-entry scratchpad with the current situation outperforms a 40-entry history.
Wrong answer to avoid: "Switch models." The model operates on the workbench it was given. A junk-drawer workbench produces junk-drawer behaviour regardless of model.
Apply now (10 min)¶
Step 1 — audit a real agent. Pick any agent you have used recently (Cursor, Claude Code, Copilot, your team's internal one). Answer three questions:
| Question | What I observe | Green or red? |
|---|---|---|
| What state survives between turn 3 and turn 4? | ||
| What happens when a tool result invalidates a prior fact? | ||
| What facts survive across sessions, and what evicts them? |
Green = explicit state design. Red = transcript-as-memory.
Step 2 — sketch the retrieval schema. From memory, write the typed retrieval tool schema (Tool B from this file). Include all five parameters and the response shape. If you can produce it cold, the architect's view of retrieval is yours.
Step 3 — cost model. For a six-turn support agent session with always-inject (top_k=4, 600 tokens/chunk), calculate total retrieval token cost. Then calculate the savings from switching three of those turns to summary return shape. Write the numbers.
Operational memory¶
This file explained why agent memory is a three-tier architecture — context window, session scratchpad, persistent store — not one giant chat transcript, and why retrieval-as-tool (the agent decides when to fetch context) is the mechanism that makes memory cost-proportional rather than cost-explosive.
The core tension is cost vs grounding. More memory means better-grounded answers. More tokens means higher cost, slower responses, and a larger surface for cross-tenant leakage. The architect's job is to put the right fact in the right tier at the right time — and evict it when it stops being worth its token cost.
Remember:
- Three tiers: context window (free but ephemeral), session memory (persists within a conversation), persistent memory (survives across sessions). Each has different cost, durability, and failure modes.
- Session scratchpad needs a schema (goal, files_read, open_questions, rejected_paths, next_action) — without keys it degenerates into a transcript.
- Retrieval-as-tool means the agent decides to fetch; always-inject means it always fetches. Choose based on the product cost of wrong-with-confidence.
- Every retrieval schema parameter is a reasoning dimension the model commits to before retrieving. Missing dimensions produce predictable failures.
- Three memory failure modes — stale, bloated, leaky — appear the moment memory exists. Eviction and scrubbing are the price of admission.
- Multi-tenant persistent memory:
tenant_idrequired at storage layer. The model is the attack surface, never the access-control layer. - Memory is a cost center. Summarisation, eviction, and return-shape switching are how you keep the cost proportional to value.
Bridge. The agent can now reason, act, compose, remember, and speak a standard protocol. It is functional. But functional is not safe. A single bad tool call — a refund with one extra zero, a database query without a WHERE clause, a deployment to production instead of staging — can destroy more value in 200 milliseconds than the agent created in a month. The next question: what is the worst thing one bad call can destroy? The answer determines how tight the leash must be. → 07-blast-radius-approval-gates.md