02. Context Window Management — Your desk is finite¶
~15 min read. Memory starts with one boring truth: the live prompt has hard limits.
Built on the ELI5 in 00-eli5.md. The desk-note — the visible sticky-note area — is finite, so every new token pushes some other token out.
1) What the context window really is¶
See the picture first.
┌──────────────────────────────────────────────┐
│ system rules 3,000 tokens │
├──────────────────────────────────────────────┤
│ tools and schemas 6,000 tokens │
├──────────────────────────────────────────────┤
│ recent conversation 8,000 tokens │
├──────────────────────────────────────────────┤
│ retrieved memory 4,000 tokens │
├──────────────────────────────────────────────┤
│ user turn right now 2,000 tokens │
├──────────────────────────────────────────────┤
│ model answer budget 5,000 tokens │
└──────────────────────────────────────────────┘
total live desk-note 28,000 tokens
2) What fits, and what must move out¶
Look at the decision flow.
new token arrives
│
├── essential for this turn? ── yes ──→ keep on desk-note
│
├── useful later across turns? ─ yes ─→ move to filing-cabinet
│
├── dated event? ─────────────── yes ─→ diary-page
│
├── stable user fact? ────────── yes ─→ address-book
│
└── low value or stale? ──────── yes ─→ cleanup-bell
3) Truncation strategies¶
There are several common strategies. Each fails in its own way.
A. Last-N turns¶
Keep only the last few turns. This is simple. Latency is low. But a crucial promise from turn two may vanish.
B. Token-budget clipping¶
Keep content until the budget fills. This is still simple. But it is blind to semantics. A verbose joke may stay. A tiny legal constraint may disappear.
C. Role-aware retention¶
Keep system rules first. Keep tool results second. Keep recent user turns third. Compress old assistant chatter early. This is usually better.
D. Task-aware packing¶
Keep items linked to the current goal. Drop unrelated branches. Pull in only relevant memory. This is best in principle. But it needs a good librarian. So what to do? In real systems, we combine strategies. Hard floor for system rules. Reserved headroom for output. Role-aware truncation. Then retrieval for anything pushed out.
4) Worked example: budgeting a 32k window¶
Suppose your model has a safe budget of 32,000 tokens. You reserve 4,000 for the answer. That leaves 28,000 for input. Now allocate: - system prompt: 2,500 - tool definitions: 7,500 - current user turn: 1,800 - latest tool outputs: 6,200 - recent conversation: 8,000 - retrieved memory: 2,000 Add them. 2,500 + 7,500 + 1,800 + 6,200 + 8,000 + 2,000 = 28,000. Perfect fit. Now a new log dump arrives. It is 5,000 tokens. You cannot just append it. Options: 1. Shrink recent conversation from 8,000 to 3,000. 2. Compress latest tool outputs from 6,200 to 2,500. 3. Drop retrieved memory this turn. 4. Ask the user to narrow the log. A good policy might do this: - keep system prompt at 2,500 - keep tool definitions at 7,500 - keep current user turn at 1,800 - compress tool outputs to 3,000 - compress conversation to a 2,500-token summary-card - skip retrieval this turn - include 4,700 tokens of the new log Now total is: 2,500 + 7,500 + 1,800 + 3,000 + 2,500 + 0 + 4,700 = 22,000. Still safe. The desk-note survives. The cost is that older detail moved out. So later we may need the filing-cabinet again.
5) Practical rules that save teams¶
Rule one. Always budget output first. Rule two. Treat tools as permanent tenants on the desk-note. Rule three. Compress assistant verbosity before user constraints. Rule four. Summaries should preserve commitments, facts, and open loops. Rule five. When truncating, log what was dropped. That makes debugging possible. See why this matters. If the agent forgets a tone preference, the issue may be truncation. If it misses a prior error code, the issue may be packing order. Context bugs often look like model bugs. They are not. They are desk-note bugs.
Where this lives in the wild¶
- Claude Projects — knowledge worker must pack long project instructions, uploaded files, and fresh chat into one limited live context.
- Perplexity Copilot — researcher must budget retrieved web snippets against the current question and answer length.
- GitHub Copilot Workspace — software engineer must fit repo context, tool output, and the latest coding task into one prompt budget.
- Glean Assistant — enterprise employee must balance company docs, meeting context, and fresh user asks without drowning the model.
- Cursor Chat — developer must decide whether to include full files, diffs, or compressed summaries when the repo context grows.
Pause and recall¶
- Why does a 32k model not give you 32k tokens of conversation history?
- What is the main weakness of a plain last-N strategy?
- In the worked example, why was retrieved memory skipped first?
- Which placeholder helps shrink old turns without losing everything?
Interview Q&A¶
Q: Why reserve answer headroom instead of using the full window for input? A: The model needs room to think and respond. If input consumes the whole budget, generation may fail, truncate, or force aggressive hidden compression at the worst moment. Common wrong answer to avoid: "Because APIs require a fixed split" — the deeper reason is response safety and predictable prompt packing. Q: Why is semantic truncation better than token clipping? A: Token clipping is blind. Semantic truncation can preserve goals, constraints, and commitments while discarding chatter. That improves correctness at the same token count. Common wrong answer to avoid: "Because semantic methods always use fewer tokens" — the gain is quality of what survives, not guaranteed size reduction. Q: Why can larger context windows still need summaries and retrieval? A: More room helps, but noisy context still hurts relevance, cost, and latency. Bigger desks still need organization. Common wrong answer to avoid: "Once context is large enough, retrieval is obsolete" — selection remains necessary even with more capacity. Q: Why log dropped context during truncation? A: Without drop logs, memory failures look random. With logs, you can trace which fact disappeared and why the assistant changed behaviour. Common wrong answer to avoid: "It is only useful for analytics" — it is mainly for debugging correctness and trust issues.
Apply now (5 min)¶
Exercise: Pick a model window size. Reserve 15 percent for output. Then allocate the rest across system rules, tools, recent turns, and retrieved memory. Force yourself to cut 20 percent. Write what you would compress first. Sketch from memory: Draw the token budget box from this file. Then draw the routing flow that sends content to desk-note, filing-cabinet, diary-page, address-book, or cleanup-bell.
Bridge. We have a finite desk-note. Good. But what exactly do we keep from prior turns: raw turns, summaries, or both? → 03-conversation-history.md