02. Autoregressive decode cost — why one more token keeps getting more expensive¶
~16 min read. This is the core arithmetic behind slow generation and the prefill-versus-decode split.
Built on the ELI5 in 00-eli5.md. The order ticket grows one spoon at a time, so the kitchen keeps revisiting old context unless we give it memory.
1) Picture first: the answer keeps dragging the past behind it¶
Imagine the model writing a sentence. At token 1, it looks at the prompt. At token 2, it looks at the prompt plus token 1. At token 3, it looks at the prompt plus tokens 1 and 2. The window keeps growing.
step 1: [prompt] + [y1]
step 2: [prompt y1] + [y2]
step 3: [prompt y1 y2] + [y3]
step 4: [prompt y1 y2 y3] + [y4]
See. If we recompute attention over the whole visible prefix every step, old work returns again and again. The kitchen is not only cooking the new token. It keeps reopening old pans. That is the structural reason naive decode hurts. Picture that before any formula.
2) The O(t²) growth with a worked count¶
Now do the math gently. Suppose we generate T new tokens after the prompt. Step 1 attends over 1 generated position. Step 2 attends over 2 generated positions. Step 3 attends over 3. So total self-attention work across generated tokens is:
-
1 + 2 + 3 + ... + T
-
= T(T + 1) / 2
Worked example. Let T = 8. Then total position visits are:
-
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8
-
= 36
If T doubles to 16:
-
1 + 2 + ... + 16
-
= 136
Tokens doubled. Work almost quadrupled. Simple, no? That is the O(t²) pain of naive autoregressive decoding.
3) Prompt length changes the baseline too¶
Real serving is worse than the clean toy sum. Why? Because every generated token also attends over the prompt prefix.
Suppose prompt length P = 2,000. Suppose output length T = 200. At decode step k, visible length is P + k - 1. Total attention positions touched become:
- 2,000 + 2,001 + ... + 2,199
- average = 2,099.5
- total = 2,099.5 × 200
- = 419,900
Only 200 new tokens were emitted. Yet the model examined about 420k visible positions. That is why long prompts plus long outputs are dangerous. The order ticket does not stay small while decoding. It carries its whole history behind it.
4) Why prefill is compute-bound but decode becomes memory-bound¶
Look at the kitchen from above. Prefill has many tokens available together. Decode has only one fresh token per sequence.
prefill decode
┌──────────────────────┐ ┌──────────────────────┐
│ many token rows │ │ one new token row │
│ large GEMMs │ │ tiny repeated GEMMs │
│ tensor cores busy │ │ weight reads dominate│
└──────────────────────┘ └──────────────────────┘
In prefill, large matrix multiplies keep compute units occupied. In decode, arithmetic per step is small, but the full layer stack must still load weights and cache. So decode often becomes memory-bound. The kitchen has many chefs. The problem is getting ingredients to them fast enough. That is why a FLOPs-only story misleads.
5) Why this naturally motivates KV cache¶
So what to do? We cannot avoid serial generation completely. Each next token depends on previous ones. But we can avoid recomputing old keys and values.
Think of the prep station. If onions are already chopped, we should not chop them again for every spoon. The same logic applies here. Store reusable attention state once. Then append only the new token’s state. Naive O(t²) decode does not disappear magically, but a huge repeated chunk of work goes away. That is the point of KV cache. Next we open that prep station and count exactly what lives there.
Where this lives in the wild¶
-
OpenAI and Anthropic chat APIs — long-form completions expose decode scaling far more than short classification prompts.
-
Cursor code generation — a 4,000-token code context plus a long patch makes quadratic growth visible very quickly.
-
Perplexity citation-rich answers — both long prompts and long outputs compound naive decode cost.
-
Character.AI roleplay chats — multi-turn sessions keep growing context, so decode cost rises turn by turn.
-
Llama.cpp local serving — single-user generation speed shows the prefill-versus-decode split very clearly on consumer hardware.
Pause and recall¶
-
Why does naive generated-token work grow like 1 + 2 + ... + T instead of linearly?
-
In the 2,000-prompt and 200-output example, how many visible positions were touched?
-
Why can prefill be compute-bound while decode is memory-bound on the same model?
-
What part of the repeated decode work does KV cache try to remove?
Interview Q&A¶
Q: Why does decode grow quadratically in the naive view, not linearly?
A: Because each new token re-attends over all previously visible positions. The per-step cost grows with sequence length, so the sum over steps becomes triangular.
Common wrong answer to avoid: "Because transformers have many layers." Layers multiply cost, but the quadratic pattern comes from growing attention length.
Q: Why is prompt length still important after generation starts?
A: Because every decode step usually attends over the prompt prefix as well. A long prompt increases the baseline cost of every later token.
Common wrong answer to avoid: "Prompt cost ends after prefill." The prompt keeps participating in attention during decode.
Q: Why not judge performance only by tokens per second?
A: Because tokens per second hides whether the slowdown came from long prompts, memory pressure, or bad scheduling. TTFT and phase-specific metrics matter too.
Common wrong answer to avoid: "Generation speed alone tells the whole story." It ignores prompt ingestion and queueing.
Q: Why is the first big inference optimization usually KV cache, not faster networking?
A: Because naive decode wastes work inside the model loop itself. Removing repeated K and V computation changes the inner algorithm, not only the plumbing around it.
Common wrong answer to avoid: "Serving pain is mostly API overhead." Inner-model repetition is the larger cost.
Apply now (5 min)¶
Pick T = 32 and compute 1 + 2 + ... + T by hand. Then add a prompt of length 1,024 and compute 1,024 + ... + 1,055. Notice how the prompt changes every later step. Sketch from memory:
-
the growing-prefix picture,
-
the prefill-vs-decode comparison,
-
and the triangular-sum idea.
Bridge. We now know what repeated work is hurting us. Next we study the prep station itself: what keys and values are stored, how much memory they consume, and why the savings come with a new budget problem. → 03-kv-cache-mechanics.md