Skip to content

12. Prefill vs decode — same weights, different workload

~8 min read. Prompt processing and token generation stress the model differently.

Built on the ELI5 in 00-eli5.md.

The exam rule still holds.

Prefill enforces it with a mask.

Decode often enforces it by cache structure.


1) One model, two operating modes

Serving is not one uniform action.

It has a prompt phase and a generation phase.

The weights stay fixed while the workload changes.

The bottleneck and scheduler change too.

That is why inference engineers separate them mentally.

In both phases, the projection lens still makes Q, K, and V.

In both phases, the parallel graders still run head by head.

The difference is which tokens exist at that moment.

┌─── PREFILL ──────────────┐    ┌─── DECODE ───────────────┐
│ prompt tokens [T_p]      │    │ new token [1]            │
│      │                   │    │      │                   │
│      ▼                   │    │      ▼                   │
│ Q: [B,H,T_p,d]           │    │ Q: [B,H,1,d]             │
│ K: [B,H,T_p,d]           │    │ K: from cache [t]        │
│ mask: lower-triangular   │    │ mask: not needed         │
│                          │    │                          │
│ compute-bound            │    │ memory-bound             │
│ (big matmul)             │    │ (read KV cache)          │
└──────────────────────────┘    └──────────────────────────┘

Prefill sees a full prompt tensor.

Decode sees one fresh query and a stored past.

2) Prefill processes the prompt in parallel

During prefill, every prompt token is already loaded.

So we can build all prompt queries together.

We can build all prompt keys together.

We can build all prompt values together.

That makes prefill look like training.

Large matrix multiplies run well on GPUs.

But the exam rule still matters.

Token t1 must not read token t5.

Both tokens exist in memory.

So future positions must be blocked explicitly.

We apply the causal mask before softmax.

That lower triangle keeps early prompt tokens honest.

Without it, prompt positions can peek rightward.

That gives illegal information during scoring.

A short shape view helps.

Q,K,V [B,H,T_prompt,d]

scores [B,H,T_prompt,T_prompt]

mask lower-triangle over [T_prompt,T_prompt]

So prefill is parallel.

It is not bidirectional.

That distinction matters in interviews and production.

3) Decode generates one new token at a time

After prefill, we store past keys and values.

That stored history is the memory shortcut.

At decode step t, we usually build one new query.

So the query axis shrinks to length one.

The cache still holds the legal past.

A compact shape view looks like this.

Q [B,H,1,d]

K,V [B,H,T_cache,d]

scores [B,H,1,T_cache]

Now ask the key question.

What future token needs masking here?

Usually none.

It does not exist yet.

The cache contains only past context.

So causality is already enforced by data structure.

That is why decode does not strictly need a triangular mask.

Some codepaths still pass a tiny mask.

That keeps APIs uniform.

Correctness does not depend on it.

The memory shortcut already removed the future columns.

4) One worked example and the systems lesson

Take one head with d = 2.

So sqrt(d) = 1.414.

Prefill row

Let q1 = [2, 0].

Let legal keys be k0 = [1,0], k1 = [2,0], k2 = [3,0].

Raw scores are [2, 4, 6].

Token t1 cannot see t2.

So masked logits become [2, 4, -inf].

Scale them to [1.414, 2.828, -inf].

Exponentials are [4.11, 16.93, 0].

The sum is 21.04.

Probabilities become [0.195, 0.805, 0.000].

The future column gets exactly zero mass.

Decode step

Now let the new query be q3 = [2,1].

Let cached keys be k0 = [1,0], k1 = [1,1], k2 = [0,2].

Raw scores are [2, 3, 2].

Scaled scores are [1.414, 2.121, 1.414].

Exponentials are [4.11, 8.34, 4.11].

The sum is 16.56.

Probabilities become [0.248, 0.504, 0.248].

There is no masked fourth entry.

No future key was loaded.

It also explains hardware behavior.

Prefill uses large matmuls, so it is usually compute-bound.

Decode does little arithmetic but reads a growing KV cache.

So decode is usually memory-bound.


Where this lives in the wild

  • vLLM continuous batching — schedulers track prefill-heavy and decode-heavy work separately.
  • TensorRT-LLM inflight batching — one batch can mix prefill requests and decode requests efficiently.
  • Anthropic-style long prompts — chunked prefill keeps prompt processing manageable at large context lengths.
  • SGLang prefix reuse — shared prompts reuse prefetched KV state instead of repeating prefill.
  • Together AI serving stacks — cache layout matters far more during decode than during prefill.

Pause and recall

  • Why must prefill block future prompt positions explicitly?
  • Why does decode often stay causal without a triangular mask?
  • Which phase is usually compute-bound, and which is memory-bound?
  • What changes in the score tensor shape between prefill and decode?

Interview Q&A

Q. Why does prefill resemble training more than decode does?

A. It processes many prompt positions together with the same causal visibility pattern.

Common wrong answer to avoid: “Because prefill is bidirectional like encoder attention.”

Q. Does decode require a causal mask?

A. Not strictly, because the cache usually contains only legal past tokens.

Common wrong answer to avoid: “Yes, every attention call always needs the same lower triangle.”

Q. Why can decode feel slower per token than prefill suggests?

A. Decode reads a growing KV cache, so memory traffic dominates latency.

Common wrong answer to avoid: “Because the model recomputes the whole prompt every token.”


Apply now (5 min)

Take a prompt of length 5.

Write the prefill score shape.

Write the decode score shape for one new token.

Then explain, in one sentence, why only the first case needs an explicit triangular mask.

Next, sketch from memory:

  • the prefill-versus-decode ASCII diagram,
  • the two score-shape lines,
  • and one sentence defining the memory shortcut.

Bridge. Good. We know when masking is explicit and when cache structure already enforces causality.

The next step is debugging the many ways attention code still fails.

13-debugging-attention.md