06. Speculative decoding — how a sous chef guesses ahead so the main chef moves in chunks¶

~17 min read. This is the draft-verify trick that attacks serial decode directly.

Built on the ELI5 in 00-eli5.md. The sous chef can prep likely next spoons cheaply, while the main kitchen checks whether those guesses are safe to serve.

1) Picture first: draft, then verify¶

Imagine two chefs.

The small one is fast but less reliable.

The big one is slower but trusted.

Instead of waiting for the big chef to choose one token, we ask the sous chef to draft several.

Then the big chef verifies them in one pass.

prompt ──→ sous chef drafts: t1 t2 t3 t4
                    │
                    ▼
             main chef verifies all four
                    │
        accept prefix and maybe reject tail

See.

If most drafted tokens are correct, we move multiple steps per expensive target-model pass.

That is the whole idea.

2) The basic algorithm step by step¶

A simple speculative loop works like this.

Run the draft model for k tokens.
Feed those candidate tokens to the large model.
Compare target probabilities with the draft path.
Accept the longest matching prefix.
If a token fails verification, resample from the target model there.
Continue from the new confirmed position.

The order ticket still gets one valid answer stream.

The trick is internal. The customer never sees the rejected draft. The plating line only emits confirmed tokens. Simple, no?

3) Worked example: expected speed from acceptance rate¶

Suppose the sous chef proposes k = 4 tokens each round. Suppose average accepted prefix length is 3 tokens. Suppose one target-model verification pass costs about what one normal decode step costs.

Without speculation:

12 output tokens need 12 target passes

With speculation:

each round confirms 3 tokens on average
12 tokens need about 12 / 3 = 4 target passes

Now count draft work too. Suppose the draft model is 6 times cheaper than the target. Four rounds need 4 × 4 = 16 draft-token steps. Those 16 cheap steps are roughly equal to 16 / 6 ≈ 2.7 target steps. Total equivalent cost ≈ 4 + 2.7 = 6.7 target steps. That still beats 12 by a lot. The win depends on acceptance.

4) When the trick works, and when it disappoints¶

Speculative decoding shines when:

the draft model is much cheaper,
the draft matches the target often,
the target can verify chunks efficiently,
output style is predictable enough for high acceptance.

It disappoints when:

draft and target disagree often,
long contexts make verification memory-heavy,
acceptance collapses on code or niche domains,
batching policies conflict with the chunk sizes.

So what to do? Measure real acceptance rate by workload, not only on toy prompts.

5) Why this is not the last infrastructure problem¶

Even great speculation does not solve everything. The target model may still be enormous. One GPU may not hold it. Or one GPU may be too slow for required throughput. Then we must split the kitchen itself across devices. That means communication between GPUs, not only smarter decode on one GPU. Next we study tensor parallelism, where one big layer is divided across many kitchens at once.

Where this lives in the wild¶

NVIDIA TensorRT-LLM deployments — speculative decoding can raise tokens per second when draft-target pairing is tuned well.
vLLM serving experiments — draft-target pipelines are increasingly used for chat workloads with predictable continuations.
GitHub Copilot-style code assistants — speculation is attractive, but acceptance rate can fall on tricky repository-specific code.
Chatbot backends for customer support — templated responses often yield higher acceptance and bigger wins for draft-verify schemes.
Local laptop assistants using a small helper model — a cheap draft model can hide some of the latency of a larger target model.

Pause and recall¶

What is the central idea of speculative decoding in one sentence?
In the worked example, why did 12 tokens need only about 4 target passes?
Which workload property matters most for whether speculation helps?
Why are rejected draft tokens never sent on the plating line?

Interview Q&A¶

Q: Why use speculative decoding instead of only tuning kernels harder?

A: Because speculation changes the algorithmic unit of progress. If multiple tokens get confirmed per target pass, serial decode pressure drops in a way kernel polishing alone cannot achieve.

Common wrong answer to avoid: "Because the draft model makes the target model unnecessary." The target still decides what is valid.

Q: Why is acceptance rate more important than draft-model raw speed?

A: Because cheap wrong guesses still create verification work. The win comes from confirmed tokens per expensive target pass, not from drafting nonsense quickly.

Common wrong answer to avoid: "Any tiny draft model will help." Mismatch can erase the gain.

Q: Why must the client receive only verified tokens, not draft tokens optimistically?

A: Because rejected draft tokens would violate correctness and create awkward rollbacks in the user stream. The plating line should emit confirmed output only.

Common wrong answer to avoid: "Users will not notice if we revise later." Revisions break trust and protocol simplicity.

Q: Why can speculative decoding still need multi-GPU serving?

A: Because the target model may remain too large or too throughput-hungry for one device. Draft-verify reduces passes, but it does not shrink the target architecture enough by itself.

Common wrong answer to avoid: "Speculation removes the need for distributed serving." Often it just delays it.

Apply now (5 min)¶

Assume a draft model proposes 5 tokens each round and average accepted prefix length is 2.5. Estimate how many target passes are needed for a 20-token answer. Then compare with plain decoding. Think about how the answer changes if acceptance rises to 4. Sketch from memory:

the draft-verify diagram,
the accept-prefix idea,
and the target-pass calculation.

Bridge. We made one kitchen smarter, but some models are still too large or too slow for one device. Next we split a single model layer across many GPUs and pay the communication bill that comes with it. → 07-tensor-parallelism.md