04. Continuous batching — how modern engines keep the burners busy every decode step¶
~16 min read. This is the scheduling trick that turns live traffic into throughput.
Built on the ELI5 in 00-eli5.md. The batch window — the moment we group order tickets — should not open once and freeze, because finished tickets leave empty burners unless the window keeps moving.
1) Static batches waste live capacity¶
Picture four order tickets entering together. Two are short. Two are long. A static batch launches all four. Then the short ones finish early. Their slots stay empty until the long ones end.
decode step → 1 2 3 4 5 6 7 8 9 10
request A █ █ █ ░ ░ ░ ░ ░ ░ ░
request B █ █ █ █ ░ ░ ░ ░ ░ ░
request C █ █ █ █ █ █ █ █ █ █
request D █ █ █ █ █ █ █ █ █ █
▲ empty capacity after A and B finish
See. The kitchen still carries the old tray shape. New short tickets outside the batch must wait. The batch window opened once, then became a prison.
2) Continuous batching thinks in iterations, not requests¶
So what to do? Schedule at each decode iteration. When one request finishes, its slot immediately becomes available. A waiting ticket can enter on the next step.
step k scheduler
│
├── remove finished tickets
├── admit waiting tickets
├── build current micro-batch
└── launch one decode step
This is iteration-level scheduling. The batch window is tiny and constantly reopening. That keeps the kitchen fuller. It also lowers average waiting time for short tickets. The order ticket no longer waits for the whole tray to reset. Simple, no?
3) Worked example: static versus continuous¶
Suppose decode can process four active streams at once. Suppose request lengths are 4, 4, 10, and 10 tokens. Suppose two more 4-token requests are waiting outside.
Under static batching:
-
initial batch runs for 10 steps
-
short requests finish at step 4
- two slots sit idle for steps 5 through 10
- waiting requests start only after step 10
- final completion time = 10 + 4 = 14 steps
Under continuous batching:
- initial short requests finish at step 4
- two waiting requests enter at step 5
- they finish at step 8 while long requests continue
- final completion time stays 10 steps
Same kitchen. Same model. Smarter batch window. Four steps saved.
4) Scheduler knobs create tradeoffs¶
Continuous batching is not magic. It needs policies. Common knobs include:
- maximum active sequences,
- maximum total tokens in one step,
- fairness between short and long tickets,
- TTFT targets for new arrivals,
- memory headroom for the prep station.
If you admit too many long contexts, you may crash memory. If you admit only tiny tickets, large tenants starve. If you wait too long to collect a fuller batch, TTFT worsens. So what to do? Pick the product objective first. Then tune the scheduler to that goal.
5) Why batching leads straight into memory management¶
Now what is the problem? Requests are entering and leaving every step. Their caches have different lengths. Contiguous memory regions become awkward fast. You free a middle chunk here. You append a long chat there. Fragmentation appears.
The prep station is now dynamic. That is wonderful for throughput. It is painful for allocation. A modern serving engine therefore needs two ideas together: continuous batching for scheduling, and paged storage for KV cache blocks. One without the other becomes messy. Next we study the memory side in detail.
Where this lives in the wild¶
-
vLLM-based API endpoints — continuous batching is the main reason many mixed-length requests can share one GPU efficiently.
-
Hugging Face Text Generation Inference — schedulers admit and retire live requests step by step to keep throughput high.
-
Anthropic-style chat workloads — short and long conversations mix constantly, so iteration-level fairness matters.
-
GitHub Copilot chat backends — quick coding questions should not wait behind one giant refactor request.
-
Perplexity answer clusters — streaming many simultaneous answers rewards schedulers that refill empty slots immediately.
Pause and recall¶
-
Why do static batches leave empty GPU capacity even when the original batch was full?
-
What is iteration-level scheduling in one sentence?
-
In the worked example, why did continuous batching finish in 10 steps instead of 14?
-
Which tradeoff gets worse if you wait too long to collect a fatter batch?
Interview Q&A¶
Q: Why is continuous batching better than one large fixed batch for mixed request lengths?
A: Because it reclaims finished slots immediately and admits new work on the next iteration. Fixed batches strand empty capacity until the longest request ends.
Common wrong answer to avoid: "Because the GPU likes larger batches only." The real win is dynamic slot reuse.
Q: Why not optimize only for maximum throughput and ignore TTFT?
A: Because users feel the first delay sharply. A scheduler that packs batches aggressively can improve throughput while making new arrivals feel sluggish.
Common wrong answer to avoid: "Throughput automatically implies good UX." High throughput can coexist with terrible first-token latency.
Q: Why does continuous batching make memory allocation harder, not easier?
A: Because request state becomes highly dynamic. Different-length caches appear and disappear every step, which fragments contiguous memory layouts.
Common wrong answer to avoid: "Dynamic scheduling only affects compute." It changes memory layout pressure too.
Q: Why should scheduler policy match product goals instead of one universal rule?
A: Because chat, code completion, and offline batch jobs value fairness, TTFT, and throughput differently. The right admission rule depends on what the product must protect.
Common wrong answer to avoid: "There is one best scheduler for all traffic." Objectives differ.
Apply now (5 min)¶
List five request lengths from a real product log or an imagined mix. Simulate them in a 4-slot static batch. Then allow new work to enter whenever a slot frees. Compare total steps and waiting time for short tickets. Sketch from memory:
-
the holey static batch,
-
the per-step scheduler loop,
-
and the 14-step versus 10-step example.
Bridge. Dynamic scheduling fixed empty burners, but it made the prep station messy. Next we study paged attention, where KV cache is stored in smaller blocks so memory fragmentation stops ruining the plan. → 05-paged-attention.md