01. Inference bottleneck — why the obvious server disappoints instantly¶

~15 min read. This is the opening failure that makes every later optimization necessary.

Built on the ELI5 in 00-eli5.md. The kitchen — the GPU doing the cooking — is fast, but a bad serving loop still keeps the burners idle and the waiters unhappy.

1) First picture: a naive serving loop¶

A naive inference server feels simple. Take one order ticket. Run the full model. Return the answer. Then pick the next ticket.

order ticket A          order ticket B          order ticket C
      │                       │                       │
      ▼                       ▼                       ▼
┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│ full prompt  │       │ waits idle   │       │ waits idle   │
│ full decode  │       │ for A        │       │ for A        │
└──────┬───────┘       └──────────────┘       └──────────────┘
       │
       ▼
   answer A only

Now what is the problem? The kitchen is not slow in the abstract. The schedule is slow. Short tickets get stuck behind long tickets. The server cannot share work. The queue grows even when GPU utilization looks bursty. See. Serving is a systems problem before it is a math problem.

2) Prefill and decode behave very differently¶

Look at one request in two phases. First we ingest the whole prompt. That phase is called prefill. Then we generate one token at a time. That phase is decode.

prompt arrives
    │
    ├──→ prefill ──→ build internal state for all prompt tokens
    │
    └──→ decode  ──→ emit token 1, token 2, token 3 ...

Prefill uses a fat matrix multiply over many prompt positions. Decode uses tiny repeated steps. That means the kitchen sees two workloads in one order ticket. One is wide and parallel. One is skinny and serial. If your server treats them identically, you waste either compute or memory bandwidth. Simple, no? The order ticket is one API call, but the engine sees two very different chores.

3) Worked example: where the wait time comes from¶

Suppose four users arrive together. Their prompt lengths are 500, 1,000, 1,000, and 4,000 tokens. Suppose answers are 100, 100, 300, and 700 tokens. A naive one-by-one server handles them serially.

If prefill speed is 20,000 prompt tokens per second, then prompt time is:

Request 1: 500 / 20,000 = 0.025 s
Request 2: 1,000 / 20,000 = 0.050 s
Request 3: 1,000 / 20,000 = 0.050 s
Request 4: 4,000 / 20,000 = 0.200 s

Suppose decode speed is 50 output tokens per second. Then decode time is:

Request 1: 100 / 50 = 2.0 s
Request 2: 100 / 50 = 2.0 s
Request 3: 300 / 50 = 6.0 s
Request 4: 700 / 50 = 14.0 s

Total serial service time is 24.325 seconds. Request 2 waits for Request 1 even though both are short. Request 1 spends most time in decode, not prefill. That is the first bottleneck clue.

4) Why naive batching also disappoints¶

So what to do? You may say, “Batch everything together.” Good instinct. Badly done, it still hurts.

static batch starts
   ├── short request finishes at step 40  ──┐
   ├── short request finishes at step 60  ──┤ idle slots
   └── long request keeps going to step 400 ┘

A fixed batch window can fill the kitchen once. After that, finished requests leave holes. Those holes do no useful work. New tickets outside the batch must wait. Head-of-line blocking returns in another form. The kitchen looks full on paper. Real burners are half idle. This is why serving engines talk about iteration-level scheduling, not only batch size. The batch window must reopen continuously. We will get there soon.

5) The hardware reason underneath the pain¶

One more picture matters. The model weights are huge. Every decode step must touch them again.

new token step
    │
    ▼
read weights from HBM ──→ run kernels ──→ write one token logits

Suppose your model weights occupy 14 GB in fp16. Suppose the kitchen emits 40 tokens per second for one stream. Then the engine effectively rereads about 14 GB × 40 = 560 GB each second. The arithmetic per step is not the only cost. Moving weights and cache through memory dominates. That is why naive decode becomes memory-bound quickly. The prep station helps. Smarter batching helps. Better kernels help. But first we need to understand why token-by-token decoding grows badly at all. Yes?

Where this lives in the wild¶

ChatGPT-style assistant backends — long answers make decode dominate even when prompt ingestion is fast.
GitHub Copilot inline completion — short tickets suffer if long code-generation jobs occupy the same kitchen too rigidly.
Perplexity answer generation — retrieval-heavy prompts create large prefills followed by slow token streaming.
Character.AI chat sessions — many simultaneous short conversations expose queueing and batching waste immediately.
Intercom Fin support replies — one slow tenant with long contexts can block faster tickets without careful scheduling.

Pause and recall¶

Why is one inference request really two workloads instead of one?
In the worked example, which phase dominated total time: prefill or decode?
Why can static batching still leave GPU slots idle?
Why does reading weights from memory matter even when the GPU has huge FLOP numbers?

Interview Q&A¶

Q: Why does a powerful GPU still feel slow under naive serving, not because the model is “bad”? A: Because the serving loop can serialize unrelated tickets, leave batch holes, and force repeated weight reads during decode. Hardware capability does not rescue poor scheduling. Common wrong answer to avoid: "If latency is high, the model architecture must be weak." Serving policy often dominates.

Q: Why separate prefill and decode instead of tracking only total latency? A: Because they stress the kitchen differently. Prefill is wide and parallel. Decode is serial and often memory-bound. The best fix for one phase may not help the other. Common wrong answer to avoid: "Latency is one number, so one fix is enough." The phases behave very differently.

Q: Why is static batching not enough for production traffic? A: Because requests finish at different times. Static batches strand empty capacity until the longest request completes, which hurts both throughput and fairness. Common wrong answer to avoid: "Any batching is automatically efficient." Bad batching can still waste most of the batch.

Q: Why is decode usually the visible pain point, not prompt ingestion? A: Prompt ingestion is one large parallel pass. Decode repeats the full stack for every new token. User-visible waiting accumulates there. Common wrong answer to avoid: "The prompt is bigger, so it must be slower." Bigger once is often cheaper than tiny repeated work many times.

Apply now (5 min)¶

Take one endpoint you know well. Split its latency into queue time, prefill time, and decode time. Estimate output tokens per request. Then estimate how much of total time sits in decode. Sketch from memory: - the serial-ticket diagram, - the prefill-vs-decode split, - and the static-batch hole picture.

Bridge. The obvious failure is repeated work inside decode itself. Next we zoom in on autoregressive generation and show exactly why naive decoding scales so poorly. → 02-autoregressive-decode-cost.md