Skip to content

11. Streaming token delivery — how the plating line makes latency feel shorter than the full answer time

~15 min read. Good serving is not only fast compute; it is fast visible progress.

Built on the ELI5 in 00-eli5.md. The plating line is what the diner actually sees, so even a strong kitchen feels slow if nothing leaves the pass for too long.


1) Picture first: visible progress versus silent waiting

Suppose an answer takes five seconds in total.

One server waits, then sends everything at once.

Another server sends the first token in 700 ms, then keeps streaming.

The second feels much faster.

silent response:   [........5.0s........] final answer
streaming:         [0.7s] token token token token ... final answer

See.

The full work may be similar.

Perceived latency is different.

The plating line changes the user experience dramatically.


2) SSE, WebSocket, and plain chunked HTTP

Common transport choices are:

  • Server-Sent Events for simple one-way token streams,

  • WebSocket for bidirectional low-latency interaction,

  • chunked HTTP responses for simpler streaming cases.

SSE is often enough for standard assistant output.

WebSocket becomes attractive when the client sends live interrupts, audio chunks, or tool state updates.

Chunked HTTP can work, but protocol ergonomics are usually rougher.

Simple, no?

Pick the transport based on interaction shape, not hype.


3) Worked example: TTFT versus total completion time

Suppose TTFT is 700 ms.

Suppose generation rate after that is 40 tokens per second.

Suppose the final answer has 200 tokens.

Total time to complete is:

  • 0.7 s + 200 / 40 s

  • = 0.7 + 5.0

  • = 5.7 s

Without streaming, user sees nothing until 5.7 s.

With streaming, user sees progress at 0.7 s.

That is a 5.0-second perception difference.

The kitchen work did not change.

The plating line did.


4) Operational details people forget

Now what is the problem? Streaming creates new design questions. How often do you flush? Do you buffer every token or small chunks? How do you moderate partial output? What if the client disconnects? How do you stream JSON safely?

Look. A great backend can still feel broken if the client renders chunks awkwardly, or if proxies buffer the stream unexpectedly. End-to-end streaming means transport, server, and UI all cooperate.


5) Metrics that matter for streaming UX

Track at least these:

  • time to first token,

  • inter-token latency,

  • total completion time,

  • disconnect rate,

  • client render delay.

A serving team that reports only total latency misses the entire plating-line story. Streaming success is about cadence, not only final finish time. Next we ask how to benchmark all this honestly, so we do not confuse one nice demo with real serving quality.


Where this lives in the wild

  • ChatGPT web responses — SSE-style token streaming is central to making long answers feel alive.

  • Anthropic Messages streaming API — clients receive structured partial events, not only one final text blob.

  • GitHub Copilot chat — fast first visible tokens help developers keep flow while the full explanation arrives.

  • Perplexity answer UI — progressive text and citations make a long answer feel responsive.

  • Realtime support copilots — supervisors value immediate partial drafts even before the final wording settles.


Pause and recall

  • Why can two answers with identical total latency feel very different to the user?

  • When is SSE usually enough, and when might WebSocket be better?

  • In the worked example, what was the total completion time and the first visible time?

  • Why is inter-token cadence a real metric, not a cosmetic detail?


Interview Q&A

Q: Why stream tokens instead of only optimizing total completion time?

A: Because users experience silence as slowness. Lower TTFT and steady cadence often improve satisfaction even when final completion time changes little.

Common wrong answer to avoid: "Only total latency matters." Human perception cares strongly about early feedback.

Q: Why choose SSE over WebSocket in many text assistant cases?

A: Because token delivery is often a simple server-to-client stream. SSE is lighter operationally when you do not need continuous bidirectional signaling.

Common wrong answer to avoid: "WebSocket is always more advanced, so it is always better." Simpler protocols are often enough.

Q: Why is streaming JSON harder than streaming plain text?

A: Because partially emitted structures may be invalid until the full object arrives. Clients and validators must handle fragments carefully.

Common wrong answer to avoid: "A token stream is a token stream." Structured outputs need stricter handling.

Q: Why can a good stream still feel janky?

A: Because proxies may buffer, the UI may render badly, or chunking cadence may be uneven. Serving UX is an end-to-end property.

Common wrong answer to avoid: "If the backend emits tokens, the job is done." Client plumbing matters too.


Apply now (5 min)

Take a 150-token answer and assume 600 ms TTFT with 30 tokens per second after that. Compute total completion time. Then write what the user sees at 0.6 s, 2 s, and 5 s. Sketch from memory:

  • the silent-versus-streaming timeline,

  • the transport choices,

  • and the TTFT equation.


Bridge. Once tokens are flowing, the next question is measurement. Next we study load testing and benchmarking so throughput, TTFT, and percentile claims mean something trustworthy. → 12-load-testing-benchmarking.md