05. Streaming Responses — send tokens through the pass window¶

~15 min read. Chat feels alive when the first token arrives early, not when the full essay finally lands.

Built on the ELI5 in 00-eli5.md. The pass window — where dishes leave the kitchen — is the perfect picture for SSE token streaming.

First picture: one-way flow beats silent waiting¶

Look at the shape first. In a normal JSON response, the client waits for the whole body. In streaming, the server sends chunks as they become ready.

normal response
request ──→ work ──→ full body ready ──→ send once

streaming response
request ──→ work ──→ chunk 1 ──→ chunk 2 ──→ chunk 3 ──→ done
                        │
                        └── user sees progress early

For AI chat, this is huge. The model may need seconds to finish. But users judge responsiveness from first token latency. If the pass window opens in 200 milliseconds, people feel the system is fast. Simple, no?

SSE means Server-Sent Events. It is one-way streaming over plain HTTP. Browser support is simple. Infra support is usually easier than WebSockets. For token-by-token output, SSE is often enough.

SSE framing in one tiny example¶

Before code, see the wire shape. An SSE stream is text. Each event usually carries lines like data: .... A blank line ends the event.

HTTP response body

event: token
data: Hel

event: token
data: lo

event: done
data: [DONE]

That is it. No complex binary framing. No bidirectional channel. Just a server pushing events down an open response.

Now a FastAPI example.

import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

async def fake_token_stream():
    for token in ["Hel", "lo", " ", "world"]:
        yield f"event: token\ndata: {token}\n\n"
        await asyncio.sleep(0.2)
    yield "event: done\ndata: [DONE]\n\n"

@app.get("/stream")
async def stream() -> StreamingResponse:
    return StreamingResponse(fake_token_stream(), media_type="text/event-stream")

See. The generator is the line cook. Each yield pushes another plate to the pass window. The browser can render incrementally.

Bridging vendor token streams to your client¶

Real systems rarely invent tokens locally. They receive streamed chunks from an LLM provider, then relay them to the browser. That means your API is both a streaming client and a streaming server.

Picture that bridge.

browser SSE client
      ▲
      │  token events
┌─────┴─────────────┐
│ your FastAPI app  │
└─────┬─────────────┘
      │  provider chunks
      ▼
LLM provider stream

Now the important rule. Do not buffer the whole provider output first. That defeats streaming. As soon as the upstream token arrives, transform it if needed, then forward it.

Worked example in words. OpenAI or Anthropic sends delta chunks. Your route awaits each chunk. Your generator yields an SSE event immediately. The pass window stays warm. The user sees typing. That is the product feel you want.

You may also add metadata events. One event for message_start. One for each token delta. One final event for usage, finish reason, or citations. That keeps the client logic clean.

Backpressure, disconnects, and buffering traps¶

Now what is the problem? Streaming looks easy in local demos. Production adds hidden traps.

Trap one is proxy buffering. A reverse proxy may buffer chunks, then flush late. Your code streams. The user still sees nothing. So what to do? Disable buffering where needed. Test through the full production path.

Trap two is client disconnect. The browser tab closes. But your upstream model stream keeps running. Now money burns for nobody. The cancel bell must ring. Your generator should stop when the request is cancelled. We will study that deeply in the timeout file.

Trap three is event shape drift. Frontend expects event: token. Backend sends raw JSON lines. Now the UI parser breaks mid-stream. Treat streaming contracts like any API schema.

Trap four is over-chatty flushing. Sending one character per event may increase overhead. Sometimes chunking by word or token batch is better. Latency and efficiency both matter.

A production-minded pattern¶

A good streaming endpoint often does this. Validate request. Start upstream stream. Yield a start event. Relay token deltas. Catch provider errors. Yield a structured error event if safe. Always finish with a done event or a clean close.

request
  │
  ├── validate body
  ├── open upstream stream
  ├── send start event
  ├── relay token events
  ├── send usage event
  └── send done event

Look. The front desk still matters. Bad inputs should fail before the stream opens. The pass window should carry well-defined event types. The cancel bell should stop wasted upstream work quickly. That is the full shape.

SSE versus WebSockets in one sentence.

If you only need server-to-client updates, SSE is often simpler. If you need both sides speaking continuously, WebSockets fit better. We will do that next-to-next. For token streaming, SSE is the practical default.

Where this lives in the wild¶

OpenAI ChatGPT web app — backend engineer: SSE-style token delivery makes GPT responses feel instant even when full generation takes seconds.
Anthropic Console — product engineer: streamed completions let users watch Claude answer live instead of waiting for a silent full payload.
Perplexity answer UI — frontend platform engineer: partial citations and answer text arrive incrementally for better perceived speed.
GitHub Copilot Chat web surfaces — API engineer: token streaming keeps editor and browser experiences responsive during long generations.
Customer support AI console — full-stack engineer: agents see draft answers grow live, so they can interrupt or edit earlier.

Pause and recall¶

Why does SSE improve perceived performance even if total generation time stays similar?
What is the basic text framing rule for one SSE event?
Why is buffering the upstream model output before relaying it a design mistake?
In the kitchen analogy, what exactly is the pass window doing during token streaming?

Interview Q&A¶

Q: Why choose SSE over a normal JSON response for chat generation? A: Because users value early partial output, and SSE lets the server push incremental tokens over ordinary HTTP with minimal protocol complexity. Common wrong answer to avoid: "Because SSE makes the model generate faster."

Q: Why can a streaming demo work locally but fail in production? A: Reverse proxies, CDN buffering, and client parsing assumptions can delay or reshape chunks, so end-to-end testing matters more than handler logic alone. Common wrong answer to avoid: "If yield works in FastAPI, production streaming is automatically correct."

Q: Why should a relay endpoint forward upstream chunks immediately instead of buffering? A: Immediate forwarding preserves low first-token latency and interactive feel, while buffering throws away the main product benefit of streaming. Common wrong answer to avoid: "Buffering is safer because the client only needs one final event."

Q: Why is disconnect handling essential for streamed LLM responses? A: Because once the client leaves, continued upstream generation wastes tokens, money, and worker capacity unless cancellation propagates. Common wrong answer to avoid: "The TCP connection closing automatically stops every upstream dependency."

Apply now (5 min)¶

Exercise. Write a tiny async generator that yields three SSE events. One start, one token, and one done. Then describe what the browser would receive line by line.

Sketch from memory. Draw the bridge from provider stream, through your app, out to the pass window. Label where buffering would hurt.

Bridge. Streaming helps short interactive work. But some jobs should leave the request path entirely. That takes us to background tasks and queues. → 06-background-tasks.md