02. Latency and streaming UX¶

Trust is the destination. The first concrete pattern is latency UX — AI is often slow; the perception of speed is design. Streaming, indicators, expectations.

A platform engineer at a Bengaluru SaaS company adds an AI feature. The model returns in 4 seconds; the UX shows a blank screen for 4 seconds then a block of text. Users perceive the feature as broken; many click away before the response arrives. The engineer adds streaming — the response appears as it generates, token by token. Median time-to-first-token is 600ms; users see motion within a second; perceived latency drops dramatically even though total time is similar. Adoption climbs.

This chapter is the latency UX discipline.

What latency UX is¶

Latency UX is the design discipline that manages user perception of AI response time through streaming, indicators, and expectation-setting — recognising that perceived speed is design, not just measurement.

The total time matters; the perceived time matters more. UX patterns shift the perception.

Streaming responses¶

The single highest-leverage latency UX pattern. The response appears as it is generated, token by token, rather than as a block when complete.

Benefits:

Fast time-to-first-token. The user sees motion within ~600ms; the AI is responding.
Reading-in-parallel. The user reads as the response generates; total wait time for the user to finish reading is less than for a block presentation.
Engagement. The motion holds attention; users do not click away.

Tradeoffs:

Implementation complexity. Streaming requires server-sent events or websockets; SDK and UI changes.
Errors mid-stream. A streaming response that fails partway is harder to handle than a complete one.
No final-form preview. The user does not see the complete response until it finishes; can be jarring for short responses.

For most chat-style interactions, streaming is the default. Non-streaming is reserved for short responses where streaming adds no value.

Loading indicators¶

For non-streaming responses or for the pre-first-token window of streaming responses, indicators communicate "the system is working":

Skeleton states. Placeholder UI that shows the shape of what is coming.
Spinner with context. "Searching documents..."; "Analysing request..." — the indicator describes what is happening.
Progress bar. For long-running tasks with knowable stages.

Avoid:

Generic spinners with no context. "Loading..." for 4 seconds feels broken.
Indeterminate spinners that don't move. Worse than no spinner.
Optimistic UI without recovery. Showing a "done" state before completion produces confusion when the actual response arrives.

Expected latency communication¶

For features with known long latency (deep research, batch processing), communicate the expected wait time:

"This usually takes about 30 seconds..."
"Generating your report..."
For very long tasks (minutes): "We'll notify you when it's ready."

The communication sets expectations. A 30-second wait that was advertised feels different from a 30-second wait that surprised.

Latency budgets per workload¶

Different workloads have different acceptable latency:

Workload	Acceptable time-to-first-token	Acceptable total time
Interactive chat	<1s	<5s
Inline suggestion	<300ms	<2s
Summary or explanation	<1s	<10s
Deep research / report	<2s	<60s (with progress)
Batch task	N/A	Notification when done

The latency budget informs both the technical design (which model, what canary tolerances) and the UX (when to show progress, when to backgound, when to require notification).

What to do when latency exceeds budget¶

If the AI is taking longer than expected:

Show progress. Update the indicator: "Still working...we're checking a few more sources."
Offer cancellation. A cancel button gives the user control.
Time out with grace. At some upper bound (e.g., 60s for an interactive chat), surface "this is taking longer than expected; would you like to try a simpler question?"
Background and notify. For tasks expected to take long, move to background after a threshold; notify when done.

The discipline is to never leave the user staring at an unmoving indicator.

Latency for the user's mental model¶

Users intuit AI latency in three buckets:

<1s — instant. The AI feels responsive; integrated into flow.
1-5s — visible wait. The AI feels considered; acceptable for complex tasks.
>5s — interrupting. The AI feels slow; user mental focus breaks; context switch likely.

Design for the right bucket per workload. An autocomplete that takes 3s is the wrong bucket; a research report that takes 3s is also wrong (too fast feels untrustworthy for deep work; users expect deep work to take time).

Streaming gotchas¶

Some streaming-specific concerns:

Streaming token boundaries. Tokens may break mid-word; the UI handles fragments correctly.
Stop sequences. The streaming should respect stop sequences and end cleanly.
Error mid-stream. Convert to a clean error state with the partial response preserved if useful.
Streaming with markdown. Render incrementally; some markdown elements (tables) cannot render until complete.

Common mistakes¶

Block presentation for chat. Users perceive broken.

Spinners without context. Generic loading; users abandon.

No expected-latency communication for long tasks. Surprise wait.

Streaming without error handling. Mid-stream failures crash the UI.

Wrong latency bucket. Autocomplete that is 3s slow; research that is 0.5s fast feels untrustworthy.

Interview Q&A¶

Q1. Why is streaming the highest-leverage latency UX pattern? Because it shifts perceived latency dramatically without requiring faster models. Time-to-first-token can be <1s even when total response is 5s. The user sees motion, reads in parallel, and engages — the wait feels productive. The total time is similar but the user experience is different. The implementation cost is real (streaming infrastructure, error handling) but the UX payoff is large. Wrong-answer notes: "make the model faster" misses the design lever.

Q2. Walk through the right latency UX for an autocomplete suggestion vs a research report. Autocomplete: time-to-first-token <300ms, total <2s; no spinner (too short); inline rendering. Research report: time-to-first-token <2s; show "Researching..." indicator with progress; total acceptable up to 60s with progress communication; background and notify for longer. Different buckets; different patterns. Autocomplete is integrated-into-flow; research is considered-work. The UX matches the user's mental model of the work type. Wrong-answer notes: uniform UX across both produces mismatch.

Q3. The AI is taking longer than budget on an interactive query. What does the UX do? Show progress with context ("Still searching..."); offer cancellation; if approaching a hard upper bound, surface "this is taking longer than expected" with options. Never leave the user staring at an unmoving indicator. The communication acknowledges the wait and gives the user control. Wrong-answer notes: "let the spinner keep spinning" produces abandonment.

Q4. The platform considers skipping streaming "to simplify." What is the tradeoff? The implementation simplification is real (no streaming infrastructure; no mid-stream error handling). The UX cost is the chapter-opening 4-second-blank-then-block-of-text experience. Users perceive the AI as broken; adoption suffers. For chat-style interactions, streaming is essentially required. For short responses (single-sentence, classifications), non-streaming is fine. The decision is per workload; "skip streaming everywhere" is the wrong simplification. Wrong-answer notes: "simpler is better" without considering UX cost.

What to do differently after reading this¶

Stream chat responses by default.
Use loading indicators with context, not generic spinners.
Communicate expected latency for long tasks.
Match latency budget to workload bucket.
Handle latency-exceeded gracefully (progress, cancellation, escalation).

Bridge. Latency UX manages the wait. The next concern is what the AI says about its own certainty — uncertainty surfacing that helps users calibrate trust. → 03-uncertainty-surfacing.md