01. Sync vs Async — why one slow spoon can stall the whole kitchen¶

~12 min read. This is the failure mode that makes AI APIs feel randomly slow under load.

Built on the ELI5 in 00-eli5.md. The line cook — the worker doing active work — becomes a bottleneck if it waits badly.

One slow table can block ten fast ones¶

Look at the picture first. A synchronous service behaves like one cook finishing one order fully. Only then does the next ticket move. That feels simple. It also wastes time.

sync kitchen

request A ──→ start ──→ wait on LLM ──→ finish
                               │
request B ─────────────────────┘ waits in line
request C ─────────────────────┘ waits in line

async kitchen

request A ──→ start ──→ await LLM ─┐
                                   │ yields
request B ──→ start ──→ await DB ──┤ back to kitchen lane
                                   │
request C ──→ start ──→ send reply ┘

See. The key problem is not speed of code execution alone. The problem is wasted waiting time. AI services wait all the time. They wait for model APIs. They wait for databases. They wait for vector stores. They wait for object storage.

Now what is the problem? If one request holds the whole worker during waiting, concurrency collapses. The front desk keeps accepting work. But the order tickets do not move. Users feel queueing delay, not compute delay.

A tiny example makes this obvious. Suppose three requests arrive together. Request A needs 2 seconds from an LLM. Request B needs 50 milliseconds from Redis. Request C needs 80 milliseconds from Postgres.

In a synchronous path, B and C wait behind A. So perceived latencies become roughly: A = 2.00 seconds. B = 2.05 seconds. C = 2.13 seconds.

In an async path, A starts and waits. Then the kitchen lane runs B. Then C. Now perceived latencies become roughly: B = 0.05 seconds. C = 0.08 seconds. A = 2.00 seconds.

Simple, no? The total wall clock did not become magic. But fast requests stopped standing behind slow waits. That is the win.

Blocking code hides inside innocent-looking functions¶

Many teams say, "We use FastAPI, so we are async." Not enough. A blocking call inside a request can still freeze progress.

Picture it like this.

request handler
      │
      ▼
┌──────────────────────┐
│ parse request        │
├──────────────────────┤
│ call requests.get()  │  ◀── blocking network call
├──────────────────────┤
│ call time.sleep()    │  ◀── blocking timer
├──────────────────────┤
│ run huge CPU loop    │  ◀── blocking compute
└──────────────────────┘

Each block above ties up the line cook. Nothing else moves on that worker meanwhile. That is why mixed stacks are dangerous. One bad library call ruins the promise of async.

See this worked example.

import time
from fastapi import FastAPI

app = FastAPI()

@app.get("/bad")
def bad_route() -> dict:
    time.sleep(2)
    return {"status": "done"}

This code sleeps for two seconds. During that sleep, the thread is parked uselessly. If enough requests do this, queues form fast.

Now compare the async version.

import asyncio
from fastapi import FastAPI

app = FastAPI()

@app.get("/better")
async def better_route() -> dict:
    await asyncio.sleep(2)
    return {"status": "done"}

Now the order ticket yields while waiting. The kitchen lane can run another ticket. That is cooperative waiting. Not parallel CPU work. Important difference.

Async is best for waiting, not for heavy compute¶

Now a common confusion. People hear async and think, "Everything becomes faster." No. Async mainly improves throughput for I/O-heavy waiting.

Look at this comparison.

good fit for async                 bad fit for async alone
┌──────────────────────────┐       ┌──────────────────────────┐
│ HTTP call to LLM         │       │ giant PDF OCR loop       │
│ database query           │       │ image embedding on CPU   │
│ Redis cache lookup       │       │ NumPy-heavy scoring loop │
│ token streaming wait     │       │ video transcoding        │
└──────────────────────────┘       └──────────────────────────┘

Why? Because async helps when the program can yield. Network waiting can yield. Disk waiting can yield. A hot CPU loop usually does not yield helpfully.

So what to do with CPU-heavy work? Move it to worker processes. Use task queues. Use separate services. Use thread or process pools carefully. The prep shelf exists for a reason. Do not make the request path knead dough for thirty minutes.

A worked example again. Suppose one endpoint chunks a 500-page PDF and computes embeddings locally. That may consume CPU for 15 seconds. If you do that inline, a worker becomes unavailable. Other requests suffer. Better flow: accept file, store metadata, return job id, push heavy work to the prep shelf.

Why this matters more for AI than ordinary CRUD¶

Classic CRUD apps often do short database calls. AI apps stack several waits. Prompt fetch. User auth. Policy lookup. Vector search. LLM call. Moderation call. Usage logging. Streaming output.

One request can touch five systems. Each adds waiting. That makes blocking mistakes more expensive.

chat request
    │
    ├──→ auth service
    ├──→ Redis session
    ├──→ vector DB search
    ├──→ LLM generation
    └──→ billing write

If all of those are awaited properly, one worker can juggle many chats. If even two of them block synchronously, tail latency spikes. The front desk looks open. The kitchen is actually jammed.

See a concrete timeline. Ten users send messages together. Each message spends 1.8 seconds waiting on an LLM. CPU work per request is only 40 milliseconds.

With blocking workers, capacity is dominated by the wait. With async workers, the 40 milliseconds matter more. That is why async APIs can serve far more concurrent chat sessions on the same machine.

The real mental model to keep¶

Do not memorize slogans. Keep one picture in your head. A blocking function keeps the cook staring at boiling water. An async function puts the ticket down and helps another table.

That does not remove physics. The LLM still takes its time. The DB still takes its time. But the service becomes fairer and more responsive. Fast requests stop inheriting slow neighbors. That is the big systems win.

So what to do? Audit every waiting point. Network call? Use an async client. Sleep? Use await asyncio.sleep. Database driver? Choose one that integrates with asyncio. Heavy compute? Offload it. Simple, no?

Where this lives in the wild¶

OpenAI API — platform engineer: non-blocking request handling keeps short moderation checks from waiting behind long GPT-4 generations.
Perplexity search backend — retrieval engineer: concurrent web fetches and ranking calls prevent one slow source from freezing the answer path.
Notion AI — backend engineer: quick autocomplete requests stay snappy even when document summarization calls run longer.
Slack AI assistant — infrastructure engineer: short channel-help prompts should not queue behind giant enterprise search requests.
Anthropic Messages API — API engineer: streaming chats benefit when waiting on model tokens does not block other conversations.

Pause and recall¶

Why can async improve latency for short requests even when long requests stay equally slow?
Which three kinds of operations commonly block a Python API worker?
Why is async a poor cure for CPU-heavy embedding or OCR work?
In the kitchen analogy, what exactly is the failure of a blocking line cook?

Interview Q&A¶

Q: Why use async endpoints for LLM calls instead of just adding more threads? A: Async makes waiting explicit and cheap, so thousands of sockets can stay alive without one thread per wait. Threads help too, but thread-per-wait scales worse in memory and coordination. Common wrong answer to avoid: "Async is always faster than threads for every workload."

Q: Why does blocking I/O inside FastAPI hurt even if the framework supports async? A: The framework can only schedule around code that yields. A blocking library call holds the worker during the wait, so concurrency falls back toward serialized handling. Common wrong answer to avoid: "FastAPI automatically converts blocking calls into non-blocking ones."

Q: Why not run heavy embedding generation inline in the request if users can wait? A: Inline CPU-heavy work ties up request capacity and harms unrelated users. Better to return a job id and move that workload to the prep shelf. Common wrong answer to avoid: "If one user accepts the delay, the design is fine."

Q: Why is async especially important for AI products compared with simple CRUD apps? A: AI requests often chain several remote systems and long waits, so blocking time multiplies quickly and tail latency explodes under concurrency. Common wrong answer to avoid: "Because AI models only work inside async functions."

Apply now (5 min)¶

Exercise. Pick one API route you know. List every place it waits. Mark each as network, disk, sleep, or CPU. Then ask which ones can yield cleanly.

Sketch from memory. Draw two timelines. One for sync. One for async. Use three requests. Show where the kitchen lane lets another order ticket move.

Bridge. Fine, async avoids wasted waiting. But who exactly decides when a paused ticket resumes? That takes us to the event loop and coroutines. → 02-event-loop-coroutines.md