07. Concurrency Patterns — let many tickets move without kitchen fights¶

~16 min read. Async alone is not enough; you also need the right control patterns for external services and shared limits.

Built on the ELI5 in 00-eli5.md. The kitchen lane — the shared scheduling path — needs rules so too many order tickets do not rush the same stove together.

First picture: concurrency without control becomes a stampede¶

Look at the picture first. Many tasks can run concurrently. That sounds good. Until all of them hit the same dependency together.

100 order tickets
      │
      ▼
┌──────────────────────┐
│   kitchen lane       │
└──────┬───────────────┘
       ▼
   vector DB / LLM / Redis
       │
       ├── healthy with limits
       └── overloaded without limits

See. Concurrency is power. Unbounded concurrency is chaos. AI systems often fan out. Chunk embeddings. Parallel rerank calls. Multiple tool calls. Several cache misses at once. We need patterns.

`gather` for independent waits¶

The first pattern is asyncio.gather. Use it when multiple tasks are independent and all should start now.

results = await asyncio.gather(
    fetch_profile(user_id),
    fetch_usage(user_id),
    fetch_limits(user_id),
)

This is better than sequential waits. One request can fetch three resources together. That cuts latency.

Picture the gain.

sequential
profile 0.2s + usage 0.2s + limits 0.2s = 0.6s

concurrent
max(0.2s, 0.2s, 0.2s) ≈ 0.2s

But do not overuse it blindly. If one task failing should not cancel all others, set return_exceptions=True or manage tasks separately. If tasks depend on each other, plain gather is the wrong shape. Simple, no?

Semaphores stop fan-out explosions¶

Now what is the problem? Suppose you must embed 1,000 chunks. Starting 1,000 upstream requests together may melt your rate limits. That is where semaphores help.

semaphore = asyncio.Semaphore(10)

async def embed_one(text: str) -> list[float]:
    async with semaphore:
        return await embed_client.embed(text)

This allows only ten active embedding calls at once. The rest wait politely. The kitchen lane stays orderly.

Worked example. Each embedding call takes 300 milliseconds. Unbounded fan-out sends 1,000 concurrent calls. The vendor starts returning 429 errors. With a semaphore of 10, you process roughly 10 at a time. Total time rises, but success rate improves dramatically. That is the adult tradeoff.

Rate limiting and connection pools are different tools¶

Semaphores limit local concurrency. Rate limiting controls request rate over time. Connection pools manage reusable network connections. These are related, but not identical.

tool                     protects against
┌──────────────────────┬──────────────────────────┐
│ semaphore            │ too many active tasks    │
│ rate limiter         │ too many requests/sec    │
│ connection pool      │ too many open sockets    │
└──────────────────────┴──────────────────────────┘

See. You may need all three. For example, a retrieval route can fire many small vector DB queries. A pool reuses TCP connections. A semaphore caps in-flight calls. A rate limiter ensures tenant fairness.

Using httpx.AsyncClient or async DB clients with pooling matters here. Creating a new client per request wastes sockets and TLS handshakes. Reuse clients when possible. That is boring, but powerful.

Structured concurrency keeps failures understandable¶

Another senior idea matters. When tasks are created, who owns them? Who cancels siblings if one fails? Who gathers results? This is why structured concurrency thinking matters.

parent request
    │
    ├── task A: fetch docs
    ├── task B: fetch profile
    └── task C: fetch policy

If the parent request is cancelled, these child tasks should usually cancel too. If task B fails critically, should A and C continue? Your design must say so. Do not scatter detached tasks everywhere. The order ticket should own its sub-work clearly.

In newer Python versions, TaskGroup helps express this. It gives a clearer failure boundary than loose task creation. You do not need it always. But the idea is important. Related work should share a parent lifecycle.

A practical pattern for AI fan-out.

Consider a retrieval-augmented answer route. It needs: a cache read, a user profile fetch, three search shards, and one policy lookup.

A good pattern is this. Start independent reads with gather. Use semaphores for shard or vendor fan-out. Reuse one client per dependency. Set deadlines for each external call. Then join the results.

request
  │
  ├── gather(profile, cache, policy)
  │
  └── gather(search shard 1..3) under semaphore

Look. The kitchen lane is not only about movement. It is also about discipline. Without patterns, async code becomes a noisy pile of awaits. With patterns, it becomes a predictable system.

Where this lives in the wild¶

Perplexity answer pipeline — retrieval engineer: parallel shard queries plus concurrency caps keep search fast without hammering sources.
OpenAI embeddings ingestion service — platform engineer: semaphores protect upstream embedding APIs from bursty document fan-out.
Slack AI assistant — backend engineer: profile, permission, and channel-context lookups run concurrently to reduce command latency.
Anthropic eval harness — infra engineer: connection pools and rate limits keep many async experiments from exhausting sockets or vendor quotas.
Enterprise RAG gateway — API engineer: structured child tasks make cancellations and partial failures easier to reason about under multi-step orchestration.

Pause and recall¶

When should you use asyncio.gather, and when is it the wrong tool?
What problem does a semaphore solve that a connection pool does not?
Why are rate limiting and concurrency limiting not the same thing?
In the analogy, why does the kitchen lane need rules around a shared stove?

Interview Q&A¶

Q: Why use a semaphore when the upstream provider already rate-limits you? A: Because local concurrency spikes can still waste sockets, memory, and retries even before the provider enforces limits, so proactive shaping protects your own system too. Common wrong answer to avoid: "A provider rate limit makes local concurrency control unnecessary."

Q: Why prefer gather for independent reads instead of sequential awaits? A: Independent waits can overlap, so latency approaches the slowest call rather than the sum of all calls. Common wrong answer to avoid: "Because gather always guarantees faster CPU execution."

Q: Why is connection pooling not a substitute for concurrency control? A: Pools reuse sockets efficiently, but they do not decide how many requests you launch at once or whether you exceed fair usage limits. Common wrong answer to avoid: "A large connection pool means you can safely fan out without bounds."

Q: Why does structured concurrency matter in request handlers? A: It makes ownership, cancellation, and failure propagation explicit, which keeps complex async request trees understandable under errors and shutdown. Common wrong answer to avoid: "Detached tasks are better because they reduce coupling."

Apply now (5 min)¶

Exercise. Take one route that does three independent fetches. Rewrite the plan using gather. Then add one semaphore if those fetches hit the same vendor.

Sketch from memory. Draw ten order tickets approaching one shared upstream box. Show where the semaphore gate sits. Write one line on why a pool is different from that gate.

Bridge. Concurrency helps only when failure stays controlled. So next we need a clean strategy for retries, exception types, and circuit breakers. → 08-error-handling-retries.md