Skip to content

12. Deployment and Production — run the kitchen without drama

~17 min read. Async code is only half the story; the service must also start, scale, stay healthy, and shut down gracefully.

Built on the ELI5 in 00-eli5.md. The front desk — now the full deployed service — must keep the kitchen lane healthy through load, restarts, and shutdown.


First picture: one app process is not the whole system

Look at the production shape first. Your FastAPI app usually sits behind a proxy. It may run multiple workers. It may run inside a container. Health checks and shutdown signals matter.

client
load balancer / ingress
uvicorn workers
  ├── FastAPI app
  ├── connection pools
  └── graceful shutdown hooks

See. Local uvicorn main:app --reload is not production. Production needs process management, resource limits, metrics, and safe deploy behavior. Simple, no?

Uvicorn workers and process shape

FastAPI apps commonly run on Uvicorn. For more CPU parallelism, you may run multiple worker processes. Each worker has its own event loop, its own memory, and its own connection pools.

That is important. If you run four workers, you do not have one giant shared async kitchen. You have four separate kitchens. Each one handles its own order tickets. That affects memory planning and client pool sizing.

Worked example. Suppose one worker can handle 1,000 mostly-idle SSE connections. Four workers may handle about 4,000, subject to memory, upstream limits, and load-balancer behavior. But if each worker opens huge DB pools, you may overrun your database. So size pools per worker, not only per service.

Containers, health checks, and readiness

A good container image should start fast, log to stdout, and expose clear health endpoints. Usually you want at least: /health/live for liveness, /health/ready for readiness.

liveness  ──→ is the process alive?
readiness ──→ can this instance serve safely now?

Readiness is not the same as liveness. The process may be alive but warming caches, waiting for DB connectivity, or draining during shutdown. In that state, it should fail readiness. The front desk is open physically, but not ready to take orders.

Docker is just packaging. Kubernetes or another orchestrator handles rollout. Do not confuse image build with runtime health design.

Graceful shutdown matters for streams and jobs

Now what is the problem? A new deploy starts. Old pods get SIGTERM. If they die instantly, active streams cut off, in-flight writes may fail, and child resources may leak.

Picture the desired flow.

SIGTERM
  ├── fail readiness
  ├── stop accepting new requests
  ├── let current requests finish within budget
  └── close pools and streams cleanly

That is graceful shutdown. The cancel bell rings for the process itself. But it should be a polite bell, not a power cut.

For AI services, this is especially important. Long streams may need a short drain window. Background workers may need to checkpoint job progress. Connection pools need closing. Detached tasks need review. See. Shutdown is where messy ownership shows up.

Production checklist for AI APIs

Set request timeouts at proxy and app layers. Tune keep-alive carefully. Reuse async clients. Export latency and error metrics. Trace upstream calls. Bound concurrency to vendors. Protect secrets. Version your schemas.

healthy service checklist
├── bounded timeouts
├── readiness and liveness
├── structured logs and traces
├── pool sizing per worker
├── graceful shutdown
└── alerting on tail latency and error rate

Also watch streaming infrastructure. Some proxies buffer by default. Some cloud platforms have idle timeout quirks. Test the full path. Not only the app code.

For Docker, keep images lean. Install only needed system packages. Use non-root when feasible. Pin runtime versions deliberately. The serving kitchen should be predictable.

Scaling is multi-dimensional.

Do not ask only, "How many requests per second?" AI services scale across several axes. Concurrent streams. Outbound provider limits. Memory per context. CPU for serialization and middleware. Background queue depth.

The front desk may look calm while the prep shelf is overloaded. Or the reverse. So what to do? Measure separate bottlenecks. Scale API workers, worker queues, and vendor quotas with different dashboards. That is production thinking.


Where this lives in the wild

  • OpenAI API edge service — SRE: graceful draining protects active streamed completions during rolling deploys.
  • Anthropic enterprise gateway — platform engineer: per-worker pool sizing prevents database and Redis exhaustion when worker count changes.
  • Perplexity answer backend — infrastructure engineer: readiness checks hold instances out of rotation until retrieval and model dependencies are healthy.
  • GitHub Copilot cloud service — reliability engineer: structured traces across editor requests and provider calls make async tail latency debuggable.
  • Enterprise document AI platform — DevOps engineer: API workers and queue workers scale separately because serving traffic and indexing throughput peak differently.

Pause and recall

  • Why is one deployed service often many separate event loops instead of one?
  • What is the difference between readiness and liveness?
  • Why does graceful shutdown matter more for streamed AI responses than many simple CRUD routes?
  • In the analogy, what happens when the front desk stops taking new orders before closing the kitchen?

Interview Q&A

Q: Why increase worker count for a FastAPI service instead of relying on one event loop? A: Multiple workers improve process-level parallelism, resilience, and multicore usage, but they also multiply memory and pool usage, so sizing must stay intentional. Common wrong answer to avoid: "One event loop can fully replace multiple worker processes for every production case."

Q: Why is readiness distinct from liveness in containerized deployments? A: A process can be alive yet temporarily unfit to serve, such as during warmup, dependency outage, or graceful drain, so traffic routing needs a stricter signal. Common wrong answer to avoid: "If a pod is alive, it is ready by definition."

Q: Why is graceful shutdown especially important for async streaming APIs? A: Because abrupt termination cuts active streams, loses in-flight cleanup, and can leak upstream resources that longer-lived connections depend on. Common wrong answer to avoid: "HTTP servers can stop instantly because clients will just retry."

Q: Why must connection pools be sized per worker rather than only per service? A: Each worker owns separate pool instances, so total open connections scale with worker count and can overwhelm dependencies if you ignore that multiplication. Common wrong answer to avoid: "A pool size of 20 means 20 connections total, no matter how many workers run."


Apply now (5 min)

Exercise. Write a mini production checklist for one FastAPI AI service. Include worker count, health endpoints, timeouts, and shutdown behavior. Then note one metric you would alert on.

Sketch from memory. Draw client → load balancer → workers. Add one readiness gate and one graceful-drain arrow. Label where the cancel bell rings during deploy.


Bridge. We can now ship a solid service. The last step is honesty: what still remains hard, surprising, or unsolved in async Python itself? → 13-honest-admission.md