10. Design ML Inference Platform¶
⏱️ Estimated time: 20 min | Level: advanced
ELI5 callback: You are on the stage, pitching city council with a blueprint. Follow the choreography, show reasoning aloud, and admit every honest gap.
Step 1: Requirements & Constraints¶
Start wide. Then narrow. See. Scope first, technology later.
Functional requirements¶
- Serve online model predictions through a common API layer.
- Support multiple model versions, rollback, and controlled rollout.
- Fetch features online or accept them directly from callers.
- Batch compatible requests dynamically for accelerator efficiency.
- Autoscale workers based on queue depth, latency, and hardware usage.
- Split traffic for A/B tests, canaries, and shadow evaluations.
Non-functional requirements¶
- Keep p99 latency within product SLA even during bursty load.
- Use GPUs or specialised hardware efficiently because they are costly.
- Support safe rollback of bad models within minutes.
- Isolate noisy models so one tenant does not starve others.
- Capture prediction logs for debugging, fairness, and offline evaluation.
- Prefer graceful degradation over total outage when features are missing.
Clarifying questions to ask¶
- Is the use case ranking, generation, classification, or embedding retrieval?
- What is the latency budget from API ingress to final response?
- Can the model tolerate stale features or missing features?
- Do we need CPU fallback when GPUs are unavailable?
- How many model versions must run simultaneously for experiments?
- Do we need tenant isolation or workload classes?
What to say on the whiteboard¶
- State the user action, core data, and critical latency target.
- Split must-have features from nice-to-have features immediately.
- Name one honest gap before locking assumptions. Simple, no?
- Ask what failure hurts most: money, freshness, or user trust.
- Confirm whether single-region launch is acceptable for round one.
- Summarise the scope before you move to numbers. Now watch.
Step 2: Scale Estimation¶
Do rough math. Clean math beats fancy math. So what to do? Pick clear assumptions and keep them verbal.
Assumptions¶
- Assume 200 thousand prediction requests per second at peak.
- Assume average input payload is 8 KB.
- Assume average output payload is 1 KB.
- Assume dynamic batching groups 16 requests on average.
- Assume one GPU worker sustains 2 thousand requests per second effective throughput.
- Assume 15 percent of traffic goes to experiments or shadow paths.
Back-of-envelope math¶
- Peak ingress is about 1.6 GB per second before headers and retries.
- At 200k RPS and 16-request batches, the scheduler handles 12.5k batches per second.
- If one GPU worker sustains 2k RPS, we need about 100 active GPU workers.
- Add 30 percent headroom for model skew, failures, and noisy bursts.
- Provision about 130 workers across zones for steady safety.
- Prediction logs at 9 KB total per request create about 1.8 GB per second raw data.
- Shadow traffic adds compute cost without user-visible benefit, so cap it carefully.
- Feature store reads may exceed prediction count if some models fan out for context.
Interview cue¶
- Say the biggest number first, then derive storage and bandwidth.
- Round aggressively. Nobody wants calculator theatre on the board.
- Mention peak-to-average ratio and why it changes capacity planning.
- Keep one reserve factor for retries, bursts, and replays.
- Remember the stage is interactive, so sanity-check assumptions aloud.
- End with the two numbers that drive architecture choice.
Step 3: High-Level Design¶
Now place the big boxes. Your blueprint should fit in one glance.
+--------+ +-----------+ +---------------+ +--------------+
| Client |-->| API Layer |-->| Traffic Router|-->| Batch Queue |
+--------+ +-----------+ +---------------+ +--------------+
| | |
v v v
+----------+ +-----------+ +-------------+
| Feature | | Model Reg | | Autoscaler |
| Store | | / Config | | Control Loop|
+----------+ +-----------+ +-------------+
| |
v v
+---------------------------------------+
| Model Servers on CPU/GPU Worker Pool |
+---------------------------------------+
Main components¶
- API layer authenticates callers and normalises request schema.
- Traffic router chooses model version, experiment bucket, and serving class.
- Feature store provides online features with freshness metadata.
- Model registry stores versioned artifacts, signatures, and rollout config.
- Batch queue collects compatible requests within tiny time windows.
- Model servers load artifacts and execute inference on CPU or GPU workers.
- Autoscaler watches queue lag, latency, and accelerator utilisation.
- Prediction logger streams outputs for debugging and offline evaluation.
Request path¶
- Client submits inference request to the shared API.
- Router attaches experiment decision and target model version.
- If needed, online features are fetched or defaulted carefully.
- Batch queue groups requests with the same model and shape.
- Worker loads batch, executes inference, and emits predictions.
- Response returns immediately while logs flow asynchronously.
- Autoscaler reacts if queue wait or GPU saturation rises.
- Offline evaluators compare experiment outcomes against baseline behaviour.
Design narration¶
- Start with ingress, then routing, then state, then async work.
- Separate control plane decisions from data plane traffic early.
- Show where metadata lives and where heavy payloads travel.
- Mark caches, queues, and databases with their exact job.
- Point out one synchronous dependency you may later relax.
- Pause and let the interviewer choose the next zoom-in area.
Step 4: Deep Dive¶
Pick two parts that actually matter. Depth without structure becomes noise. See.
Component A — Dynamic batching and serving runtime¶
- Batch only requests with compatible model, shape, and latency class.
- Use very small waiting windows, often a few milliseconds.
- Stop batching early when latency SLO is about to be violated.
- Keep warm model replicas for the hottest versions.
- Pin large models to nodes with enough memory and accelerator capacity.
- Preload tokenisers or feature transforms so batch time is not wasted.
- Separate real-time traffic from bulk traffic with distinct queues.
- Measure queue wait, batch size, and compute time independently.
- Allow CPU fallback only for models where degraded latency is acceptable.
- Evict cold models carefully so reload storms do not hurt hot paths.
Component B — Autoscaling and traffic splitting¶
- Scale on queue depth and latency, not only CPU or GPU utilisation.
- Use target batch occupancy as a signal for right-sized capacity.
- Split traffic by hash of request or user to keep experiments consistent.
- Support canary rollout percentages with instant rollback toggles.
- Shadow traffic should not affect user response paths or quotas.
- Protect baseline models with reserved capacity during experiments.
- Keep per-model quotas so one viral model does not starve others.
- Store rollout config centrally and propagate it quickly to routers.
- If experiments need ground truth, wire prediction logging to later labels.
- Use hysteresis in autoscaling so workers do not flap every minute.
Deep-dive cue¶
- Keep reasoning aloud clean while you zoom in.
- Explain data model, hot path, and one ugly edge case.
- Tie each deep dive back to a requirement you already named.
- If numbers change the design, say that directly.
- If one choice is uncertain, park it as research, not panic.
- Return to the overall system before you get lost in detail.
Step 5: Tradeoffs & Failure Modes¶
Now show judgment. Interviewers hire the tradeoff thinker, not the diagram artist.
Key tradeoffs¶
- Large batch windows increase throughput, but can break latency SLOs.
- GPU serving is efficient, but cold starts and capacity planning are harder.
- CPU fallback improves resilience, but product quality may drop with slower models.
- Shared platform reduces duplication, but noisy neighbours become a real risk.
- A/B routing improves learning, but logging and metric attribution get more complex.
- Preloading models cuts latency, but wastes memory for cold versions.
- Shadow mode is safe for users, but doubles some compute and storage cost.
- Aggressive autoscaling saves money, but can amplify cold-start pain.
Failure modes to discuss¶
- Feature store outage can make inference impossible unless defaults exist.
- Bad rollout config can shift too much traffic to a broken model.
- Cold model loads can spike latency when traffic changes suddenly.
- Autoscaler lag can build queue backlog faster than workers appear.
- Prediction logs can fall behind and blind offline evaluation teams.
- GPU fragmentation can strand capacity even when total utilisation looks low.
- One malformed request shape can poison batching efficiency.
- Experiment mis-tagging can invalidate weeks of online metrics.
Close the answer strongly¶
- Say what breaks first under sudden load and how you contain it.
- Compare the chosen design against one simpler alternative.
- Mention operational metrics, not only code-level correctness.
- Admit where future scale may require redesign. Honest and sharp.
- Offer a phased rollout plan if the company is early-stage.
- Finish with latency, reliability, and cost in one sentence.
Interview Q&A¶
Q1. Why not run one endpoint per model team?¶
A. That explodes operational duplication and makes routing, experiments, and scaling inconsistent. A. A shared platform standardises serving while preserving team ownership of models. Common wrong answer to avoid: Separate endpoints are always simpler at scale.
Q2. What is dynamic batching actually buying us?¶
A. It improves hardware utilisation by amortising overhead across many requests. A. The trick is balancing throughput gain against queueing delay. Common wrong answer to avoid: Bigger batches are always better for every model.
Q3. How do you roll out a risky model safely?¶
A. Use canary percentages, fast rollback, and quality metrics tied to the experiment bucket. A. Shadow traffic helps validate behaviour before users depend on it. Common wrong answer to avoid: Deploy to 100 percent and watch the dashboard closely.
Q4. What should trigger autoscaling first?¶
A. Queue lag and latency are the most direct user-impact signals. A. Resource metrics matter, but they lag behind the customer experience. Common wrong answer to avoid: Only GPU utilisation should drive scaling.
Apply now (5 min)¶
- Run the full choreography with a two-minute timer per step.
- Rework the platform for one giant model plus many tiny models.
- State when CPU fallback is acceptable and when it is dangerous.
- Pick one metric for canary health and explain why it matters.
- Explain how you would keep experiment buckets sticky across requests.
- Name three logs you need for debugging wrong predictions.
- Say one simplification if only one model team exists today.
Bridge. Inference served. Now personalization — a recommendation engine. → 11