13. Honest admission — the open questions and messy tradeoffs in inference optimization¶

~17 min read. Trustworthy engineers know where the serving playbook stops being exact science.

Built on the ELI5 in 00-eli5.md. The kitchen can be disciplined, instrumented, and fast, but real service still faces changing menus, uneven order tickets, and limited prep stations.

1) Picture first: neat lab versus messy production¶

A benchmark chart looks clean.

Production traffic does not.

lab setup                          production
┌──────────────────────┐          ┌──────────────────────┐
│ fixed prompts        │          │ drifting prompts     │
│ fixed model version  │          │ version churn        │
│ known concurrency    │          │ bursty traffic       │
│ stable cache ratio   │          │ tenant mix shifts    │
└──────────────────────┘          └──────────────────────┘

See.

We can optimize a lot.

We cannot freeze reality.

That is why serving work remains empirical.

2) Open problems that still matter¶

Some live questions remain hard.

predicting speculative-decoding acceptance before runtime,
scheduling fairly across tenants with very different context lengths,
choosing the right batch window under bursty traffic,
deciding optimal KV precision and eviction policy,
balancing latency, cost, and quality with one controller,
serving very long contexts without crushing memory.

Look.

These are not footnotes.

They shape daily operations in large deployments.

They are why “just use framework X” is never a complete answer.

3) Worked uncertainty example¶

Suppose your benchmark mix averages 1,000 prompt tokens and 150 output tokens.

The cluster handles 40 requests per second there.

Then production changes.

Average prompt length rises to 2,500 tokens.

Average output rises to 240 tokens.

Speculative acceptance also falls from 80% to 55%.

Now what happens?

Even without exact recomputation, you should expect:

bigger prefills,
fatter KV caches,
lower effective decode speed,
worse concurrency,
higher tail latency.

The old benchmark number is not wrong.

It is no longer sufficient.

That is the honest point.

4) What a trustworthy senior answer sounds like¶

A weak answer says:

“Use vLLM, quantize, batch, done.”

A stronger answer says: “We separate prefill from decode, track TTFT and tails, budget KV memory per token, batch continuously, consider paged cache, use speculation where acceptance supports it, and benchmark against real traffic mixes. But fairness, drift, and long-context economics remain moving targets.”

Yes? That answer sounds believable, because it names both the levers and the uncertainty.

5) Why this bridges naturally into data engineering¶

Serving is only half the story. The shape of prompts, cacheability of prefixes, retrieval chunk sizes, and traffic mix all come from data systems upstream. If your data pipeline is sloppy, your serving engine inherits that chaos.

So what to do next? Move one layer outward. Study how AI products prepare, move, validate, and refresh data before inference ever begins. That is the next module.

Where this lives in the wild¶

Large multi-tenant chat platforms — fairness and long-context cost remain active tuning problems even with strong frameworks.
GitHub Copilot-scale code assistance — workload drift across repositories makes any one benchmark incomplete.
Perplexity-style answer systems — retrieval length and citation structure keep changing the serving profile.
Enterprise internal copilots — tenant isolation, compliance prompts, and custom tools make serving mixes unstable.
Edge-and-cloud hybrid assistants — routing between local and remote inference creates objectives that conflict by design.

Pause and recall¶

Why does a clean benchmark stop being enough once prompt and output mixes drift?
Name three open serving problems that remain genuinely messy in production.
What makes the “use framework X and done” answer weak?
Why does this module naturally lead into data engineering for AI systems?

Interview Q&A¶

Q: Why is serving optimization a moving target instead of a one-time architecture decision?

A: Because traffic mix, model versions, prompt templates, acceptance rates, and product goals keep changing. The optimal serving policy drifts with them.

Common wrong answer to avoid: "Once we pick the best engine, the problem is solved." Operations keep moving.

Q: Why can a benchmarked speedup disappear in production?

A: Because the production mix may have longer prompts, different tenants, lower cache hit rate, or worse speculative acceptance than the test mix.

Common wrong answer to avoid: "If the benchmark was honest, it should generalize automatically." Honest benchmarks still depend on workload match.

Q: Why must senior answers mention uncertainty instead of only techniques?

A: Because leadership decisions require understanding risk, drift, and failure modes. A list of tricks without limits sounds shallow and unsafe.

Common wrong answer to avoid: "Confidence means sounding certain." Real confidence includes boundaries.

Q: Why does data engineering sit right after inference serving in the learning path?

A: Because serving behavior depends on the shape and movement of data upstream: prompts, retrieval corpora, feature stores, and logs. Bad data flow becomes bad serving economics.

Common wrong answer to avoid: "Serving and data pipelines are separate concerns." In production they are tightly coupled.

Apply now (5 min)¶

Write the three biggest serving uncertainties for one product you know. Then write which upstream data choice makes each uncertainty better or worse. Finally, rewrite a shallow serving answer into an honest senior answer. Sketch from memory:

the neat-lab versus messy-production comparison,
the open-problems list,
and the serving-to-data-engineering bridge.

Bridge. We now leave the kitchen floor and walk upstream to the pantry, delivery trucks, and inventory systems. Next: data engineering for AI, where the quality and movement of data start shaping everything the serving engine must handle. → ../06_evidence_data_pipelines/00-eli5.md