Skip to content

12. Production multimodal systems — reality begins when the bill arrives

~12 min read. The demo smiles. The latency chart does not.

Built on the ELI5 in 00-eli5.md. The the eye — the vision encoder that sees raw pixels and outputs numbers — is only the opening cost; production pain usually appears after those numbers hit the language stack.


1) First see the serving line before any pricing table

A multimodal product is a queue, not only a model. User sends image. Preprocessing resizes it. A vision encoder converts pixels to visual tokens. Then a language model reads those tokens and starts decoding text. Look. The first surprise is speed. The vision encoder is often not the bottleneck. A CLIP-ViT-L style encoder can run in about 5 ms on a good GPU.

Nice. But answer generation may take much longer. If the LLM decodes 150 output tokens autoregressively, that tail dominates latency. So the eye is fast enough more often than beginners expect. the translator and the language decoder usually eat the bigger share.

user image
resize / normalize
vision encoder ──→ visual tokens ──→ multimodal projector ──→ LLM decode ──→ answer
   │                    ~5 ms                small bridge         often biggest tail
GPU memory spent early, latency spent late
That system view changes decisions. You stop obsessing only over image encode speed. You start watching total end-to-end latency.

Yes?

2) Visual token budget is quality, cost, and memory together

Suppose one image becomes 576 visual tokens. More tokens usually preserve more detail. Good. But they also lengthen the sequence seen by the LLM. That raises attention cost. That raises memory use. That raises price if billing follows tokens. So what do companies do? They cap resolution.

They crop smartly. They use dynamic resolution. A simple rule is bucketed image sizes. Small images get fewer patches. Large images may be downsampled or tiled. This is where the patch becomes a budget unit, not only a vision unit. A 448 × 448 image with 14 × 14 patches gives 32 × 32 = 1,024 patches if fully dense. A 224 × 224 image with the same patching gives 16 × 16 = 256 patches. That is a jump from doubling each side.

See the danger? Resolution grows with area. Cost follows. Many products therefore use dynamic resolution strategies. High-detail regions may keep finer tokens. Background regions may be compressed harder. The user feels quality. Finance feels the bill.

3) Batching and caching become awkward with images

Text requests batch nicely. Images do not. Why?

Because image sizes vary. One request may be 1024 × 1024. Another may be a narrow phone screenshot. Another may contain four panels. If you batch them naively, you pad to the largest shape. Then smaller requests waste compute. So production teams bucket by size. Maybe one bucket for 224. One for 336.

One for 448. Better. Still imperfect. Now caching. Text chats love KV cache reuse across turns. Visual tokens are harder.

A new uploaded image usually means a fresh encoding pass. And cross-request reuse is rare because user images differ. So the eye does work that often cannot be amortized across unrelated users. That is one reason image uploads are limited. Another reason is moderation and storage overhead. A third reason is cost predictability. Simple, no?

4) Worked example: cost per image query in a production VLM

Assume a VLM charges by processed tokens. Suppose visual tokens are priced at $0.000002 each. Suppose output text tokens are priced separately, but ignore them first.

Take one image query. Image is resized into 576 visual tokens. Prompt text adds 120 tokens. Answer outputs 180 tokens. We only want the visual share first. Visual cost is 576 × 0.000002 dollars. That equals $0.001152. Good. Now include prompt text if text input is the same price for simplicity.

Text input cost is 120 × 0.000002 = $0.00024. Total input cost becomes $0.001152 + $0.00024 = $0.001392. Now suppose output tokens cost $0.000006 each. Output cost is 180 × 0.000006 = $0.00108. Full query cost becomes $0.001392 + $0.00108 = $0.002472. Now find the visual fraction of total cost. 0.001152 / 0.002472 ≈ 0.466 So visual tokens are about 46.6% of total cost. That is not tiny.

Now double image detail to 1,152 visual tokens. Visual cost becomes 1,152 × 0.000002 = $0.002304. New total cost is $0.002304 + $0.00024 + $0.00108 = $0.003624. Now visual share is 0.002304 / 0.003624 ≈ 63.6%. See what happened? Without changing the answer length, image detail became the majority cost. That is why many companies limit images per request. And why they sometimes downsample uploads.

request arrives
   ├── image resize bucket
   ├── vision encode
   ├── projector to LLM tokens
   ├── LLM prefill with text + visual tokens
   ├── autoregressive decode
   └── logging / billing / safety checks
Now add latency arithmetic. If vision encode is 5 ms. If multimodal prefill is 35 ms. If decode is 220 ms. Then total is 5 + 35 + 220 = 260 ms before network overhead. Decode is 220 / 260 ≈ 84.6% of model time. So the bottleneck is obvious. Do not optimize the wrong stage.

5) Practical product consequences

This is why user-facing limits exist. One image per turn. Maybe four at most.

Maybe lower resolution for free tier. Maybe OCR-heavy pages go to a special path. Maybe screenshots are tiled. Maybe long chats keep text KV cache but re-encode images. The product menu is shaped by cost math. Not by model beauty alone. Look. A mature team watches four dashboards together. Latency.

GPU memory. Token mix. Conversion or answer quality. If one rises, the others answer back. That is production multimodality.


Where this lives in the wild

  • OpenAI GPT-4o vision API — product limits on image count and size reflect visual-token cost and serving trade-offs.

  • Anthropic Claude vision — image understanding quality is strong, but long outputs still make language decode the visible bottleneck.

  • Google Gemini 1.5 multimodal serving — large context is powerful, yet visual tokens still compete with latency and memory budgets.

  • Microsoft Copilot Vision features — screenshots and page images require size bucketing and OCR-aware handling in production flows.

  • Amazon Bedrock multimodal endpoints — enterprise billing makes the visual-token fraction of total cost a direct product decision.


Pause and recall

  • Why is the LLM decode often the latency bottleneck even when a vision encoder is present?

  • How does doubling image side length change dense patch token count?

  • Why do variable image sizes create padding waste in batching?

  • In the worked example, when did visual tokens become the majority of total query cost?


Interview Q&A

Q: Why optimize decode latency and not only the vision encoder?

A: Because production delay is often dominated by multimodal prefill and autoregressive text generation, not by the few milliseconds spent encoding pixels. Common wrong answer to avoid: Vision is the slow part because images are bigger than text.

Q: Why use bucketed batching and not fully dynamic shapes for every request?

A: Bucketing reduces padding waste while preserving enough regularity for efficient kernels and predictable throughput. Common wrong answer to avoid: Dynamic shapes are always free, so padding does not matter.

Q: Why are visual tokens often not cacheable across requests while text KV cache is valuable?

A: Because each uploaded image is usually new input, so its visual embeddings cannot be reused the way repeated text prefixes can. Common wrong answer to avoid: Once a model has seen one image, future image requests are basically free.

Q: Why do companies limit image uploads per prompt and not just charge a little more?

A: Because visual tokens consume memory, latency budget, moderation work, and cost in ways that quickly destabilize serving economics. Common wrong answer to avoid: Upload limits are mostly arbitrary product choices with no systems reason behind them.


Apply now (5 min)

Quick exercise. Take 3 images with 384, 576, and 960 visual tokens. At $0.000002 per visual token, compute total visual input cost.

Then add 200 output tokens at $0.000006 each and find the visual share. Sketch from memory the serving stack from image upload to final answer. Under the sketch, mark which stage you would optimize first for latency.


Bridge. Production is running. But some problems have no clean fix yet. Time to be honest. → 13-honest-admission.md