09. ONNX Runtime optimization — graph fusion and portable serving beyond one narrow stack¶
~16 min read. This is the graph-level view of making inference lighter across many targets.
Built on the ELI5 in 00-eli5.md. The kitchen is still cooking the same meal, but ONNX Runtime rearranges stations so the work flows with fewer unnecessary handoffs.
1) Picture first: export the graph, then simplify the route¶
Think of a model graph as a kitchen route.
One operator writes a tray.
Another operator reads it back.
A third operator adds bias.
A fourth applies GELU.
If these handoffs can be fused, we save memory traffic and launch overhead.
See.
ONNX Runtime first gives you an explicit graph. Then it can fold constants, remove dead nodes, and fuse compatible operator chains. That matters on GPU and CPU both.
2) The main optimization families¶
Practical ONNX Runtime optimization usually means:
-
constant folding,
-
operator fusion,
-
layout simplification,
-
kernel selection by execution provider,
-
memory planning,
-
reduced copies between host and device.
Look. These are graph and runtime decisions, not new model training tricks. The model’s meaning stays the same. The kitchen path becomes shorter and cleaner. That is why ONNX is attractive for cross-platform deployment.
3) Worked example: why fusion saves traffic¶
Suppose a hidden layer has 4,096 activations. A naive path does:
-
MatMul writes 4,096 fp16 values
-
Add reads and writes them again
-
GELU reads and writes them again
Each write or read is:
- 4,096 × 2 bytes = 8,192 bytes = 8 KB
Ignoring caches, the unfused path moves roughly:
-
MatMul write: 8 KB
-
Add read + write: 16 KB
-
GELU read + write: 16 KB
-
total extra activation traffic ≈ 40 KB
Fuse Add and GELU into the producer path, and much of that intermediate traffic disappears. One small layer saved only kilobytes. Across many layers and many tokens, it becomes real time.
4) Execution providers decide where the graph runs¶
ONNX Runtime itself is not one hardware backend. It dispatches to execution providers. Examples include CPU, CUDA, DirectML, and other vendor paths.
That means the same exported model can target different kitchens. A laptop GPU path may differ from a server GPU path. A CPU fallback may still exist. Simple, no? This portability is a real engineering advantage when you need more than one deployment environment.
5) The limits for modern LLM serving¶
Now what is the problem? Large autoregressive serving still has dynamic behaviors. Continuous batching, paged KV cache, and specialized decode kernels are not always captured cleanly by a plain exported graph alone. So what to do? Use ONNX Runtime where graph portability and fusion are the main goals, especially for smaller models or edge targets. For the hottest large-scale decode path, you may still prefer a text-specialized serving engine. Next we add another major lever that both worlds care about: quantized serving.
Where this lives in the wild¶
-
ONNX Runtime GenAI with Phi-family demos — portable chat deployment across CPU and GPU targets is the main draw.
-
Windows and DirectML model apps — ONNX graphs let one application target many consumer GPUs without rewriting kernels per vendor.
-
Azure and enterprise edge deployments — exported graphs simplify shipping compact models to varied hardware fleets.
-
Hugging Face Optimum pipelines — teams use ONNX export plus runtime optimization when broad deployability matters.
-
Smaller local assistants on laptops — graph fusion and provider choice often matter more than massive cluster scheduling tricks.
Pause and recall¶
-
What does operator fusion save besides raw arithmetic?
-
In the worked example, how much extra activation traffic came from the unfused path?
-
Why are execution providers important to the ONNX story?
-
Why might ONNX Runtime not be the full answer for the hottest large-scale decode path?
Interview Q&A¶
Q: Why optimize an exported graph instead of only tuning the original framework runtime?
A: Because an explicit graph exposes constant folding, fusion, and provider-specific execution planning in a portable form. It creates a cleaner deployment artifact.
Common wrong answer to avoid: "Export is only for compatibility." Optimization is a major reason too.
Q: Why does operator fusion help latency even when FLOPs barely change?
A: Because memory traffic, kernel launches, and intermediate writes shrink. Serving speed depends on data movement, not only mathematical count.
Common wrong answer to avoid: "Fusion matters only if it reduces operations." Reduced movement is often the bigger win.
Q: Why can ONNX Runtime be a strong choice for edge deployment but not always for giant chat clusters?
A: Because portability and provider coverage are excellent, while the most advanced large-scale text-serving tricks may live in engines built specifically for autoregressive decoding.
Common wrong answer to avoid: "Portable means best everywhere." Portability and peak cluster performance are different goals.
Q: Why are execution providers more than a backend detail?
A: Because they determine which kernels and memory paths actually execute the graph on each device class. They shape real deployment behavior.
Common wrong answer to avoid: "ONNX Runtime speed is one universal number." It depends heavily on provider choice.
Apply now (5 min)¶
Pick one tiny operator chain from a model you know: MatMul, Add, GELU is enough. List the intermediate tensors in the unfused path. Count approximate bytes written and reread for one token row. Sketch from memory:
-
the fused-chain diagram,
-
the optimization family list,
-
and the provider concept.
Bridge. Graph cleanup helps, but weight size still dominates many deployments. Next we study quantized serving, where weights shrink dramatically and new accuracy-versus-speed tradeoffs appear. → 10-quantized-serving.md