05. Triton Inference Server — the serving layer above the engine¶
~21 min read. A compiled engine runs one forward pass fast. Production needs more: tokenize, run a safety classifier, run the 70B, run an embedding model, route the request, batch at the edge, version the deployment, share four GPUs across all of it. The engine doesn't do any of that. The server does.
Built on TensorRT-LLM compilation, which gave us a fast engine, and on the whole stack below it. The invariant is still feed the beast, now at the request boundary: an idle GPU between requests is the same waste as an idle SM between tokens. This file introduces dynamic batching (batching at the server edge), model ensembles (server-side pipelines), multi-framework backends, and concurrent model execution — the levers that keep GPUs fed when you serve more than one model.
What the engine does and what it refuses to do¶
File 04 gave us a TensorRT-LLM engine: it takes a batch of token IDs, runs the forward pass with fused kernels and in-flight batching, and returns logits, fast. That's all it does. It doesn't know about HTTP. It doesn't tokenize text. It doesn't run the safety classifier that has to vet every prompt, or the embedding model the retrieval step needs, or the postprocessing that turns logits into a streamed response. It doesn't version itself, route requests, or share a GPU with a second model. It is a function, not a service.
Production needs the service. A single user request to our chat endpoint is really a small pipeline: tokenize the prompt, embed it for retrieval, run a safety check, run the 70B engine, detokenize and stream the answer. Some of those steps are PyTorch, some are TensorRT, some are plain Python. They need to run together, server-side, with batching at the request edge and many models sharing the same four GPUs without stepping on each other. This file is the layer that does that — NVIDIA Triton Inference Server (folded into the Dynamo platform in 2025 and now called Dynamo-Triton) — and the four mechanisms it adds on top of the engine: dynamic batching, ensembles, multi-framework backends, and concurrent execution.
What this file solves¶
Our 70B engine is fast in isolation, but the real endpoint is slow and the GPUs are under-fed because each request bounces between separate microservices — a tokenizer service, an embedding service, a safety service, the LLM — over the network, and each model gets its own under-utilized GPU. The naive read is "give each model its own box." The real cause is that splitting the pipeline across network hops adds latency and leaves every GPU partly idle. This file teaches you to host the whole pipeline on one server: batch requests at the edge (dynamic batching), chain the steps server-side (ensembles), run mixed frameworks side by side (backends), and pack multiple models onto the same GPUs (concurrent execution) so none sits idle.
Why a server, and not a bag of microservices¶
The instinct from web engineering is to make each model a microservice: a tokenizer service, an embedding service, a safety-classifier service, an LLM service, each behind its own HTTP endpoint, wired together by an orchestrator. It's clean on an architecture diagram. On GPUs it bleeds.
Two costs appear. First, every hop between services is a network round-trip — serialize the tensor, send it, deserialize it — and for a five-stage pipeline that's several round-trips per request, often dwarfing the compute. Second, each service owns its own GPU(s), and most models don't fill a GPU: the safety classifier might use 5% of a card, the embedding model 10%, and they each hold a whole H100 to do it. You're paying for cards that sit nearly idle while the request waits on network hops between them.
A serving server collapses this. It hosts all the models in one process on one set of GPUs, runs the pipeline server-side (no network hops between stages, tensors passed in shared memory), batches incoming requests across clients, and schedules multiple models to share the same GPUs by interleaving their execution. The request edge becomes the place to keep the GPUs fed — the same "don't let the hardware idle" pressure as the roofline, one layer up.
Teacher voice. Every layer of this module is the same fight against idle hardware, at a different time scale. The roofline fought idle SMs between weight-reads (nanoseconds). Fusion fought idle SMs between kernels (microseconds). In-flight batching fought idle GPUs between tokens (milliseconds). The serving server fights idle GPUs between requests (milliseconds to seconds) and idle GPUs held by under-filling models. Same enemy, bigger clock.
The naive deployment that idles four GPUs¶
A team deploys the chat pipeline as four microservices, one model each, one GPU each. The 70B is tensor-parallel across two GPUs; the tokenizer, embedder, and safety classifier each get a GPU. Six GPUs, and the dashboard shows the three small models at single-digit utilization while requests crawl.
The visible break: a trace shows each request making four network hops, with serialization/deserialization at each, and the small-model GPUs mostly idle between requests because traffic is bursty and one request at a time doesn't fill them. Latency is dominated by the hops and the idle gaps, not the math. Adding GPUs makes it worse — more idle silicon.
So the real problem is not that the models are slow; it is that the pipeline is scattered across GPUs and network, so most cards idle while requests pay round-trip taxes. The fix isn't more GPUs — it's putting the pipeline on one server that keeps the GPUs fed by batching at the edge and packing models together.
So how do we serve many models and a multi-step pipeline without idle GPUs and network hops between every stage?
When four models could share two cards¶
Take the smallest case. A safety classifier uses ~5% of an H100; the chat engine's two cards are busy. Run them as separate services and the classifier holds its own whole card at 5% — 95% wasted — while requests hop over the network to reach it. Co-host both on the same server: the classifier shares a GPU with other small models (concurrent execution), and its step runs server-side right before the LLM step in one request (an ensemble), no network hop. Same two cards now do the work that previously took three or four, and the request makes zero internal network hops.
Rule: keep every GPU fed by batching at the edge and packing models together¶
The serving rule. Above the engine, throughput comes from never letting a GPU idle waiting on requests. Batch arriving requests into one forward pass (dynamic batching); run multi-step pipelines server-side in one request with no network hops (ensembles); host models from different frameworks in one server (backends); and run multiple model instances concurrently so a model that can't fill a GPU shares it (concurrent execution). The request edge is where idle-GPU time is created or eliminated.
Why the rule exists. The primitive is the same as the roofline's: a GPU delivers value only when fed, and the engine's in-flight batching only helps if requests actually reach it batched and the GPU isn't idling between unrelated steps. The constraint is that production has many models, multiple frameworks, multi-step pipelines, and bursty traffic — none of which a single engine handles. Microservice-per-model breaks because it scatters the pipeline across network and idles every GPU that one model can't fill. The server packs the work so the hardware stays busy.
1) Dynamic batching — batching at the door¶
The engine's in-flight batching (file 04) batches inside the model's generation loop. Dynamic batching is one level up: it batches incoming requests before they enter the model. The server holds arriving requests in a queue for a short, bounded window (microseconds to a few milliseconds), then dispatches them as one batch.
WITHOUT DYNAMIC BATCHING WITH DYNAMIC BATCHING (max_queue_delay)
──────────────────────── ──────────────────────────────────────
req → model (batch 1) req ┐
req → model (batch 1) req ┤ wait ≤ delay, then dispatch
req → model (batch 1) req ┘ → model (batch 3)
each forward pass serves 1 one forward pass serves 3
GPU re-reads weights per request weight-read amortized across the batch
The knob is the maximum queue delay: how long the server waits to accumulate a batch. Wait longer and batches get bigger (more throughput) but each request pays the wait (more latency). Wait less and latency is tighter but batches are smaller. This is the latency-vs-throughput dial of the whole serving layer, and it's the roofline's batching lever applied to request arrival: more requests per forward pass = more tokens per weight-read.
For an LLM with in-flight batching, the engine's continuous batching does most of the heavy lifting, so dynamic batching matters most for the non-LLM models in the pipeline — the embedder, the classifier — which don't have their own continuous batching and benefit from batching at the door. For classic vision/embedding models, dynamic batching is the primary throughput lever.
Mini-FAQ. "If TensorRT-LLM already does in-flight batching, why do I need dynamic batching too?" They operate at different boundaries. In-flight batching keeps the LLM's generation loop full across the lifetime of requests already inside the engine. Dynamic batching decides how requests arrive at a model — essential for the stateless models in the pipeline (embedder, classifier) and for non-LLM workloads. For the LLM step itself, you lean on in-flight batching; for the rest of the ensemble, dynamic batching does the work.
2) The picture — scattered microservices vs one server¶
The mental model that lands this file is the pipeline drawn two ways: scattered across the network, and collapsed into one server.
SCATTERED MICROSERVICES ONE SERVING SERVER (ensemble)
─────────────────────── ─────────────────────────────
client client
│ http │ http (one request)
▼ ▼
[tokenizer svc] GPU0 (idle 90%) ┌───────────── Triton / Dynamo-Triton ──────────────┐
│ net hop + (de)serialize │ ensemble pipeline, tensors in shared memory: │
▼ │ tokenize(py) → embed(PyTorch) → safety(ONNX) │
[embed svc] GPU1 (idle 85%) │ → LLM(TensorRT-LLM engine) → detokenize(py) │
│ net hop │ │
▼ │ concurrent execution: small models share GPUs; │
[safety svc] GPU2 (idle 95%) │ dynamic batching at the edge; 70B is TP-2. │
│ net hop └──────────────────────┬─────────────────────────────┘
▼ │
[LLM svc] GPU3-4 (busy) ▼
│ streamed response
▼ several net hops, 4+ GPUs, most idle one request, no internal hops, GPUs packed
The scattered version pays a network hop and a serialize/deserialize at every stage, and dedicates a whole GPU to each small model that can't fill it. The server version runs the same five stages as one ensemble — a server-side pipeline where the output of one step is passed to the next in shared memory, no network — and uses concurrent execution to pack the small models onto shared GPUs while the 70B keeps its tensor-parallel pair. Same models, fewer GPUs, no internal hops.
3) The 70B endpoint as an ensemble — the running example¶
Our endpoint isn't just the 70B; it's tokenize → embed-for-retrieval → safety-check → 70B generate → detokenize. We define this as a Triton ensemble: a directed graph of models where the server routes tensors from each step's output to the next step's input, entirely server-side.
The 70B is the TensorRT-LLM engine from file 04, served by Triton's TensorRT-LLM backend, tensor-parallel-2 on NVLink (file 03's placement). The tokenizer and detokenizer are Python backend steps. The embedder is a PyTorch model; the safety classifier is an ONNX model. Triton hosts all four frameworks in one process. The small models — embedder, classifier — run as multiple concurrent instances sharing a GPU via concurrent execution, with dynamic batching at their edges. The whole pipeline answers a request without a single internal network hop, and the previously-idle GPUs that the small models monopolized are reclaimed.
This is where the running example stops being "a fast 70B" and becomes "a fast endpoint": the engine's tokens/sec from file 04 is necessary but not sufficient; the serving layer is what removes the network and idle-GPU taxes that were keeping the endpoint slow even though the model was fast. Combined with the engine's in-flight batching, the endpoint now sustains its target throughput with the small models packed onto shared GPUs instead of hoarding their own.
Mini-FAQ. "Why not call the models from my own Python orchestrator instead of an ensemble?" You can, and people do — but then you're back to network hops between your orchestrator and each model server, plus serialization at each boundary. The ensemble keeps the tensors in the server's shared memory and the routing in C++, so a five-stage pipeline costs roughly one request's worth of overhead, not five. Move orchestration into the server when the hops dominate; keep it outside when you need complex branching the ensemble graph can't express.
4) Why a multi-framework server and not one-engine-per-service?¶
The plausible alternative is to compile everything into TensorRT and serve each as its own optimized service. Why use a general server that hosts PyTorch, ONNX, and Python alongside the TensorRT-LLM engine?
Because under our workload the pipeline is genuinely heterogeneous and changes at different rates. The 70B is frozen and worth compiling (file 04). The safety classifier gets retrained monthly and lives happily as ONNX. The tokenizer is plain Python. The embedder is a PyTorch model someone iterates on. Forcing all of them through one compiler would be enormous rebuild churn for the parts that change often (file 04's rigidity cost), and some — the Python tokenizer — don't compile to an engine at all. A multi-framework server lets each model use the runtime that fits its change rate and shape, while still co-hosting them so the GPUs stay packed and the pipeline stays hop-free. You compile what's stable and hot (the 70B); you keep the rest flexible.
Why this instead of all-TensorRT, under our workload? Our pipeline mixes a frozen hot model with several models that change often or don't compile. The server's value is hosting that mix in one process with shared-memory pipelining and GPU packing — getting the engine's speed where it matters without paying compilation's rigidity where it doesn't.
5) The property that decides packing: how well each model fills a GPU¶
The one dimension that decides concurrent-execution strategy is how much of a GPU each model actually uses. Pack models whose footprints add up to a card; isolate models that need a whole card.
| Model | GPU footprint | Serving strategy |
|---|---|---|
| 70B chat (TP-2) | spans 2 full GPUs | dedicated GPUs, tensor-parallel on NVLink; do not share |
| Embedding model | ~10% of a GPU | multiple concurrent instances sharing one GPU + dynamic batching |
| Safety classifier (ONNX) | ~5% of a GPU | co-located with embedder on shared GPU; concurrent execution |
| Tokenizer (Python, CPU-ish) | negligible GPU | runs on CPU or shares trivially |
The asymmetry: a 70B cannot share a card — it needs more than one whole GPU, so concurrent execution and (looking ahead) MIG are irrelevant to it. The small models should share, because each alone wastes 90%+ of a card. Concurrent execution and instance groups let you run several copies of the small models on one GPU so its utilization climbs from single digits toward full. Getting this packing right is the difference between six GPUs and three for the same pipeline.
6) The failure walked through: the ensemble that serialized itself¶
A team builds the ensemble, expecting the steps to overlap across requests. Throughput is poor and a trace shows the GPU mostly idle. They're confused — everything's co-hosted now, so why is it slow?
Trace it. Each model in the ensemble had a single instance (instance_group count 1), so the server could only run one copy of each step at a time. While request 1's tokens were in the LLM step, request 2's safety-check step couldn't run — it queued behind request 1's safety step, which had already finished, but there was no second instance to pick up request 2. The pipeline ran effectively serially across requests: one request's stages couldn't overlap with another's, so the GPUs idled waiting for the single instance of each stage. The fix was to raise the instance count on the cheap stages (multiple embedder and classifier instances) and enable dynamic batching, so the server overlaps many requests' stages and keeps every GPU busy. The lesson: co-hosting alone doesn't keep GPUs fed — you must give the server enough instances to overlap work, or it serializes.
7) Cost movement: what the server buys and what it costs¶
- What it fixes: removes network hops between pipeline stages (shared-memory ensembles), reclaims idle GPUs by packing models (concurrent execution), batches non-LLM models at the edge (dynamic batching), and lets each model use the right framework (backends) — so the endpoint is fast, not just the model.
- What it costs: configuration complexity (model repository layout,
config.pbtxtper model, ensemble graphs, instance groups, batching windows) and a coordination point — the server is now a shared resource whose misconfiguration (too few instances, wrong batch delay) serializes or idles the fleet. - Which subsystem pays: the serving/MLOps team owns the model repository, the ensemble definitions, and the batching/instance tuning. The reward lands in GPU count (fewer cards for the same pipeline) and endpoint latency (no internal hops). The new pressure: the server's tuning knobs (queue delay, instance count) are now load-bearing — get them wrong and you idle exactly the hardware you co-located to keep busy.
For the running example: co-hosting the five-stage pipeline on Triton turns six under-utilized GPUs into a handful of well-packed ones, removes four network hops per request, and lets the frozen 70B stay compiled while the often-changing safety model stays ONNX — at the cost of a model repository and batching config the team must own and tune.
8) Signals: healthy, first to degrade, and the liar¶
- Healthy: GPU utilization high across all co-hosted models (no card idling at single digits); ensemble end-to-end latency close to the sum of compute (little hop/queue overhead); dynamic batch sizes for the small models near their configured max; per-model queue times low.
- First metric to degrade: per-model queue time climbs (requests waiting for a free instance) or dynamic batch size collapses to 1 — both mean too few instances or too-short batch windows, so the server can't overlap or batch work. Tokens/sec sags before latency obviously spikes.
- The misleading metric: the LLM engine's own tokens/sec looks great in isolation while the endpoint is slow — because the bottleneck moved to a starved small-model instance or a network hop the engine metric can't see. Measuring only the engine hides a serving-layer stall.
- The graph an expert opens first: Triton's per-model metrics —
nv_inference_queue_duration(time requests wait for an instance) andnv_inference_compute_durationper model, plus per-GPU utilization across all models. A model with high queue duration and a GPU idling next to it means too few instances; a long ensemble latency with low compute means hop/serialization overhead.
9) Boundary: where the server shines and where it doesn't¶
Triton shines for heterogeneous, multi-model, multi-stage serving on shared GPUs — exactly our pipeline of a frozen LLM plus several small models in different frameworks. It packs GPUs, removes hops, and lets each model use its best runtime. It's also where you get production essentials: model versioning, A/B/canary via version policies, metrics, and health.
It becomes overhead when you serve a single model at massive scale with no pipeline — there, the server's generality is weight you don't use, and a thin purpose-built deployment (or NIM, file 06) may be simpler. It can also become a bottleneck if mis-tuned: a single coordination point with too few instances serializes the fleet (section 6). And for truly distributed, disaggregated LLM serving across many nodes (separating prefill and decode onto different GPUs), the newer Dynamo platform extends beyond classic Triton — Triton is one piece of it. The scale limit that invalidates naive intuition: "co-hosting always saves GPUs" is false if you under-provision instances — then co-hosting serializes and you idle more than you saved.
10) Wrong model: "the server is just an HTTP wrapper around the model"¶
The seductive wrong idea is that a serving server is a thin shim — it takes HTTP, calls the model, returns JSON — so any web framework would do. That misses everything that makes GPU serving hard: batching at the edge, packing models, server-side pipelines, multi-framework hosting, versioning.
Replace it with: the server is the layer that keeps expensive GPUs fed across requests and across models. Its job is scheduling — deciding which model instance runs which batched requests on which GPU, when — so no card idles. A plain HTTP wrapper gives you the network plumbing and none of the scheduling, which is where the GPU economics live. The server is a GPU scheduler with an HTTP front, not an HTTP server with a model attached.
11) Other failure shapes to recognize¶
- Single-instance serialization. One instance per model, so requests can't overlap and GPUs idle (section 6). Fix: raise
instance_groupcounts on cheap stages. - Batch-window mistuning. Queue delay too long (latency spikes) or too short (batches of 1, no throughput). Fix: tune
max_queue_delay_microsecondsto the latency budget. - Compiling the wrong thing. Forcing the often-retrained safety model through TensorRT and eating rebuild churn. Fix: keep volatile models in flexible backends (ONNX/PyTorch); compile only the stable hot model.
- Orchestrating outside when you should ensemble. External Python orchestrator adds hops the ensemble would remove. Fix: move hop-heavy pipelines into a server-side ensemble.
- Trying to share a card with a model that needs the whole thing. Co-locating something on the 70B's GPUs and starving it. Fix: dedicate full GPUs to large models; share only with small ones.
- Version skew across the pipeline. Tokenizer/model/detokenizer versions drift, producing subtly wrong outputs. Fix: version the ensemble as a unit; use Triton version policies.
12) Pattern transfer¶
- Same idle-hardware fight as the roofline, one clock slower. The roofline kept SMs fed across nanosecond weight-reads; the server keeps GPUs fed across millisecond requests and across models. Dynamic batching is the roofline's "amortize the weight-read" lever applied to request arrival.
- Same hop-elimination as kernel fusion. Fusion (file 02) refused to round-trip intermediates through HBM between kernels; the ensemble refuses to round-trip tensors through the network between models. "Don't cross the expensive boundary between stages" recurs from kernels to microservices.
- Same scheduling shape as OS process scheduling. Concurrent execution interleaving model instances on shared GPUs is the OS scheduler interleaving processes on shared cores — over-subscribe a little, keep the resource busy, but over-subscribe too much and you thrash. The scheduling tradeoff recurs.
13) Design test — five questions before you deploy a multi-model endpoint¶
- Is your pipeline running as a server-side ensemble (shared-memory, no hops), or scattered across microservices with a network hop per stage?
- For each non-LLM model, is dynamic batching on, with a queue delay tuned to your latency budget?
- Have you given each cheap stage enough instances to overlap requests, or will a single instance serialize the pipeline?
- Are you packing small models onto shared GPUs (concurrent execution) and dedicating full GPUs only to models that need them (the 70B)?
- Are you compiling only the stable hot model and leaving volatile or non-compilable models in flexible backends?
Where this appears in production¶
The server and its mechanisms
- NVIDIA Triton / Dynamo-Triton — the multi-framework inference server; hosts TensorRT, PyTorch, ONNX, OpenVINO, Python, and FIL backends in one process. Folded into the Dynamo platform in 2025.
- Dynamic batching — batches arriving requests within a queue-delay window; the primary throughput lever for stateless non-LLM models (embedders, classifiers, vision).
- Model ensembles — server-side pipelines that chain models with shared-memory tensor passing, eliminating network hops between stages.
- Concurrent model execution / instance groups — run multiple instances of a model (same or different GPUs) so small models share a card and requests overlap.
- TensorRT-LLM backend — hosts the file-04 engine inside Triton, exposing its in-flight batching through the server.
- Python / BLS backend — runs tokenizers, custom logic, and business-logic scripting that chains models with branching the static ensemble graph can't express.
Where it shows up
- NVIDIA NIM — wraps Triton + a TensorRT-LLM engine in a container with an OpenAI-compatible API (file 06); the packaged form of this server.
- NVIDIA Dynamo — the broader platform for distributed, disaggregated LLM serving (separating prefill and decode across GPUs) that extends Triton beyond a single node.
- Amazon SageMaker / Azure ML / GCP Vertex — offer Triton as a managed serving backend for multi-model GPU endpoints.
- Recommendation and ranking systems — use Triton's FIL backend to serve gradient-boosted trees and concurrent model execution to pack many small ranking models per GPU.
- Computer-vision pipelines (detection→tracking→classification) — classic ensemble use: chain stages server-side with dynamic batching on each.
- RAG systems — embed→retrieve→rerank→generate as an ensemble, co-hosting the embedder, reranker, and LLM to remove hops.
- A/B and canary model rollouts — Triton version policies serve multiple model versions concurrently for traffic splitting without redeploying.
- Multi-tenant model-hosting platforms — pack many customers' small models onto shared GPUs via concurrent execution to make per-model economics work.
Pause and recall¶
- Name three things a compiled engine does not do that a serving server does.
- What two costs does the microservice-per-model deployment pay on GPUs?
- How does dynamic batching differ from the engine's in-flight batching, and which models benefit most from each?
- What is a Triton ensemble, and what does it eliminate compared to an external orchestrator?
- Why can the 70B not share a GPU via concurrent execution, while the safety classifier should?
- Why does a single instance per model serialize the pipeline?
- What does the batch-queue-delay knob trade off?
- The LLM engine's tokens/sec looks great but the endpoint is slow. Where do you look?
Interview Q&A¶
Q1. Your chat pipeline is five microservices on six GPUs, three nearly idle, and latency is high. How do you redesign it? A. Collapse the pipeline into one serving server as an ensemble: tokenize→embed→safety→LLM→detokenize, passing tensors in shared memory so there are no internal network hops. Pack the small models (embedder, classifier) onto shared GPUs with concurrent execution and dynamic batching, and dedicate full GPUs only to the tensor-parallel 70B. This removes the hops and reclaims the idle cards — fewer GPUs, lower latency. Common wrong answer to avoid: "Give each service more GPUs to speed it up." The small models already waste their cards; more GPUs add idle silicon, not throughput.
Q2. If TensorRT-LLM already does in-flight batching, why does Triton's dynamic batching still matter? A. They batch at different boundaries. In-flight batching keeps the LLM's generation loop full for requests already inside the engine. Dynamic batching decides how requests arrive at a model — and it's the primary throughput lever for the stateless non-LLM models in the pipeline (embedder, safety classifier) that have no continuous batching of their own. You use in-flight batching for the LLM step and dynamic batching for the rest of the ensemble. Common wrong answer to avoid: "In-flight batching makes dynamic batching redundant." They cover different models and different boundaries; the non-LLM stages still need edge batching.
Q3. Why host multiple frameworks in one server instead of compiling everything to TensorRT? A. The pipeline is heterogeneous and changes at different rates: the 70B is frozen and worth compiling, but the safety classifier retrains monthly (ONNX), the embedder is iterated in PyTorch, and the tokenizer is plain Python that doesn't compile to an engine. A multi-framework server lets each model use the runtime matching its change rate while co-hosting them for GPU packing and hop-free pipelining. Compiling the volatile parts would mean constant rebuild churn for little gain. Common wrong answer to avoid: "Compile it all for maximum speed." You'd pay TensorRT-LLM's rigidity on models that change often or can't compile, for no net win.
Q4. You co-hosted everything but the GPUs still idle and throughput is poor. What's the most likely cause?
A. Too few instances per model — a single instance_group count means the server can't overlap one request's stage with another's, so the pipeline serializes across requests and GPUs idle waiting for the lone instance. Raise instance counts on the cheap stages and enable dynamic batching so the server overlaps and batches work, filling the cards.
Common wrong answer to avoid: "Co-hosting must not help, go back to separate services." Co-hosting helps only with enough instances to overlap; the fix is more instances, not more services.
Q5. When is a general serving server the wrong choice? A. When you serve a single model at massive scale with no pipeline — the server's multi-model, multi-framework generality is unused weight, and a thin purpose-built deployment or a prepackaged NIM container is simpler. Also when you need distributed, disaggregated LLM serving across many nodes, where the broader Dynamo platform extends beyond classic single-node Triton. Common wrong answer to avoid: "Always use the full server." For one model with no pipeline, its generality is overhead you don't need.
Q6. (Cumulative.) An endpoint is slow. How do you tell whether it's an engine problem (file 04) or a serving-layer problem (this file)? A. Measure the engine's own tokens/sec in isolation first. If the engine is slow on its own (low average batch, KV starvation), it's a file-04 problem — fix in-flight batching / KV pool. If the engine is fast in isolation but the endpoint is slow, the bottleneck is the serving layer: check per-model queue duration and per-GPU utilization across the ensemble for a starved instance, an under-batched small model, or hop/serialization overhead. The split is "is the model fast but the service slow, or is the model itself slow?" Common wrong answer to avoid: "Slow endpoint means the LLM needs optimizing." The LLM can be at peak while a starved classifier instance or a network hop gates the endpoint.
Design/debug exercise (10 min)¶
Step 1 — Model it. Take the five-stage pipeline. As microservices: 4 network hops, each ~1 ms serialize + round-trip, plus 4 small-model GPUs at <10% utilization. As an ensemble: 0 internal hops (shared memory), small models packed onto 1 shared GPU via concurrent execution. Tally the per-request overhead removed (~4 ms of hops) and the GPUs reclaimed (from 6 down to ~3). Write both layouts side by side.
Step 2 — Your turn. For our 70B endpoint, the embedder uses ~10% of a GPU and the safety classifier ~5%. Decide how many concurrent instances of each to run on one shared GPU to push it toward full utilization, and set a dynamic-batching queue delay that fits a (say) 50 ms latency budget for those stages. Argue why the 70B stays on its own tensor-parallel pair and is excluded from sharing. Tie the packing back to the roofline's "keep the hardware fed" invariant.
Step 3 — Reproduce from memory. Redraw the scattered-vs-server diagram from section 2, labeling the network hops removed, the frameworks each stage uses, and which GPUs are shared vs dedicated. Then state in one sentence how this file connects to file 04 (the server hosts the engine and exposes its in-flight batching) and to file 02 (the ensemble eliminates inter-stage round-trips the way fusion eliminated inter-kernel ones).
Operational memory¶
This chapter explained why a fast engine still yields a slow endpoint: the pipeline gets scattered across microservices, paying a network hop at every stage and dedicating a whole GPU to each small model that uses a fraction of it. The important idea is that above the engine, throughput is about never letting a GPU idle waiting on requests or held by an under-filling model — and four mechanisms keep them fed: dynamic batching at the request edge, ensembles that run the pipeline server-side with no hops, multi-framework backends so each model uses its best runtime, and concurrent execution that packs small models onto shared GPUs.
You learned to host the whole pipeline on one server (Triton / Dynamo-Triton): compile and dedicate GPUs to the frozen 70B, keep volatile models in flexible backends, chain the steps as a shared-memory ensemble, and give cheap stages enough concurrent instances to overlap requests. That solves the opening failure because it removes the network hops and reclaims the idle cards, making the endpoint fast — not just the model.
Carry this diagnostic forward: when the endpoint is slow but the engine is fast in isolation, look at per-model queue duration and per-GPU utilization across the ensemble before touching the model. If a GPU idles next to a model with high queue time, add instances; if a small model's batches are size 1, tune its dynamic-batching window.
Remember:
- A compiled engine is a fast function; the serving server is the GPU scheduler that keeps cards fed across requests and across models.
- Dynamic batching batches requests at the door (key for non-LLM models); in-flight batching keeps the LLM's loop full inside the engine.
- Ensembles run multi-step pipelines server-side in shared memory — no network hops between stages.
- Pack small models onto shared GPUs (concurrent execution); dedicate full GPUs to models too big to share (the 70B).
- Co-hosting without enough instances serializes the pipeline and idles GPUs — the opposite of the intended win.
- Next pressure: configuring and tuning all of this is real work; sometimes you'd rather take a prepackaged, pre-optimized container than build the engine and wire the server yourself.
Bridge. Triton plus a TensorRT-LLM engine gives a fast, well-packed endpoint — but getting there meant building the engine (file 04), wiring the model repository, defining ensembles, and tuning instances and batch windows. That's a lot of expert work to serve a popular open model that thousands of teams serve identically. If the optimization is the same every time, why rebuild it every time? The next file is the packaged answer — NVIDIA NIM ships the engine, server, and an OpenAI-compatible API as one container — and forces the real question of this module's upper layers: buy or build your inference stack? → 06-nim-inference-microservices.md