08. Serving frameworks — vLLM, TGI, TensorRT-LLM, and Triton in practical contrast¶

~18 min read. Framework choice decides how much performance, flexibility, and pain you inherit.

Built on the ELI5 in 00-eli5.md. The kitchen now needs a manager, storekeeper, and traffic controller, not just chefs; that full management stack is the serving framework.

1) What a serving engine actually contains¶

People say “we serve the model” as if it were one loop.

Look closely. A real engine contains many subsystems.

HTTP / gRPC front door
          │
          ▼
request queue ──→ scheduler ──→ KV manager ──→ kernels ──→ tokenizer / detokenizer
          │                         │
          └──────── metrics and tracing ───────┘

The framework decides how the batch window behaves. It decides how the prep station is laid out. It decides kernel implementations. It exposes metrics on the kitchen itself. It may also handle multi-GPU groups, quantization plugins, and streaming responses. So framework choice is architecture choice. Simple, no?

2) The four names you will hear most often¶

Here is the mental map.

vLLM            → strongest mindshare for paged KV + continuous batching
TGI             → Hugging Face server with practical ops surface
TensorRT-LLM    → NVIDIA-heavy path for max tuned performance
Triton          → broad inference server for multi-model orchestration

vLLM shines when fast open-model text serving is the goal. TGI shines when you want a packaged Hugging Face serving stack. TensorRT-LLM shines when you control NVIDIA hardware and want deep kernel tuning. Triton shines when one server must coordinate many model types and backends. None is universally best. The question is always: best for what workload?

3) Worked selection example¶

Suppose you must serve a 70B chat model on 8 NVIDIA GPUs. You need strong throughput, streaming tokens, and good mixed-length traffic handling.

Option A: vLLM. You gain paged attention and strong continuous batching quickly.

Option B: TensorRT-LLM. You may gain lower latency and better kernel specialization, but you accept a more hardware-specific path.

Option C: Triton plus backend integration. You gain broader orchestration, but likely more integration work for the text path.

Now suppose the workload shifts to many model families, including rerankers and vision models. Suddenly Triton becomes more attractive. See how the answer changed? Framework choice follows system shape.

4) Tradeoffs that matter more than marketing¶

Evaluate these dimensions clearly.

scheduler quality under mixed request lengths,
KV-cache efficiency,
quantization support,
tensor-parallel and pipeline-parallel support,
streaming support and protocol ergonomics,
observability,
operational complexity,
hardware lock-in.

Now what is the problem? Teams often benchmark one happy path. Production contains long prompts, short prompts, batch bursts, GPU failures, and upgrades. A slower-looking framework in a toy test may be easier to operate honestly at scale.

5) How to think like a senior engineer here¶

A good answer is not “vLLM is the fastest” or “TensorRT-LLM is always the best.” A good answer names the workload, the hardware, the model family, and the team’s operational appetite.

If you need portable deployment across CPU, NVIDIA GPU, and edge devices, your answer changes. If you need only one NVIDIA cluster with maximum tokens per second, your answer changes again. That is why the framework discussion flows naturally into ONNX Runtime next. We now leave pure GPU-cluster thinking and ask how graph-level optimization and portability alter the decision.

Where this lives in the wild¶

Anyscale Endpoints — widely associated with vLLM-style high-throughput serving for open LLMs.
Hugging Face Inference Endpoints — TGI is a natural fit when the workflow is already centered on Hugging Face models and tooling.
NVIDIA NIM and enterprise NVIDIA stacks — TensorRT-LLM under the hood targets aggressive NVIDIA-specific optimization.
Multi-model platforms using Triton — one server can host LLMs, rerankers, and embedding models behind one inference control plane.
Self-hosted open-model chat services — teams often choose between vLLM simplicity and TensorRT-LLM maximum hardware tuning.

Pause and recall¶

Name three non-kernel responsibilities a serving framework usually owns.
Which framework is most associated with paged attention for open-model text serving?
Why might Triton become more attractive when many model types must share one platform?
Why is a one-prompt benchmark not enough for framework selection?

Interview Q&A¶

Q: Why choose a serving framework instead of a thin custom loop around the model?

A: Because production text serving needs scheduling, cache management, streaming, metrics, and often distributed execution. Rebuilding that stack repeatedly is slow and error-prone.

Common wrong answer to avoid: "A custom loop is simpler." It is simpler only until traffic and failures appear.

Q: Why might vLLM beat a generic server on text workloads, not because it has better HTTP routing?

A: Because its core advantages live inside the inference loop: paged KV management and strong continuous batching for variable-length decoding.

Common wrong answer to avoid: "It is just a nicer API wrapper." The engine internals are the point.

Q: Why might TensorRT-LLM win on one cluster but be the wrong choice elsewhere?

A: Because its strongest gains often depend on NVIDIA-specific optimization paths. If portability or heterogeneous hardware matters more, the trade changes.

Common wrong answer to avoid: "Fastest on one benchmark means best everywhere." Deployment constraints matter.

Q: Why is Triton not simply a slower version of text-specialized engines?

A: Because its value is broader orchestration across models and backends. The right comparison depends on whether you need a text engine or a platform server.

Common wrong answer to avoid: "All inference servers solve the same problem." Scope differs.

Apply now (5 min)¶

List the top three requirements for a deployment you know. Score vLLM, TGI, TensorRT-LLM, and Triton against those needs. Be explicit about hardware lock-in and scheduler needs. Sketch from memory:

the engine stack diagram,
the four-framework mental map,
and the evaluation checklist.

Bridge. Framework comparison raises a broader question: what if you want strong graph optimization and deployment portability beyond one CUDA stack? Next we study ONNX Runtime optimization. → 09-onnx-runtime-optimization.md