11. On-prem vs managed economics — when owning the kitchen pays¶

~18 min read. The temptation is the same one every restaurant feels eventually — stop buying ingredients from the supplier, start growing your own. The spreadsheet says it saves money at scale. The spreadsheet does not show the twelve months of engineering, the on-call load, the GPU driver upgrades, or the moment your team realises model lifecycle is now their problem. This chapter is the honest total-cost-of-ownership math.

Builds on 10-model-upgrades-in-production.md. Managed APIs are renting the cook by the hour. Self-hosting is owning the kitchen, the stove, the freezer, the supplier relationship, the cook's training, and the cook's pension. Same dish on the plate. Very different bill.

1) Hook — the spreadsheet that lied and the migration that did not happen¶

A growth-stage analytics company runs an internal-tools assistant on a managed frontier API. Monthly token volume — 480 million input, 110 million output. Monthly bill — roughly $46,000 on Sonnet 4.6.

The platform team builds a TCO model for self-hosting. The model uses Llama 3.3 70B on rented H100s. The math says — three H100s at $2.80/hour, running 80% utilised, can serve the workload at $14,500/month in hardware. Net annual saving — $378,000. The platform lead presents the proposal.

The applied AI lead asks four questions.

Who runs the inference engine in production? No one yet. Who is on call when the GPU node has a kernel panic at 2 a.m.? No one yet. Who owns the quality regression when Llama 3.3 70B trails Sonnet 4.6 on the harder slices? No one yet. Who maintains the prompt portability when we re-tune from Sonnet's prompt style to Llama's? No one yet.

The platform team's revised TCO — three H100s at $14,500/month plus three senior engineers' time for year one ($210,000 fully loaded each, so $630,000), plus a 4-6 point quality drop on the harder slices that translates into measurable downstream cost, plus a six-month build time during which the managed API still serves traffic. Net year-one cost — roughly $850,000 above the managed line.

The proposal does not ship. Two years later, when the workload has grown to 4 billion input tokens a month and the team has hired two infrastructure engineers for other reasons, the proposal returns with a different shape and ships. The crossover finally arrived. The lesson was — the crossover was real, but the year it arrived was not the year someone first ran the arithmetic on the GPU lease alone.

2) The metaphor — owning the kitchen vs renting the cook¶

When a restaurant uses a contract caterer, the bill is per dish. Predictable. Includes the cook, the cook's training, the cook's manager, the cook's sick days, the kitchen, and the supplier relationships. The restaurant just plates and serves.

When a restaurant runs its own kitchen, the bill is — rent, equipment, supplier contracts, training, retention, insurance, fire safety, kitchen manager's salary, sous chef's salary, line cook's salary, dishwasher, on-call plumber, the gas company. Per-dish cost goes down at scale. But the restaurant is now in the kitchen-running business, not just the restaurant-running business.

Self-hosting a model is exactly that shift. The managed API is the contract caterer. Self-hosting is owning the kitchen. The arithmetic is not just "per-dish cost." It is "per-dish cost plus everything we now have to also do."

3) The temptation — and why it is true at scale, false at low volume¶

The headline arithmetic is genuinely compelling at high volume.

MANAGED API                       SELF-HOSTED
───────────                       ────────────
Pay per token                     Pay per GPU-hour
Bill scales linearly with use     Bill is roughly fixed regardless of use
Frontier quality available        Open-weight quality, usually one tier down
Vendor handles inference          You handle inference
Vendor handles lifecycle          You handle lifecycle
Vendor handles capacity           You handle capacity

The crossover is real. At low volume, managed wins because you do not pay for GPUs sitting idle. At high volume, self-hosted wins because the GPU is amortised across many tokens. The question is — where is the crossover for your specific workload, and what does the math under it look like.

The shape of the crossover curve.

Cost
 ▲
 │   Managed API
 │  /
 │ /
 │/ — — — — Self-hosted
 │  ─────────────
 │
 └─────────────────▶ Volume
       crossover

Under crossover volume — managed is cheaper. Over crossover volume — self-hosted is cheaper. The crossover point depends on which model tier you are comparing, what GPU you are using, what utilization you achieve, and what you count in the self-hosted column.

4) The TCO components for self-hosting¶

The arithmetic the spreadsheet usually shows is the GPU lease. The arithmetic that actually applies has seven components.

┌──────────────────────────────────┬─────────────────────────────────────┐
│ COMPONENT                        │ TYPICAL ANNUAL COST (rough 2026)    │
├──────────────────────────────────┼─────────────────────────────────────┤
│ 1. GPU lease (H100/H200/B200)    │ $20k - $40k per GPU per year        │
├──────────────────────────────────┼─────────────────────────────────────┤
│ 2. Inference engineering         │ 1-3 senior FTE for year one;        │
│    (vLLM, TGI, SGLang)           │ 0.5-1 FTE ongoing                   │
├──────────────────────────────────┼─────────────────────────────────────┤
│ 3. Reliability engineering       │ Multi-AZ failover, autoscaling,     │
│    (SRE-grade on-call)           │ on-call rotation. 0.5-1 FTE.        │
├──────────────────────────────────┼─────────────────────────────────────┤
│ 4. Eval and observability        │ Tooling + the engineer running it.  │
│                                  │ Same as managed, but you own more   │
├──────────────────────────────────┼─────────────────────────────────────┤
│ 5. Model lifecycle               │ Weight updates, fine-tuning, eval   │
│                                  │ regressions on new versions.        │
├──────────────────────────────────┼─────────────────────────────────────┤
│ 6. Security and compliance       │ Driver patches, kernel patches,     │
│                                  │ image scanning, audit logging.      │
├──────────────────────────────────┼─────────────────────────────────────┤
│ 7. Quality gap mitigation        │ Open-weight quality often trails    │
│                                  │ frontier closed-weight. The gap is  │
│                                  │ paid in user-facing cost.           │
└──────────────────────────────────┴─────────────────────────────────────┘

Components 1 and 2 are the ones the spreadsheet usually captures. Components 3 through 7 are the ones that surprise teams in the second quarter of operation.

Realistic 2026 GPU economics. H100 80GB hourly rate on hyperscalers — $3-4. On second-tier providers (Lambda, CoreWeave, RunPod) — $2-3. Spot pricing on AWS / GCP — $1.50-$2.50, with availability that comes and goes. H200 is ~30% more expensive but offers more memory and faster inference. B200, when available in 2026, is significantly more expensive but offers ~2-3x the throughput of H100 for many workloads. The hourly number is just the start.

Realistic throughput. A 70B parameter model on a single H100 with quantization serves 30-60 output tokens per second per concurrent request, with batching pushing aggregate throughput to 200-400 tokens per second across the GPU at realistic utilisation. A 405B model needs multiple GPUs (2-4 H100s for inference) and serves correspondingly slower per-request. Frontier-quality closed-weight models do not have publicly available comparable numbers because they are not self-hostable.

Realistic utilisation. The naive math assumes 100% utilisation. Real production runs at 50-80% on average, because traffic has peaks and troughs, and you must provision for peaks. Below 50%, the per-token math breaks. Above 80%, the latency tail explodes during peaks.

5) Worked example — the 100M token workload¶

Look. Let us walk through realistic numbers for a workload at 100 million input tokens per month and 20 million output tokens per month.

Managed API cost (Sonnet 4.6 at 2026 prices, roughly).

Input  : 100M tokens × $3 per million  = $300
Output :  20M tokens × $15 per million = $300
Monthly: ~$600

(Numbers illustrative; actual prices change. The point is the shape.)

Wait — that example is too small. Let me redo at a more realistic scale.

Managed API cost (10x the above — 1B input, 200M output per month).

Input  : 1000M tokens × $3 per million  = $3,000
Output :  200M tokens × $15 per million = $3,000
Monthly: ~$6,000
Annual : ~$72,000

Self-hosted cost (Llama 3.3 70B on two H100s for redundancy).

Hardware: 2 H100s × $2.80/hour × 24 × 30 = $4,032/month
         Annual : ~$48,400
Year-one engineering: 1 senior FTE @ $250k loaded = $250,000
Year-one observability + lifecycle: ~$30,000
Year-one total: ~$328,400

Year-two-plus (engineering settles to 0.5 FTE):
Hardware     : ~$48,400
Engineering  : ~$125,000
Lifecycle    : ~$30,000
Year-two total: ~$203,400

Year-one comparison. Managed: $72,000. Self-hosted: $328,400. Managed wins by $256,400.

Year-two comparison. Managed: $72,000. Self-hosted: $203,400. Managed still wins.

This workload does not cross over. The crossover for Llama 3.3 70B self-hosted vs Sonnet 4.6 managed sits roughly at 5-10x this volume — and that math also assumes you accept the quality gap. Frontier closed-weight cooks have no self-hosted alternative at all in 2026.

A different worked example. 5 billion input, 1 billion output per month.

Managed ($3 per million in + $15 per million out): $30,000 monthly, $360,000 annual.

Self-hosted (Llama 3.3 70B on a small fleet — say 8 H100s for throughput and redundancy):

Hardware       : 8 × $2.80/hour × 24 × 30 = $16,128/month, ~$194,000 annual
Engineering    : 2 senior FTE year one  = $500,000
Lifecycle / obs: $40,000
Year-one total: ~$734,000

Year two (1 FTE):
Hardware       : ~$194,000
Engineering    : $250,000
Lifecycle / obs: $40,000
Year-two total: ~$484,000

Year three (1 FTE, lower lifecycle):
Year-three total: ~$420,000

Managed annual: $360,000. Self-hosted year-three: $420,000. Still slightly above. Crossover at this workload may be 10x further. The frontier-managed cook is durably cheaper for many workloads even at very high volume, because the inference engineering cost does not vanish.

Where self-hosted clearly wins. Open-weight models that match the workload's quality bar, very high consistent volume (tens of billions of tokens per month), and an engineering team that already has GPU operations expertise. That combination is rare. When it exists, the savings are real.

6) The hosted-open-weight middle ground¶

Between managed-closed-weight and self-hosted-open-weight sits a third option. Open-weight models, but served by someone else. Together AI, Fireworks AI, Anyscale, Replicate, Hugging Face Inference Endpoints, Groq, SambaNova, Cerebras, OctoAI.

The economics. Per-token pricing, but typically lower than closed-weight managed APIs of similar tier. Llama 3.3 70B on Together AI in 2026 might price at $0.50 per million input and $0.80 per million output — significantly cheaper than closed-weight frontier rates. You do not run the kitchen.

The trade-offs. You inherit the host's reliability, capacity, and version lifecycle (covered in chapter 09's vendor risk discussion). You do not get the latest frontier quality, because the frontier remains closed-weight in 2026. You get prompt portability — you can switch hosts on the same weights in days, not weeks, because the model file is the same.

                  COST          QUALITY    OPERATIONAL BURDEN
                  ────          ───────    ──────────────────
Managed closed    Highest       Highest    Lowest
Hosted open       Middle        Middle     Low
Self-hosted open  Lowest at     Middle     Highest
                  high volume

Most teams arrive at "hosted open" as the practical middle ground. The cost saves over managed closed when the quality gap is acceptable. The operational burden is low because the host runs the GPUs. The portability is high because the weights are public.

The decision tree that works for most teams. Step one — can the workload tolerate open-weight quality? Run a bake-off (chapter 03). If yes, step two — what is the monthly cost on a hosted-open-weight provider? If that cost is acceptable, stop there. The self-hosted question only opens if the hosted-open-weight cost is also unacceptable, and that usually means the workload is in the tens-of-billions-of-tokens-per-month range.

7) The hidden costs of self-hosting¶

The TCO table named seven components. Three of them deserve their own moment because they are the ones spreadsheets miss.

Model lifecycle is now yours. When Llama 4 ships, you decide whether to upgrade, when to upgrade, and you run the upgrade playbook from chapter 10. The managed API would have done this for you, with the trade-off that the upgrade happens on the supplier's schedule. Self-hosting gives you control and the work that comes with it.

Driver and kernel patches. NVIDIA driver upgrades, CUDA version changes, kernel patches, container base image updates. None individually huge. Collectively, roughly one engineer-week per quarter for a steady fleet. More during a major NVIDIA driver transition.

Security posture. The model weights, inference logs, and prompts now live on your hardware. You inherit responsibility for at-rest encryption, audit logging, vulnerability scanning of the inference engine and base image, and incident response when something goes wrong. Managed APIs offload most of this to the supplier.

The cumulative effect — self-hosting adds roughly 1-2 FTE of ongoing load to a team beyond the spreadsheet's hardware cost. For a team where that capacity exists and the volume crosses over, it is worth it. For a team without that capacity, the apparent savings evaporate into engineer attention that should have been spent on the product.

Mid-content recall¶

What seven components belong in a self-hosting TCO calculation, and which two does the spreadsheet usually capture?
Why does utilisation matter so much for the per-token economics of self-hosting, and what is a realistic utilisation number?
What is the hosted-open-weight middle ground, and when does it win?

8) When self-hosting is forced — and the math changes shape¶

There are workloads where the managed/self-hosted decision is not an economic question. The decision is made by policy or by physics.

Data residency requirements. Some regulators require that user data never leave a specific geography. If the managed API does not offer a region in that geography, you self-host. The economics do not matter — the rule matters.

Air-gapped deployments. Defense contractors, classified workloads, some healthcare deployments. The system must not reach the public internet. Self-hosting is the only option. The cost question becomes — how do we do this efficiently, not whether to do it at all.

Regulated industries with very strict third-party risk frameworks. Some banks and insurers cannot pass model inputs containing customer data to a third party without contractual structures that take a year to negotiate. Self-hosting bypasses the negotiation.

Latency requirements that managed APIs cannot meet. Some real-time trading systems, voice interfaces, and gaming use cases need sub-100ms end-to-end latency that managed APIs over WAN cannot guarantee. Local inference on dedicated GPUs eliminates the WAN hop.

When self-hosting is forced, the cost arithmetic shifts from "is it cheaper" to "how do we do it well enough to compete on quality." The managed-API benchmark is still the quality bar — your users implicitly compare to ChatGPT and Claude even when they cannot use those products. You may pay more on a per-dish basis and still have to ship the dish.

9) Failure modes — where the self-hosting case quietly fails¶

LEAK                                     FIX
────────────────────────────────────     ──────────────────────────────────
Spreadsheet shows GPU cost only          Add components 2-7 from the TCO
                                         table; the spreadsheet is not the
                                         TCO

Assuming 100% GPU utilisation            Plan for 50-80%; under 50% the math
                                         breaks; over 80% latency tails
                                         explode at peaks

Ignoring the quality gap                 Open-weight tiers are often one tier
                                         below closed-weight frontier; run
                                         the bake-off before committing

Skipping the hosted-open-weight option   Hosted open often wins the
                                         middle-volume range without the
                                         operational burden

Year-one engineering invisible           Self-hosting needs 1-3 senior FTE
                                         year one; settles to 0.5-1 FTE
                                         ongoing

Treating model lifecycle as free         New model versions ship; you run
                                         the upgrade playbook every time;
                                         that capacity has a cost

Forgetting the security / compliance     Weights, logs, prompts on your
load                                     hardware shift the audit burden to
                                         you

Counting savings before quality parity   The saving only counts if the user
                                         experience is the same or better;
                                         otherwise it is a tax on the user

Eight common ways the self-hosting argument quietly fails. The thread — the spreadsheet is necessary but not sufficient. The TCO is the arithmetic of the whole kitchen, not the hourly cost of the stove.

Where this lives in the wild¶

The decision plays out across many infrastructure choices.

Anthropic API, OpenAI API, Gemini API — managed closed-weight frontier. Highest quality, highest per-token price, lowest operational burden.
Mistral La Plateforme — managed access to open-weight Mistral models; the company itself is the supplier.
DeepSeek API — managed open-weight from the lab that trained them; unusual hybrid.
AWS Bedrock, Azure OpenAI Service, Vertex AI — managed access to multiple suppliers' cooks; one bill, one TOS, one IAM model.
Together AI — hosted open-weight inference at competitive per-token prices; the middle-ground reference point for many teams.
Fireworks AI — same model; emphasises low-latency inference engines.
Anyscale — Ray-based open-weight hosting; appeals to teams already on Ray.
Replicate — broad open-weight catalog including community-tuned variants; longer-tail catalog.
Modal — serverless GPU; bring your own model and inference engine. The middle ground between "fully managed" and "fully self-hosted."
OctoAI — managed open-weight inference (acquired by NVIDIA in 2024); enterprise focus.
Groq — extreme-throughput open-weight hosting on LPU hardware; niche for latency-sensitive workloads.
SambaNova — RDU-based open-weight inference; longer cook lifetimes per model.
Cerebras — wafer-scale inference; curated open-weight menu.
Hugging Face Inference Endpoints — managed inference on your choice of open-weight model; the middle ground for teams with model-specific needs.
vLLM — the dominant open-source inference engine; what teams self-hosting on commodity GPUs typically run.
TGI (Text Generation Inference) — Hugging Face's inference engine; alternative to vLLM with different optimisation trade-offs.
SGLang — newer inference engine with strong structured-output support; gaining adoption in 2026.
llama.cpp — CPU and small-GPU inference engine; used for on-device and edge deployments rather than data-centre serving.
Ollama — developer-friendly wrapper around llama.cpp; popular for local development and on-device.
CoreWeave, Lambda, RunPod, Crusoe — second-tier GPU clouds typically cheaper than hyperscalers.
AWS p5 instances, GCP A3 instances, Azure ND H100 v5 — hyperscaler GPU instances at hyperscaler prices.
NVIDIA NIM — containerised inference microservices; the official NVIDIA path for self-hosting frontier-class open-weight models.

Pause and recall¶

What is the shape of the cost-vs-volume curve for managed API vs self-hosted, and what does it imply about low-volume workloads?
List the seven TCO components for self-hosting; which two does a naive spreadsheet usually contain?
Why does the hosted-open-weight middle ground often win on middle-volume workloads?
What is a realistic utilisation assumption for a self-hosted inference fleet, and why?
What four conditions force self-hosting regardless of the economic answer?
How much engineering capacity does year-one self-hosting typically need, and how does it settle in year two?
Why is the quality gap a hidden cost in the self-hosting case?

Interview Q&A¶

Q1. When does self-hosting pay off? A. When three conditions are simultaneously true. Volume is high enough to amortise GPU costs across enough tokens (typically billions per month for the workload type). Open-weight quality is acceptable for the workload. The team has GPU operations capacity, either existing or worth hiring for. If any of the three is missing, the managed or hosted options are usually cheaper after a full TCO accounting. Trap: Quoting "self-hosting is cheaper at scale" as if scale alone were the trigger. Scale plus quality acceptance plus operations capacity is the trigger.

Q2. Your CFO sees a spreadsheet showing $46k/month on a managed API and asks why you are not self-hosting. How do you respond? A. Walk through the TCO components beyond the GPU lease. Year-one engineering for inference engineering, reliability engineering, model lifecycle, security and compliance. Quality-gap mitigation if the self-hosted cook trails the managed cook. Realistic utilisation assumptions. Often the conclusion is that managed wins by a wide margin in year one even if hardware looks cheaper. Frame the conversation around year-three total cost, not year-one hardware cost. Trap: Defending managed without numbers. Bring the TCO arithmetic.

Q3. What is the hosted-open-weight middle ground and when does it win? A. Providers like Together AI, Fireworks AI, Anyscale, Replicate, Hugging Face Inference Endpoints, Groq, SambaNova, Cerebras. They serve open-weight models (Llama, Mistral, Qwen, DeepSeek) at per-token prices typically lower than closed-weight managed APIs. They win when the workload tolerates open-weight quality, the volume is meaningful but not extreme, and the team does not want GPU operations burden. For most middle-volume workloads, hosted open beats both managed closed and self-hosted. Trap: Treating the choice as binary between managed-closed and self-hosted. The middle ground is usually the right answer.

Q4. What does realistic GPU utilisation look like and why does it matter? A. 50-80% on average. The naive arithmetic assumes 100%. Below 50%, the per-token cost rises until managed APIs are cheaper. Above 80%, the latency tail at peak load gets worse than users tolerate. You provision for peaks (so average is below peak) and provision for some redundancy (so capacity is below total hardware). The 50-80% range is where most steady production fleets land. Trap: Citing GPU spec sheet throughput as actual production throughput.

Q5. When is self-hosting forced regardless of economics? A. Four typical drivers. Data residency rules that managed APIs cannot satisfy. Air-gapped deployments. Regulated industries with third-party risk frameworks slower than the build. Latency requirements (sub-100ms) that managed APIs over WAN cannot meet. When forced, the question shifts from "is it cheaper" to "how do we do it well enough to compete on quality." Trap: Assuming forced self-hosting is also economically self-hosted. You may pay more and still have to ship.

Q6. What is the typical year-one engineering cost for self-hosting? A. 1-3 senior FTE for the build. 0.5-1 FTE ongoing. Components include inference engineering (vLLM/TGI/SGLang), reliability engineering (HA, autoscaling, on-call), eval and observability, model lifecycle. At a loaded cost of $200-300k per senior engineer, that is $200-900k of year-one engineering on top of hardware. Spreadsheets that exclude this number routinely conclude self-hosting saves money when in reality it spends a year of engineering attention to break even later. Trap: Treating engineering attention as free. It is the most expensive line item in most TCO calculations.

Q7. How does the upgrade playbook change for self-hosted vs managed models? A. Self-hosted upgrades involve a model weight swap, an inference engine configuration change, possibly a new GPU layout if the model size changes. Stages of the playbook are the same — eval, shadow, canary, rollback — but the implementation involves infrastructure changes beyond a config flip. Rollback is harder because rolling back to a previous model may require rolling back the inference engine version too. Managed upgrades, by contrast, are usually a model name change. Trap: Assuming the upgrade playbook from chapter 10 applies unchanged. The stages do; the implementation differs.

Q8. Does an open-weight model have zero vendor risk? A. No. The weights have zero EOL risk. But every other piece of vendor risk is still present, just shifted. The host (Together, Fireworks, Anyscale, Replicate) is a supplier and can change pricing, capacity, or terms. The GPU cloud is a supplier (CoreWeave, Lambda, AWS) with its own risks. The inference engine (vLLM, TGI) is software you depend on and that can deprecate features. Self-hosting eliminates supplier risk on the weights and shifts it to suppliers of everything else. Trap: Equating "open weights" with "no vendor risk." The cook is yours; the kitchen is still rented.

Apply now (5 min)¶

Step 1 — score your monthly bill. Open your managed API invoice. Divide annual spend by 12 to get a stable monthly. Multiply by 12. That number is the managed-side comparison anchor.

Step 2 — build the honest self-hosted TCO. Write the seven components down. GPU lease, inference engineering, reliability, eval / observability, model lifecycle, security / compliance, quality-gap mitigation. Estimate each in dollars per year. Sum. Compare to step 1.

Step 3 — check the hosted-open-weight option. Look up the price for Llama 3.3 70B or Mistral Large 2 or a comparable open-weight cook on Together AI or Fireworks AI. Multiply by your monthly token volume. That is the third leg of the comparison most teams forget to run.

Bridge. Whether you pay per-token to a managed supplier, per-token to a hosted-open-weight host, or per-GPU-hour to yourself, the bill arrives with structure. Input is priced differently from output. Cached is priced differently from uncached. Batch is priced differently from realtime. The next chapter is reading the invoice properly.

→ 12-pricing-anatomy.md