05. Closed-weight vs open-weight — which kind of supplier do you trust?¶

~18 min read. Closed-weight suppliers rent you a finished kitchen. Open-weight suppliers ship you the recipes and the burners. Both can feed the restaurant. The right pick depends on volume, residency, regulation, and how much of the kitchen you want to own. The matching habit on this page is choosing between renting and owning the cook.

Builds on 04-task-to-tier-mapping.md. The routing matrix tells you which cook a ticket needs. This chapter zooms out one level — which supplier should that cook come from, and on whose premises should they work.

1) Hook — the same workload, three suppliers, three very different bills¶

A mid-sized fintech runs an AI document-review pipeline. Eighty million tokens a day flow through it. The work is mostly mid-tier — structured extraction from KYC documents, summarization of compliance reports, classification of suspicious-transaction memos. The team's lead has three credible plans on her desk.

Plan A — run everything on Anthropic Sonnet 4.6 via the Anthropic API. The prompts are already tuned, the strict-mode reliability is well understood, the latency is acceptable. The bill comes to roughly $3 input + $15 output per million tokens, blended at about $5/M for this workload. At 80M tokens a day, that is **$12,000 a month**. Closed-weight, vendor-managed, low operational burden.

Plan B — switch to DeepSeek-V3 on the DeepSeek API. Closed-weight in the sense that DeepSeek runs the inference, but the underlying weights are downloadable. Pricing at $0.27 input + $1.10 output per million tokens brings the bill down to roughly **$1,200 a month**. The catch — quality on the team's eval set drops from 89% to 84% on the complex extraction sub-task, and DeepSeek's data-residency story does not pass the company's EU compliance review.

Plan C — self-host Llama 3.3 70B on four H100 GPUs through a managed provider like Fireworks AI or Together AI. Pricing at roughly $0.90 input + $0.90 output per million tokens lands the bill at **$2,400 a month**. Quality on the team's eval set sits at 86% — between Sonnet and DeepSeek, above DeepSeek, comfortably above their 80% floor. Data residency can be guaranteed through the provider's EU regions.

Plan A is the safe default — 12K a month for a frontier closed-weight supplier on a managed API. Plan B saves 90% but trades quality and compliance. Plan C splits the difference — 80% savings, acceptable quality, controllable residency.

The lead chooses Plan C for the high-volume routine work and keeps Plan A warm for the 5% of complex cases the bake-off shows need frontier quality. Hybrid. Two suppliers, two cooks. The matching habit applied at the supplier level.

This chapter is about how she made that decision and how you make yours.

2) The metaphor — renting a finished kitchen vs buying the ovens¶

When the cook is closed-weight, you are renting a finished kitchen. The supplier owns the ovens, owns the prep stations, owns the cooking. You bring the order and they cook it. You pay per dish. The kitchen comes fully staffed with the latest equipment — strict-mode reliability, prompt caching, computer use, the newest reasoning modes. When the supplier releases a better stove, you get it without lifting a finger. The kitchen also has a locked door — you cannot walk in, you cannot inspect the burner, and if the supplier raises prices or shuts down a service line, you have no leverage except to find another rented kitchen.

When the cook is open-weight, you are buying the ovens. The recipe — the model weights — is published. You download it, you run it on GPUs you rent or own, and the cook works on premises you control. Capital cost is higher up front. Operations cost is higher every day. But the kitchen door is yours. You see exactly what is cooking. You can move the kitchen to a different building. You can fine-tune a recipe to your house style. You can serve customers in jurisdictions that forbid letting data leave the premises. And once you have crossed a volume threshold, the per-dish cost is meaningfully lower because you stopped paying the rent markup.

The supplier choice is not which cook is better. Frontier closed-weight cooks are usually a notch above the best open-weight cooks on hard tasks in 2026. The choice is which kitchen layout fits your restaurant. Volume, residency, regulation, and operational appetite decide. The matching habit on this page is figuring out which layout fits.

3) The split — what closed-weight and open-weight actually mean¶

Let's get the categories precise before the tradeoffs.

┌───────────────────────────────────────────────────────────────────────┐
│                       THE SUPPLIER LANDSCAPE                          │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  CLOSED-WEIGHT (API-only)                                             │
│  ────────────────────────                                             │
│  Anthropic   — Opus 4.7, Sonnet 4.6, Haiku 4.5                       │
│  OpenAI      — GPT-5, GPT-4o, GPT-4o-mini                            │
│  Google      — Gemini 2.5 Pro, Flash, Flash-Lite                     │
│                                                                       │
│  You get an API endpoint. You do not get the weights.                 │
│  Vendor controls everything — quality, pricing, uptime, deprecation.  │
│                                                                       │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  OPEN-WEIGHT (downloadable)                                           │
│  ───────────────────────────                                          │
│  Meta        — Llama 3.3 70B, Llama 3.3 405B                         │
│  Mistral     — Mistral Large 2, Mixtral 8x22B                        │
│  Alibaba     — Qwen 2.5 72B, Qwen 2.5 Coder                          │
│  DeepSeek    — DeepSeek-V3 (weights public, also runs an API)        │
│                                                                       │
│  You can download the weights. You can run them anywhere.             │
│  Licensing varies — some commercial-friendly, some restricted.        │
│                                                                       │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  HOSTED OPEN-WEIGHT (best of both)                                    │
│  ───────────────────────────────────                                  │
│  Together AI, Fireworks AI, Anyscale, Replicate, OctoAI,             │
│  Groq, SambaNova, Cerebras, Hugging Face Inference Endpoints         │
│                                                                       │
│  Someone else runs the open-weight model on their GPUs.               │
│  You get an API, the vendor takes the ops burden, you keep the       │
│  weight portability.                                                  │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

The third row is where most production teams actually live. Pure self-hosted open-weight is a real option for the largest deployments and the strictest residency regimes, but for most workloads, hosted open-weight is the cheaper and saner middle ground.

4) When the closed-weight kitchen wins¶

Five forces push you toward a closed-weight supplier.

Frontier quality. On hard reasoning, hard tool selection, hard multimodal tasks, the top closed-weight models in 2026 — Opus 4.7, GPT-5, Gemini 2.5 Pro — still lead public benchmarks and most internal bake-offs by a meaningful margin. Llama 3.3 405B is the strongest open-weight contender and lands roughly 3-7 points behind the frontier closed-weights on most hard evaluations. If your workload sits at the top of the quality curve and those points matter, closed-weight is the cheaper outcome despite the higher per-token price.

Low operational burden. Closed-weight is just an HTTPS endpoint. No GPU provisioning, no model serving stack, no autoscaling, no kernel-version matrix, no inference-engine upgrades. A two-engineer team can run a production LLM workload on Anthropic or OpenAI indefinitely without ever hiring an MLOps specialist. Self-hosted open-weight requires that specialist. Hosted open-weight is closer to the closed-weight burden but still has more failure modes — model-version mismatches, provider quirks, inference-engine differences across vendors.

Fast iteration on new features. When Anthropic ships prompt caching, computer use, an extended-thinking mode, or a new tool-use envelope, you get it the day it ships. Open-weight ecosystems lag by months — the inference engines need to add the feature, the providers need to roll out the engines, and your code needs to update. In 2026, computer use, native parallel tool calling, and adaptive reasoning depth are all closed-weight exclusives or closed-weight-first.

Strict-mode reliability. OpenAI's strict: true and Anthropic's tool-use validation guarantee schema-conformant output at ~99.5%+. Open-weight models without specialized inference-engine support (vLLM's structured-output mode, TGI's guidance integration) sit at ~92-95%. Five points sounds small until you compound it across an eight-step agent — strict reaches 97% end-to-end, non-strict drops to 67%.

Latest research features. Reasoning modes, extended thinking, native multi-modal vision, audio, computer use — closed-weight suppliers ship these first. By the time the open-weight world catches up, the closed-weight world has moved another step.

5) When the open-weight kitchen wins¶

Five different forces push you the other way.

Data residency and air-gapped deployments. If your data cannot leave a specific jurisdiction, a specific VPC, or a specific physical datacenter, closed-weight APIs are usually disqualified. The Anthropic API runs in specific AWS regions; the OpenAI API runs in OpenAI-controlled infrastructure; Gemini runs in Google's cloud. None of them can deploy inside your VPC, your on-prem datacenter, or your air-gapped facility. Open-weight on self-hosted GPUs can. Defense contractors, healthcare systems with HIPAA-restricted data, financial institutions with data-sovereignty rules, and any organization with EU GDPR concerns about US-based suppliers often have no choice but open-weight on infrastructure they control.

Fine-tuning freedom. Closed-weight suppliers offer fine-tuning, but the process is gated. You upload data, they train and host the fine-tuned model, they keep the weights. With open-weight, you fine-tune on your own GPUs, you keep the weights, you control the training data lineage, you can run custom training recipes (LoRA, QLoRA, DPO, RLHF). For specialized domains — medical NER, legal contract analysis, code in a private DSL — open-weight fine-tuning often produces a model that beats the closed-weight frontier on your narrow task, at a fraction of the inference cost.

Fixed cost at scale. This is the big one. Closed-weight pricing is linear with usage — every token costs the same as the last token. Self-hosted open-weight has fixed cost (the GPU cluster) and near-zero marginal cost above the fixed capacity. Above a crossover point — usually somewhere between 50M and 500M tokens per month depending on workload, model size, and GPU choice — self-hosted open-weight is cheaper. For workloads at a billion tokens a month, the savings often exceed 70%.

No rate limits. Closed-weight APIs impose rate limits — RPM and TPM caps — that throttle bursts. Self-hosted open-weight has the rate limit you deploy, which can be 10x what the closed-weight provider would give you on the same enterprise tier. For ingestion pipelines, batch processing, and research workloads that need to chew through terabytes of text quickly, this is decisive.

Regulatory and auditability requirements. Some regulators want to know exactly which model produced an output, what data it was trained on, and whether the model has been modified. Closed-weight suppliers cannot fully answer these questions — their training data is proprietary. Open-weight suppliers publish papers, datasheets, and sometimes training data manifests. For regulated industries that need to demonstrate model provenance, open-weight is often the only defensible choice.

6) The cost crossover — worked numbers¶

Let's put real numbers on the crossover point. Take a workload that needs mid-tier quality on routine extraction and summarization. Compare three options at four volume levels.

Option A — Anthropic Sonnet 4.6 (closed-weight, vendor API)
  Pricing: $3 input + $15 output per million tokens
  Blended at 60/40 input/output split: ~$7.80/M tokens

Option B — Llama 3.3 70B on Fireworks AI (hosted open-weight)
  Pricing: $0.90 input + $0.90 output per million tokens
  Blended: $0.90/M tokens

Option C — Llama 3.3 70B self-hosted on 4 H100 GPUs (Anyscale, Modal, your VPC)
  Fixed: 4 × $3/hour × 730 hours = $8,760/month
  Capacity: ~120M tokens/hour at 70B model on 4 H100s
  Marginal cost: ~$0/M above fixed capacity until cluster saturates

At four volume points, the monthly bill looks like this.

                       OPTION A      OPTION B       OPTION C
                       (Sonnet API)  (Fireworks)    (Self-host)
                       ─────────────────────────────────────────
10M tokens/month       $78           $9             $8,760 (way over)
100M tokens/month      $780          $90            $8,760 (still way over)
1B tokens/month        $7,800        $900           $8,760 (now close)
10B tokens/month       $78,000       $9,000         $8,760 (winning big)
50B tokens/month       $390,000      $45,000        $8,760 + $8,760 × N clusters

The crossover for self-hosting against hosted open-weight is around 9-10 billion tokens per month. The crossover for hosted open-weight against closed-weight API is essentially zero — hosted open-weight is cheaper from the first token. The crossover for self-hosting against closed-weight is at about 1.1 billion tokens per month for this workload.

These numbers shift if you change the model size, the GPU choice, the utilization rate, the engineer time to operate the cluster (which is real money even if it does not appear in the table), and the quality delta. A frontier closed-weight model that is 5 points better on a quality-critical workload may be worth the cost premium even at high volume. Volume alone does not decide.

Mid-content recall¶

Why is hosted open-weight often the practical middle ground rather than pure self-hosted?
Name three forces that push you toward closed-weight and three that push you toward open-weight.
Roughly where does the crossover from hosted open-weight to self-hosted open-weight land for a mid-tier workload?

7) The open-weight contenders — who's who in 2026¶

The open-weight ecosystem in 2026 is not a single market. Four families matter and each has a distinct shape.

Llama 3.3 (Meta). The 70B and 405B models. Largest ecosystem, broadest tooling, most-tested fine-tuning recipes, best inference-engine support across vLLM, TGI, and Ollama. Llama is the default open-weight choice if you need predictability — every hosted provider runs it, every fine-tuning guide assumes it. License is commercial-friendly with attribution. The 70B trades roughly 6-8 quality points against Sonnet 4.6 on mid-tier tasks; the 405B closes most of that gap but doubles the inference cost.

Mistral Large 2 (Mistral AI). EU-based supplier, strong on European languages, strong on compliance positioning, hosted on Mistral La Plateforme as well as downloadable. Mistral's 2026 lineup includes Mistral Large 2 (frontier-ish), Mistral Medium, Mistral Small, and the Mixtral mixture-of-experts models. Strongest pick when EU data residency and the EU AI Act are board-level concerns. Slightly weaker than Llama on non-European languages.

Qwen 2.5 (Alibaba). Strongest open-weight model on Chinese-language tasks, very strong on code (Qwen 2.5 Coder is often within a hair of Claude on coding benchmarks), competitive on English reasoning. The hosted Alibaba Cloud endpoints raise the same residency concerns in reverse for US/EU customers as Anthropic raises for Chinese customers, but the downloadable weights are usable anywhere. Often the default choice for Chinese-market applications and for coding-heavy workloads where the cost savings vs frontier closed-weight matter.

DeepSeek-V3 (DeepSeek AI). The cost-per-quality leader in 2026. DeepSeek-V3 hits Sonnet-tier quality on many benchmarks at roughly one-tenth the inference cost via the DeepSeek API, and the weights are downloadable for self-hosting. The catch — the DeepSeek API runs on Chinese infrastructure, raising residency concerns for many Western enterprises. Self-hosted DeepSeek is the workaround when the cost-quality ratio matters and residency is non-negotiable.

┌──────────────────┬──────────────────┬──────────────────────────────┐
│ FAMILY           │ STRENGTH         │ WATCH OUT FOR                │
├──────────────────┼──────────────────┼──────────────────────────────┤
│ Llama 3.3 70B    │ Ecosystem,        │ ~6-8 pts behind Sonnet on   │
│                  │ tooling,          │ hard reasoning              │
│                  │ predictability    │                              │
├──────────────────┼──────────────────┼──────────────────────────────┤
│ Llama 3.3 405B   │ Closes gap to     │ 4x inference cost of 70B;   │
│                  │ closed-weight     │ needs 8+ H100s              │
├──────────────────┼──────────────────┼──────────────────────────────┤
│ Mistral Large 2  │ EU compliance,    │ Weaker on non-European      │
│                  │ EU languages      │ languages                   │
├──────────────────┼──────────────────┼──────────────────────────────┤
│ Qwen 2.5 72B     │ Chinese, code,    │ Alibaba Cloud residency for │
│                  │ cost              │ hosted; self-host if Western│
├──────────────────┼──────────────────┼──────────────────────────────┤
│ DeepSeek-V3      │ Cost-per-quality  │ Chinese-API residency;       │
│                  │ leader            │ newer ecosystem; self-host  │
│                  │                   │ for control                  │
└──────────────────┴──────────────────┴──────────────────────────────┘

8) The hybrid pattern — most production systems run two suppliers¶

The single-supplier story is the exception, not the norm. Most serious production AI systems in 2026 run a hybrid — open-weight for the high-volume routine work, closed-weight for the cases that need frontier quality.

                INCOMING TICKET
                       │
                       ▼
                ┌──────────────┐
                │ ROUTER       │  ← Haiku-tier classifier
                │ (closed)     │
                └──────┬───────┘
                       │
       ┌───────────────┼────────────────┐
       ▼               ▼                ▼
   "simple"        "standard"       "complex"
       │               │                │
       ▼               ▼                ▼
  Llama 3.3 70B   Llama 3.3 70B    Opus 4.7
  on Fireworks   on Fireworks      via Anthropic
       │               │                │
       ▼               ▼                ▼
   $0.0008/req    $0.003/req        $0.05/req
       (62%)          (30%)            (8%)

The cost math is decisive. A 3M-tickets-per-month workload at all-Sonnet might run $30,000 a month. The hybrid pattern above lands at roughly $8,000 a month — a 73% reduction, with the frontier closed-weight cook preserved for the cases that genuinely need her.

The pattern relies on three things. One — the bake-off must confirm that the open-weight cook clears the quality floor for the routine cases. Two — the router must correctly identify which 8% are complex. Three — the operational glue (failover, observability, prompt-portability work) must be in place so both suppliers can be operated cleanly. The next two chapters cover the third item.

9) Failure modes — supplier choices that backfire¶

SIGNAL                                FIX
──────                                ───
"open-weight is always cheaper"     → run the math at YOUR volume; at low
                                       volume closed-weight wins on TCO

"self-host because we control it"   → unless volume justifies it,
                                       hosted open-weight gives 90% of the
                                       control at 10% of the ops cost

skipping the quality bake-off       → DeepSeek-V3 looks Sonnet-tier on
 because pricing looks great          benchmarks; on YOUR workload it may
                                       not — measure first

ignoring data residency until        → board-level compliance review will
 launch                                kill closed-weight choices late if
                                       residency was not designed in

assuming open-weight features        → strict mode, prompt caching, computer
 match closed-weight                   use, native tool envelopes lag in
                                       open-weight by months

fine-tuning before bake-off          → fine-tuning is expensive; only commit
                                       after the base open-weight model
                                       fails a bake-off you ran

choosing 405B when 70B was enough    → 4x inference cost without a
                                       proportional quality gain on most
                                       workloads

operating self-hosted without        → MLOps headcount is a real cost;
 dedicated headcount                   budget the engineer-month, not just
                                       the GPU bill

forgetting license restrictions      → some open-weight models have
                                       commercial-use clauses; read the
                                       license before committing

single-supplier dependency           → even closed-weight teams should run
                                       a hot-warm secondary; see chapter 7

The deepest pattern across these — the supplier choice is a system-design decision, not a procurement decision. It interacts with volume, residency, team shape, quality requirements, and operational maturity. Reduce it to a single dimension and you will pick wrong.

Where this lives in the wild¶

The closed-vs-open-weight choice shows up in every part of the production AI ecosystem.

Anthropic API — Opus 4.7, Sonnet 4.6, Haiku 4.5; closed-weight, vendor- managed, available via direct API and AWS Bedrock and Vertex AI.
OpenAI API — GPT-5, GPT-4o, GPT-4o-mini; closed-weight, vendor-managed, available via direct API and Azure OpenAI Service.
Gemini API — Gemini 2.5 Pro, Flash, Flash-Lite; closed-weight, vendor- managed, available via Vertex AI and direct AI Studio.
AWS Bedrock — multi-vendor closed-weight access (Anthropic, AI21, Cohere, Meta, Mistral, Stability) on shared-tenancy AWS infrastructure.
Azure OpenAI Service — closed-weight OpenAI models on Azure infrastructure with enterprise compliance.
Vertex AI — Google's multi-vendor closed-weight gateway with Gemini, Anthropic, Mistral, and others.
Mistral La Plateforme — Mistral's own API for open-weight and hosted- closed-weight models with EU residency.
DeepSeek API — direct API for DeepSeek-V3 at cost-leading prices.
Together AI — hosted open-weight (Llama, Mistral, Qwen, DeepSeek) on Together's GPUs with OpenAI-compatible API.
Fireworks AI — hosted open-weight with low-latency serving and fine-tuning support.
Anyscale — Ray-based hosted open-weight serving and training.
Replicate — open-weight model gallery with per-call pricing and cold-start tolerated workloads.
Modal, Banana — serverless GPU platforms for self-deployed open-weight models.
Groq — proprietary inference hardware running open-weight models at very high tokens-per-second.
SambaNova, Cerebras — specialized inference hardware for open-weight models at high throughput.
OctoAI — hosted open-weight serving (acquired by NVIDIA in 2024, continues to operate).
Hugging Face Inference Endpoints — dedicated open-weight serving with per-deployment GPU allocation.
vLLM — high-throughput open-source inference engine for open-weight models, used by most hosted providers under the hood.
TGI (Text Generation Inference) — Hugging Face's inference engine for open-weight serving.
llama.cpp server — CPU-and-Apple-Silicon-friendly open-weight serving for low-resource and edge deployments.
Ollama — local open-weight runtime for developer workstations and small deployments.
LiteLLM — open-source proxy that abstracts over closed- and open-weight suppliers with a uniform API.
OpenRouter — multi-supplier gateway for both closed and open weights, with per-request supplier selection.
Vercel AI SDK — multi-supplier abstraction for both closed and open weights at the framework level.
Helicone, Langfuse, LangSmith, Braintrust — observability platforms that let you compare supplier choices in production traces.
Vellum, PromptLayer, Pezzo — prompt management with per-supplier routing configuration.

Pause and recall¶

What is the practical difference between closed-weight, open-weight, and hosted open-weight?
State the five forces that push toward closed-weight and the five that push toward open-weight.
Roughly where is the cost crossover from closed-weight API to hosted open-weight? From hosted open-weight to self-hosted?
Name the four major open-weight families and one distinguishing strength of each.
Why is the hybrid pattern more common than single-supplier?
What three pieces must be in place for the hybrid pattern to work in production?
Which closed-weight-exclusive features lag months in the open-weight ecosystem?

Interview Q&A¶

Q1. How would you decide between closed-weight and open-weight for a new production workload? A. Five axes. Volume — closed-weight wins at low volume on total cost of ownership because the ops cost is absorbed by the vendor; open-weight wins above the crossover. Residency — closed-weight is disqualified when data cannot leave specific infrastructure. Quality — frontier closed-weight still leads on the hardest workloads; if your bake-off shows open-weight clears the quality floor, the cost case usually wins. Operational appetite — closed-weight is a few API calls; self-hosted open-weight is a real ops investment. Feature timeline — strict mode, prompt caching, computer use, extended reasoning launch closed-weight first. Walk through each axis with the interviewer; whichever axis is binding decides the choice. Trap: Picking on price alone. Open-weight is not always cheaper, and even when it is, the ops cost can erase the savings.

Q2. The team says "self-host because we control it." What's missing? A. Three things. One — the engineer-month cost of running the cluster. MLOps for production LLM serving is not free; budget the headcount, not just the GPU bill. Two — the feature lag. Computer use, strict mode, and native parallel tool calling are months behind in open-weight. Three — the hybrid alternative. Hosted open-weight (Fireworks, Together, Anyscale) gives 90% of the control — model choice, fine-tuning, residency — at 10% of the ops cost. Pure self-hosted is justified by volume above 5-10B tokens a month, strict air-gap requirements, or in-house GPU capacity already paid for. Trap: Conflating "control" with "self-hosted." You can have control via hosted open-weight without owning the infrastructure.

Q3. Walk me through the cost crossover math for a 1B-tokens-per-month workload. A. Take Sonnet 4.6 blended at $7.80/M; that is $7,800 a month at 1B tokens. Hosted Llama 3.3 70B on Fireworks at $0.90/M is $900 a month. Self-hosted Llama on 4 H100s at $8,760/month fixed has enough capacity for 1B tokens and is roughly tied with Fireworks at this volume — the self-host advantage shows up above 5-10B. Add a quality delta — if Llama clears your quality floor, switching saves $7,000 a month; if it falls short on the complex 8%, run a hybrid and capture $5,000 a month with the frontier preserved for the hard cases. The math is volume × per-token-savings minus the ops/engineering cost of operating the second supplier. Trap: Reciting prices without doing the engineer-month math. Ops cost is real money.

Q4. Why is hosted open-weight often the right middle ground? A. Three reasons. One — you keep weight portability, which means you can move providers if pricing or terms change, fine-tune on your data, and deploy to regions the closed-weight vendors do not serve. Two — you offload the ops work (GPU provisioning, autoscaling, inference-engine upgrades) to a provider whose core business is doing it. Three — you get OpenAI-compatible APIs, so the migration cost from closed-weight to hosted open-weight is mostly prompt-portability work rather than infrastructure work. Pure self-host is only worth it above 5-10B tokens a month or under strict air-gap requirements. Trap: Treating self-hosted as the only "real" open-weight. Hosted open-weight is where most production workloads should sit.

Q5. A team claims DeepSeek-V3 is Sonnet-tier at one-tenth the price. How do you evaluate? A. Three checks. One — what benchmarks? Public benchmarks are not your workload; run a bake-off on your evals. Two — what residency? DeepSeek's hosted API runs on Chinese infrastructure; if that is a concern, the real comparison is self-hosted DeepSeek vs Sonnet, which shifts the cost math. Three — what features? DeepSeek's tool-use, strict-mode, caching, and reasoning-mode support may not match Sonnet's; the "Sonnet-tier" claim may be on raw completion quality, not feature parity. If your bake-off confirms quality on your workload and residency works, DeepSeek can be a credible primary or a hybrid component. If not, the headline number is misleading. Trap: Treating leaderboard scores as decisive. Leaderboards are not your workload.

Q6. Why might you choose Mistral Large 2 over Llama 3.3 70B for a European fintech? A. Two main reasons. One — Mistral is EU-based, hosted on EU infrastructure, and positioned around the EU AI Act compliance story. For a European fintech, the procurement and compliance review goes faster with an EU supplier. Two — Mistral Large 2 is stronger on European languages, particularly French, German, and Spanish, which matters for a multi-country deployment. Llama is the stronger ecosystem choice but compliance and language often win the supplier argument in European fintech. Trap: Choosing on pure benchmark numbers when compliance is binding.

Q7. When does the "open-weight is always cheaper" claim break down? A. Three breakdowns. One — at low volume, the operational and fixed costs of self-hosting exceed the closed-weight API bill; under 50-100M tokens a month, closed-weight is usually cheaper on TCO. Two — when quality is load-bearing, the closed-weight model may save more in downstream costs (re-work, customer impact) than the per-token premium. Three — when the workload needs features only available in closed-weight (strict mode at 99.5%+, native computer use, prompt caching, extended reasoning), the "cheaper" open-weight ends up requiring extra engineering to approximate those features. Trap: Reducing the choice to per-token pricing.

Q8. A regulated healthcare workload needs frontier-tier reasoning quality on patient-record analysis. Closed-weight or open-weight? A. Probably hybrid with self-hosted open-weight as the primary path. Patient-record data typically cannot leave HIPAA-compliant infrastructure, which disqualifies most closed-weight APIs in their default configuration. Some closed-weight vendors (Anthropic via Bedrock with BAA, OpenAI via Azure with BAA) offer HIPAA-compliant deployments — those are the only closed-weight options. Otherwise the path is self-hosted Llama 3.3 405B or DeepSeek-V3 on HIPAA-compliant infrastructure, accepting a 3-7 quality point gap against frontier closed-weight and compensating with a stronger human-in-the-loop review layer for the high-stakes outputs. Trap: Suggesting an unencumbered closed-weight API for HIPAA data. Default closed-weight contracts do not include BAA; you must negotiate.

Apply now (5 min)¶

Step 1 — categorize your current suppliers. List every model your system calls today. For each, mark it closed-weight (vendor API), hosted open-weight (Together, Fireworks, etc), or self-hosted open-weight. If you only see one category, your hybrid pattern is missing.

Step 2 — find the wrong-supplier rows. For each model in step 1, write the workload volume in monthly tokens and the per-token price. Mark red any row where (a) volume × price exceeds $1,000/month and (b) the workload is routine enough that an open-weight cook would clear the quality floor. Each red row is a candidate for a hybrid move.

Step 3 — run a 50-call shadow test. Pick one red row. Send the same 50 prompts to your current closed-weight supplier and to a hosted open-weight equivalent (Llama 3.3 70B on Fireworks is a safe first try). Compute quality on your eval rubric and cost per call. The result is either evidence to migrate or evidence to stop arguing about it.

Bridge. Picking a different supplier sounds simple — change the endpoint, change the API key. It is not. The next chapter is the hidden tax of switching — prompt portability, tokenizer math, schema drift, tool-call format. Move one prompt across suppliers without understanding the tax and quality drops by ten points overnight.

→ 06-switching-cost-anatomy.md