Skip to content

06. Switching-cost anatomy — the hidden tax of moving between suppliers

~18 min read. Two suppliers can have identical APIs and you can still lose ten quality points porting a prompt overnight. Prompt portability, tokenizer math, schema drift, tool-call format, strict-mode semantics, streaming shapes, caching mechanics — each one is a small tax, and they compound. This chapter is the inventory of every tax you owe when you change suppliers, and the playbook for paying them deliberately.

Builds on 05-closed-vs-open-weight.md. The previous chapter helped you pick which kitchen to rent. This one is the moving-day reality — what breaks when you actually try to walk a recipe from one kitchen to another.


1) Hook — the team that lost twelve points overnight

A consumer-tech team runs a customer-support summarization workload on GPT-4o via the OpenAI API. The prompts are tuned. The eval set scores 87% on a faithfulness-and-completeness rubric. The bill is $18,000 a month.

Anthropic releases Sonnet 4.6 with stronger long-context performance on exactly this kind of workload. The team's lead reads the announcement, runs a quick test on five prompts, sees the outputs look fine, and ships the migration over a weekend. Monday morning, the same eval set drops to 75%. Customer-support agents start flagging summaries that miss key points or invent unrelated details. The team rolls back by Wednesday.

What went wrong? Six taxes, paid by accident.

THE BILL FOR AN UNPLANNED MIGRATION
───────────────────────────────────
1. System-prompt placement     Sonnet uses a dedicated `system` parameter;
                               the team had pasted system content into the
                               first user message. Sonnet treated the
                               instructions as user input, not directive.

2. XML-tag conventions         The original prompt used markdown headers
                               (## Context, ## Task). Anthropic models
                               score meaningfully higher when context is
                               wrapped in XML tags (<context>, <task>).

3. Few-shot format             OpenAI prompts used JSON-like example
                               formatting. Sonnet does better with prose-
                               narrative examples for this task type.

4. Tokenizer count             The original 12K-token context exceeded
                               14K when re-tokenized by Anthropic's
                               tokenizer; truncation was silently
                               clipping the most important section.

5. Strict-mode semantics       OpenAI's `strict: true` was on; Sonnet's
                               tool-use validation has a different feature
                               surface. Some schema constraints became
                               soft.

6. Output format               GPT-4o emitted clean JSON 99% of the time
                               at strict; Sonnet at 96% — but the team
                               had no retry layer, so 4% of outputs broke
                               downstream parsing.

A twelve-point quality drop is not bad luck. It is six unpaid taxes adding up. The team had no shadow eval, no per-tax checklist, and no rollback plan tighter than "revert the env var." This chapter is the inventory of the taxes and the playbook for paying them deliberately.


2) The metaphor — moving a recipe between kitchens

A chef who runs a recipe in one kitchen and walks it into another kitchen discovers something quickly. The oven temperatures are calibrated differently. The flour measures by weight here and by volume there. The sauce-pan bottoms conduct heat differently. The order in which you add the ingredients matters more in one kitchen because the burner runs hotter. The same recipe, the same ingredients, the same intent — and the dish comes out different.

A professional chef does not just hand the recipe to the new kitchen and hope. She reads the recipe, walks the new kitchen, adjusts what needs adjusting, runs a test plate, adjusts again, and only then puts the dish on the menu. The matching habit on this page is treating every supplier migration like that test-plate process — never the "weekend switch."

Each tax on this page is one of those kitchen-specific quirks. Prompt portability is the recipe rewrite. Tokenizer math is the new measuring cups. Schema drift is the different pan shapes. Tool-call format is the different ladle handles. Streaming format is how the dish is plated. Each quirk is small. Together they are why a careful chef never assumes a recipe ports for free.


3) The seven taxes — what changes when the supplier changes

┌──────────────────────────────────────────────────────────────────────┐
│                THE SWITCHING-COST ANATOMY                            │
├──────────────────────────────────────────────────────────────────────┤
│ 1. Prompt portability    — system prompt placement, role formatting, │
│                             XML/markdown conventions, few-shot shape │
│ 2. Tokenizer math        — same text, different token counts,        │
│                             different costs, different truncation    │
│ 3. Schema drift          — same JSON Schema, different adherence     │
│                             rates and feature support                │
│ 4. Tool-call format      — function-call envelopes differ; parallel  │
│                             tool calling differs                     │
│ 5. Strict-mode semantics — what "strict" guarantees varies by vendor │
│ 6. Streaming format      — SSE event shapes and content fields       │
│                             differ; chunk boundaries shift           │
│ 7. Prompt caching        — different mechanisms, different cache key │
│                             behaviors, different price structures    │
└──────────────────────────────────────────────────────────────────────┘

Each one is small in isolation. Skipped, they compound. Paid deliberately, the migration goes clean. Let's walk through each tax.


4) Tax #1 — prompt portability

The same instructions land differently on different models.

System-prompt placement. OpenAI uses role-based messages — a role: "system" message at the start of the chat. Anthropic has a dedicated system parameter outside the messages array. Gemini calls it systemInstruction. Mistral and most open-weight serving frameworks follow OpenAI's role-based convention.

OpenAI / Mistral / vLLM:           Anthropic:
─────────────────────              ──────────
messages = [                       messages = [
  {role: "system",                   {role: "user", content: "..."}
   content: "You are..."},         ]
  {role: "user", content: "..."}   system = "You are..."
]

Forget this and on Anthropic, your system content lands inside the user message — the model treats it as user input, not directive. Quality drop of 3-8 points on instruction-following workloads.

Role formatting and turn structure. OpenAI accepts arbitrary alternation of user and assistant. Anthropic enforces strict alternation (user, assistant, user, assistant) and rejects two user-messages in a row. Gemini's structure is closer to Anthropic. Llama-family chat templates have model-specific role tokens (<|start_header_id|>) baked in.

XML tags vs markdown. Anthropic's docs are explicit — wrap context in XML tags. The model has been trained on these conventions and prompts using <context>...</context> <task>...</task> <examples>...</examples> score consistently better. OpenAI is more agnostic — markdown headers, JSON-like structures, and prose all work. Gemini sits closer to OpenAI's tolerance.

Few-shot formatting. Examples can be embedded as a chat history (role-based) or as inline prose examples in the system prompt. The choice matters. Anthropic recommends prose examples in XML tags. OpenAI docs lean toward role-based history. Get this wrong and the model either ignores the examples or treats them as the active conversation.

The prompt-portability tax is the largest of the seven. A naive port loses 5-12 points on quality, sometimes more on instruction-heavy workloads. The fix is to rewrite the prompt for the target supplier using that supplier's documented conventions, then run a shadow test before the cutover.


5) Tax #2 — tokenizer math

The same English sentence yields different token counts across vendors, because each vendor's tokenizer was trained on different corpora with different subword units.

TEXT: "The quick brown fox jumps over the lazy dog. " × 100 (1,000 words)

TOKENIZER                  TOKEN COUNT
─────────                  ───────────
OpenAI cl100k (GPT-4)       ~745
OpenAI o200k (GPT-4o)       ~700
Anthropic (Claude 3+)       ~850
Google (Gemini)             ~900
Llama 3 (sentencepiece)     ~790
Mistral                     ~780

The variance is roughly 20-25% across vendors for English. For non-English text, particularly Chinese, Japanese, and Indic languages, the variance is much larger — Anthropic's tokenizer is often 1.5-2x more efficient on Devanagari script than older OpenAI tokenizers, for example.

Three operational consequences.

Cost math shifts. A workload that costs $X per million tokens on OpenAI does not cost the same amount when you re-bill the same content on Anthropic, even at the same per-token price. If your content tokenizes 14% larger on Anthropic, your effective price-per-task is 14% higher than the headline rate suggests.

Context-window math shifts. A document that fits in 100K tokens on GPT-4o may overflow a 100K budget on Anthropic. Silent truncation kills quality without raising an obvious error. Always re-measure context fit in the target supplier's tokenizer before migrating.

Prompt caching math shifts. Cache savings depend on token counts. A prompt cached at $0.30 per million tokens on Anthropic at 12K tokens saves $X/request; the same prompt at 14K Anthropic tokens caches at a different absolute amount even if the per-token price is identical.

EXAMPLE: 50,000-token document workload
─────────────────────────────────────
OpenAI GPT-4o tokenizer count    50,000 tokens → $0.25 input cost
Anthropic Sonnet 4.6 tokenizer   57,000 tokens → $0.171 input cost
                                  (input price $3/M × 0.057)

Per-task input cost is LOWER on Anthropic despite the larger token
count, because Sonnet's input price is lower than GPT-4o's. The
tokenizer math is non-trivial; assume nothing.

Always re-tokenize a representative sample on the target supplier's tokenizer before claiming a cost number for the migration. Both Anthropic and OpenAI publish their tokenizers (or compatible libraries) for exactly this purpose.


6) Tax #3 — schema drift

Same JSON Schema. Different adherence rates. Different feature support.

OpenAI's strict: true provides constrained decoding — the model emits schema-conformant output at ~99.5%+ reliability, with some schema features disallowed (overlapping anyOf, optional fields without defaults, certain oneOf patterns).

Anthropic's tool-use validation is different. The model produces JSON, the SDK validates against the schema, retry-on-failure is the standard pattern. Reliability is high (95-98%) but not at the constrained-decoding guarantee level. The feature support is broader — most JSON Schema features work.

Gemini supports schemas with propertyOrdering — the order of properties in the schema becomes the order in the output. This matters for chain-of-thought patterns where you want reasoning fields before conclusion fields.

Open-weight inference engines vary widely. vLLM with guided decoding gets to ~95% on simple schemas, lower on complex nested structures. TGI's constrained generation is similar. Without specialized inference-engine support, raw open-weight models hit ~85-92% schema adherence — workable with retries but not strict-mode equivalent.

SCHEMA ADHERENCE RATES (approximate, varies by schema complexity)
─────────────────────────────────────────────────────────────────
OpenAI strict: true             99.5-99.9%
Anthropic tool-use validation   95-98%
Gemini structured output        96-99%
vLLM guided decoding            93-96%
Open-weight without guidance    85-92%

The migration risk — same schema, different adherence rate, and your retry logic was tuned for the original supplier's failure rate. If your downstream parser assumes 99.5%+ valid JSON, dropping to 95% means 5x more retries, 5x more downstream errors, and likely an SLA breach.

The fix during migration — add a retry-on-validation-failure layer sized for the new supplier's failure rate, and add a downstream parser that fails gracefully on the rare invalid output.


7) Tax #4 — tool-call format

Every supplier wraps tool calls slightly differently.

OpenAI function-call envelope:
{
  "role": "assistant",
  "tool_calls": [
    {
      "id": "call_abc",
      "type": "function",
      "function": {
        "name": "lookup_order",
        "arguments": "{\"order_id\":\"4481\"}"  // string, JSON-encoded
      }
    }
  ]
}

Anthropic tool-use envelope:
{
  "role": "assistant",
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_xyz",
      "name": "lookup_order",
      "input": {"order_id": "4481"}  // object, parsed
    }
  ]
}

Gemini function-call:
{
  "candidates": [{
    "content": {
      "parts": [{
        "functionCall": {
          "name": "lookup_order",
          "args": {"order_id": "4481"}  // object, parsed
        }
      }]
    }
  }]
}

Three differences that bite.

Arguments are a string in OpenAI, an object in Anthropic and Gemini. Code that does json.loads(call.arguments) works for OpenAI and breaks on Anthropic. Code that does call.input works for Anthropic and breaks on OpenAI. Frameworks like LiteLLM, the Vercel AI SDK, and LangChain abstract this — if you wrote raw SDK code, the migration is more work.

Tool-result envelope shape varies. When you return the tool output to the model, OpenAI expects a message with role: "tool" and tool_call_id. Anthropic expects a tool_result content block inside a user message with tool_use_id. Gemini wraps it as a functionResponse part. Same concept, three shapes.

Parallel tool calling support differs. OpenAI supports parallel tool calls — the model can issue multiple tool_calls in one response. Anthropic supports parallel as well in 2026. Gemini supports parallel function calls. Open-weight serving frameworks vary — vLLM's parallel tool-call support depends on the model and the version. If your agent relies on parallel calls for latency, verify the target supplier supports them at production quality.

The migration impact — code that uses raw SDKs needs rewriting for each supplier. The fix is to abstract early, either through a library (LiteLLM, LangChain, Vercel AI SDK) or through your own thin wrapper that normalizes the envelopes.


8) Tax #5 — strict-mode semantics

"Strict" means different things in different places.

OpenAI's strict: true is constrained decoding — the model emits schema-conformant tokens at decode time. The output is guaranteed valid (modulo a tiny tail of edge cases). The cost is some schema features are disallowed and decoding adds ~50-200ms latency.

Anthropic does not have a strict flag in the same sense. Anthropic's tool-use validation is post-hoc — the SDK validates the model's output against the schema and you retry on failure. Reliability is high but not constrained-decoding-high.

Gemini's structured output mode is closer to constrained decoding, particularly with responseSchema and responseMimeType: application/json.

Open-weight engines have their own constrained-decoding implementations (vLLM's guided decoding, TGI's grammar support, outlines library). Each has its own feature support and quirks.

"STRICT" ACROSS SUPPLIERS
─────────────────────────
OpenAI strict:true           constrained decoding (decode-time)
Anthropic tool validation    post-hoc validation (retry on fail)
Gemini responseSchema        constrained decoding (decode-time)
vLLM guided_decoding         constrained decoding (grammar-based)
Open-weight no guidance      no constraint; ~85-92% adherence

The migration risk — assuming "strict mode" exists with identical semantics. It does not. The mitigation is to add a retry-on-failure layer regardless of supplier and to measure the actual schema-adherence rate on your workload before the cutover.


9) Tax #6 — streaming format

Streaming responses use Server-Sent Events (SSE), but the event shapes differ.

OpenAI streams chunks with delta.content containing partial text and delta.tool_calls containing partial tool-call arguments. Anthropic streams content_block_delta events with text_delta or input_json_delta types. Gemini streams candidates[0].content.parts incrementally. Each format requires its own parser.

Chunk boundaries also differ. OpenAI chunks are typically 1-3 tokens. Anthropic chunks are similar in size but the event metadata (start, delta, stop events for each content block) creates a different parse shape. Gemini chunks can be larger and less frequent.

For token-by-token UI rendering, the impact is on the streaming parser — your code needs to know the event shape of the active supplier and route each chunk to the appropriate UI handler. The fix is to abstract this in a single streaming-parser module that the rest of your app does not touch.


10) Tax #7 — prompt caching

Prompt caching is the cheapest tax to forget about and the most expensive to pay.

OpenAI's prompt caching is automatic for prefixes ≥1024 tokens that appear in repeated requests within ~5-10 minutes. The cache discount is 50% on input tokens (cached tokens are billed at half rate). No explicit cache_control needed.

Anthropic's prompt caching is explicit — you mark cache_control: {"type": "ephemeral"} on specific blocks. Cache writes cost 25% more than uncached input; cache reads cost 10% of uncached input. Cache TTL is ~5 minutes by default, with a 1-hour option at higher cost.

Gemini's context caching (introduced 2024, expanded 2025) requires explicit cache creation via the API. Cache reads are heavily discounted. Cache management is more involved — you create, reference, and explicitly destroy caches.

PROMPT CACHING ACROSS SUPPLIERS
───────────────────────────────
OpenAI       automatic, 50% read discount, 5-10 min TTL
Anthropic    explicit cache_control, 10% read discount,
             25% write premium, 5 min TTL (1 hr available)
Gemini       explicit cache creation/destruction, large read
             discount, explicit TTL
Open-weight  varies by serving framework; not always
             available

The migration risk — porting a workload that depended on automatic caching on OpenAI to Anthropic without adding cache_control markers, and watching the cost go up because you lost the cache savings without realizing it. The fix is to audit cache hit rates on the new supplier post-migration and add explicit cache markers where the workload's repeated-prefix pattern justifies them.


Mid-content recall

  1. State the seven switching-cost taxes. Which is usually largest?
  2. Why does tokenizer math change the cost calculus of a migration even when the per-token price is identical?
  3. What is the practical difference between OpenAI's strict: true and Anthropic's tool-use validation?

11) The migration playbook — paying the taxes deliberately

A clean supplier migration has five stages. Skipping any of them is how the twelve-point hook scenario happens.

Stage 1 — eval baseline on the current supplier. Lock the eval set, the rubric, the judge model, and the run procedure. Run the baseline three times to establish noise floor. Anything inside the noise is not a regression; anything outside is.

Stage 2 — prompt rewrite for the target supplier. Apply tax #1 (prompt portability) deliberately. Move system content into the right parameter, rewrap context in the target's preferred convention (XML for Anthropic, markdown for OpenAI/Gemini), re-format few-shot examples. Do not just copy-paste the prompt.

Stage 3 — shadow test. Send the same N requests to both suppliers in parallel for a week or two weeks. Compare quality on the eval rubric, cost per call, latency, schema-adherence rate, and downstream success rate. Capture the deltas in a table the team can review.

Stage 4 — partial cutover. Route 5%, then 25%, then 50% of production traffic to the new supplier behind a feature flag. Monitor production metrics on both paths. If the new supplier underperforms on real traffic, the cutover stays partial until the gap closes.

Stage 5 — full cutover with rollback ready. Switch the remaining traffic. Keep the original supplier's credentials, prompts, and routing warm for at least 30 days. The "warm" part matters — chapter 7 covers it in depth.

THE MIGRATION TIMELINE
──────────────────────
Week 1   eval baseline locked
Week 2   prompt rewrite for target; smoke tests
Week 3-4 shadow test in parallel; deltas captured
Week 5   partial cutover (5% → 25%)
Week 6   partial cutover (25% → 50%)
Week 7   partial cutover (50% → 100%)
Week 8+  monitor; original supplier kept warm 30 days

The hook scenario at the top of this chapter compressed all of this into a weekend. The result was predictable — six unpaid taxes, twelve points of quality lost, three days of customer impact, and a rollback. A clean migration takes weeks, not days.


12) Failure modes — migrations that backfire

SIGNAL                                FIX
──────                                ───
"the API is OpenAI-compatible,        → OpenAI-compat is the envelope;
 just change the endpoint"               prompt portability is the content.
                                         Always rewrite the prompt.

shadow test skipped because         → without shadow data, you cannot
 "the new model is better"             tell quality regressions from
                                        prompt-portability bugs

prompt rewritten but eval set         → if the eval set is not locked
 not locked                             before the migration, you cannot
                                        measure the delta

tokenizer count assumed equal        → re-tokenize representative samples
 across vendors                         on the target tokenizer; cost and
                                        context-fit can shift 20%+

retry logic sized for old             → schema-adherence rates differ;
 supplier's failure rate                 resize the retry budget for the
                                        new supplier's actual rate

strict-mode features assumed         → "strict" means different things;
 to be portable                          test the actual schema features
                                        you depend on

streaming code coupled to             → abstract streaming parse in one
 specific event shape                     module; the rest of the app
                                        should not know the shape

prompt caching savings lost           → audit cache hit rates on new
 silently                               supplier; add explicit cache
                                        markers where appropriate

full cutover with no rollback          → keep the original supplier warm
 plan                                    for 30 days; rollback should be
                                        a single env var flip

migration framed as "save money"    → frame as "evaluate alternative
 not "validate quality"                  supplier with quality
                                        non-negotiable"; cost is the
                                        secondary metric

The pattern across these — the supplier-switch is treated as a configuration change when it is actually a system change. Every tax unpaid is a future incident.


Where this lives in the wild

The switching-cost anatomy shows up in tooling, abstractions, and production playbooks across the ecosystem.

  • LiteLLM — unified API across closed and open suppliers; normalizes most of tax #4 (tool-call format) and tax #6 (streaming format).
  • OpenRouter — multi-supplier gateway with per-request supplier selection; useful for A/B and shadow tests.
  • Vercel AI SDK — provider abstraction for tools, streaming, and schemas; absorbs taxes #4 and #6 for most workloads.
  • LangChain, LangGraph — provider abstractions with model-class routing; absorbs taxes #1 (partially), #4, and #6.
  • Helicone — proxy with token counting per supplier; surfaces tax #2 (tokenizer math) in production metrics.
  • Langfuse, LangSmith, Braintrust — eval platforms for running shadow tests with the same eval set across suppliers.
  • Vellum, PromptLayer, Pezzo — prompt management with version history per supplier; structured way to pay tax #1.
  • Anthropic APIsystem parameter, XML-tag conventions, explicit cache_control; the canonical Anthropic prompt-portability surface.
  • OpenAI API — role-based messages, strict: true, automatic prompt caching, function-call envelope.
  • Gemini APIsystemInstruction, propertyOrdering, responseSchema, explicit context caching.
  • Mistral La Plateforme — OpenAI-compat envelope, but prompt-portability quirks (system-message placement, few-shot shape) still apply.
  • DeepSeek API — OpenAI-compat envelope; tokenizer differs from OpenAI's by a few percent.
  • AWS Bedrock Converse API — provider-agnostic envelope that absorbs some tax #4; cross-supplier prompt portability still your job.
  • Azure OpenAI Service — OpenAI envelope on Azure; tax #1 mostly zero, tax #7 (caching) requires Azure-specific configuration.
  • Vertex AI — Google's multi-supplier gateway; per-supplier prompt portability still applies.
  • Together AI, Fireworks AI — OpenAI-compat envelope on open-weight models; tax #1 (prompt portability) lighter because OpenAI conventions often work, but model-specific prompt tuning still pays off.
  • vLLM, TGI — open-weight inference engines with their own guided decoding semantics; tax #5 (strict-mode) varies by engine version.
  • tiktoken — OpenAI's tokenizer library, for tax #2 cost analysis.
  • Anthropic's tokenizer — published via SDK for tax #2 analysis.
  • Hugging Face tokenizers — SentencePiece-based tokenizers for Llama, Mistral, and most open-weight models.
  • BAML, Pydantic AI, instructor, Marvin — structured-output libraries that absorb tax #3 (schema drift) with cross-supplier validation and retry layers.
  • Ollama, llama.cpp server — local open-weight runtimes with their own envelope quirks; full prompt-portability work required when migrating from a hosted closed-weight to local Ollama.
  • Replicate — model-specific endpoints; each model is essentially its own API surface, so tax #4 is high.
  • Hugging Face Inference Endpoints — dedicated open-weight serving with model-specific prompt-portability work per endpoint.

Pause and recall

  1. Name the seven switching-cost taxes and one consequence of skipping each.
  2. Why does Anthropic's system parameter create a portability bug for prompts written for OpenAI?
  3. Roughly how much does tokenizer count vary across vendors for English text?
  4. What is the practical difference between strict mode and post-hoc validation? What does each guarantee?
  5. State the five-stage migration playbook. What is the shadow-test stage protecting you from?
  6. Why is "keep the original supplier warm for 30 days" non-negotiable after a full cutover?
  7. Which observability platforms let you measure tokenizer differences and schema-adherence rates across suppliers in production traces?

Interview Q&A

Q1. Walk me through how you would migrate a production workload from GPT-4o to Sonnet 4.6. A. Five stages over six to eight weeks. One — lock the eval baseline on GPT-4o (eval set, rubric, judge model, three baseline runs for noise floor). Two — rewrite the prompt for Anthropic conventions (system parameter, XML tags for context, prose-formatted few-shot examples). Three — shadow test in parallel for two weeks, capturing quality, cost, latency, schema-adherence, and downstream success rate. Four — partial cutover at 5%, 25%, 50% behind a feature flag with production monitoring. Five — full cutover with the GPT-4o path kept warm for 30 days. Frame it as a system change, not a configuration change. Trap: "Just change the endpoint and the API key." That is how twelve-point regressions happen.

Q2. A teammate says "the API is OpenAI-compatible, so we can swap suppliers for free." What's wrong? A. OpenAI-compatibility is the envelope — the request and response shapes — not the content. The seven switching-cost taxes still apply. Prompt portability is the largest tax and is independent of the envelope. Tokenizer math, schema-adherence rates, strict-mode semantics, streaming chunk patterns, and caching mechanics all differ regardless of whether the envelope says "OpenAI-compatible." The compat layer is useful — it reduces the abstraction work — but it does not make the underlying model interchangeable. Trap: Treating envelope compatibility as supplier compatibility.

Q3. Your shadow test shows the new supplier is 3 points better on quality but 2x more expensive. What do you do? A. Three checks. One — is 3 points materially better for your workload? On a customer-facing high-stakes task, 3 points is usually worth real money; on a routine summarization, it may not be. Two — what is the volume? At low volume, 2x of a small number is still a small number; at high volume, 2x is a major budget decision. Three — can the quality gain be captured selectively? Route the hard cases (the 10-20% that the cheaper supplier misses) to the new supplier and keep the routine cases on the old one — the hybrid pattern. The answer is rarely "all or nothing." Trap: Treating the choice as binary.

Q4. The same prompt yields different token counts on different suppliers. Why does this matter for cost analysis? A. Three reasons. One — billed input is the supplier's tokenizer count, not the source character count. A workload that costs $X/M on OpenAI billed against OpenAI tokens does not cost the same when re-billed on Anthropic at the same per-token rate, because the same content re-tokenizes 10-20% larger on Anthropic. Two — context-window math also uses the supplier's tokenizer. A document that fits at 95K on GPT-4o may overflow Anthropic's 100K budget. Three — cache savings depend on token counts, so cache cost projections need re-tokenized inputs to be accurate. Always re-tokenize a representative sample on the target supplier's tokenizer before quoting a migration cost. Trap: Quoting a migration cost using the old supplier's tokenizer.

Q5. Why is "strict mode" not a portable feature across suppliers? A. Because the underlying mechanism differs. OpenAI's strict: true is constrained decoding — the model emits schema-conformant tokens at decode time, with disallowed schema features. Anthropic does not have the same flag; tool-use validation is post-hoc, with retry on failure. Gemini's responseSchema is closer to constrained decoding but with different feature support and ordering semantics. Open-weight engines vary. "Strict" gives you ~99.5% on OpenAI, ~96% post-hoc on Anthropic, ~95-99% on Gemini, ~85-95% on open-weight depending on engine. The migration mitigation is to add a retry-on-failure layer sized for the new supplier's actual failure rate, not the old supplier's. Trap: Assuming the strict-mode contract is identical across suppliers.

Q6. A team did a shadow test for three days and shipped the migration. What did they miss? A. Three things. One — three days is not enough to capture weekday versus weekend traffic patterns, holiday spikes, or the long tail of unusual queries that emerge over weeks. Two — three days may not include enough rate-limit incidents, partial outages, or edge-case schema failures to know the new supplier's reliability on your workload. Three — three days does not let you measure cache hit rates maturing, which can shift the cost picture significantly. A reasonable shadow test runs two weeks minimum, four weeks for low-volume workloads where you need enough samples per task type. Trap: Treating shadow tests as a sanity check rather than a statistical experiment.

Q7. What does "warm rollback" mean in the context of supplier migration, and why is it non-negotiable? A. After full cutover, you keep the original supplier's credentials, prompts, routing config, and a small fraction of canary traffic alive for 30 days. Warm means you could flip a feature flag and route 100% of traffic back within minutes. Non-negotiable because production incidents reveal regressions that the shadow test missed — schema drift on rare query types, latency spikes during traffic peaks, quality regressions on edge cases. Without warm rollback, your only options when these surface are "wait for the vendor to fix it" or "absorb the customer impact." With warm rollback, you flip back and resume the migration with a fix in place. Trap: Decommissioning the old supplier on cutover day. The 30-day warm period is cheap insurance.

Q8. How would you abstract the seven switching-cost taxes in your codebase to make future migrations easier? A. Six layers. One — prompt management with per-supplier templates (Vellum, PromptLayer, or in-house templating with supplier-specific variants). Two — a tokenizer-aware cost estimator that uses the target supplier's tokenizer (tiktoken for OpenAI, Anthropic SDK for Claude, etc). Three — a structured-output library (BAML, instructor, Pydantic AI) that absorbs schema-drift differences. Four — a tool-call abstraction (LiteLLM, Vercel AI SDK, or a thin in-house wrapper) that normalizes envelope shapes. Five — a strict-mode policy that always adds a retry-on-validation-failure layer regardless of supplier. Six — a streaming-parser module that the rest of the app uses without knowing the supplier-specific event shape. The seventh tax (caching) is workload-specific and harder to abstract — instrument cache hit rates per supplier and tune explicit cache markers as workload patterns emerge. Trap: Trying to abstract everything in one layer. The taxes live at different layers of the stack.


Apply now (5 min)

Step 1 — audit your current prompt. Pick one prompt in your system. List every supplier-specific assumption — system-prompt placement, XML vs markdown conventions, few-shot format, strict-mode flag, expected schema adherence rate, cache markers, tool-call envelope. The list is your tax bill if you ever migrate.

Step 2 — re-tokenize one representative input. Take one real input to your system. Tokenize it on your current supplier's tokenizer. Tokenize it on at least one alternative supplier's tokenizer (Anthropic SDK has one; tiktoken handles OpenAI; Hugging Face tokenizers handle Llama/Mistral). Compute the percentage difference. That is your tax #2 for this workload.

Step 3 — sketch the migration playbook for one alternative supplier. For one candidate alternative (Sonnet 4.6 if you are on GPT-4o; Llama 3.3 70B if you are on closed-weight; etc), write the five-stage plan in pseudocode. Eval baseline, prompt rewrite, shadow test, partial cutover, full cutover with warm rollback. Estimate the calendar time per stage honestly.


Bridge. Even with a clean migration playbook, you do not want to be migrating during an incident. The right time to validate the alternative supplier is before you need it. The next chapter is the dual-sourcing pattern — keeping a second supplier warm so the migration can happen in minutes when the primary supplier fails.

07-dual-sourcing-fallback-chains.md