09. Deployment strategies — change the model without shaking the factory¶

~15 min read. Production trust depends on how you upgrade, not only what you upgrade.

Built on the ELI5 in 00-eli5.md. The upgrade without downtime — the factory swap where the new machine enters safely — is how teams change models without gambling on all users at once.

A deployment strategy is a risk strategy¶

See.

Teams often ask, "Which deployment strategy is best?" Better question: which failure are you most afraid of?

Are you afraid the new model crashes the service? Are you afraid it stays up but quietly hurts output quality?

Are you afraid it costs far more than expected under real traffic? Different fears call for different deployment strategies.

That is why the upgrade without downtime is a design choice, not a checkbox.

Picture first.

new model ready
      │
      ▼
┌─────────────────────┐
│ choose risk pattern │
└──────┬──────┬───────┘
       │      │
       │      ├── blue-green
       │      ├── canary
       │      ├── shadow
       │      └── percentage rollout
       ▼
observe evidence
       │
       ▼
promote / pause / rollback

Simple, no?

A strong team does not say, "We deployed." It says how it deployed and why.

That language matters because blue-green, canary, shadow, and rollout are not synonyms.

Champion-challenger thinking helps here. The champion is the current production model. The challenger is the new candidate trying to earn trust.

The strategy decides how the challenger meets reality.

Blue-green, canary, shadow, and percentage rollout are different tools¶

Blue-green means you keep two full production environments ready.

Blue serves users now. Green is the new version waiting beside it.

When ready, traffic switches from blue to green in one clear move.

Rollback is crisp because you can switch back quickly.

This is great when you fear deployment breakage and want fast reversal.

Canary means a small live percentage sees the new version first.

Maybe 1 percent, then 5 percent, then 20 percent, and later 100 percent.

You collect live evidence before full exposure.

This is strong when you fear quality regressions that offline tests may miss.

Shadow means the new model receives production requests but its outputs do not affect users.

It runs beside the champion so you can observe behavior safely.

This is useful when you fear unknown quality shifts or cost surprises.

Percentage rollout is the broader pattern of gradually increasing real user traffic.

Canary is a specific early stage of that idea. The terms overlap in conversation, but they are not identical.

Look at the comparison.

blue-green   = switch whole traffic between full environments
canary       = send a small real slice to challenger first
shadow       = mirror traffic, but do not serve challenger output
rollout      = increase real traffic in controlled percentages over time

Yes?

Language discipline matters because wrong words cause wrong expectations.

If someone says shadow but means canary, the team may think users are safe when they are not.

If someone says rollout but means blue-green, rollback assumptions may be wrong.

Use the strategy that matches the failure you fear¶

Look.

Blue-green is good when you want crisp rollback and clear environment separation.

It is especially useful for infrastructure or packaging changes where breakage may be immediate.

Canary is good when you need live evidence on quality, latency, or cost before wider exposure.

Shadow is good when you want observation without user impact, especially for output comparison.

Percentage rollout is good when traffic scaling itself is the risk and you want controlled expansion.

Here is a tiny example.

Suppose your new support classifier reduces offline error by 3 points.

You still fear unseen enterprise tickets because those are expensive mistakes. A shadow phase can compare outputs silently on live enterprise traffic.

Then a canary can expose 2 percent of real users. Then a percentage rollout can climb through 10, 25, 50, and 100 percent.

That sequence matches the fear: first unknown quality, then controlled live impact.

Now another example. Suppose the model logic is unchanged, but the serving stack moved to a new GPU image.

Your biggest fear is environment breakage, not model semantics. Blue-green may be the cleanest answer because rollback speed is the key need.

Simple, no?

Choose for the failure mode you fear most, not for the trendiest term. The upgrade without downtime becomes much easier when the fear is named clearly.

Never deploy straight from notebooks¶

This line should sound obvious, but teams still violate it under pressure.

A notebook is for exploration. Production needs versioning, evaluation, registry records, and repeatable deployment paths.

If you deploy straight from a notebook, you break traceability and rollback discipline.

You also invite mystery state, hidden dependencies, and unreviewed feature logic.

So what to do?

Notebook produces insight. Pipeline produces artifacts. Deployment uses approved artifacts only.

That is the safe chain.

The challenger should come from the registry, not from someone's laptop mood.

The upgrade without downtime depends on that discipline because strategy is useless if the artifact itself is untrustworthy.

One more small numerical example helps.

Suppose a canary starts at 5 percent of traffic.

Champion handles 95,000 daily requests. Challenger handles 5,000 daily requests.

Error rate for the champion stays at 1.8 percent.

Error rate for the challenger lands at 4.6 percent on billing-related tickets.

That is enough to stop the rollout even if overall averages still look acceptable.

Look at the compact table.

traffic split     champion reqs   challenger reqs   action
95 / 5            95,000          5,000             watch
billing error %   1.8             4.6               stop

See how live evidence changes the decision.

This is why champion-challenger framing is useful. The new model must beat reality, not only notebooks.

Keep the terms clean in team conversations¶

Canary is not shadow.

Shadow is not percentage rollout.

Blue-green is not the same thing as gradual rollout.

If the team mixes these terms casually, incident response becomes sloppy.

Write the traffic path clearly in runbooks and release notes.

State whether challenger output reaches users. State what percentage of traffic is live. State how rollback works.

That sounds simple because it is simple. Clarity is a systems feature.

Yes?

The best deployment strategy is the one whose behavior every operator can explain under pressure.

That is the standard worth keeping.

Where this lives in the wild¶

Google Search ranking experiments — search quality engineer: uses champion-challenger rollouts because silent quality regressions matter more than green builds.
Stripe risk models — ML platform engineer: prefers controlled canaries and hard rollback rules for fraud systems with asymmetric business risk.
Uber pricing services — reliability engineer: uses blue-green patterns when infrastructure changes could break live request handling quickly.
LinkedIn feed ranking — experimentation engineer: relies on percentage rollouts to expand exposure while monitoring member engagement and fairness slices.
OpenAI or assistant platforms — inference engineer: uses shadow traffic to compare candidate outputs safely before user-visible promotion.

Pause and recall¶

Why is deployment strategy really a risk-management choice?
How do blue-green, canary, shadow, and percentage rollout differ?
When is champion-challenger framing useful?
Why should approved artifacts come from pipelines and registries, not notebooks?

Interview Q&A¶

Q: Why is blue-green deployment attractive for some model releases? A: It keeps two full environments ready, so traffic can switch cleanly and rollback stays fast when infrastructure or packaging risk is the main fear. Common wrong answer to avoid: "Because blue-green always gives the best model evaluation." It is primarily a traffic-switching and rollback strategy.

Q: Why is shadow traffic not the same thing as canary traffic? A: Shadow sends live requests to the challenger without affecting user responses, while canary exposes a real subset of users to challenger outputs. The safety level is very different. Common wrong answer to avoid: "Both just mean a small rollout." One may affect zero users directly.

Q: How should teams choose between canary and blue-green? A: Choose based on the failure mode you fear most. Canary is stronger for gathering live evidence on quality, while blue-green is stronger for crisp rollback when environment breakage is the concern. Common wrong answer to avoid: "Canary is always more advanced." Different tools solve different risks.

Q: Why is deploying straight from notebooks a bad production habit? A: It breaks repeatability, traceability, approval flow, and rollback discipline. Production models should come from versioned pipeline artifacts, not ad hoc local state. Common wrong answer to avoid: "Because notebooks are slow." Speed is not the core problem.

Apply now (5 min)¶

Exercise. Pick one model release scenario and write the main failure you fear most. Then choose blue-green, canary, shadow, or percentage rollout and justify it in two lines.

Next, write one sentence explaining why the other strategies are weaker for that exact fear.

Sketch from memory. Draw the upgrade without downtime tree and label which strategies expose users, which only observe, and which give the crispest rollback.

Bridge. A safe deployment gets the new version live, but live systems still drift and degrade after release. So next we study monitoring and drift detection. → 10-monitoring-drift.md