Skip to content

10. Deployment Strategy — Staging, Canary, Rollback, CI/CD for AI Systems

~11 min read. Deployment is where the whole house meets real weather — and real weather is nothing like the controlled lab.

Built on the ELI5 in 00-eli5.md. The move-in day — deployment — is the moment the house becomes real for the people living in it. One broken pipe on move-in day harms real users. Plan it as carefully as you planned the build.


Why AI deployment is different from traditional software deployment

See. Traditional software deployment is deterministic. You deploy the new version. You test a function. Same input → same output. If it passes, you ship.

AI deployment is probabilistic. Same input → similar but not identical output. What you tested in staging is not identical to what runs in production. The real user distribution is not your test set.

This creates a new challenge: how do you gain confidence in a system that cannot be fully verified before it runs on real traffic?

Answer: progressive exposure. You release to a small slice first. Observe. Then expand.


The four environments

Every production AI system needs four environments.

┌──────────────────────────────────────────────────────────────┐
│  Local Dev                                                   │
│  Developer's machine. Mocked LLM calls. Fast iteration.      │
│  No real data. No real cost.                                 │
├──────────────────────────────────────────────────────────────┤
│  Staging                                                     │
│  Full system. Real LLM calls. Synthetic or anonymised data.  │
│  Runs the full eval suite (the inspection) automatically.    │
├──────────────────────────────────────────────────────────────┤
│  Canary (Shadow / Gradual Rollout)                           │
│  Real production data. 5–10% of traffic. Full monitoring.    │
│  Serves real users. Quality compared vs. stable baseline.    │
├──────────────────────────────────────────────────────────────┤
│  Production                                                  │
│  100% of traffic. All monitoring active. Alert thresholds.   │
└──────────────────────────────────────────────────────────────┘

Local → Staging → Canary → Production. Never skip canary for an AI system. One bad prompt version reaching 100% of traffic before you catch it is costly.


Canary deployment: the mechanics

A canary releases a new version to a small fraction of traffic. The rest continues on the stable version.

Traffic split:
  Stable version (v1):  90% of requests
  Canary version (v2):  10% of requests

Observation period:     48 hours minimum (catch slow-degrading failures)
Success criteria:
  - Quality score on canary ≥ quality score on stable - 0.02
  - Latency p95 on canary ≤ latency p95 on stable × 1.1
  - Error rate on canary ≤ error rate on stable × 1.2

If all three criteria pass after 48 hours: Promote canary to 100%. Retire stable v1.

If any criterion fails: Rollback immediately. Investigate. Do not promote.

Look. The numbers in the success criteria matter. Write them before the deployment, not after you see the results. Post-hoc criteria are always met.


Rollback strategy

Rollback must be faster than the problem is spreading. For AI systems, rollback means reverting to the previous stable version.

Pre-deployment rollback checklist: - Previous stable version is tagged and kept accessible. - Rollback is a one-command operation (not a manual process). - Rollback does not require re-deploying the database or vector index. - The previous prompt version is stored and recoverable.

Deploy new version  →  Monitor for 48 h  →  Problem detected
         │                                          │
         │                                          ▼
         │                              Execute rollback command:
         │                              $ deploy.sh --version=v1-stable
         │                              30-second rollback. ✓
Continue if healthy

Never treat rollback as a failure. Rollback is a safety mechanism, like a circuit breaker. Using it correctly is a sign of a mature engineering team.


CI/CD for AI systems

CI/CD (Continuous Integration / Continuous Deployment) works differently for AI.

Traditional CI/CD:

Code commit → Build → Unit tests → Deploy

AI CI/CD:

Code commit → Build → Unit tests → Eval suite (the inspection) → Staging deploy
           → Regression check (compare eval score to baseline) → Canary deploy
           → Quality monitor (48 h observation) → Full production deploy

The key additions: - Eval gate: eval score must meet baseline before staging deploy proceeds. - Regression check: new version must not regress on the locked baseline score. - Quality monitor: automated monitoring after canary launch, not just error monitoring.

Worked pipeline timing:

CI run:          10 minutes (build + unit tests)
Eval gate:       25 minutes (150-case eval suite on staging)
Canary launch:   2 minutes
Observation:     48 hours
Full rollout:    2 minutes
Total time:      ~48 hours 37 minutes

This feels slow compared to traditional software CI/CD. It is slow. It is appropriate for a probabilistic system serving real users.


Prompt version as a deployment artefact

Prompts are not code. But they must be treated as deployment artefacts.

Deployment manifest for v2:
  Model version:       gpt-4o-mini-2024-07-18
  System prompt:       prompts/v3_support_assistant.txt  (sha: a4f2c1)
  Retrieval config:    configs/retriever_v2.yaml  (top-k: 3, min_score: 0.72)
  Embedding model:     text-embedding-3-small (fixed, do not upgrade without eval)
  Chunk config:        350 tokens, 50-token overlap

Every field in this manifest is pinned. A model upgrade by the provider is a deployment, not a background change. If the provider changes the model behaviour silently, your eval suite catches it.


Where this lives in the wild

  • Duolingo — AI feature rollouts use gradual exposure (5% → 20% → 100%) with quality A/B comparison before full release.
  • Airbnb — ML model deployments go through shadow mode (no production decisions, only logging) before canary.
  • GitHub Copilot — prompt changes are deployed as versioned artefacts; rollback is a one-command operation.
  • Stripe Radar — fraud model canary compares false positive and false negative rates against baseline; not just accuracy.
  • Intercom Fin — every prompt version is stored with its eval score; revert is instant because old prompts are never deleted.

Pause and recall

  1. Name the four deployment environments and their key properties.
  2. What are the three success criteria for promoting a canary to full production?
  3. How long should the canary observation period be at minimum?
  4. What fields belong in a deployment manifest for an AI system?

Interview Q&A

Q: "How would you deploy a new version of an LLM-based feature safely?"

A: I stage the deployment through four environments: local dev, staging with the full eval suite (the inspection), canary at 10% traffic for 48 hours, then full rollout. I define success criteria for the canary before deployment. I ensure rollback is a one-command operation.

Common wrong answer to avoid: "I test it in staging and deploy to production." Staging cannot reproduce the real user distribution. You need canary on real traffic.


Q: "How do you prevent a prompt change from breaking production?"

A: Treat the prompt as a versioned deployment artefact. Every prompt change goes through the CI/CD eval gate — it must meet the baseline score before staging deploy. It then goes through canary. Rollback means reverting to the previous prompt version, which is always preserved.

Common wrong answer to avoid: "I update the prompt directly in the production config." Direct production prompt changes bypass all quality gates.


Q: "What is canary deployment and why is it especially important for AI systems?"

A: Canary deployment exposes a new version to a small fraction of real traffic before full rollout. For AI systems, it is especially important because AI behaviour cannot be fully verified in synthetic environments — the real user distribution always contains surprises. Canary limits blast radius and gives you 48 hours of real-traffic quality data.

Common wrong answer to avoid: "Staging is sufficient for AI systems." Staging uses synthetic or anonymised data, not the real user distribution. Edge cases and distribution shifts only appear with real traffic.


Q: "What does a rollback look like for an AI system?"

A: For the model code: redeploy the previous container version. For the prompt: revert to the previous prompt file (always preserved). For the embedding model: impossible to rollback the index if a new model was used — which is why you never change the embedding model without rebuilding the entire index. Rollback should complete in under two minutes.

Common wrong answer to avoid: "Delete the new deployment and redeploy the old one from the repository." This can take 30+ minutes. Rollback must be instant. Keep the previous version warm.


Apply now (5 min)

Write a deployment plan for your capstone. Define the four environments and what runs in each. Write three success criteria for your canary promotion. Specify what your deployment manifest contains (model, prompt version, retrieval config).

Sketch from memory: Draw the four-environment stack with arrows showing the deployment flow. Add the AI CI/CD additions (eval gate, regression check, quality monitor) in the correct positions.


Bridge. The house is built and occupied. Now you need to show someone the house. Interviewers, portfolio reviewers, and potential employers will judge the capstone — learn how. → 11-presentation-portfolio.md