04. Canary rollouts¶
Gates are the precondition. The canary is the rollout discipline that observes per-call effects on subsets of traffic before broad release. The canary catches what the gates miss; the discipline is the steps, the duration, the monitoring at each step.
A platform engineer at a Pune SaaS company canaries a prompt change: 5% on day one; 25% on day three; 100% on day five. At 25%, the team monitors the canary's feedback profile vs the baseline (the 75% on the previous prompt). No regression. Promotion to 100%. After 100%, the team continues to monitor for 48 hours; no issues; the change is mature. The discipline produced safe rollout; the contrast: a previous platform shipped a similar change directly to 100% and rolled back after an hour with measurable customer impact.
This chapter is the canary discipline. Steps, monitoring, decision points, the timing of progression.
The canary steps¶
A reasonable default for prompt changes:
| Step | Traffic | Duration | What to monitor |
|---|---|---|---|
| 1 | 5% | 24h | Feedback profile vs baseline; eval-on-canary-traffic |
| 2 | 25% | 24-48h | Same; broader stratum coverage |
| 3 | 50% | 24-48h | Same; near-half-traffic for confidence |
| 4 | 100% | 24-48h post-promotion | Continued monitoring; no automatic rollback |
The percentages and durations are tuned per platform. Smaller percentages catch issues earlier; longer durations confirm stability.
For model changes (higher blast), the steps may be slower: 1% → 5% → 25% → 50% → 100% over weeks.
For data changes, the steps may be by partition (tenant, region) rather than by percentage.
What to monitor at each step¶
At each canary step, compare the canary's metrics to the baseline (the rest of traffic, still on the prior version).
| Metric | What to watch |
|---|---|
| Negative-feedback rate | Should hold within tolerance |
| Implicit-signal rates (abandonment, repeat-ask, etc.) | Should hold |
| Eval-on-canary-traffic score | Production-traffic eval scores; should hold |
| Calibration agreement | Should hold; sudden drift suggests judge interaction |
| Latency p95 | Should hold (a more verbose prompt may slow responses) |
| Cost per call | Should hold (a longer prompt costs more in tokens) |
| Error rate | Should hold |
The comparison is canary-vs-baseline, not canary-vs-historical. The baseline tracks current production conditions.
A regression on any metric is a signal; the team investigates before promoting.
How long is each step¶
The duration must cover:
- Enough traffic to be statistically meaningful (lower traffic means longer step).
- Enough time to catch slow signals (some implicit signals only appear after multi-turn conversations).
- Enough time for cyclical patterns to surface (a daytime-only canary misses evening behaviour).
For high-traffic platforms: 24-48 hours per step. For lower traffic: 48-72 hours per step. The principle is "enough data to be confident."
Too-short canaries miss signals; too-long canaries delay shipping. The tuning is per platform.
What stops promotion¶
The canary holds (does not promote) when:
- A monitored metric regresses beyond tolerance.
- A new failure pattern appears in the canary's feedback.
- A correlated change in the broader system happens (an incident in a different system; the canary's signal is muddied).
- The team's review identifies a concern not captured by metrics.
The hold is a manual or automated step. Some platforms automate the promotion based on metrics; others require human approval per step.
For high-blast changes (model migrations, irreversible-class changes), human approval is the default.
What rolls back¶
The canary rolls back when:
- A monitored metric degrades beyond a rollback threshold (typically tighter than the hold threshold).
- A customer-impact incident is correlated with the canary.
- A senior engineer or platform lead decides the canary is too risky to continue.
The rollback is fast — the canary weight drops to 0; the change is gone from production. The discipline is in chapter 05.
Cohort canaries¶
Some changes are canaried by cohort rather than by percentage:
- Per tenant. A specific tenant gets the change first; once stable, broader rollout. Useful for tenant-specific changes or for piloting with a willing customer.
- Per segment. The change goes to free-tier users first; if it holds, then premium. The discipline limits blast on the most valuable segment.
- Per region. The change goes to one region first; once stable, broader. Useful for region-specific regulatory considerations.
Cohort canaries are slower but produce sharper signals (the cohort's response is unambiguous, not a sample-of-many).
When to skip canary¶
A canary is the default; the exception is documented.
- Emergency fixes. Critical security patches, regulatory compliance fixes. Chapter 10 covers emergency discipline.
- Reverting a previously-canaried change. Going back to a known-good version does not need a canary; it is restoring known state.
- Non-user-facing changes. Eval changes (not user-facing); some agent-code changes that do not change behaviour.
Skipping canary requires justification; the bypass discipline from chapter 03 applies.
What canary does not solve¶
- Long-tail rare cases. A 1-in-10,000 failure mode is unlikely to surface in a 5% canary; the long tail emerges at higher traffic percentages or in production.
- Slow-developing patterns. Some user reactions emerge over weeks; canary durations of days do not catch.
- Cross-platform correlations. A change that interacts with another system's change; canary in isolation misses the interaction.
The defences: post-promotion monitoring (chapter continues after canary); rollback if production reveals issues; postmortem for what the canary missed.
Common mistakes¶
Skipping canary. All-at-once rollout; the chapter-opening contrast.
Canary too short. Insufficient data; missed signals; broader rollout proceeds with risk.
Canary with no monitoring. Traffic shifts; no comparison; promotion happens regardless.
Canary's automatic promotion without human review. For high-blast changes, the automation may promote based on metrics that look fine while a human reviewer would have flagged.
No cohort-based canary for tenant-specific changes. Generic percentage canary misses tenant-specific issues.
Interview Q&A¶
Q1. Walk through a canary for a prompt change. The change is at 5% canary on day one. The canary's feedback profile is compared to the baseline (95% on the prior prompt). At 24 hours, no regression; promote to 25%. At day three, monitor; no regression; promote to 50%. At day four, monitor; no regression; promote to 100%. Post-promotion monitoring for 24-48 hours. Throughout, any regression on negative feedback, implicit signals, eval-on-canary, latency, or cost triggers hold or rollback. The discipline produces safe rollout; the contrast is all-at-once rollouts that fail catastrophically. Wrong-answer notes: missing the per-step comparison or the rollback path.
Q2. The canary at 25% shows the negative-feedback rate is 1.5σ above baseline. What do you do? Investigate before promoting. 1.5σ is a hold signal — not a clear regression but a concerning trend. Pull recent negative-feedback cases on the canary; read them; understand what the change is doing differently. If the cases reveal a real issue, rollback or refine; if the cases look fine and the signal is noise, extend the canary step duration to gather more data. The 1.5σ is between "noise" and "regression"; the case-level investigation distinguishes. Wrong-answer notes: "promote anyway, it's below 2σ" misses the warning signal; "rollback immediately" may be premature.
Q3. The team's canary is at 50%; metrics look fine for two days. Should the team promote to 100%? Yes, assuming the 50% step's duration covered cyclical patterns (day/night, weekday/weekend) and the metrics held. Promote to 100%. Continue monitoring post-promotion for 24-48 hours. The post-promotion monitoring catches issues that emerge only at full traffic (cross-tenant interactions, rare cases). The promotion is not the end; the rollback path stays in place. Wrong-answer notes: "leave at 50% indefinitely" stalls the rollout without reason; "promote immediately on positive metrics" without considering coverage misses cycles.
Q4. The platform's changes go to a single tenant first as a canary. What is the discipline? Cohort canary. The tenant is the canary cohort; the rest of the platform remains on the prior version. The tenant's feedback is the signal; if positive, the change rolls out to more tenants. Cohort canaries produce sharper signals than percentage canaries (the cohort's response is unambiguous). They are slower; appropriate for changes where the cohort's reaction is the primary risk (tenant-specific changes, pilot programmes). Wrong-answer notes: "percentage canary is always better" misses when cohort canary is the right fit.
What to do differently after reading this¶
- Canary every AI change by default; document exceptions.
- Steps: 5% → 25% → 50% → 100% with 24-48h per step for prompt changes; slower for model changes.
- Monitor canary-vs-baseline on multiple metrics at each step.
- Hold or rollback on regression; investigate before promoting.
- Continue monitoring post-100% promotion.
Bridge. Canary is the gradual rollout. Rollback is the discipline when the canary or production shows the change is wrong. The next chapter is the rollback that must be tested, fast, and rehearsed. → 05-rollback-discipline.md