Skip to content

03. Release gates

Change types in hand. The next discipline is the release gates — preconditions a change must meet to be eligible to ship. Eval gate, feedback gate, sometimes additional gates per change type. The gates are what distinguish "ready to ship" from "deployed."


A platform engineer at a Mumbai SaaS company has a CI process: a PR is reviewed, tests pass, the change merges, the next deploy ships it. Six months in, she audits the AI changes: three quarters of them shipped without the eval being run against the change. The CI did not require it; engineers did not voluntarily run it. The shipped changes that did not regress on the eval were lucky; the ones that did regressed silently in production until customer complaints surfaced. The fix is the eval gate: the CI for prompt and model changes requires the eval to be run, against the regression set, with no regression on any critical stratum, before the change can merge.

This chapter is the gate discipline. Eval and feedback as the preconditions; CI enforces; bypass is a documented exception.


The two core gates

Gate Required before Signal
Eval gate Merge / first canary step Score on the regression set holds or improves per stratum
Feedback gate Promotion past each canary step Feedback profile in canary holds against baseline

Each gate has thresholds; passing is the precondition for the next step.


The eval gate

Built on the regression eval set (module 01_dataset_golden_set_operations). For a prompt or model change:

  • Run the eval against the change.
  • Compare to the current production baseline.
  • Pass criteria: overall score holds within tolerance; no critical stratum regresses; specific failure-mode coverage holds.

The gate is CI-enforced: a PR that changes a prompt or model alias triggers the eval; if the gate fails, the PR cannot merge.

For sub-cases:

  • Stratum-level gates. Per chapter 04 of 01_dataset_golden_set_operations, per-stratum scores are gated; a regression on any critical stratum blocks.
  • Specific failure-mode coverage. A change that drops the system's handling of a known failure mode is blocked; the eval set has cases for the mode.

The gate's strictness is tuned: too strict produces false positives that block good changes; too loose produces regressions slipping through. The tuning is per platform; chapter 11 of 01_dataset_golden_set_operations covers the false-positive vs real-regression discipline.


The feedback gate

Built on production feedback (module 02_telemetry_feedback_loops). For each canary step (chapter 04):

  • Monitor the feedback profile of the canary traffic vs the baseline (rest of traffic).
  • Pass criteria: negative-feedback rate, implicit-signal rates, calibration scores all hold within tolerance.

The gate is monitored continuously during canary; if signals degrade, the canary is held or rolled back before promotion.

Unlike the eval gate (pre-merge), the feedback gate is during rollout. It is the production signal that the change is actually behaving well.


Additional gates per change type

Some change types have additional gates.

Model changes add a cost gate: the new model's cost per call must be within the platform's budget. A cheaper model is favoured; a more expensive model needs justification.

Agent code changes add integration gates: pact tests against the model gateway, integration tests of tool calls, schema validation.

Eval changes add a calibration gate: if the rubric changed, the judge calibration agreement must hold (chapter 06 of 02_telemetry_feedback_loops).

Data changes add a schema-drift gate and a content review gate.

Each is the type's discipline applied to the gate concept.


Bypass and exceptions

Sometimes a gate must be bypassed: an urgent security fix that cannot wait for eval; a regulatory change that supersedes normal discipline; an experimental rollout to a single tenant for testing.

Bypass is allowed but disciplined:

  • The bypass is documented; the change ticket records the gate skipped and the reason.
  • The bypass is approved by a senior engineer or platform lead.
  • Post-bypass, the eval is run as soon as possible; the result is recorded.
  • A pattern of bypasses triggers a discipline review.

Most platforms see <5% of releases as bypasses. A higher rate suggests the gates are too strict or the discipline has decayed.


What the gates do not guarantee

The gates raise the floor; they do not eliminate risk.

  • An eval gate passes if the change is good on the cases in the eval set; cases not in the set may regress.
  • A feedback gate passes if the canary feedback holds; broader population may respond differently.
  • Both gates can pass and the change can still produce production issues at scale.

The canary discipline (chapter 04) and the rollback discipline (chapter 05) are the defences for what the gates miss.


Common mistakes

Eval gate optional. CI does not require it; engineers skip it; regressions slip through.

Eval gate that always blocks. Too strict; team learns to bypass routinely; gate erodes.

No feedback gate during canary. The canary runs without quality monitoring; promotion happens regardless.

Bypass without documentation. The discipline decays silently.

Gates per type not differentiated. All changes use the prompt change gate; model changes lack the feedback comparison; data changes lack schema-drift.


Interview Q&A

Q1. What is the eval gate, and why is it CI-enforced? The eval gate is the precondition that a prompt or model change must pass the regression eval before it can merge. CI-enforced means the CI pipeline runs the eval and blocks the merge if it fails. Without CI enforcement, the gate is voluntary; engineers skip it under time pressure; regressions slip into production. The CI is what makes the gate the discipline. Wrong-answer notes: "we encourage engineers to run the eval" without CI enforcement produces the chapter-opening pattern.

Q2. Walk through the feedback gate during a canary. The change is at 5% canary traffic. The monitoring compares the canary's feedback profile (negative-feedback rate, implicit-signal rates, calibration agreement) to the rest of the traffic (the baseline). Pass criteria: each metric within tolerance (typically 2σ). If passing, the canary is promoted to the next step (25%). If a metric degrades, the canary is held or rolled back. The feedback gate is during rollout, not pre-merge; it is the production signal of actual user response. Wrong-answer notes: "promote on schedule" misses the gate's role.

Q3. The team has a 10% bypass rate on eval gates. What does that suggest? The gate may be too strict (producing false positives); the discipline may have decayed; or genuine emergencies are routine (suggesting upstream problems). Investigate: are the bypasses justified? are they patterns (always the same type of change)? are the false positives the chapter-11 issue of 01_dataset_golden_set_operations? The 10% rate is a signal worth investigation, not just acceptance. Wrong-answer notes: "tighten the bypass policy" without understanding why may misdiagnose.

Q4. The eval gate and the feedback gate both pass. The change ships. Production has issues. What is happening? The gates measure samples (the eval set; the canary traffic). The broader population may respond differently. The change may exercise cases not in the eval set; the canary may not have been representative of the production diversity. The gates raise the floor; they do not eliminate risk. The defences for what the gates miss are the canary's continued monitoring at higher traffic percentages, the rollback discipline if production degrades, the post-promotion feedback monitoring. Wrong-answer notes: "the gates failed" misses that the gates are not omniscient; the discipline includes what to do when production reveals gaps.


What to do differently after reading this

  • CI-enforce the eval gate for prompt and model changes.
  • Monitor the feedback gate during every canary step.
  • Add type-specific gates (cost for model, schema for data, calibration for eval).
  • Track bypass rate; investigate when it exceeds a small percentage.
  • Recognise the gates as raising the floor, not guaranteeing safety.

Bridge. Gates are the precondition. The canary is the rollout discipline that observes per-call effects on subsets before broad release. → 04-canary-rollouts.md