04. Review gates — who can edit a prompt, and what must happen first¶

~14 min read. A prompt change is a production change. Production changes need reviewers and a green CI run. Prompts deserve the same gate as code, and for the same reason — silent breakage is what kills you.

Builds on 03-versioning-and-rollback.md. The recipe book now has SHAs and rollbacks. Today we decide who is allowed to add a new SHA, and what must happen before that SHA reaches a customer's plate.

1) Hook — the 11pm Tuesday paste¶

A PM has been hearing complaints. The bot "feels too corporate" with one large customer. She opens Langfuse, finds the customer-success greeter prompt, edits the second paragraph to soften the tone, and clicks Save. The change is live. It is 11:04pm.

By 11:30pm, the bot is greeting every customer — not just the one — with the softer tone. By 8am, support has 14 complaints. Three customers say the bot "feels unprofessional" and "doesn't take our enterprise relationship seriously." One customer threatens to cancel.

The PM did not do anything malicious. She thought she was making a targeted fix for one customer. She did not know that this prompt was global. She did not know that there was no review. She did not know that there was no eval before deploy. She had edit access because somebody, six months ago, decided that "prompts should be editable in the dashboard for quick fixes." Now that decision is the root cause of a Wednesday morning customer crisis.

A review gate would have caught this in three places — at edit time (PM cannot promote directly to production), at eval time (multi-tenant test would have flagged the global change), and at deploy time (a reviewer would have asked the right question).

This chapter is what those three gates look like.

2) The metaphor — the head baker and the chalkboard¶

In the recipe book model, anyone can propose a new recipe. Anyone can write a draft on the chalkboard at the back of the bakery. But before that recipe gets transcribed into the recipe book and onto the kitchen line, a head baker reads it. The head baker checks four things — what changed, why, whether the taste test passed, and whether the change targets the right customers.

Only after the head baker's signature does the recipe move from the chalkboard to the line. Anyone can draft. Only the head baker can ship.

The review gate is the head baker. It is not paperwork. It is the difference between a bakery that runs and a bakery that gets shut down by the health inspector.

3) The anatomy — three gates, three signatures¶

┌─────────────────────────────────────────────────────────┐
│ PROMPT CHANGE LIFECYCLE                                 │
├─────────────────────────────────────────────────────────┤
│ 1. DRAFT      author writes new version → SHA assigned  │
│ 2. AUTO-CHECK lint, eval, drift check (CI)              │
│ 3. REVIEW     human reviewer reads diff and intent      │
│ 4. APPROVE    reviewer signs; version becomes deployable│
│ 5. DEPLOY     deployer promotes to a ramp stage         │
│ 6. AUDIT      every step logged with who/when/why       │
└─────────────────────────────────────────────────────────┘

Each row in this diagram is a gate. Each gate has an owner. Each gate produces an artifact — a CI status, a reviewer's signature, a deployment record. If any gate fails, the change does not move.

The single most common mistake is collapsing gates into one. "Edit and save" is one button that performs steps 1-5 in one click. That is the 11pm-Tuesday button. Senior teams never have that button.

4) Role-based edit permissions¶

Three roles, three powers.

ROLE            CAN DRAFT?    CAN APPROVE?    CAN DEPLOY?
──────────────────────────────────────────────────────────
prompt_author      yes            no              no
prompt_reviewer    yes            yes             no
prompt_deployer    yes            yes             yes

The author role belongs to anyone who needs to propose changes — engineers, PMs, customer-success leads, support agents. The reviewer role belongs to people who understand the prompt's downstream effects — usually a senior engineer or applied AI lead. The deployer role belongs to a smaller set who control the rollout — usually a tech lead or on-call engineer.

A common mistake is collapsing reviewer and deployer into one role. The separation matters because the reviewer asks "is this change correct?" and the deployer asks "is now the right time to ramp this?" Two different questions, two different defenders.

A second common mistake is making the author role too narrow. If only engineers can draft, PMs and support route every change through an engineer, who becomes a bottleneck. The right setting is broad authoring with strict review.

5) Automated checks before review¶

Before a human reviewer reads the diff, the change runs a battery of automated checks. The reviewer should never burn time on something a CI step could catch.

AUTO-CHECK BATTERY
──────────────────
1. LINT       length within bounds, no banned tokens, valid template variables
2. EVAL       golden eval suite passes at or above the prior SHA's score
3. DRIFT      output distribution comparison against the prior SHA
4. SCHEMA     downstream parser still validates outputs
5. PII        no PII or secrets in the prompt text
6. COST       estimated cost-per-call within budget

Each check is a gate of its own. A change can fail any check and be sent back to the author. Only changes that pass all six reach the human reviewer.

The eval check is the heaviest — typically 5-15 minutes to run a 50-200 example suite. It belongs in CI, not in the dashboard, because dashboard edits invite the temptation to skip evals.

6) Worked example — a prompt PR¶

A senior engineer opens a PR to update the customer-support greeter from SHA a8c3f9... to SHA b1d7e4.... The diff:

- Greet the user by name and acknowledge their request.
- Keep responses warm but professional.
+ Greet the user by name and acknowledge their request.
+ Keep responses warm, conversational, and concise.
+ For enterprise customers, prefer "Hi [name]" over "Hello [name]".

Automated checks run:

LINT       ✓ length 142 → 178 chars (under 500 limit)
EVAL       ✓ 96/100 → 97/100 on golden set (+1, within noise)
DRIFT      ⚠ avg response length: 82 → 71 tokens (-13%)
SCHEMA     ✓ downstream parser validates 100/100 outputs
PII        ✓ none detected
COST       ✓ -2% estimated cost-per-call

The drift check flagged a 13% length decrease — not a failure, but worth a human reviewer's attention. The reviewer reads the diff, sees the "concise" addition, recognizes the cause, and asks the author: "Is this length drop intentional? Customers in segment X have asked for more detail in the past."

This conversation happens in the PR. The author either justifies the change with data, or restores some verbosity and updates the prompt. The PR cycles. When approved, it merges. The deployer schedules a ramp.

That workflow — automated checks, human review with diff visibility, approval, ramped deployment — is what a review gate gives you. The 11pm-Tuesday paste does not survive any step of it.

Mid-content recall¶

What three roles separate authoring from approving from deploying?
Which automated check usually takes the longest, and why does it belong in CI rather than in a dashboard?
Why would a drift check produce a warning even when the eval check passes?

7) Non-engineer prompt editing — the dashboard with teeth¶

PMs, customer-success folks, support engineers, and AI-product designers all have legitimate reasons to edit prompts. They know the user's voice. They hear complaints first. They sometimes know the right phrasing better than the engineer who wrote the prompt last quarter.

A mature setup gives non-engineers a UI that runs the same review gate as a PR. The flow looks like this:

┌──────────────────────────────────────────────────────────┐
│ PM opens prompt in dashboard                             │
│ ↓                                                        │
│ edits text, clicks "Propose change"                      │
│ ↓                                                        │
│ system creates draft version → SHA assigned              │
│ ↓                                                        │
│ auto-checks run (eval, lint, drift)                      │
│ ↓                                                        │
│ status posted to PM and to reviewer Slack                │
│ ↓                                                        │
│ reviewer approves in dashboard or rejects with comment   │
│ ↓                                                        │
│ deployer schedules ramp                                  │
└──────────────────────────────────────────────────────────┘

The dashboard is not a shortcut around the gate. It is a different surface for the same gate. The PM proposes. The reviewer reviews. The deployer ramps. The 11pm paste does not exist because there is no button that does all three.

Langfuse, Vellum, Pezzo, and Braintrust all support this kind of dashboard-with-teeth flow. Some teams build their own on top of GitHub PR review — the dashboard creates a PR, the eval check is a CI job, the reviewer approves on GitHub. Same gate, different surfaces.

8) What a reviewer actually checks¶

A good reviewer spends time on six things, not on the prompt's prose.

REVIEWER'S CHECKLIST
────────────────────
1. INTENT     does the diff match the stated intent in the PR description?
2. SCOPE      does this change affect tenants beyond the stated audience?
3. EVAL       did the eval pass on the actual cases the change targets?
4. DRIFT      do any drift signals deserve a follow-up conversation?
5. ROLLBACK   is the prior SHA known-good and easy to revert to?
6. TIMING     is this the right ramp window — quiet day, business hours, oncall available?

The intent check is the most under-rated. PRs that "fix tone" sometimes also restructure few-shot examples or add new constraints. The reviewer's job is to catch "this PR does more than its description says" before the change ships.

The scope check is what would have caught the 11pm paste. A change to a global prompt that the author thought was tenant-specific should be flagged at review.

9) Failure modes¶

Signal	Likely cause	Fix
Reviewers rubber-stamp every PR	Review fatigue, ratio of reviewers to authors too low	Add reviewers, rotate, or invest in better auto-checks
Eval check skipped because "it's an emergency"	No incident playbook with explicit fast-path	Add emergency rollback (no eval needed) + fast-path forward-fix (lighter eval)
Non-engineer editors find the gate annoying and ask to bypass	Dashboard does not surface auto-check results clearly	Show auto-check results inline; make the gate feel collaborative
Author can self-approve their own change	Misconfigured permissions	Require approval from a different user; CODEOWNERS-style enforcement
Drift warnings ignored	Reviewers do not understand what drift means	Train reviewers; link drift signal to a concrete past incident
Audit log gaps	Deployment happens via API without going through review surface	Block direct API deploys; force the gate
Reviews take days, slowing iteration	Too few reviewers, no async expectations	Add reviewers; SLA on PR turnaround
Reviewer approves without reading	Diff too large or noisy	Cap diff size; require PR description to explain intent

The pattern in all eight rows is the same — the gate degrades when the gate feels heavier than the change. Senior teams invest in making the gate feel lightweight while keeping its protections strong.

10) The audit trail — what to log¶

Every change produces a row in an audit log. The row has:

Prompt name
Previous SHA, new SHA
Author (user ID)
Reviewer (user ID), approval timestamp
Deployer (user ID), deployment timestamp
Auto-check results (eval score, drift score, lint pass/fail)
Intent (the PR description)
Ramp stage at each step

The audit log is not regulatory paperwork. It is a debugging tool. Six months from now, when a customer asks "why does our bot sound different than last quarter," the audit log answers in 30 seconds.

For SOC2, HIPAA, or EU AI Act compliance, the audit log is also the regulatory artifact. Treat it as both.

Where this lives in the wild¶

Langfuse — prompt management with approval flows, audit log, role-based permissions.
Vellum — UI-first prompt review, eval gates baked in.
Pezzo — open-source registry + review.
Braintrust — eval-first review gate with PR-style flow.
PromptLayer — registry + review surface.
LangSmith — prompt management within LangChain ecosystem, review flows.
Helicone — observability with prompt management, lighter review.
OpenAI Playground (stored prompts) — versioned prompts, evolving review surface.
Anthropic Workbench — prompt versioning, integration points for review.
GitHub PR review — most common DIY backing for prompt review when stored in repo.
GitLab Merge Requests — equivalent for GitLab-hosted repos.
CODEOWNERS — enforces reviewer requirements per file path.
Phabricator (Differential) — historical reviewer flow, still used at some scale companies.
ReviewBoard — alternative review tooling.
dbt — analytics tooling whose review model (PR + CI + eval) is widely copied by AI teams.
LaunchDarkly — feature flag system used to ramp prompts after review.
Statsig — same, with stronger experimentation focus.
Split.io — feature flag system with audit log compliance.
Flagsmith — open-source feature flag with review workflows.
ConfigCat — feature flag system focused on simple deploy gates.
Optimizely — experimentation platform, often used for prompt A/B.
Hashicorp Vault — for prompt secrets and credentials.
AWS Parameter Store / Secrets Manager — for runtime prompt config with audit.
Doppler — secrets management with audit trails.
Auth0 / Okta — RBAC backing for role-based prompt permissions.

Pause and recall¶

What three roles separate prompt change responsibilities?
What six items belong on a reviewer's checklist?
Why does collapsing reviewer and deployer into one role weaken the gate?
How does a "dashboard with teeth" differ from a "dashboard as shortcut"?
What does the audit log buy you six months later?
Which automated check would have caught the 11pm-Tuesday paste?
Why does broad authoring with strict review beat narrow authoring with loose review?

Interview Q&A¶

Q1. How do you let non-engineers edit prompts without losing safety? A. Build a dashboard surface that runs the same review gate as a PR. PMs draft, auto-checks run, reviewers approve, deployers ramp. The dashboard is a different surface for the same gate. Langfuse, Vellum, Pezzo, and Braintrust all support this model. Trap: "Just give them edit access." That is the 11pm-Tuesday button. The next regression is a matter of time.

Q2. What should an auto-check battery cover before human review? A. Six things — lint, eval, drift, schema, PII, cost. Each is a gate of its own. The eval check is the heaviest and belongs in CI. Drift catches behavior changes that pass the eval but shift production behavior. PII catches operator mistakes that leak data. Trap: "We just run the eval." Eval alone misses drift, schema, PII, and cost.

Q3. Why separate reviewer from deployer? A. They answer different questions. The reviewer asks "is this change correct?" The deployer asks "is now the right time?" A single role pressures the holder to answer both at once, which often defaults to "yes, ship it." Two roles defend two different failure modes. Trap: "Same person, faster." Faster at shipping, slower at recovering when both judgments were rushed.

Q4. How do you keep review from becoming a bottleneck? A. Three levers — expand the reviewer pool, invest in auto-checks so reviewers spend time on intent and scope rather than mechanics, and set an async-review SLA (e.g., reviewer responds within 4 business hours). The gate stops working when authors batch changes into mega-PRs because review is slow. Trap: "Just have one senior engineer review everything." That engineer becomes the bottleneck and the rubber stamp simultaneously.

Q5. A reviewer approves a change without reading it. How do you fix the culture? A. Two interventions. (1) Cap PR/diff size so review is feasible. (2) Tie one observed regression back to the rubber-stamp approval in a postmortem. Once the team feels the cost of a missed review, the next month's reviews tighten naturally. Cultural fixes follow visible incidents. Trap: "Add more reviewers." More reviewers does not fix the rubber-stamp problem; it dilutes it across more people.

Q6. What goes in the audit log, and why does it matter six months later? A. Per change: prompt name, prior SHA, new SHA, author, reviewer, deployer, timestamps, eval/drift scores, intent description, ramp stage history. Six months later it answers "what changed and why" without anyone needing to remember. For SOC2/HIPAA/EU AI Act, the same log is the regulatory artifact. Trap: "We log it in Slack." Slack is not a structured log; it ages out, gets edited, and cannot be filtered.

Q7. Your team needs to ship an emergency prompt fix. The gate slows you down. What do you do? A. Two paths. Rollback is fast (no eval gate, no review — it is reverting to a known-good SHA, which is by definition safe). Forward-fix has a lighter gate — auto-checks run, single reviewer approves, narrow ramp. Build both paths in advance so emergencies do not become "skip the gate" exceptions. Trap: "Skip the gate just this once." Just-this-once normalizes to always.

Apply now (5 min)¶

Step 1 — audit your current edit surface. For one prompt in production, list every way it can be changed (code PR, dashboard, API, runtime config). For each path, identify the gates — or note their absence.

Step 2 — find the weakest path. Which path has the least review? Is it ever used? Is the temptation to use it high (e.g., Friday afternoon, incident)?

Step 3 — close one gap. Pick the weakest path and add one gate to it this week. Even a basic eval check or a Slack-bot review-request can move a path from "no review" to "some review."

The discipline is closing gaps one at a time and tightening over time. A perfect gate built in a quarter beats a perfect gate planned forever.

Bridge. The gate decides which versions are allowed to reach production. The next chapter is the question of how much traffic a version gets when it gets there — shadow runs that compare without affecting users, and A/B splits that ramp confidence in stages. → 05-shadow-and-ab-testing.md