Skip to content

10. Prompt feature flags — the dial that ramps and the switch that kills

~15 min read. The same flag machinery that gates a code rollout can gate a prompt rollout. The dial moves slowly from one percent to a hundred. The kill switch flips back in one click. Everything in between is bucketing, eval gates, and good hygiene.

Builds on 05-shadow-and-ab-testing.md and 09-multi-tenant-prompts.md. The recipe book holds the versions. The flag system decides which version each customer actually tastes today.


1) Hook — two ways to ship a prompt change

You finished v18 of the support prompt. The taste test passes. You want to ship.

Path A — the all-or-nothing deploy. Merge the PR. The new SHA goes live for everyone the moment it deploys. If v18 turns out to nudge replies in a colder direction that the eval did not catch, every customer feels it at once. When the complaint surge starts you push a revert PR, wait for CI, wait for deploy, and serve the bug for forty minutes.

Path B — the feature-flagged ramp. Merge the PR. The new SHA is registered, but the live flag value still points to v17. You toggle the flag in LaunchDarkly — "1% of users to v18." You watch the dashboards. The complaint rate holds steady. You move to 5%. Then 25%. Then 50%. At each step, eval metrics from production traffic confirm or deny. If complaints spike at 5%, you flip the flag back to v17. Two clicks. Forty seconds.

Path A is the bakery from the ELI5 chapter. Path B is the bakery you want. The only difference is that the which-recipe-runs decision has been pulled out of code and into a runtime flag value. The flag is the dial. The dial is the difference.

This chapter is about turning the prompt-rollout decision into a flag, and not making the standard flag-system mistakes along the way.


2) The metaphor — the dial above the oven

Picture the head baker's bench. Two recipe sheets sit ready — v17, the safe one, and v18, the new one. Above the oven hangs a single dial marked from zero to a hundred. The dial does not say "replace the recipe." It says "of every hundred loaves baked today, how many use v18?"

At dawn the dial reads one. Ninety-nine percent of customers eat v17. One percent eat v18. The baker watches the complaints book. No spike. He turns the dial to five. He waits. He turns to twenty-five. Still calm. He turns to a hundred. Now everyone eats v18.

If at any point the complaints book lights up, the baker reaches up and slams the dial back to zero. Within minutes the next batch out of the oven is v17 again. Nobody had to print a new recipe. Nobody had to rip pages out of the binder. The recipe never left. Only the dial moved.

That dial is the feature flag. The two recipes are two SHAs. The decision of which recipe each loaf comes from is bucketing — a small piece of fast math that runs every time an order arrives.

The whole chapter is the engineering behind that dial.


3) Anatomy — flag, SHA, bucketing, killswitch

A prompt feature flag has four moving parts.

┌────────────────────────────────────────────────────────────────┐
│  PROMPT FEATURE FLAG — support_prompt_rollout                  │
├────────────────────────────────────────────────────────────────┤
│  variants:                                                     │
│    control:    customer_support_agent@v17  (SHA a8c3f9...)     │
│    treatment:  customer_support_agent@v18  (SHA b1d7e4...)     │
│                                                                │
│  default:      control                                         │
│                                                                │
│  rules:                                                        │
│    1. tenant_id in [acme, internal] → treatment                │
│    2. user_id hash % 100 < 5         → treatment               │
│    3. otherwise                       → control                │
│                                                                │
│  killswitch:   force_control = false                           │
└────────────────────────────────────────────────────────────────┘

Four parts.

Variants map flag values to prompt SHAs. Two is the common case — control and treatment. Three is allowed when you A/B/C test. More than four is usually a smell.

Default is what the system serves when the flag service is unreachable. Always the known-safe SHA. Never the new one. This is the most-forgotten rule in flag systems.

Rules decide which user gets which variant. Rules run top-down. The first match wins. The two common shapes are targeted rules — internal tenants, beta accounts, specific user IDs — and percentage rules with deterministic bucketing.

Killswitch is a flat override. When set, every request returns the control variant regardless of the rules. The killswitch is what you flip during an incident. It is the only single boolean in the flag that you must protect with operational habit, not just permissions.

The flow at request time looks like this.

                  ┌──────────────────────┐
   request ──▶    │  build flag context  │
                  │  user_id, tenant_id  │
                  └──────────┬───────────┘
                  ┌──────────▼───────────┐
                  │  flag.evaluate()     │
                  └──────────┬───────────┘
                  ┌──────────▼───────────┐
                  │  flag value: v17/v18 │
                  └──────────┬───────────┘
                  ┌──────────▼───────────┐
                  │  load prompt by SHA  │
                  └──────────┬───────────┘
                  ┌──────────▼───────────┐
                  │  send to model       │
                  │  log SHA in trace    │
                  └──────────────────────┘

The trace logs the SHA that actually ran, not just the flag value. When an incident hits, the trace tells you whether this user was on v17 or v18 without consulting the flag system at all. That separation matters during the incident — flag systems can be slow, dashboards can lag, traces are the ground truth.


4) Deterministic bucketing — same user, same recipe

Random bucketing is a beginner's mistake. If user 4481 sees v18 on Monday and v17 on Tuesday, your eval signal is muddy and the user experience is jarring. Deterministic bucketing means the same user always lands in the same bucket, until the rollout percentage moves.

The standard trick is a fast hash.

def bucket(user_id: str, flag_name: str) -> int:
    h = murmurhash3(f"{flag_name}:{user_id}")
    return h % 100   # bucket 0..99

The flag name gets mixed into the hash. Without that mix, every percentage flag puts the same users into the low buckets — users in bucket 0 see every new feature first, users in bucket 99 see them last. That biases your A/B reads. Mixing the flag name decorrelates the buckets across flags.

USER ID    +    FLAG NAME          HASH        BUCKET
─────────────────────────────────────────────────────
user_4481      support_prompt      0x9a2f...    47
user_4481      pricing_prompt      0x142e...    62
user_4481      ranking_v2          0xcc01...    08

Same user, three different buckets across three different flags. That is the property you want.

Rollout percentage works against the bucket. At 5%, users with bucket < 5 see treatment. At 25%, users with bucket < 25. The same user stays in their bucket forever. When you ramp from 5% to 25%, the new users entering treatment are buckets 5..24 — users who were on control yesterday and treatment today. Everyone in 0..4 was already there. Everyone in 25..99 is still on control. No one whiplashes.

Mini-FAQ. "Do I bucket by user_id or session_id?" For most prompt rollouts, user_id — same person, same recipe across sessions. For anonymous traffic, session_id is the fallback. For tenant-wide rollouts, tenant_id.


5) Worked example — Statsig-driven ramp from v17 to v18

Walk through one realistic rollout. The team uses Statsig. The prompt is customer_support_agent. v17 is in production. v18 just passed CI evals.

Day 0 — register the flag.

flag: support_prompt_rollout
variants:
  control:   { sha: a8c3f9, label: v17 }
  treatment: { sha: b1d7e4, label: v18 }
default: control
killswitch: false
rules:
  - if tenant_id in [internal, acme_staging] → treatment
  - else → control

The flag is live but no production user sees treatment. Internal tenants and the Acme staging account get v18 immediately so the team and one friendly customer can taste the new bread.

Day 1 — 1% ramp.

The team adds the bucket rule. Csat dashboards are wired to a downstream metric. A complaint-rate-per-thousand-resolutions chart sits in Grafana with a red line.

rules:
  - if tenant_id in [internal, acme_staging] → treatment
  - if bucket(user_id, "support_prompt_rollout") < 1 → treatment
  - else → control

About one percent of production users — selected deterministically, stable across sessions — now taste v18. The team watches for thirty-six hours. Complaint rate holds steady. The trace volume on v18 is small but enough to see the most common reply shapes.

Day 3 — 5%.

The percentage moves from 1 to 5. The team is now watching csat per variant — the prompt observability layer from chapter 7 splits the metric. v18's csat sits within a fraction of a point of v17. Acceptable. Sample size still small.

Day 5 — 25%.

Now the comparison has real signal. A confidence interval starts to form. v18 is roughly equal to v17, perhaps a hair better on conciseness. No regression on the regressions-suite metrics.

Day 7 — 50%.

The team holds at 50% for two days because the weekend traffic profile is different. They want both halves to cover a weekend.

Day 10 — 100%.

The percentage moves to 100. Everyone is on v18. The flag rule still has the old control branch wired up.

Day 17 — flag retirement.

A week passes without incident. The team retires the flag — they change the call sites to load v18 directly, remove the flag from Statsig, and delete the rollout config. The flag has a removal date in its metadata; a Slack reminder fires when the date hits.

The whole ramp took ten days. The riskiest moment was day 1, when one percent of users met v18 for the first time in production. By day 5 the team had production evidence v18 was at least equal. By day 10 the rollout was complete. At no point was there a moment where a regression could not be reversed by flipping the killswitch.


Mid-content recall

  1. Why is bucketing by hash(flag_name + user_id) better than hash(user_id) alone?
  2. What does the killswitch do that a percentage rollback does not?
  3. Why does the trace log the SHA, not the flag value?

6) The killswitch — what it must do, fast

The killswitch is a single flag attribute that, when flipped, forces every request to control regardless of rules. Two properties matter.

Speed of propagation. When you flip the killswitch in the flag console, the new value must reach every server inside thirty seconds, ideally inside ten. LaunchDarkly streams flag updates over a server-sent-events connection — typical propagation is sub-second. Statsig and Split similarly stream. If your flag library polls every minute, your killswitch is a minute-switch. Replace it.

Resistance to flag-service failure. If the flag service is down, every request still falls back to the default — which is control. The killswitch and the failure mode point in the same direction. This is not an accident. The default is configured for the worst case.

The operational habit around the killswitch is the part most teams underinvest in.

┌────────────────────────────────────────────────────────────┐
│  KILLSWITCH RUNBOOK — support_prompt_rollout               │
├────────────────────────────────────────────────────────────┤
│  Step 1. Open Statsig console → support_prompt_rollout     │
│  Step 2. Toggle killswitch ON                              │
│  Step 3. Confirm complaint rate normalizes in dashboard    │
│  Step 4. Open #incident-prompt-rollout, post first update  │
│  Step 5. File incident ticket with the trace IDs that      │
│          triggered the kill                                │
│                                                            │
│  Owner: support-ai-team                                    │
│  Pager: PD service "support-ai-rollout"                    │
└────────────────────────────────────────────────────────────┘

A runbook beats memory at 3 a.m. The runbook lives next to the flag, not in a wiki you cannot find under pressure.

Mini-FAQ. "Do I need a separate killswitch per flag?" For prompt rollouts, yes. A blanket "kill all AI flags" switch sounds safe but disables features that were not the problem. Per-flag killswitches let you act surgically.


7) Targeted rollouts and the internal-first habit

Percentage ramps are the headline. Targeted rules are the unsung work.

The internal-first habit — every prompt rollout starts with internal users only, before any production percentage moves. The flag rule for that is one line.

rules:
  - if user_email endswith "@yourcompany.com" → treatment
  - if tenant_id in [internal_staging, internal_dogfood] → treatment
  - else → control

This catches catastrophic failures — prompts that produce empty replies, prompts that violate the safety policy, prompts that the eval suite did not cover — before any real customer sees them. The cost is nearly zero. The risk-reduction is high.

After internal, targeted customer pilots. "Acme has asked to try this. Bayer is on standby." Two friendly customers, ramping ahead of the general rollout, often catch the patches the eval missed. Their feedback loop is direct. When something breaks, you hear about it from a known relationship, not from a Twitter thread.

Geography targeting matters when prompts touch language. "v18 is English-only. Roll out to EN tenants first; HI/AR/ZH tenants stay on v17 until the multi-lingual eval passes." Geography rules also surface regulatory boundaries — "the EU footer change rolls only inside the EU; US tenants are untouched."

Segment targeting is the last common one. "Pro and Enterprise tenants get the longer-context prompt. Free tenants stay on the shorter, cheaper one." This is where cost optimization and prompt rollouts intersect — the flag is also a routing decision.

ROLLOUT ORDER
──────────────────────────────────────────────────────────
1. internal staff           (0%-100% in 1 day)
2. internal dogfood tenants (0%-100% in 1 day)
3. friendly customer pilots (2-3 named tenants)
4. percentage ramp          (1% → 5% → 25% → 50% → 100%)
5. flag retirement          (~1 week after 100%)

Five stages. Each stage is a chance for a different class of bug to surface. Skipping a stage is how the incidents from chapter 11 start.


8) Multi-flag interactions — when flags compound

This is the part that bites at scale.

A real production request often consults three or four flags. The prompt flag picks v17 or v18. The model flag picks claude-sonnet or claude-opus. The retrieval flag picks bm25-only or hybrid. The temperature flag picks 0.0 or 0.2.

Four flags, two values each — sixteen combinations. Your eval suite tested one — v18 with sonnet with hybrid with 0.0. The combination some user actually receives might be v18 with opus with bm25 with 0.2. You never evaluated it. Half your production traffic is in untested combinations.

The mitigation has two parts.

Bundle related flags. Prompt and model and retrieval often change together. Instead of three flags, define a single bundle flag with named variants — bundle_v1 (v17 + sonnet + hybrid + 0.0), bundle_v2 (v18 + sonnet + hybrid + 0.0), bundle_v3 (v18 + opus + hybrid + 0.0). Now each variant is a tested combination.

Monitor variant coverage. Even with bundles, some combinations slip through. A dashboard that counts requests per variant combination catches drift — if 12% of requests are landing in a combination your eval never tested, you know to add coverage or kill that combination.

SEEN COMBINATIONS THIS WEEK
─────────────────────────────────────────────────
bundle_v1 + temp 0.0     45%   tested
bundle_v2 + temp 0.0     38%   tested
bundle_v2 + temp 0.2     12%   ← untested!
bundle_v3 + temp 0.0      5%   tested

The 12% slice is the bug breeding ground. Either fold it into the eval, or remove the combination from the rule set.

Mini-FAQ. "What about overrides for support engineers debugging a ticket?" Flag systems support per-user overrides for staff. They are valuable. Just make sure the trace logs whether an override was active. "It worked when I tested it" is meaningless if the engineer was on a different variant than the customer.


9) Flag hygiene — the dead-flag pile

Flags accumulate. The team ramps v18 to a hundred percent in week one. Week three the v18-v17 flag is still in the code, still callable. Week sixteen someone toggles it to control to "see what changes," and serves v17 to half the traffic for an hour.

Every flag needs four metadata fields at creation time.

flag: support_prompt_rollout
owner: support-ai-team
created: 2026-04-12
removal_target: 2026-05-19            # 4 weeks after creation
status: active                        # active | rolled_out | dead

A nightly job lists every flag past its removal target with status not dead. The list goes to the owning team. After two weeks of grace, the flag's call sites are removed and the flag is deleted.

Tools enforce this. LaunchDarkly has "stale flag" detection. Statsig has the same. The discipline is using it.

DEAD-FLAG SYMPTOMS
────────────────────────────────────
+ flag fully rolled out for 30+ days
+ no remove_target set
+ owner left the team 6 months ago
+ flag toggled accidentally during incident
+ code path on the control side is dead

Five signs the flag should be gone. Each one is a small incident waiting.


10) Failure modes — where prompt flags leak

SYMPTOM                                  ROOT CAUSE                          FIX
─────────────────────────────────────────────────────────────────────────────────────
"It worked when I tested it"             Engineer override active            Log override in trace
Killswitch took 90 seconds to act        Flag client polls, not streams      Switch to streaming SDK
User saw v17 Monday, v18 Tuesday         Random bucketing, not deterministic Hash on user_id + flag_name
1% ramp shows huge regression            Eval gate skipped, prompt was bad   Block flag flip on failing eval
Flag flipped back on Friday, broke       Default not set; flag-service down  Set default = known-safe SHA
prompt for everyone                      treated as no-decision
Dashboards still show v17 a week         Trace logs flag value, not SHA      Log SHA, not flag value
after full rollout
50% ramp held for an hour, then          Flag service had stale rule cache   Add health-check on rule freshness
reverted to 1% by itself
Six prompt flags interact in untested    No bundle abstraction               Bundle related flags into one
combinations
Dead flag toggled, caused incident       No flag removal discipline          Removal target + nightly audit
1% ramp had 0 production traffic         Bucket on user_id, but most calls   Bucket on tenant_id for B2B
                                         are tenant_id-scoped
Trace shows treatment, but config        Flag context built incorrectly      Unit-test context-builder
shows user was in control bucket
Killswitch flipped, csat did not         Killswitch propagation delayed by   Add metric on flag-update lag
recover for 10 minutes                   CDN cache

Eleven leaks. The shape — flag systems are reliable when you respect their model (deterministic bucketing, streaming updates, defaults, killswitches) and unreliable when you treat them as magic. They are not magic. They are a fast key-value store with a rules engine, hardened.


Where this lives in the wild

Flag systems and the prompt-rollout patterns built on top of them.

  • LaunchDarkly — the most widely deployed feature-flag service; native support for percentage rollouts and SSE-based propagation.
  • Statsig — flags, experiments, and Pulse metric monitoring tightly integrated; popular for AI features.
  • Split.io (Harness FME) — feature flags with strong experimentation analytics.
  • Flagsmith — open-source flag system with self-host option.
  • Optimizely Feature Experimentation — flags layered with the experimentation suite.
  • ConfigCat — lightweight flag service used by smaller teams.
  • GrowthBook — open-source flag and experiment platform.
  • Unleash — open-source feature flags with self-host emphasis.
  • AWS AppConfig — flag-style dynamic config used to gate prompt SHAs on Bedrock-backed apps.
  • Cloudflare Workers KV / Durable Objects — used as a fast flag store at the edge.
  • OpenAI Assistants API — version pinning gated by application-layer flags.
  • Anthropic message-batches — frequently rolled out as a flagged path before becoming default.
  • GitHub Copilot rollout patterns — staff-first, then enterprise-tenant ramps mediated by feature flags.
  • Cursor, Lovable, v0 — gated rollouts of new model defaults and prompt revisions across users.
  • Notion AI, Slack AI, Atlassian Rovo — workspace-segment ramps for prompt and model changes.
  • Vercel AI SDK — model-routing flags wrapping prompt versions for staged rollouts.
  • Datadog Feature Flags / dynamic instrumentation — used to gate prompt traces and logging volume.
  • Sentry feature flags integration — links flag state to error reports for AI-feature regressions.
  • Langfuse prompt labelsproduction, staging, experiment labels acting as a lightweight flag surface.
  • PromptLayer release groups — staged-release tagging for prompt SHAs.
  • Braintrust — experiment groups gating prompt variants in production.
  • Vellum environments — production / staging / experiment scoping per prompt.
  • Helicone properties — runtime tagging used as a soft flag for routing.
  • AWS Parameter Store, Hashicorp Consul, Doppler — config stores commonly repurposed for prompt-SHA gating.
  • GitHub Actions / GitLab CI / CircleCI — pipelines that block flag flips on failing eval suites before percentages move.
  • PagerDuty / Opsgenie — paging surfaces wired to flag-flip events for fast escalation.
  • New Relic, Datadog — dashboards split by flag variant for csat and complaint metrics.
  • Stripe Capital experimental rollouts — same shape applied to ML decision logic at percentage ramps.
  • Booking.com, Airbnb, Etsy — long-standing internal flag systems used to gate model and prompt changes for years.
  • Microsoft Copilot Studio — environment-based gating of prompt revisions per tenant.

Pause and recall

  1. What four parts make up a prompt feature flag?
  2. Why does bucketing mix the flag name into the hash?
  3. What is the default value of a prompt flag during a flag-service outage?
  4. Why log the SHA in the trace rather than the flag value?
  5. What is the cost of holding a 50% ramp across a weekend?
  6. Why do dead flags become incidents?
  7. What is the simplest mitigation for combinatorial flag interactions?

Interview Q&A

Q1. How do you safely roll out a new prompt version? A. Register the new SHA. Wire a feature flag with control = old SHA, treatment = new SHA, default = control. Ramp through internal users, then targeted pilot customers, then 1% → 5% → 25% → 50% → 100% of production over a week, watching csat and complaint metrics at each stage. Keep a killswitch that flips the flag back to control instantly. Retire the flag after a quiet week at 100%. Trap: "We deploy it on Tuesday morning." That is the bakery from chapter 0.

Q2. Why is deterministic bucketing important, and how does it work? A. Same user, same variant across sessions. The bucket comes from hash(flag_name + user_id) % 100. Mixing the flag name into the hash decorrelates buckets across flags — user 4481 might be in bucket 4 for the prompt flag and bucket 88 for the retrieval flag. Without that mix, the same users see every new feature first, which biases analysis. Trap: "We bucket randomly per request." That destroys A/B signal and creates jarring UX.

Q3. A prompt rollout incident is happening right now. Walk through the response. A. Open the flag console. Toggle the killswitch ON. Confirm the trace SHA returns to the control value within thirty seconds. Watch the complaint-rate dashboard normalize. Post in the incident channel. File the incident ticket with the trace IDs from the regression. The full procedure is in a runbook attached to the flag. Trap: "We revert the PR." That takes minutes you do not have. The flag is the fast path.

Q4. How do you prevent dead flags from accumulating? A. Every flag has metadata at creation — owner, created date, removal target, status. A nightly audit lists flags past their removal target. After two weeks of grace, the call sites are deleted and the flag is removed from the service. Tools like LaunchDarkly and Statsig provide stale-flag detection; the discipline is acting on it. Trap: "We clean them up quarterly." Quarterly is when the dead-flag incidents happen.

Q5. Six flags interact in production. How do you reason about that? A. Two moves. First, bundle related flags — prompt + model + retrieval often change together; combine into one bundle flag with named variants, each variant being a tested combination. Second, monitor variant coverage — a dashboard counting requests per combination. Any combination with significant traffic and no eval is either folded into the eval or removed from the rule set. Trap: Treating flags as independent. They are not. Their combinations are the production surface.

Q6. What is the default value of a prompt flag when the flag service is unreachable? A. The control variant — the known-safe SHA. The system must serve a response when the flag service is down, and the response must come from the safe version. This means the flag and its failure mode point the same direction, which is the design intent. Trap: "We fail open to treatment." That ships untested behavior during an outage.

Q7. The 1% ramp shows a regression. The eval did not catch it. What do you do? A. Flip the killswitch. Then write the failing case from the production trace into the eval suite — the "fix the eval, not just the prompt" rule. Make the eval fail on the bad behavior. Iterate on the prompt until the eval passes. Restart the rollout. Without the new eval case, the same regression will ship next time. Trap: "We tweak the prompt and re-ramp." If the eval did not catch it once, it will not catch it next time. The eval is the asset.

Q8. Why log the SHA in the trace, not the flag value? A. The flag value is a label that can be re-mapped. The SHA is content-addressed and immutable. Three months from now, when you debug a complaint, you want to know exactly what text was sent to the model, not what label the flag service had assigned that day. Flag systems also fail; traces should remain interpretable without them. Trap: Logging only the flag value. Months later, when the flag is retired, you have lost the link.


Apply now (5 min)

Step 1 — model first. Sketch a flag definition for a hypothetical pricing_prompt_rollout. Two variants. A default. A rule list with internal targeting and a 1% bucket. A killswitch field. The four parts.

Step 2 — your turn. Pick one prompt in your system that has shipped a change without a flag. Write the flag definition that would have wrapped the change. Identify what the percentage ramp schedule would have looked like and what the kill criterion would have been (csat drop, complaint rate, latency p99).

Step 3 — sketch from memory. Draw the request flow from section 3 — request comes in, flag evaluates, SHA loads, trace logs the SHA. If you can draw it without looking, you have the model.


Bridge. Flags give you control over the rollout. But sometimes the prompt change is bad and the metrics catch it. Now what? How fast can you actually roll back, and what habits make the postmortem produce a stronger eval suite instead of just a sigh of relief? Next. → 11-prompt-incidents-and-rollback.md