09. Multi-tenant prompts — one recipe book, many customers¶
~16 min read. A hundred enterprise customers each want their prompt slightly different. You do not fork the recipe a hundred times. You patch a controlled base, resolve at runtime, and eval each customer's effective prompt as its own thing.
Builds on 08-prompt-eval-suites.md. The recipe book now serves many bakeries. Each bakery wants its own tweak — a different greeting, a banned phrase, an extra compliance line. The recipe stays one. The tweaks live on top.
1) Hook — the hundred-forks trap¶
Look. You sell a support agent to enterprises. The base prompt is good. Then the customers arrive.
Tenant A — "Always sign off as 'Team Acme'. Never use the word 'unfortunately'."
Tenant B — "Add a HIPAA disclosure when the topic touches health. Use British English."
Tenant C — "Match our brand voice — warm, direct, no exclamation marks."
Tenant D — "Refuse anything that mentions our competitors by name."
The lazy answer is to fork the prompt. Copy the base, edit it, save it as support_prompt_acme.md, support_prompt_bayer.md, support_prompt_cigna.md. Three months later you have a hundred copies of nearly the same recipe. Then you find a bug in the base. You must fix it a hundred times. You will miss four. Those four customers ship the bug for another year.
The grown-up answer is the customer's recipe as a patch. The base lives once. The patches live per tenant. At runtime, you resolve — base plus patches in order — and get the effective prompt for that tenant. Same machinery as configuration overlays in Kubernetes, or theme overrides in a design system. One source, many faces.
This chapter is how that resolution works, and where it breaks.
2) The metaphor — base recipe, sticky notes per bakery¶
Picture the head chef's recipe binder. On the binder's main page sits the master recipe — flour, water, salt, yeast, time, temperature. That page never gets crossed out. Instead, every franchise location keeps a small stack of sticky notes clipped to the binder. The Mumbai bakery's note says "reduce salt by 10%." The Bengaluru note says "swap white sugar for jaggery." The Dubai note says "add cardamom to the dough."
When the chef walks into a kitchen, the routine is the same every time. Read the master page. Apply this kitchen's sticky notes in order. Bake the resulting recipe.
The master page is the base prompt. The sticky notes are tenant patches. The thing the chef actually bakes from is the effective prompt — the resolved result. The customer never sees the master, never sees the patches. They just taste the bread.
Three properties matter.
The master is the only thing the recipe team edits. The sticky notes are owned by the franchise — the franchise can change them whenever, but only inside the lanes the master allows. And the franchise cannot rewrite the master from inside a sticky note. "Replace all flour with sand" is not a sticky note. It is a fork. Forks live in their own binder and pay their own cost.
That last property is what stops the hundred-forks problem from coming back through the side door.
3) Anatomy — the section-aware prompt and three patch types¶
For patches to behave, the base prompt has to be structured. Raw text does not patch cleanly. Structured text does.
A section-aware prompt looks like this.
┌──────────────────────────────────────────────────────────┐
│ BASE PROMPT — customer_support_agent v23 │
├──────────────────────────────────────────────────────────┤
│ [section: role] │
│ You are a customer support agent. │
│ │
│ [section: tone] │
│ Warm. Direct. One short paragraph per reply. │
│ │
│ [section: constraints] │
│ Never make up policy. Cite the docs section if used. │
│ Never promise refunds without policy_check. │
│ │
│ [section: banned_phrases] │
│ (empty) │
│ │
│ [section: signoff] │
│ "Is there anything else I can help with?" │
│ │
│ [section: footer] │
│ (empty) │
└──────────────────────────────────────────────────────────┘
Six named sections. Each section is the unit a patch can touch. A patch never edits raw bytes. It edits a section.
Three patch types cover most of what tenants ask for.
┌──────────────┬───────────────────────────────────────────┐
│ PATCH TYPE │ WHAT IT DOES │
├──────────────┼───────────────────────────────────────────┤
│ append │ Add lines to a section │
│ replace │ Swap a section's content │
│ redact │ Remove specific lines or phrases │
└──────────────┴───────────────────────────────────────────┘
append is the friendliest. "Add 'Team Acme' to the signoff." It cannot remove anything. The base's promises still hold. replace is sharper. "Replace the tone section with: formal, never use contractions." The patch reviewer must check it does not violate the base's contract. redact is for removals. "Remove the line about offering refunds." Same review needed.
The shape of a patch in the registry looks like this.
tenant_id: acme_corp
base_prompt: customer_support_agent@v23
patches:
- section: signoff
op: replace
value: "Best regards, Team Acme."
- section: banned_phrases
op: append
value: "- unfortunately"
- section: tone
op: append
value: "Match Acme house style — no exclamation marks."
Three patches. Each targets a section. Each names its operation. The patch file is short, reviewable, diffable. When Acme wants to add a new banned phrase next month, they edit one line. The base never moves.
4) Resolution — how runtime turns base plus patches into a prompt¶
The resolution loop runs once per request, for the tenant making the request.
┌─────────────────────┐
tenant_id ──▶ │ load patch list │
└──────────┬──────────┘
│
┌──────────▼──────────┐
base SHA ──▶ │ load base prompt │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ apply patches in │
│ defined order │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ effective prompt │
│ + effective SHA │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ send to model │
└─────────────────────┘
Three notes about this loop.
First, the effective SHA is computed from hash(base_sha + patches_in_order). That is the SHA that lands in the trace. When a complaint arrives, the trace tells you the exact effective prompt that ran, not just the base. Without an effective SHA, you can roll back the base and still be serving the bad patched version.
Second, patch order is declared, not implicit. The patch file lists patches in the order they apply. The order matters when two patches touch the same section. A tenant can have "append: be formal" followed by "replace tone: be casual" — the second wins. Declared order is the only sane rule.
Third, resolution caching. For a hot tenant making a thousand calls a second, you do not recompute the effective prompt every time. You cache by (tenant_id, base_sha, patches_version). Cache invalidates when any of those three change. The miss path runs in single-digit milliseconds. The hit path is a dictionary lookup.
Mini-FAQ. "Why not just inline the patches into the trace?" Because the trace has to be small. Storing the effective SHA plus a pointer to the patch file is cheaper and lets you re-resolve later if a patch file gets versioned.
5) Worked example — one base, three tenants, three effective prompts¶
Start with the base from section 3 — customer_support_agent@v23.
Three tenant patch files.
# tenants/acme/patches.yaml
base_prompt: customer_support_agent@v23
patches:
- section: signoff
op: replace
value: "Best regards, Team Acme."
- section: banned_phrases
op: append
value: |
- unfortunately
- regrettably
# tenants/bayer/patches.yaml
base_prompt: customer_support_agent@v23
patches:
- section: tone
op: replace
value: "Formal. British English. One paragraph per reply."
- section: footer
op: replace
value: |
If your question relates to a medical product,
please consult a qualified healthcare professional.
# tenants/cigna/patches.yaml
base_prompt: customer_support_agent@v23
patches:
- section: constraints
op: append
value: |
Refuse to discuss any competitor by name.
If asked, redirect to product capabilities.
- section: tone
op: append
value: "No exclamation marks anywhere."
After resolution, the three effective prompts look like this — same base, three different faces.
ACME BAYER CIGNA
role agent agent agent
tone warm, direct formal, British EN warm, no '!'
constraints base base base + no competitors
banned_phrases unfortunately, (empty) (empty)
regrettably
signoff "Best regards, base base
Team Acme."
footer (empty) HIPAA-style note (empty)
Three customers. One base. Three patch files. Zero forks.
When the base team fixes a bug in constraints next month — say, tightening the policy-check rule — all three tenants inherit the fix the moment the new base ships. No hundred edits. No four that get missed.
Mid-content recall¶
- Why does a section-aware prompt patch better than raw text?
- What does the effective SHA capture that the base SHA does not?
- If a tenant adds "append: formal" and then "replace tone: casual" — which wins, and why?
6) Conflict handling — when base and patch disagree¶
The hard cases are not the friendly appends. They are the conflicts.
The base says "refuse to make policy promises." A tenant patch says "always confirm refunds are processed." The patch is asking the agent to make a policy promise. The base forbids it. Who wins?
The rule that scales is the base owns invariants, the patch owns flavor. Some sections are declared flavor — tone, signoff, banned phrases, footer. Tenants can patch these freely within bounds. Other sections are declared invariant — safety constraints, escalation rules, policy-check requirements. Tenants cannot patch these at all. A patch that targets an invariant section is rejected at registry-write time, not at runtime.
# base prompt metadata
sections:
role: { patchable: false }
tone: { patchable: true, ops: [append, replace] }
constraints: { patchable: true, ops: [append] } # append only
banned_phrases: { patchable: true, ops: [append] }
signoff: { patchable: true, ops: [replace] }
footer: { patchable: true, ops: [replace] }
safety_rules: { patchable: false } # locked
Six sections, each with declared rules. constraints is append-only — tenants can add policies, never remove them. safety_rules is locked — no patches allowed. tone and signoff accept replace.
This is the cheap version of policy. The expensive version adds a lint step where every patch runs through a content rule — a small model or a regex pack — that flags patches like "never refuse" or "always agree." The lint step catches what the section metadata cannot.
Mini-FAQ. "Can two tenants have patches that conflict with each other?" They cannot. Each tenant resolves in isolation. The only conflict surface is between one tenant's patches and the base.
7) Eval per tenant — the silent break¶
This is the part most teams discover the hard way.
A base prompt change ships. Internal evals pass on the base. The change rolls out. Three days later, customer Bayer complains — their footer disclosure is gone. The team looks. The base change touched a section adjacent to footer, and the patch resolver, due to a name collision, dropped Bayer's footer patch silently.
The base eval suite would never catch this. The base does not have Bayer's footer.
The rule — every tenant gets its own eval suite, gated on its own effective prompt. When the base changes, the CI run does not just eval the base. It re-resolves every tenant's effective prompt and runs each tenant's eval set against it. If any tenant's eval regresses, the base change blocks.
┌───────────────────┐
base change PR ──▶ │ resolve effective │ ──▶ acme.eval
│ prompt per tenant │ ──▶ bayer.eval
│ (all tenants) │ ──▶ cigna.eval
└───────────────────┘ ──▶ ... 97 more
│
▼
any tenant eval fails
│
▼
block merge
At a hundred tenants this is slow. You parallelize by tenant. You cache eval inputs. You sample to a representative slice on every commit and run the full slate nightly. The point is not perfect coverage on every keystroke. It is that no base change reaches production without each tenant's eval set running against the tenant's actual effective prompt.
The other half — when a tenant patch changes, only that tenant's evals need to run. Tenant patches do not affect other tenants.
8) Failure modes — where multi-tenant prompts leak¶
SYMPTOM ROOT CAUSE FIX
───────────────────────────────────────────────────────────────────────────────────
"We forked the prompt for one tenant" Patch system not in place yet Build sections + patches first
Tenant's customization mysteriously Patch order not declared, Make patch order explicit;
gone after base update base edit broke section name version section names
Base bug fix did not reach 18 customers Some forks never re-synced Move forks back onto base+patch
Tenant added "ignore all safety" No section lock on safety_rules Declare patchable: false
clause through a patch and lint patches at write time
Trace shows base SHA but bad output Effective SHA not captured Hash base+patches; log that
A patch broke when the base added a Patches keyed to text, not section Move all patches to section keys
new line above
Eval pass on base, fail in production No per-tenant eval suite One eval set per tenant;
gate base change on all
Hot tenant burns CPU on resolution No cache for effective prompt Cache by tenant_id + base_sha
Two patches both target same section, No order rule Declared order in patch file
unclear winner
Tenant patch file becomes its own Patches accumulate without review Quarterly tenant patch audit;
forked prompt over years prune dead lines
Ten leaks. The common shape — patches need structure (sections), order (declared), bounds (section metadata), identity (effective SHA), and eval coverage (per tenant). Drop any one and the system slides back toward forks.
Where this lives in the wild¶
Multi-tenant prompt patching shows up across applied AI platforms, sometimes named, often hidden.
- Salesforce Einstein — per-customer prompt overrides applied as layered config on top of a managed base prompt for the Service Cloud agent.
- Microsoft Copilot Studio — tenant-scoped prompts and plugin manifests sit above a shared system prompt.
- Intercom Fin — per-workspace tone, banned phrases, and signoff overrides on top of a shared customer-support backbone.
- Zendesk AI agents — brand-voice overrides resolved at runtime per brand inside an account.
- Glean — per-customer system prompts patched onto a shared enterprise-search assistant base.
- Notion AI — workspace-level customization layered above the global instruction set.
- Slack AI — workspace-tuned summary and search prompts inheriting from a shared base.
- Atlassian Rovo — per-site prompt customization for Confluence and Jira agents.
- HubSpot AI Assistants — portal-level voice and compliance patches over the base support prompt.
- Drift / Conversica — account-level persona overlays on top of a shared sales-bot backbone.
- Twilio Verified AI — operator-scoped prompts derived from a managed base.
- AWS Bedrock Agents — agent-level instructions layered with Action Group prompts.
- Azure OpenAI on-behalf-of patterns — customer-tenant system messages composed before request time.
- Vercel v0 / Cursor / Lovable — per-project rules and conventions appended to a base coding agent prompt.
- GitHub Copilot for Business / Enterprise — org-scoped custom instructions stacking on the base Copilot prompt.
- Anthropic Claude Projects / OpenAI custom GPTs — user-scoped instructions resolved against the model's base.
- Langfuse prompt management — variants and labels used to express per-tenant prompts.
- PromptLayer — tagged prompt variants per tenant with a shared template ancestor.
- Vellum — base prompt plus environment-scoped overrides per customer.
- Braintrust — prompt variants gated by attributes; effective prompt logged with the trace.
- LangSmith — versioned prompt with metadata-based variant selection.
- Helicone — prompt template plus runtime substitution per request.
- Pezzo — environment-scoped prompts with hierarchical inheritance.
- Doppler / AWS Parameter Store / Hashicorp Consul — config stores commonly repurposed for tenant prompt patches.
- GitHub-hosted YAML registries — tenant-patch files reviewed by PR, resolved at deploy.
- Stripe Radar rule overlays — same shape applied to fraud rules; per-account overlays on a shared base.
- Cloudflare Workers AI — account-level system prompts composed at the edge before invoking models.
- Twilio Studio / IBM watsonx Assistant — tenant-scoped flow-level overrides on shared base flows.
Pause and recall¶
- Why does forking the prompt per customer fail at the hundred-tenant scale?
- What is the difference between a
replacepatch and anappendpatch? - Why must each section declare which patch operations are allowed?
- What is the effective SHA, and why does the trace need it?
- When the base prompt changes, whose eval suites run?
- Why does patch order need to be declared, not implicit?
- What is the one section type that should never accept patches?
Interview Q&A¶
Q1. A new customer asks for a tone change. How do you handle it without forking the prompt?
A. Add a tenant patch file. It references the base prompt's SHA and applies a replace on the tone section. The base never moves. The effective prompt resolves at runtime as base plus patch. The tenant's eval suite is added and gated on every base change.
Trap: "We copy the prompt and edit it for them." That is a fork. It dies at scale.
Q2. How do you prevent a tenant from patching their way around safety rules?
A. Section metadata declares which sections are patchable and which are locked. Safety-critical sections are patchable: false and the registry rejects any patch that targets them. A lint step on patch write catches patches that try to negate base invariants — "ignore prior instructions" and similar.
Trap: "We trust the tenant." Trust does not survive an integration tested by an adversary.
Q3. Tenant Acme's customization is silently gone after a base update. Diagnose. A. Most likely a section rename in the base broke the patch's section key. Patches must reference sections by stable identifiers, not text. The fix has two parts. Add the effective SHA to traces so this kind of silent drop is visible. Run per-tenant evals on every base change so the regression blocks before deploy. Trap: "Just don't rename sections." Renames happen. The system has to be robust to them.
Q4. How does the effective SHA differ from the base SHA? A. Base SHA hashes the base prompt content alone. Effective SHA hashes the base SHA combined with the tenant's patch list in order. Two tenants with the same base have different effective SHAs because their patches differ. The trace logs the effective SHA so debugging a customer complaint points to the exact prompt that ran for that customer. Trap: Logging only base SHA. You will roll back the base and still serve the broken patched version.
Q5. Patch resolution adds latency on every call. How do you make it fast?
A. Cache by (tenant_id, base_sha, patches_version). The cache miss path resolves in single-digit milliseconds. The hit path is a hash lookup. Invalidate when any key component changes. For hottest tenants, prewarm the cache on patch write.
Trap: Recomputing resolution on every call. At a thousand QPS per tenant, it shows up as CPU.
Q6. A tenant has a hundred patches accumulated over three years. What is the operational risk? A. The tenant's patch file has become its own shadow prompt. The base team cannot reason about its effective prompt. Quarterly audits should prune dead patches, consolidate overlapping appends, and re-review patches that touch sensitive sections. If a tenant's patch count exceeds a threshold, treat it as a candidate for a promoted variant — a sanctioned named base derived from the standard one. Trap: Letting patch files grow unbounded. They become unrecoverable forks-in-disguise.
Q7. Two tenants want exactly opposite tone instructions. How does the system handle that? A. It is not a conflict. Each tenant resolves in isolation. Tenant A's effective prompt has Tenant A's tone. Tenant B's effective prompt has Tenant B's. There is no shared mutable state between them. The only conflict surface is between one tenant and the base, and that is governed by section permissions. Trap: Assuming patches share scope. They never do.
Q8. How do per-tenant evals stay tractable at a hundred customers? A. Three levers. Parallelize by tenant — each tenant's eval is independent. Sample a representative tenant slice on every commit and run the full slate nightly. Cache eval inputs and prompt-render outputs aggressively. The goal is not perfect coverage per keystroke. It is that no base change reaches production unbounded by the worst-case tenant regression. Trap: "We only eval the base." That is exactly how customer X's footer disappears.
Apply now (5 min)¶
Step 1 — model first. Take a base support prompt with five sections — role, tone, constraints, signoff, footer. Write a tenant patch file with one replace on tone, one append on banned_phrases, one replace on signoff. Resolve mentally — write out the effective prompt.
Step 2 — your turn. Pick one prompt in your system that has at least two tenants. Identify which lines belong to the base and which to per-tenant overrides. Sketch the section list. Mark each section patchable: true or false with allowed operations.
Step 3 — sketch from memory. Redraw the resolution diagram from section 4 — tenant_id and base SHA flow in, effective prompt and effective SHA flow out, cache sits beside it. If you can draw it without looking, you have the model.
Bridge. Patches give you fine-grained per-tenant control. But you still need a coarse-grained way to ramp a prompt change across all tenants — first 1% of traffic, then 5%, then 25%, with a kill switch on standby. That is the feature-flag layer. Next. → 10-prompt-feature-flags.md