13. Honest admission — what prompt ops still does not solve¶
~12 min read. Mature prompt ops engineers sound calm because they have already named the eight problems they cannot solve. Honesty about boundaries is the senior signal.
Built on the ELI5 in 00-eli5.md. We still have the recipe, the recipe book, the SHA, the rollback, the taste test, the trial bake, the bakery log, the customer's recipe. This final lesson is an honest accounting of where those placeholders still fail.
1) Prompt drift across model upgrades is unsolved¶
Your prompt has a SHA. The eval suite passes against the current model. Then the supplier ships a minor update to the model — same name, same tier, slightly different behavior. The prompt SHA is unchanged. The eval suite may still pass. But the production behavior shifts in ways the eval did not measure.
The honest position — re-evaluating every prompt on every minor model update is the right answer, and almost nobody does it because it costs too much. The compromise is targeted re-evaluation of the highest-stakes prompts and continuous monitoring on the rest. The compromise leaks behavior changes through the cracks. There is no clean solution.
2) Multi-modal prompts make versioning much harder¶
A prompt that takes text and one image is one thing. A prompt that takes text and seven images and a system instruction and a tool schema becomes a multi-headed thing to version. The SHA covers the text. But what about the images? Their content, their order, their resolution, their alt-text?
Some teams version the prompt + the image manifest together. Others treat images as runtime inputs and accept that the same SHA can produce different outputs depending on which images arrived. Both compromises have real failure modes.
Audio prompts and video prompts compound the problem. The recipe book was designed for text. Multi-modal recipes break the abstraction.
3) Long-context prompts interact with caching in ways nobody has fully solved¶
Prompt caching saves 90% on the cached portion. To benefit, the cached portion must be stable — same prefix, same tokens, same model. Now consider a multi-tenant prompt where the base is stable but the per-tenant patch differs. Cache hit rate depends on which order you assemble the parts.
Some teams structure the prompt for cache stability — put the global base first, the tenant override last, the user input at the very end. This works for some patch types and breaks for others. There is no clean rule because cache mechanics differ across suppliers.
Long-context prompts (50K+ tokens) make this worse — the bigger the cache, the bigger the cost of a cache miss, the bigger the incentive to optimize. But optimizing for cache locks you into prompt structures that may not be optimal for clarity or maintainability.
4) Prompt search and reuse — there is no good answer¶
Your registry has 800 prompts across 40 services. Someone needs a prompt for "summarize a long support ticket." Is there already one? Probably yes, two or three. Which one? Hard to say. Most registries do not support semantic search across prompts. Even the ones that do are slow to maintain index freshness.
The result — duplicate prompts proliferate. Service A's summarization prompt evolves separately from Service B's near-identical one. Bug fixes do not propagate. Evals are duplicated. Reviewers split attention.
There is no clean solution. Some teams enforce a manual catalog. Some build a semantic-search-over-prompts feature. Neither scales past a few hundred prompts without ongoing curation work.
5) AI assistants editing prompts inside IDEs bypass the registry¶
A developer is pair-programming with Claude or Copilot. They ask the AI assistant to "fix the prompt in this file." The assistant edits the prompt string in the Python source. The edit lands in the next commit. The registry is bypassed. The review gate is bypassed. The SHA is whatever the new string hashes to, but nobody noticed.
This is the modern version of the 11pm-Tuesday paste — except the editor is an AI. The fix is to move prompts out of source code entirely (into the registry, referenced by ID). Most teams have not made that migration yet. Many never will, because the cost of refactoring is real and the benefit feels abstract until the first incident.
6) Cross-team coordination is harder than the tooling pretends¶
Two teams share a base prompt. Team A wants to make it more formal. Team B wants to make it more concise. Both teams' customers have legitimately different needs. The prompt registry has version control, but version control does not resolve conflicting product requirements.
The resolution is organizational, not technical — a prompt owner, a steering meeting, a documented intent for the prompt. Tools do not solve it. The teams that handle this well have a "prompt-owner" role for every shared prompt, similar to a DRI for shared infrastructure.
Even with the role, the resolution is slow and political. The tooling does not help.
7) Prompt evals are themselves prompts — turtles all the way down¶
LLM-as-judge evals are prompts. The judge prompt is also versioned, also evaluated, also subject to drift. The eval-for-the-eval is itself a judge prompt that has the same biases as the original judge.
The pragmatic stop — human-labeled calibration sets ground the system. Quarterly re-calibration catches judge drift. But the chain has weak links, and a sufficiently sophisticated regression can hide in the judge-of-judges layer.
Nobody has fully solved this. The honest position — your eval system has uncertainty you cannot eliminate, only manage.
8) Multi-tenant prompt fragmentation can grow faster than ops can manage¶
The patches-on-base model from the multi-tenant chapter (09-multi-tenant-prompts.md) works well at 50 tenants and starts to crack at 500. By 2000 tenants with personalized prompts, even the patch model produces a combinatorial mess. Patches overlap. Conflicts emerge. Evals must run per tenant.
Some teams cap the number of tenants who can have custom prompts. Others build automated patch analysis to detect overlap. Some accept the cost and run eval suites per-tenant nightly. There is no clean answer at scale.
The honest position — multi-tenant prompts at scale require ongoing curation work that grows with the customer base. It does not commoditize.
Where this lives in the wild¶
- Anthropic model deprecation notices — surface drift problem (#1).
- OpenAI's model spec evolution — same.
- Google Gemini quarterly updates — same.
- Langfuse multi-modal traces — surfaces but does not solve image/audio versioning (#2).
- Anthropic prompt caching docs — explicit on cache stability constraints (#3).
- OpenAI prompt caching — similar mechanics with different cache key behavior.
- Vellum's prompt library — example of curated central catalog (#4 partial mitigation).
- PromptHub — community prompt sharing, surfaces the search problem.
- GitHub Copilot's prompt editing — example of AI-assistant editing surfacing inside IDEs (#5).
- Cursor's prompt mode — same.
- Anthropic Workbench — partial answer to AI-edits-prompts by keeping prompts in a managed surface.
- Notion AI's prompt customization — multi-tenant customization at consumer scale (#8).
- Glean's tenant-specific prompts — enterprise-scale multi-tenant fragmentation.
- Harvey's customer-specific legal prompts — same in legal vertical.
- OpenAI's evals repo — judge prompts are themselves committed and versioned (#7).
- Braintrust's calibration sets — quarterly recalibration practice.
- Promptfoo's drift detection — production drift comparison.
- LangSmith's continuous eval — production trace eval to catch silent drift.
- The OpenAI Forum, Anthropic Discord — where prompt-ops practitioners share what is not working.
- r/LocalLLaMA, r/MachineLearning — practitioner-community surfacing of all eight admissions.
- EU AI Act audit requirements — push for stronger prompt audit trails, intersect with #5 and #6.
- NIST AI RMF — risk management framework that touches drift and audit.
- SOC2 controls for AI systems — emerging area; many controls map to admissions #5 and #6.
Pause and recall¶
- Why is re-evaluating every prompt on every model update right in theory and impossible in practice?
- What makes multi-modal prompt versioning harder than text-only?
- How do caching mechanics interact with multi-tenant prompts in ways that create trade-offs?
- Why does the proliferation of similar prompts across services not solve itself?
- What is the "AI-assistant editing prompts in IDEs" problem and what is the structural fix?
- Why is the prompt-owner role organizational, not technical?
- What does it mean to say "prompt evals are themselves prompts"?
Interview Q&A¶
Q1. What does mature prompt ops not solve? A. Eight things. (1) Drift on minor model updates. (2) Multi-modal prompt versioning. (3) Long-context caching interactions. (4) Prompt search and reuse. (5) AI-assistant edits bypassing the registry. (6) Cross-team coordination on shared prompts. (7) The recursive judge-of-judges problem. (8) Multi-tenant fragmentation at scale. Naming these honestly is what makes a senior candidate sound senior. Trap: "Modern tooling has solved all of this." It has not. Saying so signals lack of production experience.
Q2. Your model upgraded silently and behavior changed. How do you detect it? A. Three layers. (1) Continuous eval on production traces — daily score distribution; alerts on drops. (2) Output-shape monitoring — length, JSON validity, tool-call rate. (3) User-facing metric monitoring — csat, complaint rate, conversion. Each layer catches different drifts. None catches all. Trap: "We re-run the eval suite on every model update." That works in theory; almost nobody does it in practice because of cost.
Q3. How do you version a multi-modal prompt? A. Version the text + the manifest of accompanying assets (images, audio, schema). The asset content has its own SHA. The combined SHA covers the full prompt configuration. The compromise — runtime inputs (user-supplied images) cannot be SHA'd; their effect on output is unbounded. Mature teams accept this and monitor output drift continuously. Trap: "Just SHA the text." The image content can drift output behavior dramatically.
Q4. Your registry has 500 prompts and developers cannot find the right one. What do you do? A. Three interventions, in order. (1) Audit the registry for duplicates and merge what can be merged. (2) Add a manual catalog with tags and use cases. (3) Build a semantic-search-over-prompts feature using embeddings. The first two are cheap and catch most of the pain. The third pays off only after the team commits to ongoing index maintenance. Trap: "Just have everyone search the registry." Without curation, the registry becomes write-only.
Q5. AI assistants are editing prompts inside source code. How do you contain it? A. The structural fix is to move prompts out of source code entirely. Prompts live in the registry; code references them by ID. The AI assistant can suggest a registry change, but the change goes through the review gate. If the structural fix is too big, the interim fix is CODEOWNERS rules that force review on prompt-string changes anywhere in source. Trap: "Train developers to be careful." Cultural fixes do not scale; structural fixes do.
Q6. Two teams want different versions of a shared prompt. How do you resolve? A. Organizationally. Assign a prompt-owner (DRI) for the shared prompt. They are accountable for the prompt's evolution. Teams with different needs propose changes; the owner evaluates against the prompt's stated intent. If the conflict is fundamental, fork the prompt (one for team A, one for team B) and accept the maintenance cost. Trap: "Tooling will solve this." Tooling surfaces the conflict; resolution is still political.
Q7. The judge model has been drifting silently. How do you detect and fix? A. Quarterly re-calibration against a human-labeled gold set of 50-100 examples. If judge-human agreement drops below a threshold (commonly 80%), recalibrate or replace the judge. Use an ensemble of 3 judges across model families to reduce single-judge drift. Accept that some residual uncertainty in the eval pipeline is unavoidable. Trap: "Use a stronger judge." Strength does not protect against drift; the calibration practice does.
Q8. Multi-tenant fragmentation is becoming unmanageable. What do you do? A. Three options. (1) Cap the number of tenants who can have custom prompts; offer a "tiers" model where only enterprise tenants get customization. (2) Build automated patch analysis to flag overlap and suggest consolidations. (3) Accept the cost and run per-tenant evals nightly. Most teams use some combination of (1) and (3). Pure (2) is hard and not yet a solved tooling problem. Trap: "We will let every tenant have any customization they want." That works at 50 tenants and breaks at 500.
Apply now (5 min)¶
Step 1 — name your top three. Of the eight admissions in this chapter, which three are most acute for your team's current state? Be honest — name them.
Step 2 — rate the cost of each. For each of your top three, estimate the worst-case incident cost. A drift in a critical agent prompt? A multi-tenant fragmentation that adds a manual hour per week per tenant? An AI-assistant edit that lands a regression?
Step 3 — pick one to address this quarter. Not three. One. Schedule the work. The other two go on the risk register.
The mature stance is knowing what you cannot fix and choosing which one to chip at next. The teams that pretend they have solved all eight are the ones who get burned.
Bridge. Prompt ops gives you control over the most editable surface in your AI system. But control is not the same as quality. The next module is about how you actually measure whether the system is working — evals, but at production scale. → ../../04_ai_product_evals/00_ai_evals_release_gates/00-eli5.md