Skip to content

01. Prompts as code — why the string in your source file is a production asset

~14 min read. A prompt is not a string literal. It is load-bearing config that runs every time the model speaks for your company. The moment you accept this, half of prompt ops becomes obvious.

Builds on 00-eli5.md. The tired cook at 2 a.m. crosses out "two pinches salt" and writes "two tablespoons". The bread is ruined. The next chapters explain why this is a tooling problem, not a discipline problem.


1) Hook — two versions of the same line, two very different Fridays

Look. Here is a file from a real support-bot codebase. The team is six engineers, the system is live to about forty thousand users a day. The prompt that drives the greeter agent looks like this:

# app/agents/support.py

def build_system_prompt(user_name: str) -> str:
    return f"""You are a helpful support agent for Acme.
Always greet the user by name. The user's name is {user_name}.
Always end with "Is there anything else I can help with?"
"""

A junior engineer is fixing a bug in pagination two screens away. While he is in the file, he notices the greeter "feels too formal" — he deletes the line about greeting by name. The PR is titled "fix: pagination off-by-one on inbox page". The reviewer sees pagination changes, sees a one-line deletion in the agent prompt, assumes it is intentional cleanup, approves. Merge to main. Deploy at 17:42 on a Friday.

By Monday, complaints arrive. "The bot stopped using my name." "It feels like talking to a robot." The CSAT score for the week is down four points. Nobody on the engineering team remembers the prompt change. The PR title said pagination. Nobody thought to grep for it.

Now imagine the same line lived in a registry, behind a review gate, behind an eval suite. The deletion would have been a separate change request. Two reviewers would have looked at it. The greeter eval suite would have flagged a twelve-percent drop in tone scores. The change would never have shipped.

Same engineer. Same intent. Two very different Fridays. The thing that differs is not skill or care. The thing that differs is whether the recipe lives inside the source file or inside the recipe book.


2) The metaphor — config that does not look like config

Most engineers know the rule about config. Database URLs do not live in source. Feature flags do not live in source. API keys do not live in source. All of these get pulled out into environment variables, secret stores, flag systems. The reason is the same in every case — they have a different lifecycle from the code. They can change without a deploy. They can vary by environment. They are edited by people who are not the engineers shipping the code.

A prompt has all three properties.

It changes without a deploy when product wants to soften the tone for an enterprise customer. It varies by environment when staging runs a more verbose version for debugging. It is edited by people who are not engineers — product managers, customer success leads, even legal when a disclaimer needs adding.

And yet the prompt sits in agents/support.py as an f-string. The mismatch is not academic. Every time someone wants to change the prompt, the only path available is open the Python file, change the string, raise a PR, wait for review, merge, deploy. That path is so slow that smart, frustrated people start working around it — pasting the new prompt into the production database, hot- patching the running process, editing the file on the production server during an incident. The shortcut becomes the norm. The audit trail dies.

A prompt is the recipe. The recipe must live somewhere a non-engineer can read it without spelunking into source control, somewhere every edit leaves a trace, somewhere the production version is a known SHA that any trace can carry. That somewhere is the recipe book, which is the topic of the next chapter. For now, the point is narrower — the recipe does not belong baked into the oven.


3) The "but it's just a string" objection

Senior engineers see prompt-ops tooling and ask the obvious question. Why is this not just a constant? The answer is uncomfortable and worth saying plainly.

A prompt is load-bearing in a way other config is not. A wrong database URL crashes the app at startup, loud and instant. A wrong feature flag flips a button on or off, visible and obvious. A wrong prompt produces the wrong answer to a customer question, silently, indistinguishably from a correct one. The service returns 200 OK. The latency is normal. The error rate is unchanged. Only the content is wrong, and content is not something your alerting cares about.

This makes prompts the most dangerous kind of config — config that fails silently. The blast radius is wide. The detection lag is long. The diagnostic trail is often missing, because the trace logged the model's output but not the SHA of the prompt that produced it.

That is why "just a string" is wrong. A string with no version is a fire hazard you cannot smell.


4) The four laws — what makes a prompt a managed asset

A prompt becomes a production asset when four properties hold. Less than four, you have a string in a file with a fancy name. All four, you have something you can run a service on.

┌───────────────────────────────────────────────────────────┐
│ THE FOUR LAWS OF PROMPTS-AS-CODE                          │
├───────────────────────────────────────────────────────────┤
│ 1. IDENTITY     every prompt has a stable name            │
│ 2. VERSION      every change produces a new SHA           │
│ 3. REVIEWER     every change has a human who approved it  │
│ 4. EVAL         every change passed a known eval suite    │
└───────────────────────────────────────────────────────────┘

Identity sounds trivial until you do not have it. A prompt without a stable name has nowhere to live in metrics, no row in a dashboard, no key in the bakery log. When a complaint trace says "the answer was rude", you cannot answer "which prompt produced this" without one. The naming convention does not need to be fancy — service.role.purpose is plenty, e.g. support.agent.greeter or billing.classifier.refund_intent. What matters is that the name is stable across versions. The greeter is the greeter whether it is v1 or v97.

Version is the SHA. A content hash of the prompt text plus its metadata. Two distinct contents always produce two distinct SHAs. The same content produces the same SHA forever. The SHA is what gets stamped on every trace, so when a regression appears, the lookup from trace to exact text is a single query. Versioning is the topic of chapter three and the reason rollback is possible at all.

Reviewer is the human gate. Every change to a deployed prompt must have a named approver. Not "the team", not "the channel" — a person whose name is in the audit log. The reviewer's job is not to spell-check. It is to ask what behavior does this change. Reviewers are covered in chapter four.

Eval is the automated gate. Before a new SHA can be promoted, it must pass a known suite of test cases — the taste test. The eval scores from before and after are recorded against the change. Eval suites are module 24's job; in this module we treat them as a gate that already exists.

Hold all four, the system survives a Friday-evening deploy. Drop any one, you are back in the bakery with the tired cook.


5) The strings-in-source antipattern, and the refactor

Here is the antipattern in the wild. Pulled from a real codebase, names changed.

# app/agents/refund.py  (BEFORE)

REFUND_SYSTEM = """You are the refund agent.
Confirm the order ID, the amount, and the reason before issuing a refund.
Never refund more than the order total.
If the user is angry, apologize first."""

REFUND_FEW_SHOT = [
    {"role": "user", "content": "I want a refund for order 4481"},
    {"role": "assistant", "content": "I can help with that. Can you confirm..."}
]

def build_messages(user_msg: str) -> list[dict]:
    return [
        {"role": "system", "content": REFUND_SYSTEM},
        *REFUND_FEW_SHOT,
        {"role": "user", "content": user_msg},
    ]

The problems are dense. The system prompt is a triple-quoted string with no version. The few-shot examples live in a separate constant and might drift from the system prompt without notice. There is no name on either constant beyond the Python identifier — no prompt_name that survives outside this file. If someone changes one line of REFUND_SYSTEM, the diff is buried inside a PR that probably also changes routing logic. Every trace this code produces will log the model's output, but nothing that ties back to which version of the prompt ran.

Now the same module after the smallest reasonable refactor.

# app/agents/refund.py  (AFTER)

from app.prompts import registry

def build_messages(user_msg: str) -> list[dict]:
    prompt = registry.load("billing.agent.refund")  # → (sha, system, few_shot)
    return [
        {"role": "system", "content": prompt.system,
         "metadata": {"prompt_sha": prompt.sha}},
        *prompt.few_shot,
        {"role": "user", "content": user_msg},
    ]

Four things change. The text moves out of the Python module entirely. The identity is now a name string — billing.agent.refund — that survives across versions. The version is a SHA returned by the registry. The trace carries the SHA so any future complaint can be linked to the exact text that ran. The refund logic file no longer ships in the same PR as the wording change.

The refactor is small. The implications are large. Every change to the prompt now flows through one chokepoint — the registry — and every chokepoint is a place to install identity, version, reviewer, and eval gates.


6) The hidden invariants nobody documents

A prompt looks like prose. It is not. Buried inside that prose are invariants that downstream code depends on, often without anyone realising. The reason prompt edits go wrong so often is that the editor sees only the prose and edits like prose, while the downstream parser is reading like code.

A small list of invariants that have caused real production incidents:

The output format the downstream parser expects. If the prompt says "reply with a JSON object with keys 'intent' and 'confidence'", the parser is doing json.loads(response).get("intent"). The day someone rewords this to "tell me the intent and how confident you are", the parser breaks. The output went from JSON to prose. The string change looks innocent. The system breaks at the parsing layer with a JSONDecodeError, or worse, the parser silently returns None and the caller misroutes the ticket.

The few-shot examples that anchor behavior. Few-shot examples are not decorative. They teach the model the exact tone, format, and decision shape you want. Deleting one because "the prompt is getting long" can shift behavior by many percentage points on the eval suite. The deleter never knew the example was load-bearing.

The tool descriptions the agent needs. If the prompt enumerates the agent's tools — "you have access to lookup_order, issue_refund, and escalate" — removing one of those names from the prose can make the model stop calling that tool, even though the tool is still wired in. The agent's behavior changed; the code did not.

The negative constraints. The line "never refund more than the order total" is doing work. Remove it because it feels redundant with the business logic, and the model will, sometimes, propose a refund larger than the order. The business logic will catch it. The customer will see the model's wrong suggestion before the catch. CSAT drops without an alert firing.

None of these invariants live in code review checklists. They live in engineers' heads, and they leave with the engineers. A registry with eval gates makes the invariants enforceable rather than tribal.


Mid-content recall

  1. Why does a wrong prompt fail more dangerously than a wrong database URL?
  2. Which two of the four laws — identity, version, reviewer, eval — are missing when a prompt is a Python f-string?
  3. What is the smallest change to a few-shot block that can move eval scores by several percentage points?

7) The PR-bundling mistake — and why it kills postmortems

Of all the antipatterns this chapter could warn against, one matters more than the others — bundling a prompt edit with a code refactor in the same PR. The opening hook was an example. The reason it deserves its own section is that even teams who have moved prompts into a registry keep bundling.

The shape of the mistake is always the same. An engineer is in a file for reason X. While there, she also fixes a typo in a prompt, or shortens a sentence, or adds a few-shot example she has been meaning to add. The PR ships both. The next time the eval suite catches a regression, the bisection is broken — the bad commit contains two changes, and there is no easy way to tell which one caused the regression.

The rule that experienced AI teams converge on is sharp. Prompt changes are their own PRs. No exceptions for typos. No exceptions for "while I am in this file". The cost of a separate PR is small. The cost of an unbisectable postmortem is large.

The corollary is that prompt changes go through a different review path. Code changes go through code reviewers who care about types, performance, and architecture. Prompt changes go through prompt reviewers who care about tone, eval scores, and downstream parser invariants. The two audiences are not the same. The two review templates are not the same.


8) Failure modes — where strings-in-source leaks

SYMPTOM                                  FIX
───────                                  ───
"who changed this?" with no answer   →   identity + reviewer in registry
trace shows wrong answer, no SHA     →   stamp prompt_sha on every trace
prompt edit shipped in pagination PR →   prompt changes are their own PRs
few-shot drifted from system prompt  →   one prompt artifact, one SHA
edit-on-server during incident       →   registry edit, audit-logged
non-engineer cannot read the prompt  →   move prompt out of source
PR diff shows only the new wording   →   diff against last deployed SHA
silent regression after "small fix"  →   eval gate before promote

The fixes are not heroic. They are the four laws applied. Identity gives you the answer to "who changed this". Version stamps the SHA on every trace. A separate review path makes the wording diff visible. Eval gates catch the "small fix" before it ships.


Where this lives in the wild

The shift from strings-in-source to managed-prompt is visible across many products and platforms. Each one solves a slice of the same problem.

  • Langfuse — open-source prompt management with versioned prompts, eval hooks, and trace-to-prompt linking baked into the SDK.
  • PromptLayer — wraps OpenAI and Anthropic clients to log every request against a named, versioned prompt template.
  • Pezzo — prompt management with environments (dev/staging/prod) and a visual diff between versions.
  • Helicone — observability platform with prompt versioning and replay against historical traces.
  • Vellum — prompt registry and eval workflow for enterprise teams.
  • Braintrust — prompt-and-eval pairing where every prompt change runs the attached eval suite.
  • LangSmith (LangChain) — hub of named prompts with version pins, callable from the LangChain SDK by name.
  • PromptHub — Git-backed prompt versioning with CI integration.
  • OpenAI Playground / Stored Prompts — first-party prompt storage with version pins callable from the API by ID.
  • Anthropic Workbench — prompt iteration UI with version history per workspace.
  • Vercel AI SDK — typed prompt definitions that can be wrapped behind a registry layer.
  • GitHub Actions for prompt repos — workflows that lint and eval prompt files on every PR.
  • GitLab Merge Requests — used as the review surface for YAML-backed prompt repos.
  • Phabricator — older review tooling still used at large companies for prompt-as-code review.
  • ReviewBoard — niche but in use at teams treating prompts as part of the code review surface.
  • dbt — analytics teams have adopted a similar pattern for SQL models; prompts borrow the same versioned-asset mental model.
  • Hashicorp Vault — used to store the secret half of prompts (API keys referenced from within prompts) separately from the prose half.
  • AWS Parameter Store — common bootstrap home for prompt strings before a team adopts a real registry.
  • AWS Secrets Manager — used for prompt-adjacent credentials with audit trail.
  • ConfigCat — flag platform some teams (mis)use as a prompt store; works for tiny apps, falls over at scale.
  • LaunchDarkly — flag platform with JSON values, sometimes used to gate prompt versions per cohort.
  • Statsig — experiment platform used to A/B test prompts when paired with a real registry.
  • Flagsmith — open-source flag platform used the same way.
  • Optimizely — experimentation surface for prompt A/Bs in larger orgs.
  • Split.io — flag-and-experiment platform for prompt rollouts.
  • GitHub Copilot for PRs — when used carelessly, suggests prompt edits as part of unrelated code changes — the exact bundling antipattern this chapter warns against.

The point of listing this many products is not that you need them all. The point is that somebody, somewhere, has built a tool to solve every slice of this problem. Strings-in-source is not the only option, and it has stopped being the default at any team that has lived through one bad Friday.


Pause and recall

  1. State the four laws of prompts-as-code in order.
  2. Why is "prompts are just strings" the wrong mental model?
  3. Give two examples of hidden invariants buried inside a prompt's prose.
  4. What is the single PR rule that makes postmortems bisectable?
  5. Which of the four laws does a Python f-string fail on first?
  6. Why is a wrong prompt more dangerous than a wrong feature flag?
  7. What is the smallest refactor that moves a prompt from antipattern to managed asset?

Interview Q&A

Q1. Why are prompts not just strings in source? A. Prompts have a different lifecycle from code — they change without a deploy, they vary by environment, they are edited by non-engineers. They are also load-bearing in a way other strings are not, because a wrong prompt fails silently with a 200 response and the wrong content. Treating them as config gives them identity, version, review, and eval gates — the four properties that make incidents survivable. Trap: Saying "we use constants" — constants give you identity but not the other three.

Q2. Walk me through a real prompt-edit incident you would design against. A. The bundling case. Engineer fixes pagination, also deletes a line from the greeter prompt while in the file. PR title says pagination. Reviewer misses the wording change. CSAT drops on Monday, nobody bisects to it because the commit message lies. The design fix is prompt changes are their own PRs, plus an eval gate that would have caught the tone drift on the wording PR before merge. Trap: Blaming the engineer. The system allowed the bundle; the system is the fix.

Q3. What are the four laws of prompts-as-code? A. Identity (every prompt has a stable name), version (every change produces a new SHA), reviewer (every change has a human approver), eval (every change passed a known suite). Drop any one and you have a string with extra steps. Trap: Listing only versioning. Versioning without review still lets a tired cook ship a v18 with no second pair of eyes.

Q4. How is a load-bearing prompt different from a feature flag? A. Both are config. Both have lifecycle independent of code. The difference is failure mode. A wrong flag flips a visible button; the failure is loud. A wrong prompt produces wrong content with a 200 OK; the failure is silent and needs eval suites and trace SHAs to detect. Trap: "Use a feature flag system to store prompts." Possible for small systems; loses the eval-gate and diff-review story for any real product.

Q5. What goes wrong when prompt changes share a PR with code changes? A. Bisection breaks. When eval scores drop or CSAT regresses, the offending commit contains two unrelated changes. The reviewer at PR time also splits attention — the code reviewer is not the prompt reviewer. Fix is to mandate separate PRs and route prompt changes through prompt reviewers. Trap: "It's a one-line change, it's fine." One-line prompt changes have shipped some of the biggest CSAT regressions on record.

Q6. What invariants in a prompt should an editor know about before changing it? A. The output format the downstream parser expects, the few-shot examples that anchor tone and behavior, the tool names the agent reaches for, and any negative constraints ("never refund more than..."). All four can break silently when prose is edited as if it were prose. Trap: Treating prompts as documentation. The model reads them as instructions; the parser reads the output as code; both audiences enforce shape.

Q7. Why does identity (a stable name) matter before versioning? A. Because traces and metrics need a stable key to attach to. If the prompt's "name" is its Python identifier in a source file, that key dies the moment you rename the variable. A registry-given name like billing.agent.refund survives across refactors and gives dashboards a stable column to chart. Trap: Inventing a UUID for each version with no shared name. You end up with versions and no series.

Q8. When is "strings-in-source" actually fine? A. Prototypes, demos, internal tooling with one user, and any system where a wrong prompt has no business consequence. The moment a real customer sees the output, the four laws need to start applying. The migration is small; the trigger is when the first incident is one prompt edit away. Trap: "We are still small." Most teams ship the antipattern long past the point where it starts to bite.


Apply now (5 min)

Step 1 — find the prompts in your codebase. Grep for triple-quoted strings, f-strings passed to client.messages.create or client.chat.completions, and constants named *_PROMPT, *_SYSTEM, SYSTEM_*. Make a list. Count how many of the four laws each one satisfies.

Step 2 — pick the worst offender. Usually it is the one that produces customer-visible output. Write down its name (give it a service.role.purpose name if it does not have one), its current text, and the SHA of that text.

Step 3 — write the refactor. Sketch the two-file diff — the agent file that now calls registry.load("..."), and the registry entry that holds the text. You do not have to ship it yet. The next chapter is the registry itself.


Bridge. The refactor in section 5 hand-waves over one thing — where does the registry live? Postgres? A YAML file? A SaaS product? Each answer has different tradeoffs for who can edit, how immutable versions are, and how the SHA gets computed. The next chapter is the recipe book itself.

02-the-prompt-registry.md