Skip to content

05. Prompts out of code

Sequencing is settled. The first structural work begins. Prompts are the system's spec — and in an inherited system they are usually buried in source files, scattered across modules, with near-duplicates that disagree. This chapter is the migration: prompts into a registry, without breaking the running system.


An engineer at a Bengaluru travel-tech company opens the legacy itinerary-explanation feature. The prompts live in three places: a prompts.py with twelve string constants, inline f-strings in two of the request handlers, and one prompt in a Notion page that an engineer copies into code every release "for the latest version." Three of the constants are near-duplicates of each other, edited at different times for different bug fixes; nobody is sure which one is currently used in production. The audit identified the registry migration as a week-three structural item. The engineer's job is to consolidate the prompts, version them, move them out of source, and connect the running system to read from the registry — all without the production behaviour changing on the way.

The eval backstop from chapter 03 is the safety net. The migration is the careful work above it.


What "out of code" means, and why

A prompt-registry is a versioned store that owns prompts as artefacts. Each prompt has a name, a version, a body (often a template), parameters, an owner, and a changelog. The running system reads prompts from the registry by name; the code does not contain the prompt text.

Three reasons to do this:

1. Behaviour change without code change. Changing a prompt is changing behaviour. When the prompt lives in code, every behaviour change is a deploy. The registry lets prompt changes happen on their own cadence with their own review.

2. Versioning and rollback. Without a registry, prompts have no history beyond the git log of the code file. A registry tracks every change explicitly with audit and supports rollback per prompt rather than per deploy.

3. Eval gating. Module 13 (prompt lifecycle) and module 04_ai_product_evals build the discipline of prompts-as-config with eval gates. The registry is the implementation surface for that discipline. Without it, the gates have nowhere to live.

For a deep treatment of the production discipline, see module 13. This chapter is the modernisation move that makes the discipline applicable.


The migration plan

Five steps, executed in order over one to two weeks.

Step 1 — Inventory every prompt

Find every place a prompt is constructed. This is harder than it sounds. They appear as:

  • Module-level string constants
  • f-strings inside functions
  • Multi-string concatenations built across files
  • YAML/JSON files loaded at startup
  • Documentation pages copied at deploy time
  • Test fixtures that have become source of truth

The day-one audit (chapter 02) already produced a partial list. Extend it. For each prompt:

  • Where in the code it lives
  • What it does (the intent)
  • What system it talks to (which model, which alias)
  • When it was last changed (git log) and by whom

The result is a table. For the chapter-opening case it might look like:

prompt_name              file:line                  intent                      model
itinerary_explanation_v1 prompts.py:42              explain itinerary           claude-3-opus
itinerary_explanation_v2 handlers/explain.py:128    same, edited for tone       claude-3-opus
itinerary_explanation_v3 prompts.py:67              same, edited for length     claude-3-opus
itinerary_disclaimer     prompts.py:89              attach disclaimer text      n/a (concat)

Three near-duplicates of the same intent, edited at different times. Step 2 disambiguates.

Step 2 — Pick the canonical version

For each intent (each conceptual prompt), determine which version is actually used in production. Methods:

  • Read the code carefully — which function does each handler call?
  • Check the production log: does any debug log capture which prompt was used?
  • Run the system in a controlled environment and observe which path executes

Often the answer is "v2 is used by handler A; v3 is used by handler B; v1 is dead code." Document this.

Pick the canonical prompt — usually the one in production. If two are in use for related purposes, you may need two canonical prompts. Resist the urge to merge them now; the migration's first pass is to extract, not to refactor.

Step 3 — Move to the registry

Create the registry entries. The format depends on what you build, but a reasonable minimal shape:

prompts/itinerary_explanation.yaml
---
name: itinerary_explanation
version: 1.0.0      # first version in the registry; matches current production
owner: travel-platform
intent: |
  Given a customer's itinerary, produce a plain-language explanation
  in 2-3 paragraphs, suitable for inclusion in a confirmation email.
model_alias: smart-summariser    # decoupled from concrete model
parameters:
  temperature: 0.2
  max_output_tokens: 600
template: |
  You are a friendly travel assistant.

  Given the following itinerary, produce a plain-language explanation
  suitable for inclusion in a confirmation email...

  Itinerary: {itinerary}
  ...

Three rules:

  • Match production exactly. The first registry entry is byte-equal to what production runs today (modulo whitespace normalisation). The migration's safety property is "no behaviour change"; any deviation is a future change, not a migration.
  • Use a model alias, not a concrete version. Even if production uses claude-3-opus-20240229, the registry entry refers to a model alias the gateway resolves. The migration retires the hardcoded version separately (chapter 08).
  • Parameters are explicit. Temperature, max_output_tokens, stop sequences — all named in the registry entry. Whatever the code did, the registry now declares.

Step 4 — Connect the running system

The code that used to construct the prompt now fetches it by name:

# before:
prompt = "You are a friendly travel assistant.\n...\nItinerary: " + itinerary
response = client.messages.create(model="claude-3-opus-20240229", ...)

# after:
prompt_spec = prompt_registry.get("itinerary_explanation", version="1.0.0")
response = gateway.call(
    model_alias=prompt_spec.model_alias,
    messages=[{"role": "user", "content": prompt_spec.template.format(itinerary=itinerary)}],
    parameters=prompt_spec.parameters,
)

Two patterns work for the connection:

A. Synchronous registry read. The code reads from the registry each call. The registry must be highly available; usually it is backed by a small in-memory cache.

B. Bundled-at-deploy. Prompts are bundled into the deployed artefact at build time; the running code reads from local files. Changes to prompts require a deploy. Less responsive but simpler.

Most platforms use pattern A with a fallback to bundled prompts if the registry is unreachable. The choice depends on how often prompts are expected to change without a code deploy.

Step 5 — Verify with the eval

Run the eval set (chapter 03) against the migrated system. The expectation: scores match the pre-migration baseline within tolerance. If they do, the migration is verified — production behaviour is unchanged, prompts are now in the registry, future changes are gated by the eval.

If scores diverge, the migration has introduced a regression. The most common causes: whitespace differences in the prompt body, parameter mismatches (temperature, max_output_tokens), missing fallback for an edge case the code's f-string handled but the template did not.


Doing this without breaking production

The safety properties throughout:

  • Pre-migration eval baseline. Established before any change. Without it, you cannot verify the migration.
  • Match production exactly in the first registry entry. Migration is not a refactor.
  • Feature-flag the registry read. Roll out 1% → 10% → 50% → 100% behind a flag. Watch metrics.
  • Keep the old code path alive briefly. The flag can route back to the hardcoded prompt if the registry has a problem.
  • One prompt at a time. Migrate prompts one by one, not all at once. Each migration is independently verifiable.

What changes after the registry exists

The registry unlocks several disciplines that were impossible before.

Per-prompt versioning. A change to one prompt is a version bump on that prompt, not a deploy of the whole system. Module 13 builds this.

Eval-gated prompt changes. Before promoting a prompt change from version X to version Y, run the eval; gate on no regression. Module 13 builds this too.

Prompt review. Prompts can be reviewed as artefacts, separately from code. The reviewer can be a domain expert who would not be reviewing code changes.

A/B testing. Two prompts can serve in parallel for a slice of traffic, with outcomes compared.

The registry is the substrate. Modules 13 and 04_ai_product_evals operate on the substrate.


What goes in the registry beyond prompts

A practical registry holds more than the prompt template. A reasonable schema:

Field Purpose
name Stable identifier
version Versioned, semver-ish
owner Team handle
intent One paragraph: what is this prompt for?
model_alias The alias to use (resolved by the gateway)
parameters Temperature, max_output_tokens, stop sequences, tools
template The prompt body, possibly templated
template_variables Schema for the variables the template expects
eval_set_ref The eval set this prompt is verified against
changelog Per-version changes, including who reviewed
deprecation Sunset date if any

Even if your registry starts as a directory of YAML files, having these fields is the discipline. The implementation can be lightweight; the schema is the load-bearing piece.


Common mistakes

Migrating all prompts at once. A big-bang migration loses the per-prompt verification. Migrate one prompt, verify with the eval, then the next.

Refactoring during migration. "While I'm in here, let me improve the wording" produces a behaviour change disguised as a migration. The migration is byte-equal; the improvements come after.

Skipping the eval verification. "It looks the same to me" is not verification. The eval is the only objective check.

Building an over-engineered registry. A directory of YAML files in a git repo is a usable registry. Start there. Graduate to a service when you have multiple prompts changing daily.

Leaving the hardcoded prompts in the code. After verification, delete the hardcoded prompts. Leaving them as "fallback" produces drift between code and registry. The fallback should be a registry feature (e.g., a "previous version" entry), not a code feature.


Interview Q&A

Q1. Why does a registry matter? Can't we just keep prompts in code as we do today? Prompts in code conflate two cadences: behaviour change and code change. Every prompt edit becomes a deploy; every deploy might bundle a prompt edit nobody noticed. The registry separates them: prompts have their own lifecycle, their own review, their own versioning, and their own eval gates. The system gains the ability to change behaviour at prompt cadence without code-deploy cost, and to review behaviour changes as the artefacts they are. Wrong-answer notes: "for cleanliness" misses the cadence-decoupling point.

Q2. Walk through migrating a prompt that exists as three near-duplicates in the code. Inventory: find all three; identify which is in production via code-read and log inspection. Pick the canonical one. Create the registry entry, byte-equal to the production prompt, with parameters explicit and model_alias substituted for any hardcoded model string. Connect the system to read from the registry behind a feature flag at 1% → 100%. Run the eval at each stage; the baseline must hold. Delete the in-code prompts once verified. The two non-canonical versions are now removed unless they served distinct intents, in which case they get their own registry entries. Wrong-answer notes: picking "the latest" or "the best-looking" version without verifying which is in production produces a behaviour change disguised as a migration.

Q3. The eval shows a small score drop after the migration. The prompt body matches; the parameters match. What is the likely cause? Whitespace normalisation. Templates often normalise trailing whitespace or line endings; production may have run with the exact whitespace the original string had. Re-check the template byte-by-byte against the original. The other common cause: a template variable formatting subtly differently (e.g., a list-rendering helper that joined with ", " in code and "," in the template). Find the difference; restore exact behaviour. Wrong-answer notes: "small drops are fine" defeats the migration's safety property.

Q4. The system uses ten prompts. Should you migrate them all in one PR or one at a time? One at a time. Each prompt has its own surface, its own eval coverage, its own potential failure modes. One PR per prompt makes each migration independently verifiable and reversible. A big-bang PR is harder to review, harder to test, and harder to roll back. The total time is roughly the same; the risk is much lower. Wrong-answer notes: "all at once is faster" misses that the verification cost is per-prompt regardless of PR grouping.


What to do differently after reading this

  • Inventory every prompt before migrating any. Resolve the production-canonical version per intent.
  • Build a minimal registry (a YAML directory is enough). Add the schema fields even if you do not use them all yet.
  • Migrate one prompt at a time, behind a feature flag, with eval verification at each step.
  • Delete the in-code prompts after verification. No "fallback" copies in source.
  • Plan the connection to module 13's prompt-lifecycle discipline after the registry is in place.

Bridge. Prompts are in the registry. The next thing the system needs is something to be observed from. Many inherited systems have no traces, no audit log, no dashboards for the AI module. The next chapter is observability retrofitted onto a running system without breaking it. → 06-observability-retrofitted.md