02. The prompt registry — the recipe book that every trace can name¶

~15 min read. The registry is where prompts live, get versions, and earn an audit trail. Pick the wrong backing store and you inherit the wrong failure mode. Pick the right one and most of prompt ops becomes a query.

Builds on 01-prompts-as-code.md. The four laws say a prompt needs identity, version, review, and eval. The registry is the physical thing that holds the first two and makes the last two possible.

1) Hook — the moment a registry earns its keep¶

Look. It is 14:32 on a weekday. The on-call dashboard shows the refund agent's "correct intent extraction" score has fallen from 96% to 78% in the last fifteen minutes. The on-call engineer pulls up a failed trace.

trace_id     : 4481_t99a2c
ts           : 2026-02-11T14:31:18Z
service      : billing
agent        : refund_classifier
prompt_name  : billing.classifier.refund_intent
prompt_sha   : b1d7e4c2
output       : {"intent": "warranty_claim", "confidence": 0.42}
expected     : {"intent": "refund_request", "confidence": >0.8}

The trace carries the SHA. The on-call engineer types one command into the registry CLI.

$ promptctl history billing.classifier.refund_intent

SHA       parent    deployed_at           author          eval_score
b1d7e4c2  a8c3f971  2026-02-11T14:18:00Z  ria@acme.com    91.4
a8c3f971  e5f02d8b  2026-02-10T09:12:00Z  sundar@acme.com 96.7
e5f02d8b  ...       2026-02-04T11:55:00Z  sundar@acme.com 96.5

There it is. A new SHA went live at 14:18 — fourteen minutes before the metric dropped. The author is named. The parent SHA is named. The eval score on promotion was 91.4, lower than the previous 96.7, and somebody promoted it anyway. The rollback to a8c3f971 is one command. Metrics recover three minutes later.

None of this works without a registry. The trace would have logged the model's output, but nothing that ties the failure to a specific edit. The bisection would have been "what shipped today?", the search would have been a git log, and the recovery would have been measured in hours, not minutes.

This chapter is about what a registry is, what it stores, what backs it, and the tradeoffs in each choice.

2) The metaphor — the recipe book on the kitchen wall¶

The bakery has thirty bakers across three shifts. They cannot all be reading the same paper recipe on the counter. Not safely. Not without somebody crossing out "two pinches salt" the way the tired cook did.

So the bakery installs the recipe book. It is a bound volume on the wall. Every page is a recipe. Every recipe has a name on the spine — bread.sourdough, pastry.croissant, cake.birthday_chocolate. Every page has a number, and every revision of that page gets pasted in as a new entry with its own identifier — never erased, never edited in place. When a baker wants tomorrow's shift to use a new croissant recipe, she does not edit the existing page. She writes a new page, gets the head baker to sign it, files it as pastry.croissant — entry 17, and only then changes the pointer on the front desk that says "today's croissant is entry 17".

That whole construct — the named spine, the immutable revisions, the head baker's signature, the pointer that says what is live — is the prompt registry.

The registry holds three different things, and most teams confuse them at first.

The content is the recipe text itself, content-addressed by its SHA. Once written, never edited. New text gets a new entry.

The identity is the spine label — billing.classifier.refund_intent. Stable across versions. The same name today and a year from now.

The deployment pointer is the front-desk note — "production is billing.classifier.refund_intent at SHA a8c3f971". The pointer changes. The content never does.

Confusing content with identity is what causes the "v17 was edited last week" problem we will return to. Confusing identity with the deployment pointer is what causes deploys-by-rename. Keep all three separate in your head, the registry's structure follows from there.

3) Anatomy of a registry entry¶

A useful registry stores, at minimum, the following per version. Strip any of these out and you lose a capability later.

┌──────────────────────────────────────────────────────────────┐
│ PROMPT REGISTRY ENTRY                                        │
├──────────────────────────────────────────────────────────────┤
│ name             billing.classifier.refund_intent            │
│ sha              b1d7e4c2... (sha256 of the content blob)    │
│ parent_sha       a8c3f971...                                 │
│ content          { system, few_shot, tools, format }         │
│ created_at       2026-02-11T14:14:00Z                        │
│ created_by       ria@acme.com                                │
│ status           draft | in_review | approved | deployed |   │
│                  deprecated | rolled_back                    │
│ eval_suite       refund_classifier_eval_v3                   │
│ eval_score       91.4                                        │
│ approvers        [aman@acme.com, priya@acme.com]             │
│ tags             {service: billing, owner: payments_squad}   │
│ change_note      "softened apology in fallback example"      │
└──────────────────────────────────────────────────────────────┘

Every field earns its place. The SHA is the content hash; two different texts can never collide on the same SHA, and the same text always produces the same one. The parent SHA gives you the chain — every version knows the one it replaced, which is how diffs and rollbacks work without a separate history table. The content is itself a structured object, not a single string, because a prompt is rarely just a system message — it usually carries few-shot examples, tool descriptions, and an expected output format. The status field is what separates a draft an author is iterating on from the version production is actually running. The eval score and approvers are the audit trail for the four laws.

The status field deserves a second look. The five states form a small state machine.

  draft  ─create─▶  in_review  ─approve─▶  approved  ─promote─▶  deployed
    │                   │                      │                    │
    │                   │                      │                    ▼
    │                   └──reject──┐           └──supersede─▶  deprecated
    │                              ▼                                │
    └─────────────────────────▶  abandoned                          ▼
                                                              rolled_back

A draft is the author's playground — edit freely, run trial evals, but not yet visible to production. In review is awaiting human approval. Approved has passed review and eval gates but is not yet live. Deployed is the one the deployment pointer is currently pointing at. Deprecated is an older version superseded by a new deploy. Rolled back is a previously deployed version that production has reverted to — kept as a distinct state because "this version was once live, then unlived, and is now live again" is worth knowing in incident timelines.

4) Content-addressed identity — why SHA beats v17¶

The instinct most teams have is to number versions. v1, v2, v3. It is the mental model from semver. It also gets you into trouble fast in prompt land, and the reason is worth being explicit about.

Numbered versions are names. The thing they name can change. Two engineers both add a new version called "v17" on different branches, you have a name collision and have to renumber. Worse, in tools that let you edit a numbered version in place — and many do — "v17" today is not "v17" last week. The trace from last week says it ran v17. Pull up v17 today. It is different text. The link from trace to text is broken.

Content-addressed identity solves both. The version is sha256(content). No two distinct contents ever share a SHA. The same content always produces the same SHA. There is no "renaming" — the SHA is the content's fingerprint. Two engineers both make the same one-character typo fix, they get the same SHA, the system de-duplicates. Two engineers make different edits, they get different SHAs, and one of them is going to lose the race to deploy, which is how it should be.

NUMBERED VERSIONS                CONTENT-ADDRESSED
────────────────                 ─────────────────
v17 today may differ             SHA b1d7e4c2 is the same text
  from v17 last week               today, tomorrow, always

two authors collide on v17       same edit by two authors →
  → renumber, both lose history    same SHA, deduped for free

editing v17 in place is          editing in place is impossible
  possible by accident             — new text = new SHA

"which v17"                      SHA answers itself

The downside is human-readability. b1d7e4c2... is not memorable. Mature registries solve this by displaying short prefixes (the first 8 characters) and pairing SHAs with optional human-readable tags (refund-classifier-2026- feb-launch). The tag is mutable, the SHA is not. The trace logs the SHA. The dashboard displays the tag. The two layers do not interfere.

Mini-FAQ. "What if I want both — SHA and a friendly number?" Many registries store an auto-incremented version_number alongside the SHA, as a display convenience. The number is for humans; the SHA is for the machines. Production traces should always log the SHA, never the number, because the number can be reassigned across systems and environments.

5) Worked example — a YAML-backed registry entry¶

The simplest registry that satisfies the four laws is a YAML file in a git repo. Below is one entry as a team would actually write it.

# prompts/billing/refund_classifier.yaml
name: billing.classifier.refund_intent
sha: b1d7e4c2a3f048e7b9c1d2e3f4a5b6c7
parent_sha: a8c3f97132e9b5a8d4f6e7c8b9a0c1d2
created_at: 2026-02-11T14:14:00Z
created_by: ria@acme.com
status: deployed
eval_suite: refund_classifier_eval_v3
eval_score: 91.4
approvers:
  - aman@acme.com
  - priya@acme.com
tags:
  service: billing
  owner: payments_squad
  language: en
change_note: |
  Softened the apology in the fallback example. The previous wording
  ("I'm sorry, I can't help with that.") was triggering a tone-drop in
  the conversational eval. New wording is "Let me get someone who can
  help with that." Behavior on positive cases unchanged.
content:
  system: |
    You are the refund-intent classifier for Acme Billing.
    Given a customer message, classify into one of:
      - refund_request
      - warranty_claim
      - cancellation
      - other
    Return strict JSON: {"intent": ..., "confidence": 0.0..1.0}
  few_shot:
    - role: user
      content: "I want my money back for order 4481"
    - role: assistant
      content: '{"intent": "refund_request", "confidence": 0.95}'
    - role: user
      content: "The blender broke after two weeks, what do I do?"
    - role: assistant
      content: '{"intent": "warranty_claim", "confidence": 0.88}'
  tools: []
  expected_format: json_object

The entry holds everything the four laws asked for. The name gives identity. The SHA gives version. The approvers and change_note give the human review trail. The eval_suite and eval_score give the automated trail. The content block carries everything the runtime needs to load — system message, few-shot turns, tool list, expected format.

A new edit by a different author produces a sibling YAML file at a different SHA, with this one's SHA as its parent. The deployment pointer (a separate small file, prompts/deployments/billing.yaml, that maps name → currently- deployed SHA per environment) is what gets updated to switch traffic. The content files themselves are append-only.

6) The four backing stores — and what each one buys you¶

Teams pick from roughly four ways to back the registry. Each one trades off the same axes — who can edit, how strong the audit trail is, how fast reads are, and how multi-tenancy works. Knowing the four is enough to evaluate any SaaS pitch you will hear.

Option A — Postgres (or any relational DB)¶

The registry is a set of tables — prompts (name, owner), prompt_versions (sha, name, content, parent_sha, created_by, status, eval_score), deployments (name, env, sha). Reads are a join. Writes go through an API the team owns. Audit trail comes from row history or a separate events table.

What you buy. Full control of schema. Fast reads at scale. Plays well with existing infra. Easy to add fields specific to your business.

What you pay. You build the editor UI. You build the diff view. You build the eval-gate plumbing. There is no UI until you make one. Most teams start here and then either build a real internal tool or move to one of the SaaS options below.

Option B — Git repo of YAML files¶

Each prompt is a YAML file in a prompts/ directory. The deployment pointer is a separate file. Edits go through pull requests. The Git history is the audit trail. CI runs lint and eval gates on the PR.

What you buy. Versioning, diff, review, audit — all free, all inherited from the tooling engineers already use. Code review patterns transfer. Easy to inspect, easy to grep.

What you pay. Only engineers can comfortably edit. Non-engineers (PMs, support leads) need either a UI on top or the patience to learn GitHub. Hot edits in production are slow — every change needs a PR, a CI run, and a deploy. The audit trail is excellent for what changed but weak for who viewed it and what eval ran.

Option C — Hosted SaaS (Langfuse, Pezzo, Vellum, Braintrust, etc.)¶

You install an SDK. You write prompts in the SaaS UI. The SaaS stores versions, runs evals, exposes a deployment pointer, gives non-engineers a clean interface to edit and review.

What you buy. A real UI day one. A real audit trail day one. Eval workflows built in. Non-engineer-friendly. Mature features for diffs, A/Bs, and rollbacks. Multi-tenancy and SSO if you pay enough.

What you pay. Vendor lock-in to some degree (most have export). Latency overhead if you hit the SaaS on every request — most SDKs cache, but you inherit the cache logic. A line item in the budget. Data residency questions if your prompts contain regulated content.

Option D — DIY S3 + a manifest¶

The cheap, durable option. Prompt content blobs in S3 (or any object store), keyed by SHA. A manifest file — JSON or YAML — that holds the name-to-SHA mapping per environment. A tiny service or a Lambda updates the manifest. Audit trail is S3 access logs plus the manifest's git history.

What you buy. Almost free at any scale. Immutability is enforced by the object store. Reads are cheap and globally cacheable. Survives a registry- service outage because there is barely a service.

What you pay. Tooling poverty. No UI. No diff view. No eval gate without building one. Workable for a small backend team; painful for product-led prompt editing.

                       EDIT-BY-NON-   AUDIT      LATENCY   DIY
                       ENGINEER?      TRAIL      ON READ   BUILD
                       ──────────     ─────      ───────   ─────
A — Postgres           medium         strong     low       high
B — Git/YAML           low            strong     low       medium
C — Hosted SaaS        high           strong     medium    low
D — S3 + manifest      low            medium     very low  medium

Most teams end up in B for a while, then move to C as the non-engineer- editing pressure grows. A and D are reasonable end states for teams with unusual constraints — heavy custom workflows or strict cost ceilings.

Mid-content recall¶

Why does content-addressed identity beat numbered versions?
Name three fields a registry entry must carry beyond name and content.
What is the practical difference between deployed and approved in the status state machine?

7) Naming conventions — the part that becomes load-bearing¶

A registry full of well-versioned prompts named prompt_1, prompt_2, new_prompt, new_prompt_actually_final will fail you faster than no registry at all. Names are the index that every other tool reads. They go on dashboards, in eval reports, in alert messages, in customer-success documentation. Pick them with the care you would pick a public API surface.

The convention that survives contact with real teams is service.role.purpose. Three dotted segments, lowercase, snake_case within each segment.

billing.classifier.refund_intent
billing.agent.refund_resolver
support.agent.greeter
support.summarizer.ticket_close
onboarding.checker.email_verification
moderation.classifier.spam
search.rewriter.query_expansion

The first segment is the service or domain. The second is the kind of prompt — agent, classifier, summarizer, rewriter, extractor, grader. The third is the specific purpose. The result is a flat namespace that sorts, greps, and dashboards cleanly.

Names should be stable. Once billing.classifier.refund_intent exists, renaming it costs you every dashboard, alert, and eval that referenced the old name. Add a new prompt with a new name instead. Deprecate the old one in the registry's status field.

Mini-FAQ. "Can we use slashes instead of dots?" You can. Dots match Python module conventions and read cleanly in logs. Slashes match path conventions and read cleanly in file trees. Pick one, stick to it.

8) Immutability — the rule that makes everything else work¶

The single rule that the registry must enforce above all others is this. A version, once created, cannot be edited. Not by the author. Not by an admin. Not by a database query. Edit = new version. New SHA.

The rule sounds severe. It is the reason every other piece of the system holds together. Every trace logged a SHA. If the content behind that SHA can change, the trace's link to the truth is broken. Every eval reported a score against a SHA. If the SHA's content can change, the score is a lie. Every rollback target is a SHA. If the SHA's content can change, the rollback is gambling.

Implementation matters. In a Postgres backing, immutability is enforced by a trigger that rejects updates to the content column. In a Git/YAML backing, it is enforced by the convention that each version is a separate file at a SHA-named path and CI rejects modifications to existing files. In a hosted SaaS, it is enforced by the platform. In an S3 backing, it is enforced by the object store itself — write once, read many.

The corollary is that draft is different from version. A draft is mutable. An author can iterate on a draft text freely. The moment the draft is submitted to review, it is hashed into a version, and the version is immutable from that point on. Drafts can carry the same name as the live version; what differs is the status. Only versions with status deployed are reachable by production runtime.

9) Failure modes — where registries leak¶

SYMPTOM                                  FIX
───────                                  ───
"v17 was different last week"        →   content-addressed SHAs, no in-place
                                         edits
two authors collide on the same      →   immutable versions, parent SHA
  version name                           lineage
trace logs prompt name but no SHA    →   stamp SHA on every trace (chapter 7)
non-engineer cannot edit             →   add a UI layer over backing store
PR review is the only audit trail    →   add deployment events and access logs
prompts/ directory has 4000 files    →   namespace by service in subdirs
prompt_v2_final_FINAL.yaml           →   enforce naming convention in lint
deployment pointer drift between     →   single deployments file per env,
  environments                           reviewed like prompts themselves

The two leaks worth highlighting are the deployment pointer drift — staging runs SHA X, production runs SHA Y, and nobody knows because the pointers live in different files — and the trace-without-SHA leak, which is the reason chapter seven exists.

Where this lives in the wild¶

Each of these products has staked out a slice of the registry problem.

Langfuse — open-source platform; named prompts, content-addressed versions, environment-scoped deployment labels (production, staging), Python and TypeScript SDKs that resolve a name to its currently-labelled version.
PromptLayer — SaaS registry with template variables, version diffing, and a "release" concept that maps to deployment pointers.
Pezzo — open-source registry with environments, role-based edit permissions, and visual diff between versions.
Helicone — observability-first product that ties traces to prompt versions and lets you replay a trace against a different SHA.
Vellum — enterprise registry with structured prompt blocks (system / user / examples / tools), eval workflows, and approval gates.
Braintrust — registry plus eval store; the registry's primary value is that every prompt is paired with its eval suite.
LangSmith (LangChain) — hub of named prompts callable by name from the LangChain SDK; versions are pinned by hash.
PromptHub — git-backed registry with a UI layer; mixes options B and C.
OpenAI Stored Prompts — first-party named, versioned prompts callable by ID from the API; the deployment pointer concept lives inside the OpenAI platform.
Anthropic Workbench — prompt iteration UI with per-workspace history and ability to copy a SHA-equivalent identifier.
Vercel AI SDK — typed prompt-definition surface that can be backed by any of the four storage options.
GitHub (PR review + repo) — when used as the registry, supplies the audit trail, diff, and review surface for option B.
GitLab Merge Requests — the GitLab-equivalent surface for the same pattern.
Phabricator — the older review tool some large companies still use as the gate over a Git-backed registry.
ReviewBoard — niche but in use for the same pattern.
dbt — analytics teams treat SQL models as versioned, named, eval-gated assets — the same mental model adapted for SQL.
Hashicorp Vault — used to store credentials referenced from inside prompts; the secret half of a registry, not the prose half.
AWS Parameter Store — a popular bootstrap option D backing; cheap, S3- like, but you build everything around it.
AWS Secrets Manager — paired with Parameter Store for the secret half.
ConfigCat — flag platform some teams misuse as a registry; works for toy systems, struggles at scale.
LaunchDarkly — gate prompt deployment pointers per cohort or rollout.
Statsig — same role, paired with experimentation features.
Flagsmith — open-source equivalent for the same pattern.
Optimizely — used for prompt A/B experiments where the pointer maps to a variant SHA.
Split.io — flag-and-experiment surface for the deployment pointer.
MLflow — model-registry framework whose pattern (model + version + stage) maps cleanly onto prompts; some teams reuse it.

The point is not that one of these is right and the others are wrong. The point is that the four backing-store options have been productised many times over, and you are picking which trade-off curve your team wants to ride.

Pause and recall¶

Name the three things a registry actually stores, kept conceptually separate.
What are the five (or six, with rolled-back) statuses in the version state machine?
Why is immutability the rule that makes every other capability work?
List the four backing-store options and the main thing each one buys.
What is the convention for prompt names that survives contact with real teams?
Why is "v17 in place" worse than "a new SHA"?
Where does the deployment pointer live, and why is it a separate artifact from the content?

Interview Q&A¶

Q1. What does a prompt registry actually store? A. Three logically distinct things — the content (immutable, content- addressed), the identity (a stable name across versions), and the deployment pointer (which SHA each environment is currently running). Each version row also carries metadata — parent SHA, author, eval score, approvers, status, change note. Trap: Conflating identity with version. The name billing.refund_intent is the identity. The SHA is the version. Production points one at the other.

Q2. Why content-address prompts instead of numbering them? A. Numbers are mutable names — v17 today can differ from v17 last week if the system lets anyone edit in place. SHAs are content fingerprints — two different texts can never share one, and the same text always produces the same one. Traces, evals, and rollbacks all need a stable link to a specific text. Only SHAs provide that. Trap: "We use semver." Fine for human display, but the trace should still log a SHA.

Q3. Walk me through choosing between Postgres, Git, hosted SaaS, and S3 for the registry. A. Postgres for full control with effort. Git/YAML if engineers are the only editors and you want the audit trail for free. Hosted SaaS (Langfuse, Vellum, Braintrust) once non-engineers need to edit. S3 + manifest for cost- ceiling cases with no UI requirement. Most teams go B first, then C as the edit pressure grows. Trap: Picking SaaS too early — you inherit lock-in before you know your edit shape. Or staying on Git too long when product wants to edit weekly.

Q4. Why is immutability the rule the registry enforces hardest? A. Every other capability depends on it. Traces log a SHA; if the SHA's content can change, the trace is a lie. Evals score a SHA; same problem. Rollbacks target a SHA; same problem. Immutability is the foundation contract. Drafts are mutable; versions are not. Trap: Allowing "admin override" edits. The audit trail breaks on the first override.

Q5. How do you name prompts? A. service.role.purpose — three dotted segments, lowercase. The first is the service or domain (billing, support, search). The second is the role (agent, classifier, summarizer). The third is the specific purpose. Stable across versions. Lints enforce shape. Trap: Naming by author or by sprint. The name has to survive ownership changes and outlive the engineer who created it.

Q6. What is the deployment pointer and why is it a separate artifact? A. The deployment pointer is the small file or row that maps name → SHA per environment — production points billing.refund_intent at SHA a8c3f971. Keeping it separate from the content means deploys are pointer changes, not content changes. Rollbacks are pointer flips. Promotions across environments are pointer flips. The content store stays append-only. Trap: Embedding "is_deployed" as a flag on the content row. Then promote and rollback both mean updating immutable rows.

Q7. How does a registry support multi-environment workflows? A. The deployment pointer is environment-scoped — production → SHA X, staging → SHA Y, dev → SHA Z. Promotion is "flip the staging-pointing SHA to also be the production-pointing SHA". Rollback is "flip production back to the previous SHA". The content store does not care which environment is reading. Trap: Per-environment content copies. You end up with drift between staging and production because somebody edited staging in place.

Q8. What is the most common registry leak you have seen? A. Trace-without-SHA. The trace logs the model's output but not the SHA of the prompt that produced it. Every incident becomes an archaeology project. Fix is to wrap the runtime so every request carries prompt_sha into the trace alongside model and latency. Cost is one extra header. Value is every postmortem from then on. Trap: Logging the prompt name but not the SHA. The name does not tell you which version was live at the time of the failed request.

Apply now (5 min)¶

Step 1 — sketch your registry schema. On paper or whiteboard, draw the three tables (or YAML files) — prompts (name, owner), prompt_versions (sha, name, content, parent_sha, created_by, status, eval_score), deployments (name, env, sha). Fill in two real prompts from your codebase.

Step 2 — pick a backing store. Look at the four options. Ask which two properties matter most to your team in the next year — non-engineer edits, hot edits without deploy, cost, audit-trail strength. Pick.

Step 3 — name twenty prompts. Open your codebase. Find every f-string that lands in a chat-completion call. Give each one a service.role.purpose name. Lay them out as a flat list. Notice which service segments are over-represented — that is where your next refactor goes.

Bridge. The registry holds versions. The next question is what to do with them — how to diff, how to promote, how to roll back fast when the 14:32 incident in the hook lands on your screen. The rollback is its own discipline.

→ 03-versioning-and-rollback.md