02. The prompt registry — the recipe book that every trace can name¶
~15 min read. The registry is where prompts live, get versions, and earn an audit trail. Pick the wrong backing store and you inherit the wrong failure mode. Pick the right one and most of prompt ops becomes a query.
Builds on 01-prompts-as-code.md. The four laws say a prompt needs identity, version, review, and eval. The registry is the physical thing that holds the first two and makes the last two possible.
1) Hook — the moment a registry earns its keep¶
Look. It is 14:32 on a weekday. The on-call dashboard shows the refund agent's "correct intent extraction" score has fallen from 96% to 78% in the last fifteen minutes. The on-call engineer pulls up a failed trace.
trace_id : 4481_t99a2c
ts : 2026-02-11T14:31:18Z
service : billing
agent : refund_classifier
prompt_name : billing.classifier.refund_intent
prompt_sha : b1d7e4c2
output : {"intent": "warranty_claim", "confidence": 0.42}
expected : {"intent": "refund_request", "confidence": >0.8}
The trace carries the SHA. The on-call engineer types one command into the registry CLI.
$ promptctl history billing.classifier.refund_intent
SHA parent deployed_at author eval_score
b1d7e4c2 a8c3f971 2026-02-11T14:18:00Z ria@acme.com 91.4
a8c3f971 e5f02d8b 2026-02-10T09:12:00Z sundar@acme.com 96.7
e5f02d8b ... 2026-02-04T11:55:00Z sundar@acme.com 96.5
There it is. A new SHA went live at 14:18 — fourteen minutes before the metric
dropped. The author is named. The parent SHA is named. The eval score on
promotion was 91.4, lower than the previous 96.7, and somebody promoted it
anyway. The rollback to a8c3f971 is one command. Metrics recover three
minutes later.
None of this works without a registry. The trace would have logged the model's
output, but nothing that ties the failure to a specific edit. The bisection
would have been "what shipped today?", the search would have been a git
log, and the recovery would have been measured in hours, not minutes.
This chapter is about what a registry is, what it stores, what backs it, and the tradeoffs in each choice.
2) The metaphor — the recipe book on the kitchen wall¶
The bakery has thirty bakers across three shifts. They cannot all be reading the same paper recipe on the counter. Not safely. Not without somebody crossing out "two pinches salt" the way the tired cook did.
So the bakery installs the recipe book. It is a bound volume on the wall.
Every page is a recipe. Every recipe has a name on the spine — bread.sourdough,
pastry.croissant, cake.birthday_chocolate. Every page has a number, and
every revision of that page gets pasted in as a new entry with its own
identifier — never erased, never edited in place. When a baker wants tomorrow's
shift to use a new croissant recipe, she does not edit the existing page. She
writes a new page, gets the head baker to sign it, files it as
pastry.croissant — entry 17, and only then changes the pointer on the front
desk that says "today's croissant is entry 17".
That whole construct — the named spine, the immutable revisions, the head baker's signature, the pointer that says what is live — is the prompt registry.
The registry holds three different things, and most teams confuse them at first.
The content is the recipe text itself, content-addressed by its SHA. Once written, never edited. New text gets a new entry.
The identity is the spine label — billing.classifier.refund_intent.
Stable across versions. The same name today and a year from now.
The deployment pointer is the front-desk note — "production is
billing.classifier.refund_intent at SHA a8c3f971". The pointer changes.
The content never does.
Confusing content with identity is what causes the "v17 was edited last week" problem we will return to. Confusing identity with the deployment pointer is what causes deploys-by-rename. Keep all three separate in your head, the registry's structure follows from there.
3) Anatomy of a registry entry¶
A useful registry stores, at minimum, the following per version. Strip any of these out and you lose a capability later.
┌──────────────────────────────────────────────────────────────┐
│ PROMPT REGISTRY ENTRY │
├──────────────────────────────────────────────────────────────┤
│ name billing.classifier.refund_intent │
│ sha b1d7e4c2... (sha256 of the content blob) │
│ parent_sha a8c3f971... │
│ content { system, few_shot, tools, format } │
│ created_at 2026-02-11T14:14:00Z │
│ created_by ria@acme.com │
│ status draft | in_review | approved | deployed | │
│ deprecated | rolled_back │
│ eval_suite refund_classifier_eval_v3 │
│ eval_score 91.4 │
│ approvers [aman@acme.com, priya@acme.com] │
│ tags {service: billing, owner: payments_squad} │
│ change_note "softened apology in fallback example" │
└──────────────────────────────────────────────────────────────┘
Every field earns its place. The SHA is the content hash; two different texts can never collide on the same SHA, and the same text always produces the same one. The parent SHA gives you the chain — every version knows the one it replaced, which is how diffs and rollbacks work without a separate history table. The content is itself a structured object, not a single string, because a prompt is rarely just a system message — it usually carries few-shot examples, tool descriptions, and an expected output format. The status field is what separates a draft an author is iterating on from the version production is actually running. The eval score and approvers are the audit trail for the four laws.
The status field deserves a second look. The five states form a small state machine.
draft ─create─▶ in_review ─approve─▶ approved ─promote─▶ deployed
│ │ │ │
│ │ │ ▼
│ └──reject──┐ └──supersede─▶ deprecated
│ ▼ │
└─────────────────────────▶ abandoned ▼
rolled_back
A draft is the author's playground — edit freely, run trial evals, but not yet visible to production. In review is awaiting human approval. Approved has passed review and eval gates but is not yet live. Deployed is the one the deployment pointer is currently pointing at. Deprecated is an older version superseded by a new deploy. Rolled back is a previously deployed version that production has reverted to — kept as a distinct state because "this version was once live, then unlived, and is now live again" is worth knowing in incident timelines.
4) Content-addressed identity — why SHA beats v17¶
The instinct most teams have is to number versions. v1, v2, v3. It is the mental model from semver. It also gets you into trouble fast in prompt land, and the reason is worth being explicit about.
Numbered versions are names. The thing they name can change. Two engineers both add a new version called "v17" on different branches, you have a name collision and have to renumber. Worse, in tools that let you edit a numbered version in place — and many do — "v17" today is not "v17" last week. The trace from last week says it ran v17. Pull up v17 today. It is different text. The link from trace to text is broken.
Content-addressed identity solves both. The version is sha256(content). No
two distinct contents ever share a SHA. The same content always produces the
same SHA. There is no "renaming" — the SHA is the content's fingerprint.
Two engineers both make the same one-character typo fix, they get the same
SHA, the system de-duplicates. Two engineers make different edits, they get
different SHAs, and one of them is going to lose the race to deploy, which is
how it should be.
NUMBERED VERSIONS CONTENT-ADDRESSED
──────────────── ─────────────────
v17 today may differ SHA b1d7e4c2 is the same text
from v17 last week today, tomorrow, always
two authors collide on v17 same edit by two authors →
→ renumber, both lose history same SHA, deduped for free
editing v17 in place is editing in place is impossible
possible by accident — new text = new SHA
"which v17" SHA answers itself
The downside is human-readability. b1d7e4c2... is not memorable. Mature
registries solve this by displaying short prefixes (the first 8 characters)
and pairing SHAs with optional human-readable tags (refund-classifier-2026-
feb-launch). The tag is mutable, the SHA is not. The trace logs the SHA. The
dashboard displays the tag. The two layers do not interfere.
Mini-FAQ. "What if I want both — SHA and a friendly number?" Many registries store an auto-incremented
version_numberalongside the SHA, as a display convenience. The number is for humans; the SHA is for the machines. Production traces should always log the SHA, never the number, because the number can be reassigned across systems and environments.
5) Worked example — a YAML-backed registry entry¶
The simplest registry that satisfies the four laws is a YAML file in a git repo. Below is one entry as a team would actually write it.
# prompts/billing/refund_classifier.yaml
name: billing.classifier.refund_intent
sha: b1d7e4c2a3f048e7b9c1d2e3f4a5b6c7
parent_sha: a8c3f97132e9b5a8d4f6e7c8b9a0c1d2
created_at: 2026-02-11T14:14:00Z
created_by: ria@acme.com
status: deployed
eval_suite: refund_classifier_eval_v3
eval_score: 91.4
approvers:
- aman@acme.com
- priya@acme.com
tags:
service: billing
owner: payments_squad
language: en
change_note: |
Softened the apology in the fallback example. The previous wording
("I'm sorry, I can't help with that.") was triggering a tone-drop in
the conversational eval. New wording is "Let me get someone who can
help with that." Behavior on positive cases unchanged.
content:
system: |
You are the refund-intent classifier for Acme Billing.
Given a customer message, classify into one of:
- refund_request
- warranty_claim
- cancellation
- other
Return strict JSON: {"intent": ..., "confidence": 0.0..1.0}
few_shot:
- role: user
content: "I want my money back for order 4481"
- role: assistant
content: '{"intent": "refund_request", "confidence": 0.95}'
- role: user
content: "The blender broke after two weeks, what do I do?"
- role: assistant
content: '{"intent": "warranty_claim", "confidence": 0.88}'
tools: []
expected_format: json_object
The entry holds everything the four laws asked for. The name gives identity. The SHA gives version. The approvers and change_note give the human review trail. The eval_suite and eval_score give the automated trail. The content block carries everything the runtime needs to load — system message, few-shot turns, tool list, expected format.
A new edit by a different author produces a sibling YAML file at a different
SHA, with this one's SHA as its parent. The deployment pointer (a separate
small file, prompts/deployments/billing.yaml, that maps name → currently-
deployed SHA per environment) is what gets updated to switch traffic. The
content files themselves are append-only.
6) The four backing stores — and what each one buys you¶
Teams pick from roughly four ways to back the registry. Each one trades off the same axes — who can edit, how strong the audit trail is, how fast reads are, and how multi-tenancy works. Knowing the four is enough to evaluate any SaaS pitch you will hear.
Option A — Postgres (or any relational DB)¶
The registry is a set of tables — prompts (name, owner),
prompt_versions (sha, name, content, parent_sha, created_by, status,
eval_score), deployments (name, env, sha). Reads are a join. Writes go
through an API the team owns. Audit trail comes from row history or a
separate events table.
What you buy. Full control of schema. Fast reads at scale. Plays well with existing infra. Easy to add fields specific to your business.
What you pay. You build the editor UI. You build the diff view. You build the eval-gate plumbing. There is no UI until you make one. Most teams start here and then either build a real internal tool or move to one of the SaaS options below.
Option B — Git repo of YAML files¶
Each prompt is a YAML file in a prompts/ directory. The deployment pointer
is a separate file. Edits go through pull requests. The Git history is the
audit trail. CI runs lint and eval gates on the PR.
What you buy. Versioning, diff, review, audit — all free, all inherited from the tooling engineers already use. Code review patterns transfer. Easy to inspect, easy to grep.
What you pay. Only engineers can comfortably edit. Non-engineers (PMs, support leads) need either a UI on top or the patience to learn GitHub. Hot edits in production are slow — every change needs a PR, a CI run, and a deploy. The audit trail is excellent for what changed but weak for who viewed it and what eval ran.
Option C — Hosted SaaS (Langfuse, Pezzo, Vellum, Braintrust, etc.)¶
You install an SDK. You write prompts in the SaaS UI. The SaaS stores versions, runs evals, exposes a deployment pointer, gives non-engineers a clean interface to edit and review.
What you buy. A real UI day one. A real audit trail day one. Eval workflows built in. Non-engineer-friendly. Mature features for diffs, A/Bs, and rollbacks. Multi-tenancy and SSO if you pay enough.
What you pay. Vendor lock-in to some degree (most have export). Latency overhead if you hit the SaaS on every request — most SDKs cache, but you inherit the cache logic. A line item in the budget. Data residency questions if your prompts contain regulated content.
Option D — DIY S3 + a manifest¶
The cheap, durable option. Prompt content blobs in S3 (or any object store), keyed by SHA. A manifest file — JSON or YAML — that holds the name-to-SHA mapping per environment. A tiny service or a Lambda updates the manifest. Audit trail is S3 access logs plus the manifest's git history.
What you buy. Almost free at any scale. Immutability is enforced by the object store. Reads are cheap and globally cacheable. Survives a registry- service outage because there is barely a service.
What you pay. Tooling poverty. No UI. No diff view. No eval gate without building one. Workable for a small backend team; painful for product-led prompt editing.
EDIT-BY-NON- AUDIT LATENCY DIY
ENGINEER? TRAIL ON READ BUILD
────────── ───── ─────── ─────
A — Postgres medium strong low high
B — Git/YAML low strong low medium
C — Hosted SaaS high strong medium low
D — S3 + manifest low medium very low medium
Most teams end up in B for a while, then move to C as the non-engineer- editing pressure grows. A and D are reasonable end states for teams with unusual constraints — heavy custom workflows or strict cost ceilings.
Mid-content recall¶
- Why does content-addressed identity beat numbered versions?
- Name three fields a registry entry must carry beyond
nameandcontent. - What is the practical difference between deployed and approved in the status state machine?
7) Naming conventions — the part that becomes load-bearing¶
A registry full of well-versioned prompts named prompt_1, prompt_2,
new_prompt, new_prompt_actually_final will fail you faster than no
registry at all. Names are the index that every other tool reads. They go on
dashboards, in eval reports, in alert messages, in customer-success
documentation. Pick them with the care you would pick a public API surface.
The convention that survives contact with real teams is
service.role.purpose. Three dotted segments, lowercase, snake_case within
each segment.
billing.classifier.refund_intent
billing.agent.refund_resolver
support.agent.greeter
support.summarizer.ticket_close
onboarding.checker.email_verification
moderation.classifier.spam
search.rewriter.query_expansion
The first segment is the service or domain. The second is the kind of
prompt — agent, classifier, summarizer, rewriter, extractor,
grader. The third is the specific purpose. The result is a flat namespace
that sorts, greps, and dashboards cleanly.
Names should be stable. Once billing.classifier.refund_intent exists,
renaming it costs you every dashboard, alert, and eval that referenced the
old name. Add a new prompt with a new name instead. Deprecate the old one in
the registry's status field.
Mini-FAQ. "Can we use slashes instead of dots?" You can. Dots match Python module conventions and read cleanly in logs. Slashes match path conventions and read cleanly in file trees. Pick one, stick to it.
8) Immutability — the rule that makes everything else work¶
The single rule that the registry must enforce above all others is this. A version, once created, cannot be edited. Not by the author. Not by an admin. Not by a database query. Edit = new version. New SHA.
The rule sounds severe. It is the reason every other piece of the system holds together. Every trace logged a SHA. If the content behind that SHA can change, the trace's link to the truth is broken. Every eval reported a score against a SHA. If the SHA's content can change, the score is a lie. Every rollback target is a SHA. If the SHA's content can change, the rollback is gambling.
Implementation matters. In a Postgres backing, immutability is enforced by a
trigger that rejects updates to the content column. In a Git/YAML backing,
it is enforced by the convention that each version is a separate file at a
SHA-named path and CI rejects modifications to existing files. In a hosted
SaaS, it is enforced by the platform. In an S3 backing, it is enforced by
the object store itself — write once, read many.
The corollary is that draft is different from version. A draft is
mutable. An author can iterate on a draft text freely. The moment the draft
is submitted to review, it is hashed into a version, and the version is
immutable from that point on. Drafts can carry the same name as the live
version; what differs is the status. Only versions with status deployed
are reachable by production runtime.
9) Failure modes — where registries leak¶
SYMPTOM FIX
─────── ───
"v17 was different last week" → content-addressed SHAs, no in-place
edits
two authors collide on the same → immutable versions, parent SHA
version name lineage
trace logs prompt name but no SHA → stamp SHA on every trace (chapter 7)
non-engineer cannot edit → add a UI layer over backing store
PR review is the only audit trail → add deployment events and access logs
prompts/ directory has 4000 files → namespace by service in subdirs
prompt_v2_final_FINAL.yaml → enforce naming convention in lint
deployment pointer drift between → single deployments file per env,
environments reviewed like prompts themselves
The two leaks worth highlighting are the deployment pointer drift — staging runs SHA X, production runs SHA Y, and nobody knows because the pointers live in different files — and the trace-without-SHA leak, which is the reason chapter seven exists.
Where this lives in the wild¶
Each of these products has staked out a slice of the registry problem.
- Langfuse — open-source platform; named prompts, content-addressed
versions, environment-scoped deployment labels (
production,staging), Python and TypeScript SDKs that resolve a name to its currently-labelled version. - PromptLayer — SaaS registry with template variables, version diffing, and a "release" concept that maps to deployment pointers.
- Pezzo — open-source registry with environments, role-based edit permissions, and visual diff between versions.
- Helicone — observability-first product that ties traces to prompt versions and lets you replay a trace against a different SHA.
- Vellum — enterprise registry with structured prompt blocks (system / user / examples / tools), eval workflows, and approval gates.
- Braintrust — registry plus eval store; the registry's primary value is that every prompt is paired with its eval suite.
- LangSmith (LangChain) — hub of named prompts callable by name from the LangChain SDK; versions are pinned by hash.
- PromptHub — git-backed registry with a UI layer; mixes options B and C.
- OpenAI Stored Prompts — first-party named, versioned prompts callable by ID from the API; the deployment pointer concept lives inside the OpenAI platform.
- Anthropic Workbench — prompt iteration UI with per-workspace history and ability to copy a SHA-equivalent identifier.
- Vercel AI SDK — typed prompt-definition surface that can be backed by any of the four storage options.
- GitHub (PR review + repo) — when used as the registry, supplies the audit trail, diff, and review surface for option B.
- GitLab Merge Requests — the GitLab-equivalent surface for the same pattern.
- Phabricator — the older review tool some large companies still use as the gate over a Git-backed registry.
- ReviewBoard — niche but in use for the same pattern.
- dbt — analytics teams treat SQL models as versioned, named, eval-gated assets — the same mental model adapted for SQL.
- Hashicorp Vault — used to store credentials referenced from inside prompts; the secret half of a registry, not the prose half.
- AWS Parameter Store — a popular bootstrap option D backing; cheap, S3- like, but you build everything around it.
- AWS Secrets Manager — paired with Parameter Store for the secret half.
- ConfigCat — flag platform some teams misuse as a registry; works for toy systems, struggles at scale.
- LaunchDarkly — gate prompt deployment pointers per cohort or rollout.
- Statsig — same role, paired with experimentation features.
- Flagsmith — open-source equivalent for the same pattern.
- Optimizely — used for prompt A/B experiments where the pointer maps to a variant SHA.
- Split.io — flag-and-experiment surface for the deployment pointer.
- MLflow — model-registry framework whose pattern (model + version + stage) maps cleanly onto prompts; some teams reuse it.
The point is not that one of these is right and the others are wrong. The point is that the four backing-store options have been productised many times over, and you are picking which trade-off curve your team wants to ride.
Pause and recall¶
- Name the three things a registry actually stores, kept conceptually separate.
- What are the five (or six, with rolled-back) statuses in the version state machine?
- Why is immutability the rule that makes every other capability work?
- List the four backing-store options and the main thing each one buys.
- What is the convention for prompt names that survives contact with real teams?
- Why is "v17 in place" worse than "a new SHA"?
- Where does the deployment pointer live, and why is it a separate artifact from the content?
Interview Q&A¶
Q1. What does a prompt registry actually store?
A. Three logically distinct things — the content (immutable, content-
addressed), the identity (a stable name across versions), and the deployment
pointer (which SHA each environment is currently running). Each version row
also carries metadata — parent SHA, author, eval score, approvers, status,
change note.
Trap: Conflating identity with version. The name billing.refund_intent
is the identity. The SHA is the version. Production points one at the other.
Q2. Why content-address prompts instead of numbering them? A. Numbers are mutable names — v17 today can differ from v17 last week if the system lets anyone edit in place. SHAs are content fingerprints — two different texts can never share one, and the same text always produces the same one. Traces, evals, and rollbacks all need a stable link to a specific text. Only SHAs provide that. Trap: "We use semver." Fine for human display, but the trace should still log a SHA.
Q3. Walk me through choosing between Postgres, Git, hosted SaaS, and S3 for the registry. A. Postgres for full control with effort. Git/YAML if engineers are the only editors and you want the audit trail for free. Hosted SaaS (Langfuse, Vellum, Braintrust) once non-engineers need to edit. S3 + manifest for cost- ceiling cases with no UI requirement. Most teams go B first, then C as the edit pressure grows. Trap: Picking SaaS too early — you inherit lock-in before you know your edit shape. Or staying on Git too long when product wants to edit weekly.
Q4. Why is immutability the rule the registry enforces hardest? A. Every other capability depends on it. Traces log a SHA; if the SHA's content can change, the trace is a lie. Evals score a SHA; same problem. Rollbacks target a SHA; same problem. Immutability is the foundation contract. Drafts are mutable; versions are not. Trap: Allowing "admin override" edits. The audit trail breaks on the first override.
Q5. How do you name prompts?
A. service.role.purpose — three dotted segments, lowercase. The first is
the service or domain (billing, support, search). The second is the role
(agent, classifier, summarizer). The third is the specific purpose. Stable
across versions. Lints enforce shape.
Trap: Naming by author or by sprint. The name has to survive ownership
changes and outlive the engineer who created it.
Q6. What is the deployment pointer and why is it a separate artifact?
A. The deployment pointer is the small file or row that maps name → SHA
per environment — production points billing.refund_intent at SHA
a8c3f971. Keeping it separate from the content means deploys are pointer
changes, not content changes. Rollbacks are pointer flips. Promotions across
environments are pointer flips. The content store stays append-only.
Trap: Embedding "is_deployed" as a flag on the content row. Then promote
and rollback both mean updating immutable rows.
Q7. How does a registry support multi-environment workflows?
A. The deployment pointer is environment-scoped — production → SHA X,
staging → SHA Y, dev → SHA Z. Promotion is "flip the staging-pointing
SHA to also be the production-pointing SHA". Rollback is "flip production
back to the previous SHA". The content store does not care which
environment is reading.
Trap: Per-environment content copies. You end up with drift between
staging and production because somebody edited staging in place.
Q8. What is the most common registry leak you have seen?
A. Trace-without-SHA. The trace logs the model's output but not the SHA of
the prompt that produced it. Every incident becomes an archaeology project.
Fix is to wrap the runtime so every request carries prompt_sha into the
trace alongside model and latency. Cost is one extra header. Value is every
postmortem from then on.
Trap: Logging the prompt name but not the SHA. The name does not tell
you which version was live at the time of the failed request.
Apply now (5 min)¶
Step 1 — sketch your registry schema. On paper or whiteboard, draw the
three tables (or YAML files) — prompts (name, owner), prompt_versions
(sha, name, content, parent_sha, created_by, status, eval_score),
deployments (name, env, sha). Fill in two real prompts from your codebase.
Step 2 — pick a backing store. Look at the four options. Ask which two properties matter most to your team in the next year — non-engineer edits, hot edits without deploy, cost, audit-trail strength. Pick.
Step 3 — name twenty prompts. Open your codebase. Find every f-string
that lands in a chat-completion call. Give each one a
service.role.purpose name. Lay them out as a flat list. Notice which
service segments are over-represented — that is where your next refactor
goes.
Bridge. The registry holds versions. The next question is what to do with them — how to diff, how to promote, how to roll back fast when the 14:32 incident in the hook lands on your screen. The rollback is its own discipline.