07. Prompt observability — tracing a bad answer back to the exact recipe that ran¶

~17 min read. A complaint arrived at 14:32. The user got a bad answer. The bakery's job is to know — in under two minutes — which recipe was on the counter at 14:32, what inputs were poured into it, what the model said, and which downstream tools fired. If the bakery log cannot answer that, the bakery is flying blind.

Builds on 06-prompt-drift-detection.md. Drift detection tells you when behavior changes; observability tells you which recipe produced which output for which user at which moment.

1) Hook — the complaint at 14:32¶

Friday, 14:32. A support ticket arrives. "I asked the chatbot for a refund. It told me to email a department that doesn't exist. Trace ID 4481."

Two scenarios, same complaint, very different Fridays.

Scenario A — no prompt observability. The team opens trace 4481. The trace shows the LLM call, the input, and the output. The team sees the bad answer but cannot tell which prompt produced it. The prompt registry has v17 and v18 of refund_handler; the team does not know which one was live at 14:32. The on-call sleuths git log, runtime config, deployment timestamps. Forty-five minutes later they conclude v18 was probably live. They pull v18, read the change, find the bug, write the fix. By the time the rollback ships, the bad answer has gone to 200 more users.

Scenario B — prompt observability. The team opens trace 4481. The trace shows the LLM call, the input, the output — and the span carries prompt.name = refund_handler, prompt.sha = b1d7e4..., prompt.version = v18. One click pulls up the rendered prompt content, with the user's name and order ID already interpolated. Another click pulls up the diff between v18 and v17 in the prompt registry. The bug is in v18's instruction line that mentions a now-decommissioned support email. The rollback to v17 is one command. Two minutes from complaint to rollback.

The difference is not magic. It is span tags. The prompt SHA on every LLM span, queryable in both directions — from trace to prompt, from prompt to traces. This page is how that works.

2) The metaphor — the bakery log¶

Every croissant that leaves the bakery carries a tiny invisible label. Which recipe was on the counter when this croissant was baked. Which version of that recipe. What flour was used. Which oven. What time. The label is sealed in. The customer never sees it. But when a customer comes back two days later holding a stale croissant, the bakery reads the label and knows everything — exact recipe SHA, exact inputs, exact oven, exact moment.

That sealed label is the trace. The recipe SHA on every croissant is the prompt SHA on every span. The bakery log is the trace store. Without it, every complaint is a forty-five-minute hunt. With it, every complaint is a two-minute lookup.

The same label lets the question run the other way. "We just rolled back v18. Show me every croissant baked from v18 in the last six hours." The bakery scans the log for SHA = b1d7e4 between T-6h and T-now. That set of croissants is the impact radius of the regression. The bakery can decide whether to refund customers proactively, send an apology email, or just note it.

3) The anatomy — what gets tagged on every span¶

┌───────────────────────────────────────────────────────────────────┐
│ TRACE STRUCTURE FOR ONE PROMPT RUN                                │
├───────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ROOT SPAN — request                                              │
│  ├─ trace_id          = abc123...                                 │
│  ├─ user_id           = u_5012 (hashed)                           │
│  ├─ tenant_id         = t_acme                                    │
│  │                                                                │
│  ├── CHILD SPAN — prompt.template.render                          │
│  │    ├─ prompt.name       = refund_handler                       │
│  │    ├─ prompt.sha        = b1d7e4...                            │
│  │    ├─ prompt.version    = v18                                  │
│  │    ├─ prompt.variant    = control      (or A/B variant)        │
│  │    └─ prompt.template_inputs = {order_id, customer_name, ...}  │
│  │                                                                │
│  ├── CHILD SPAN — llm.call                                        │
│  │    ├─ genai.system       = anthropic                           │
│  │    ├─ genai.model.name   = claude-4-7-sonnet                   │
│  │    ├─ genai.model.id     = claude-4-7-sonnet-20260301          │
│  │    ├─ genai.request.tokens.input  = 1240                       │
│  │    ├─ genai.response.tokens.output = 180                       │
│  │    ├─ genai.usage.cost   = 0.0028                              │
│  │    └─ rendered_prompt    = (full prompt content, or pointer)   │
│  │                                                                │
│  └── CHILD SPAN — output.parse                                    │
│       ├─ output.format       = json                               │
│       ├─ output.valid        = true                               │
│       └─ tool_calls          = [issue_refund]                     │
│                                                                   │
└───────────────────────────────────────────────────────────────────┘

The five attributes that matter most are prompt.name, prompt.sha, prompt.version, prompt.variant, and genai.model.id. Every LLM span carries them. The OpenTelemetry GenAI semantic conventions formalize this — genai.system, genai.model.name, genai.usage.input_tokens, genai.usage.output_tokens, plus extensions for prompt metadata.

Why both prompt.sha and prompt.version? The SHA is the content hash — immutable, unforgeable, the actual identity of the prompt that ran. The version is the human-readable name — v17, v18, "after the refund-policy fix." Operators query by SHA because SHA is what guarantees correctness. Humans read version because version is what they remember.

Why prompt.variant? Because during an A/B (chapter 05) two SHAs can be live simultaneously. The variant tag tells the team which arm of the experiment this trace belongs to.

4) The joined view — both directions¶

The point of tagging is bidirectional lookup.

Trace → prompt. From a trace ID, surface the prompt SHA. From the SHA, surface the prompt content in the registry. From the content plus template_inputs, render the exact prompt the model saw. This is the path for incident response — "trace 4481 ran v18; here is the rendered prompt; here is the bug in line 7."

Prompt → traces. From a prompt SHA, surface every trace that used it. Aggregate over time — how many requests, what cost, what latency, what error rate. This is the path for impact analysis and for monitoring — "v18 has been live for 6 hours; here are the 12,400 traces that used it; here are the 38 traces that failed."

The "joined view" is the UX where these two paths are one click apart. Langfuse, LangSmith, Braintrust, and Phoenix all build this UX directly. Datadog and Helicone build it via tags and dashboards. OpenLLMetry builds it via OpenTelemetry attribute conventions and lets the backend (Honeycomb, Datadog, Grafana Tempo) handle the queries.

5) Sampling vs full capture — the cost decision¶

At scale, capturing every trace at full fidelity is expensive. A single Claude call can produce 10-50 KB of trace data when the rendered prompt and full output are included. A million requests a day is 10-50 GB. Compounding over weeks of retention, the trace store bill becomes real.

The standard answer is tiered sampling.

TIER                         CAPTURE                       RATE
────                         ───────                       ────
default                      span tags only                100%
                             (SHA, model, tokens, cost)

normal                       + rendered prompt             1-10%
                             + full output

triggered                    + everything                  100%
                             on:
                             - error
                             - high latency tail
                             - user complaint
                             - explicit flag (debug=true)
                             - new prompt SHA (first 24h)

Always tag the SHA on the span — that is cheap and the most valuable signal. Tag tokens, cost, latency — also cheap. Capture the full rendered prompt and full output only on a sample, plus a triggered rule that captures everything for the cases the team actually needs. The triggered rule is what catches the 14:32 complaint — if the trigger fires on any error or high-complaint-likelihood pattern, the team has the full data exactly when needed.

A new prompt SHA in the first 24-72 hours of rollout deserves 100% capture. That is the highest-risk window and the cost of full capture is finite (one prompt, one ramp window).

6) PII and the rendered prompt¶

Prompts often have user data interpolated — name, email, order details, the user's literal question. Storing the rendered prompt in the trace means storing PII. That has legal and operational consequences.

Three patterns to manage it.

Pattern one — store the template, not the rendered prompt. Save the prompt SHA (which points to the template), save the template inputs as a structured dict, but do not store the fully concatenated string. The team can re-render at debug time. The trace store does not hold PII.

Pattern two — redact at write. Run a PII redactor on the rendered prompt before it enters the trace store. Replace emails with <EMAIL>, phone numbers with <PHONE>, names with <NAME>. Reduces utility a little (the team cannot see the exact wording the user used) but eliminates the storage risk.

Pattern three — encrypt and TTL. Store rendered prompts encrypted, with a short TTL (30-90 days), and gate decryption on incident response with audit logs. Highest utility, highest operational cost.

Pattern one is the cheapest and the most defensible. Patterns two and three are common in regulated industries (healthcare, finance). The decision depends on the data, the regulator, and the team's risk appetite — but pretending the issue does not exist is not an option, and observability vendors increasingly default to redaction on capture.

7) Worked example — root-causing trace 4481¶

The Friday 14:32 complaint, walked through with full observability.

T+0:00 (14:32) — complaint arrives. User reports bad answer, supplies trace ID 4481.

T+0:15 — open trace. The team's observability tool (Langfuse, say) shows trace 4481. The root span is a customer-support request. The LLM span carries:

prompt.name      = refund_handler
prompt.sha       = b1d7e4f2a8c9...
prompt.version   = v18
prompt.variant   = control
genai.model.id   = claude-4-7-sonnet-20260301

T+0:30 — open the rendered prompt. One click in the UI surfaces the prompt template content for SHA b1d7e4. Another click overlays the template inputs from the trace (order_id = 4481, customer_name = "Aarti", complaint_text = "..."). The team reads the rendered prompt and spots the offending line — "For refunds over ₹10,000, ask the customer to email refunds-large@oldcorp.example." That email address was decommissioned in March. The line was edited in v18 to "update the contact" and the editor used a stale address.

T+0:45 — confirm impact. The team queries the trace store for prompt.sha = b1d7e4 over the last 24 hours. 14,200 traces. Filter for ones where the model output contained "refunds-large@oldcorp" — 142 traces, ~1% of v18 traffic. Those are the affected customers.

T+1:30 — rollback v18 to v17. One command in the prompt registry. New traces immediately start carrying prompt.sha = a8c3f9... for v17.

T+1:45 — verify. Check the most recent traces. SHA on the LLM span is a8c3f9. v17 is live. Time-to-recovery: under two minutes from "open trace" to "rollback issued." Cleanup of the 142 affected customers follows asynchronously.

Without the SHA tags, this whole walk would have been a forty-five-minute git log dive. With the tags, the team is back to baseline before the engineering manager finishes her coffee.

Mid-content recall¶

Why does an LLM span carry both prompt.sha and prompt.version, and what does each enable?
What is the "joined view" and which two directions of lookup does it support?
What is the standard rule for capture rate on a newly-rolled-out prompt SHA?

8) Same SHA, different model — the case observability still has to handle¶

A team rolls a prompt change. v18 ships, the SHA is b1d7e4. Two weeks later, the model provider releases a new minor version of the underlying model — claude-4-7-sonnet-20260315 replaces claude-4-7-sonnet-20260301. The deployment pulls the new model. The prompt SHA has not changed. The behavior has.

Users complain. The team opens a trace. The prompt SHA is unchanged from two weeks ago. Confused on-call. "We haven't touched this prompt — why is behavior different?"

This is why genai.model.id (not just genai.model.name) goes on every span. The model name is claude-4-7-sonnet — too coarse. The model ID is claude-4-7-sonnet-20260301 — the specific snapshot. When behavior changes without a prompt change, the model ID tag is the signal. The team queries the trace store: "show me model IDs used for refund_handler over time." A clean step change at 2026-03-15 reveals the model update. The complaint correlates. The team either pins the old model snapshot or re-evals the prompt against the new model.

Tag the model ID, the prompt SHA, and the toolbelt SHA (if the toolbelt is also versioned). All three can drift independently. Each one tagged independently lets the team isolate which axis changed.

9) Failure modes — where observability quietly fails¶

FAILURE MODE                              FIX
────────────                              ───
logging only the final output             →   tag the prompt SHA, rendered prompt,
                                              and model ID on every LLM span
logging the rendered prompt only          →   also tag the SHA so SHA-based queries work
prompt.name without prompt.sha            →   SHA is mandatory; name is for humans
no model ID, only model name              →   tag the full snapshot ID
no variant tag during A/B                 →   tag prompt.variant on every span
storing PII in rendered prompts           →   redact at write, or store template + inputs
                                              separately
sampling that drops error traces          →   triggered capture on error, regardless
                                              of sample rate
SHA tag missing on the FIRST call         →   instrument at the prompt template render
of a new prompt                                step; SDKs auto-tag if integrated
retention too short to investigate slow   →   30-90 days minimum; longer for regulated
complaints                                    contexts
no audit log on PII decryption            →   gate decryption with audit trail
trace store has tags but UI cannot        →   the bidirectional joined view is what
do bidirectional lookup                       makes tags useful — pick tooling that
                                              supports it

The single deepest failure mode is the first one. A surprising number of LLM systems log only the model's final output, not the rendered prompt or its SHA. Those systems can answer "what did the model say?" but cannot answer "what did we ask?" — and the second question is what root-causing requires.

10) The integration model — how SDKs wire this up¶

The instrumentation does not have to be hand-written. Modern prompt-ops tooling provides SDKs that auto-tag spans.

The Langfuse SDK has a Langfuse.prompt.get("refund_handler") call that pulls the template and returns it as a wrapped object. When that object is used to render a prompt and then passed to an LLM client, the SDK auto-tags the resulting span with prompt.name, prompt.sha, prompt.version, and template_inputs. LangSmith's client.pull_prompt() does the same. Braintrust's SDK wraps the same pattern. Helicone tags via headers — the prompt service sets Helicone-Prompt-Id and Helicone-Prompt-Version on the LLM request, and the proxy records them on the span.

For systems on OpenTelemetry directly, OpenLLMetry instrumentation libraries auto-tag the GenAI semantic-convention attributes on LLM calls, and the team adds prompt.* attributes manually at the template render step. The trace backend (Datadog, Honeycomb, Grafana Tempo, New Relic) picks up the attributes and indexes them.

The point is — the team should not be writing trace tagging code by hand. The SDK does it. The team's job is to make sure the SDK is integrated at the prompt registry boundary so that every prompt fetch carries its SHA into the resulting span.

Where this lives in the wild¶

The observability layer is one of the more mature pieces of prompt ops tooling. Many products converge on similar shapes.

Langfuse — open-source LLM observability with built-in prompt registry; SDK auto-tags spans with prompt SHA and template inputs; joined view from trace to prompt to traces.
LangSmith — LangChain's hosted observability; auto-tags prompt name, version, and template inputs via the SDK.
Helicone — proxy-based observability; tags via headers; built-in prompt diff and request explorer.
Braintrust — eval-and-observability platform with paired diff views and prompt-version-aware trace search.
Phoenix (Arize) — open-source LLM observability built on OpenTelemetry; supports prompt-version attributes natively.
PromptLayer — request tracking by prompt template version; web UI shows full prompt history per request.
Vellum — observability and eval with prompt-version tagging across requests.
Pezzo — prompt management with built-in observability; tags requests with prompt ID and version.
Galileo — production LLM observability with prompt-version metric breakdowns.
Datadog LLM Observability — span-level capture with genai.* attributes plus custom prompt.* tags; dashboards by prompt version.
New Relic AI Monitoring — similar shape; trace-and-prompt correlation in the request explorer.
Dynatrace AI Observability — OpenTelemetry-based, prompt version as a custom dimension.
OpenLLMetry — open-source OpenTelemetry instrumentation for LLM SDKs; emits GenAI semantic conventions.
OpenTelemetry GenAI conventions — the spec that defines genai.system, genai.model.name, genai.usage.*; the lingua franca for vendor-agnostic LLM traces.
Honeycomb — backend for OpenTelemetry GenAI spans; high-cardinality queries by prompt SHA.
Grafana Tempo + Loki — trace and log backends that index prompt-version attributes for joined queries.
Sentry AI — error-first observability with prompt-version tagging on captured exceptions.
Patronus AI — production observability with prompt-version regression detection.
TruLens — feedback functions logged against prompt version metadata.
OpenAI Evals + Trace — paired eval-and-trace tooling for prompt-version comparison.
Promptfoo eval reports — link to upstream observability platforms by prompt version.
GitHub Copilot internal telemetry — known to tag prompt version and acceptance rate per request.
Anthropic Console (logging) — tagged metadata per request for prompt-version correlation.
OpenAI Logs — request-level metadata, including user field and custom tags for prompt-version tagging.
AWS Bedrock model invocation logs — capture full prompt and output; team adds prompt version via input tags.

The pattern is convergent across the industry — span tags for prompt SHA, model ID, and template inputs on every LLM call. The vendor differences are in the joined view UX and the retention defaults, not in the data model.

Pause and recall¶

What five attributes must every LLM span carry to enable trace-to-prompt lookup?
What is the difference between prompt.sha and prompt.version, and which one is mandatory for correctness?
Why must genai.model.id be tagged separately from genai.model.name?
What is the standard rule for trace capture during the first 24-72 hours of a new prompt rollout?
Name three patterns for handling PII in rendered prompts in observability storage.
What is the bidirectional joined view, and why does it make tags useful?
Why is the "log only the final output" approach insufficient for root-causing?

Interview Q&A¶

Q1. A complaint arrives at 14:32 with a trace ID. Walk me through getting to a rollback in under two minutes. A. Open the trace. The LLM span carries prompt.sha, prompt.version, prompt.variant, and genai.model.id. One click pulls the prompt content from the registry by SHA. Overlay the trace's template_inputs to see the rendered prompt the model actually saw. Spot the bug. Query the trace store for the same SHA over the last 24 hours to scope impact. Issue the rollback to the prior SHA via the registry CLI. Verify by checking the SHA on the next traces. Two minutes if all the tags are in place. Trap: "We can git blame the prompt." Git blame tells you who edited; it does not tell you which SHA was live at 14:32 in production, which is the question.

Q2. What are the must-have span attributes for an LLM call? A. prompt.name, prompt.sha, prompt.version, prompt.variant (during A/B), genai.system, genai.model.name, genai.model.id, input and output token counts, latency, cost, and the parsed output structure. The OpenTelemetry GenAI semantic conventions cover the model-side attributes. Prompt attributes are added at the template render step. Trap: Listing only model attributes. Without prompt attributes, you cannot correlate behavior to recipe.

Q3. Why log the SHA and the rendered prompt? Isn't the SHA enough? A. The SHA points to the template — you can always re-render from SHA plus template inputs. So if storage cost matters, store SHA + inputs and skip the rendered string. But many teams store the rendered prompt directly during the first 24-72 hours of a new SHA, for fast inspection without re-rendering. The decision is cost vs convenience. Both flows must work — the SHA alone is the floor; the rendered prompt is the convenience layer. Trap: "Log the rendered prompt, skip the SHA." Without SHA, you cannot do "show me all traces from prompt b1d7e4" — the reverse-lookup is broken.

Q4. How do you handle PII in the rendered prompt at scale? A. Three options. (1) Store the SHA and template inputs separately, do not store the concatenated rendered string — re-render at debug time, no PII at rest. (2) Run a redactor on the rendered prompt before write — replace emails, phones, names with placeholders. (3) Encrypt the rendered prompt with a short TTL and audit-log decryption. The right choice depends on regulation, data sensitivity, and how often the team needs to inspect raw prompts. Trap: "We just won't store it" — then the team cannot root-cause. The choice is how to store, not whether.

Q5. The prompt SHA hasn't changed, but behavior is different. What do you check? A. The model ID. Provider model updates can ship under the same model name with a new snapshot ID. Tag genai.model.id on every span and query for step changes over time. Also check the toolbelt — if tool schemas are versioned, tag the toolbelt SHA. Also check retrieval — if the corpus or embedding model changed, the inputs to the prompt changed even though the prompt did not. All three axes (prompt, model, toolbelt, retrieval) can drift independently; instrument all three. Trap: "Behavior changed means the prompt changed." Not always.

Q6. How do you sample traces at scale without losing the ones that matter? A. Two-layer sampling. Always tag every span — SHA, model ID, tokens, cost, latency. That is cheap and constant cost. Capture rendered prompts and full outputs on a sample (1-10%) plus on triggers — every error, every high-latency tail event, every user-flagged complaint, every trace from a new prompt SHA in its first 24-72 hours. Trigger rules capture the rare cases the team actually needs without paying the full storage bill on uneventful traffic. Trap: "Sample uniformly at 1%." You will miss the error you need 99% of the time.

Q7. How do prompt-management SDKs (Langfuse, LangSmith) integrate with observability? A. The SDK wraps the prompt fetch. client.prompt.get("refund_handler") returns a wrapped template object. When the template is rendered and the result is passed to an LLM client, the SDK auto-tags the resulting span with prompt.name, prompt.sha, prompt.version, and template_inputs. The team does not write tagging code; the SDK does it at the registry boundary. For OpenTelemetry-native setups, OpenLLMetry instrumentation does the same for genai.* attributes and the team adds prompt.* manually. Trap: "We'll hand-roll the tagging." That works for a 5-engineer team. It does not survive scale.

Q8. What is the retention policy you would recommend for LLM traces? A. Span tags (SHA, model ID, tokens, cost, latency): 6-12 months for trend analysis and incident lookback. Rendered prompts and outputs: 30-90 days, with PII redaction or encryption. Triggered captures (errors, complaints, debug flags): 6-12 months because those are the cases incident review needs. Decryption of any encrypted prompt store must be audit-logged. Regulated industries push these numbers up; budget pressure pushes them down. The floor is "long enough that a complaint discovered three weeks later can still be root-caused." Trap: "Two weeks is enough." Slow-surfacing complaints (refund cycles, churn) need months.

Apply now (5 min)¶

Step 1 — model first. Open one production LLM call in your codebase. Identify where the prompt is rendered. List the attributes that should be tagged on the span at that point — prompt.name, prompt.sha, prompt.version, prompt.variant, template_inputs, genai.model.id. Check which of those are currently being tagged. The missing ones are your gap list.

Step 2 — your turn. Pick one trace from production. Walk through the lookup as if you were responding to a complaint. Can you find the rendered prompt in under 60 seconds? The model ID? The trace count for the same SHA over the last 24 hours? Each "no" is a backlog item.

Step 3 — sketch from memory. Redraw the trace-structure diagram from section 3 with the five mandatory attributes labeled. Then sketch the bidirectional joined view — trace → prompt, prompt → traces — with the lookup time budget written next to each direction.

Bridge. Observability tells you what ran and what happened. It does not tell you whether what happened was acceptable. For that, the bakery needs a taste test that runs every time the recipe is edited — a panel of judges with rubrics, an answer key, and a verdict. That is the eval suite, and it is the gate that keeps bad recipes off the counter in the first place.

→ 08-prompt-eval-suites.md