05. Reproducing the failure — freeze the scene before the trail goes cold¶
~13 min read. A detective cannot solve a crime they cannot re-enact. Replay is the spine of agent debugging.
Built on the ELI5 in 00-eli5.md. The case file — the full trail of one agent run — is only useful if we can press play again and watch the same crime unfold step by step.
The picture before the details¶
A user complains. You open the trace. You re-run the same input. The agent answers perfectly. What happened?
Non-determinism. Temperature. A different model snapshot. A tool that returned different data. A retrieval index updated overnight. Many small drifts. The original crime becomes impossible to recreate.
Tuesday 3:14 PM Wednesday 10:02 AM
┌─────────────────────────┐ ┌─────────────────────────┐
│ same user input │ │ same user input │
│ model = sonnet-4-2510 │ │ model = sonnet-4-2511 │
│ temp = 0.7 │ │ temp = 0.7 │
│ tool returns order #88 │ │ tool returns order #91 │
│ result: WRONG ORDER │ │ result: looks fine │
└─────────────────────────┘ └─────────────────────────┘
crime no crime to see
The detective is stuck. The case file exists but the scene has changed. The discipline is to freeze the scene at capture time and replay against it — same inputs, same tools, same model snapshot — so the crime re-enacts on demand instead of slipping back into the dark.
The non-determinism budget¶
Every agent run has hidden variables. Each can flip the outcome. List them honestly.
non-determinism budget
├── temperature, top_p — sampling randomness in the model
├── seed — supported by some providers, not all
├── model version drift — sonnet-4-2510 → sonnet-4-2511 silently
├── provider routing — regional endpoints answer slightly differently
├── tool-call latency — race conditions between parallel tools
├── tool-result drift — same query, different rows after a write
├── retrieval freshness — vector index updated since the trace
├── time-of-day load — 429 rate limit at peak, 200 at midnight
└── clock-based logic — "due today" depends on now()
Each item must be captured or controlled. You do not need to eliminate randomness; you need to record enough to recreate it. A trace that ignores model version is half a case file, and half a case file rarely closes.
What MUST be captured to enable replay¶
A replay-ready trace is more than spans. It is a full forensic kit.
Replay kit per trace
┌──────────────────────────────────────────────────────────────┐
│ run_id, parent_span_id │
│ user_id, tenant_id, session_id │
│ input message (raw, byte-exact) │
│ system prompt (full text, not a name) │
│ tool schemas registered at that moment (versioned) │
│ retrieval results (chunk ids + content snapshots) │
│ model id + provider + region │
│ temperature, top_p, seed (if supported) │
│ all tool requests AND tool responses, byte-exact │
│ timestamps (start, first_token, end) per span │
│ agent code commit hash │
│ feature flags active for this user │
└──────────────────────────────────────────────────────────────┘
These become the evidence tags and witness notes of the case file. Skip the system prompt — you cannot replay. Skip the tool schema version — the same tool name may have a different shape next week. Skip the model id — the provider may have rolled forward silently. Capture is cheap. Missing capture is a missed confession.
Frozen fixtures — killing tool non-determinism¶
Replay against live tools and you replay against a moving target. Order #88 from Tuesday has shipped, the DB now says "delivered," and the same query returns a different answer with the crime gone. The fix is to snapshot every tool response at trace time and store it with the run. On replay, the tool is a stub that returns the frozen response, so the agent sees exactly the bytes it saw the first time.
Live run Replay run
┌─────────┐ call ┌──────────┐ ┌─────────┐ call ┌──────────┐
│ agent │──────→ │ real DB │ │ agent │──────→ │ fixture │
│ │←──────│ live data │ │ │←──────│ stored │
└─────────┘ resp └──────────┘ └─────────┘ resp └──────────┘
same bytes
This is the frozen-fixtures pattern. The tool layer becomes deterministic. Only model sampling remains random. Pin temperature = 0 and seed, the run becomes near-identical.
Time-of-day failures — the hidden non-determinism¶
A senior trap. The bug happens at 3 PM every day, you debug at 10 AM, and reproduction quietly refuses to cooperate. At 3 PM a downstream pricing API rate-limits and returns 429 with a partial body; your parser sees half-JSON and hallucinates a price. At 10 AM the same API answers cleanly, replay sees no 429, the ticket gets closed, and 3 PM hits again the next day.
This is why tool-response payloads must be captured byte-exact. Replay the recorded 429 body, not a live call, and the crime returns. The confession turns out to be "downstream 429 body broke our parser," a sentence no amount of model-staring would have produced.
The "happened-once" problem¶
Sometimes the trace is incomplete: sampling was on, tool responses were not stored, the system prompt changed three deploys later. Full replay is no longer possible. The disciplined response is to triage with partial evidence and instrument so the next occurrence is fully replayable.
- Mark the trace as cold case — known unreproducible.
- Capture every detail still visible — input, output, model id, span timings.
- Add a metric to count similar events.
- Add the missing capture fields now so the next occurrence is replayable.
- If patterns emerge across cold cases — same model, same hour, same tenant — that is your confession without replay.
You do not throw the trace away. You instrument so the next one is fully replayable.
The capture → store → replay pipeline¶
LIVE TRAFFIC
│
▼
┌────────────────┐
│ agent run │
│ prompt build │── span ─┐
│ llm.generate │── span ─┤
│ tool.call_a │── span ─┤──► fixture recorder
│ tool.call_b │── span ─┘ (raw req + resp bytes)
└────────┬───────┘
▼
┌────────────────┐ ┌────────────────────┐
│ trace store │◀──────│ fixtures store │
│ (spans + tags) │ │ (tool snapshots) │
└────────┬───────┘ └─────────┬──────────┘
└──────────┬───────────────┘
▼
┌────────────────┐
│ replay harness │
│ load run │
│ stub tools │
│ pin model │
│ diff output │
└────────────────┘
This pipeline is the spine of the rest of the module. Every later chapter assumes you can replay. Without replay, the lineup of suspects is a guessing game.
Worked example — the Tuesday order-lookup mess¶
Agent helps users check order status. On Tuesday at 3 PM, user u_7710 asks "where is my order?" The agent answers with details for someone else's order. A real privacy hit.
The complaint slip points to run r_8821. Open the case file.
run r_8821 — order-lookup agent
├── span prompt.build 180 ms
├── span llm.generate 1,420 ms tool_request = lookup_order(query="my order")
├── span tool.lookup_order 240 ms returned 3 candidate orders
├── span llm.generate 900 ms chose order #4421 (wrong user)
└── final response "Order #4421 is shipped."
Step one — naive replay. Same input, live tools. The tool returns 1 candidate now, not 3. The agent answers correctly. Crime gone.
Step two — inspect what we captured.
captured for r_8821
✓ input message, system prompt, model id + temperature
✓ tool schemas at run time
✓ tool response bytes for lookup_order ← three candidates
✗ user_id missing from the lookup_order request ← THE BUG
✗ tenant_id not propagated to the tool span
Step three — frozen replay. Load fixture for tool.lookup_order. Stub the tool. Pin model = sonnet-4-2511, temperature = 0. Run.
replay: agent calls lookup_order("my order")
fixture: returns 3 candidates (the same 3)
agent: picks first match → order #4421 (wrong user)
The crime re-enacts. Confession — lookup_order was called without user_id. It returned all orders matching the query string, not the user. The agent picked the top hit.
Step four — what was missing, what we added.
gaps closed for future runs
+ tool request must include user_id (schema-enforced)
+ tool span captures user_id as evidence tag
+ tenant_id propagated through every span
+ fixture recorder enabled for top-10 tools; regression test r_8821 added
The lock is the eval that loads run r_8821 as a fixture, pins the model, and asserts the agent never returns an order belonging to a different user. Without replay, the team would have rolled back the model "just in case." With it, the real bug — a missing user_id parameter — is found in twenty minutes.
Reproducing agent failures in shipped tooling¶
- LangSmith — replay button: open any trace, click "replay," it re-runs the same prompt against the same model with stored fixtures, useful for prompt-template debugging.
- Braintrust — experiments + replay: pulls historical traces into a dataset, replays each one against a new prompt or model, diffs the outputs row by row.
- Arize Phoenix — experiments framework: snapshots LLM and tool spans, runs deterministic replays against new versions, surfaces output drift per example.
- Helicone — replay endpoint: the proxy stores raw request and response bodies, exposes a replay URL that re-sends the exact recorded payload to compare provider behaviour across versions.
- Anthropic cookbook — custom replay harness: pattern of storing
messages,system,tools, and tool-result fixtures as JSON, running a small Python script that re-invokes the SDK with those exact bytes plusseedfor near-deterministic replay. - Cursor — bug-repro flow: the editor captures the exact prompt, file context, and tool calls behind a complaint and bundles them into a shareable repro file so the agent team can replay the failure against the same model snapshot.
- LangFuse — session replay: stores full session traces with prompt and tool I/O; the replay view re-runs a session against a new prompt or model and diffs each step.
- Comet Opik — dataset replay: turns failing traces into eval rows so the same byte-exact input is re-run on every model candidate, surfacing regression at the failure point rather than the average.
- OpenAI Evals (graders + replay) — historical conversations are loaded as test cases and replayed against new model versions with deterministic seeds where supported.
- Anthropic Console — message replay: lets you reopen a Claude conversation, edit one variable, and re-run with the same tools registered, useful for isolating prompt drift versus model drift.
- rr (record-and-replay debugger) — Mozilla's deterministic-replay tool for native code; the same philosophy applied at the model layer: record once, replay many times against the same scene.
- Replay.io — record-and-replay for web sessions; agent teams use it to capture the front-end interactions that produced an LLM call so the upstream environment can be re-enacted too.
- PromptLayer — request history + replay: stores every prompt/response pair with metadata and exposes a "rerun" action against the same or a new model.
- Galileo — failure clustering with replay: groups similar failures and replays representatives against fixes to confirm the cluster collapses.
- Weights & Biases Weave — trace replay: Weave traces store inputs and outputs at every step; replays re-run the same op graph for diffing.
- Cleric.io / agent-debugging tools — ingest production traces, classify failures, and run replays against candidate fixes before they ship.
- OpenTelemetry GenAI conventions — by standardising captured attributes, OTel turns replay into a portable operation across vendors; a frozen-fixture run from one tool can be replayed in another.
- AWS Bedrock invocation logs + Step Functions replay: captured invocation payloads can be re-fed to the same model id and region for byte-exact rerun in incident review.
- Azure AI Foundry traces — capture full prompt, tool, and response history; the studio's "rerun with changes" action is the managed equivalent of a replay harness.
- GCP Vertex AI Model Garden replay — stored prediction requests can be replayed against pinned model versions for regression analysis.
- Seed handling in OpenAI and Mistral SDKs — both expose a
seedparameter that, combined with frozen fixtures, brings replay within a few-token margin of the original output. - vcrpy / nock / WireMock — language-level HTTP recorders that pin tool responses to disk; the classical pattern beneath every modern LLM replay harness.
Recall — can you reconstruct the replay kit cold?¶
- Name five items in the non-determinism budget that a replay kit must capture or pin.
- Why does replaying against live tools usually fail to reproduce the original bug?
- What is the frozen-fixtures pattern, and which layer of the agent does it make deterministic?
- When you cannot reproduce a "happened-once" failure, what is the disciplined next step?
Interview Q&A¶
Q: Why is capturing the input message and model id not enough to replay an agent failure? A: Tool responses, retrieval results, system prompt version, and tool schemas all change between trace time and replay time. Without those, the same input produces a different downstream path. You need the full forensic kit — system prompt text, tool schemas at that moment, retrieved chunks, tool-response bytes, model id, temperature, seed, and timestamps.
Common wrong answer to avoid: "Set temperature to zero and replay" — temperature only controls model sampling. Tool responses and retrieval still drift. The agent walks a different path even at temp=0.
Q: Why use frozen fixtures for tools instead of just calling the tools again on replay? A: Tools are stateful. The order database, the user record, the vector index — all keep changing. Live calls give different bytes than the original run. The agent's behavior depends on those bytes, so the bug disappears or mutates. Frozen fixtures replay the exact bytes the agent originally saw, isolating model and prompt behavior.
Common wrong answer to avoid: "Frozen fixtures are about test speed" — speed is a side effect. The real reason is determinism. Live tool calls reintroduce non-determinism that defeats replay.
Q: A bug happens only at 3 PM and you cannot reproduce in the morning. Most likely cause, and how does replay help? A: A downstream dependency behaves differently under load — a rate-limited tool returns a partial body, a 429, or a slower response. The agent's parser or loop handles that path poorly. Replay against the recorded 3 PM tool-response bytes — not live calls — reproduces it instantly.
Common wrong answer to avoid: "The model behaves differently at peak load" — providers do not change behavior by time of day. The variance lives in your dependencies, not the LLM itself.
Q: What is the correct response to a "happened-once" failure you cannot reproduce? A: Mark it as a cold case, preserve every detail you still have, and immediately add the missing capture fields so the next occurrence is fully replayable. Add a metric to count similar events. Patterns across cold cases — same tenant, same tool, same hour — can yield a confession even without full replay.
Common wrong answer to avoid: "Close it and wait for it to reproduce" — without instrumentation changes, the next occurrence will be just as unreproducible. The fix is to make the next one debuggable, not to hope.
Apply now (10 min)¶
Step 1 — model the exercise. Here is the replay-kit checklist I would build for the chapter's Tuesday order-lookup failure, sorted into the three buckets:
| Bucket | Field | Already captured? | If not — fix |
|---|---|---|---|
| Input | full prompt + system message | yes | — |
| Input | user_id, tenant_id | yes | — |
| Input | conversation history | partial — only last 6 turns | extend retention to 30 turns |
| Environment | model version + provider | yes | — |
| Environment | agent code commit hash | NO | inject at start_run |
| Environment | embedding model version | NO | log on every retrieval span |
| Environment | RNG seed | yes | — |
| Environment | timestamp (UTC) | yes | — |
| Side-effects | every tool call's input + output (frozen) | partial — outputs only | freeze inputs too, store as fixture |
| Side-effects | retrieval chunk IDs and scores | yes | — |
Two missing fields (commit hash, embedding model version) are the cheapest fixes that close the most cold cases. Tool inputs not being captured turned what should have been a 20-minute replay into a multi-day mystery last quarter.
Step 2 — your turn. Take an agent you know. List every field you would need to capture to replay one run end-to-end. Sort into the same three buckets. Mark which fields your current tracing already captures and which are missing. Pick the cheapest two to add and write a one-line plan for each.
Step 3 — reproduce from memory. Draw the capture → store → replay pipeline. Label where fixtures are recorded, where they are stored, and where the replay harness stubs the tool calls. Write one line on why the agent code commit hash must be part of the case file.
What you should remember¶
This chapter explained why most agent failures cannot be re-run on the second try. The world the agent ran in has already changed — model weights drifted, a tool returned different data, a date moved past midnight, the retrieval index was reindexed. The diagnostic move is not heroic re-running but disciplined capture: every input, every environment variable, every side-effect frozen at the moment of the original run, stored as a fixture the case file can reload weeks later.
You also learned that non-determinism has a budget. Some sources you accept (LLM temperature > 0), some you eliminate (frozen tool fixtures, pinned model versions, fixed timestamps in tests). The replay kit makes the budget visible. Anything the kit cannot reproduce is anything the team has implicitly accepted will never be debuggable.
Carry this diagnostic forward: when a bug feels like "it only happened once", the system did not have a non-determinism problem — it had an instrumentation problem. Add the missing capture field before closing the ticket, even if the immediate bug is already fixed. The next "happened once" failure deserves a debugger, not a shrug.
Remember:
- A trace alone is not a replay kit. You also need inputs, environment fingerprints, and frozen side-effects.
- The agent code commit hash belongs in the case file. Without it, replay drifts with every refactor.
- Tool calls must be frozen as fixtures, not re-run live. Live re-runs hit different data and lie about reproducibility.
- Time-of-day failures need explicit timestamp pinning in fixtures. UTC midnight is a real bug class.
- "Happened-once" failures are usually instrumentation failures. Add the missing field before closing the ticket.
Bridge. We can re-enact the crime. The scene replays cleanly. But which suspect broke — prompt, tool, loop, memory, or model? We do not guess. We line them up and eliminate one at a time. → 06-layer-isolation-lineup.md