Phase 3 — Survive production¶

Covers chapters 16–18. By the end, the Phase 2 agent survives a mid-trajectory crash without losing state or double-charging the customer, the on-call engineer can read a trace at 3 AM and pinpoint the failing layer in under five minutes, and the topology's first crack is named before it ships rather than after the first outage.

What you will add this phase¶

Three layers, narrower than Phase 2 but each one essential before launch:

State checkpointing with idempotent resume — pod dies between acting and checking, the next worker picks up exactly where the previous one stopped, no duplicate writes.
Topology-aware alarm wiring — your agent's first crack is one of six known shapes; the alarm panel must catch its first signal in production.
Span-level observability designed at architecture time — every span carries the tags an on-call engineer would need at 3 AM, every LLM call carries cost, every tool call carries latency and status.

Phase 3 is shorter than Phase 2 in number of layers but matches it in difficulty because every bug you catch here is one you would otherwise catch at 3 AM with a customer waiting.

Chapters to read first¶

16-state-recovery-checkpointing.md — the resume-vs-restart trade
17-failure-mode-by-topology.md — six topologies, six first cracks
18-observability-by-design.md — span schemas and the 3 AM rubric

Re-read chapter 08's idempotency section if you skimmed it in Phase 2; Phase 3's resume semantics depend hard on the keys you wired then.

The build¶

Step 1 — Persist the scratchpad after every state transition¶

Phase 2's scratchpad lived in SQLite, keyed by session_id. Phase 3 makes the persistence happen after every observation, not at end-of-turn. The reason is the chapter-16 picture: a pod can die between the try firing and the check recording the result. If the scratchpad write happens only at end-of-turn, that gap is unrecoverable; the agent restarts and may re-fire the tool.

Implement a record_observation() function that:

Receives the tool name, input, and result.
Writes a row to a state_transitions table with (session_id, iteration, tool_name, tool_input_hash, idempotency_key, result_summary, timestamp).
Updates the in-memory scratchpad.
Returns control to the loop.

The write is synchronous and durable. If the pod dies after step 2, the next pod sees the transition in the table and knows it already happened.

Step 2 — Wire the resume path¶

When a new pod picks up a session, it should:

Load all state transitions for the session_id, ordered by iteration.
Reconstruct the scratchpad from the transitions.
Re-validate each transition's idempotency_key against the backend — for write tools, query the backend by key to confirm the side effect happened. For read tools, trust the cached result.
Continue from the iteration after the last successful transition.

The subtle case: the pod died after issue_refund returned refund_id=rf_77 but before the transition was written. On resume, the agent retries issue_refund with the same key; the backend recognises the duplicate and returns the same refund_id. The resume path inserts the transition and continues. Idempotency made the retry safe; the checkpoint made resume possible at all.

For the hands_on_lab, simulate a crash by killing the Python process partway through Suresh's flow (after the policy check but before the refund — the natural checkpoint location). Restart from the same session_id. Verify the trace shows the agent resuming at iteration 4, not restarting from iteration 1.

Save the crash-and-resume trace as runs/phase-3/crash-resume.json. The diff between this trace and Phase 2's clean-trace should be small — that is the point.

Step 3 — Name your topology and its first crack¶

Your agent today is a ReAct loop with parallel reads and chained writes — a mostly-ReAct topology with a small fan-out optimization. Chapter 17's table says ReAct's signature first crack is runaway iteration — same tool, similar args, no progress, but the loop keeps running because the model thinks it's making progress.

Write down in design-notes.md: "The signature first crack of our topology is runaway iteration. The first signal is the no-progress detector firing twice within a 5-minute window. The alarm response is to page the on-call and freeze the agent for the affected tenant."

If you have evolved to a hybrid topology (plan-execute skeleton with ReAct inside each action), name the second crack — stale plans — and its signal too.

The exercise is not academic. Phase 4's eval gates depend on you knowing which failure modes to test against; if you don't name them, your eval set will miss them.

Step 4 — Design the span schema¶

Phase 1 and 2 generated traces, but they were debugging dumps, not engineered spans. Phase 3 promotes the trace to a first-class artifact with chapter 18's required tags.

Every span must carry these tags as required fields:

trace_id — one per agent run.
parent_span_id — for the tree structure.
tenant_id — required by chapter 15's discipline; required again here for filtering.
session_id — for multi-turn conversation context.
step_index — which iteration of the loop.
timestamp_ms — wall clock at span start.
span_kind — one of agent_step, llm_call, tool_call, retry, hitl_approval.

LLM-call spans add: model_id, model_version, prompt_version_hash, input_tokens, output_tokens, temperature, cost_usd.

Tool-call spans add: tool_name, tool_version, args_redacted (after PII scrubbing), result_status (success/error/timeout), latency_ms.

Write a Span Python dataclass; make every span emission go through one helper function so the required tags can never be forgotten. Emit spans to JSONL on disk (real production would emit to OTel / Datadog / Langfuse; JSONL is fine for the hands_on_lab because the schema discipline is what matters).

Step 5 — Walk one trace end to end¶

Run Priya's full refund through the agent and capture the trace. The expected span tree:

trace_id=abc-123
└── agent_step (step_index=0)
    ├── llm_call (cost_usd=0.004)
    ├── tool_call (find_customer_by_email)
    │   └── (latency=120ms, status=success)
    └── agent_step (step_index=1)
        ├── llm_call (cost_usd=0.005)
        ├── tool_call (list_orders)        ┐ parallel
        ├── tool_call (retrieve_policy)    ┘
        └── agent_step (step_index=2)
            ├── llm_call (cost_usd=0.006)
            ├── tool_call (issue_refund)   ← Class 4
            └── agent_step (step_index=3)
                ├── llm_call (cost_usd=0.003)
                ├── tool_call (send_customer_email)
                └── final_response (latency_total=4200ms, cost_total=0.018)

Every span at every level carries tenant_id=acme, session_id=..., trace_id=abc-123. The parallel-tool fan-out shows two sibling tool-call spans at the same step_index. The Class-4 refund call is visible by tool name; if you wanted, you could annotate it with blast_radius_class=4.

Save the trace as runs/phase-3/priya-trace.jsonl.

Step 6 — Build the four on-call queries¶

The trace is only as good as the queries you can run on it. Build a small CLI (query_traces.py) supporting four queries:

--tenant <id> --since <duration> --sort-by latency — find slow sessions for one tenant.
--tool <name> --status error — find failing tool calls.
--group-by model_version — see whether a model version change correlates with errors.
--group-by prompt_version_hash — see whether a prompt change correlates with cost or latency drift.

These four queries are the bare minimum. Production observability platforms support hundreds; the discipline here is that every query you can imagine running at 3 AM should be answerable from the span tags. If you can't write the query because the tag doesn't exist, the tag was missing at architecture time.

Test the queries against the Priya and Suresh traces. If any query returns nothing meaningful, the underlying tag is missing or the trace is incomplete.

Step 7 — Per-tenant cost dashboard¶

Aggregate cost_usd from the LLM spans by tenant_id and by feature (where feature is the agent's top-level workflow name, e.g. refund_handling). Output a per-tenant per-day cost summary as runs/phase-3/cost-summary.csv. This is the "where did the bill go?" view chapter 18 promised.

For the hands_on_lab, you only have two tenants (Acme, Initech) and a handful of sessions. The point is to wire the aggregation now; in production, the same code over a real trace store produces the real numbers.

Step 8 — Capture and emit eval data¶

Every completed agent trajectory is a candidate eval case for chapter 19. Phase 3 starts capturing the structured data Phase 4 will need.

For every session that ends (success or graceful failure), emit a runs/phase-3/eval_candidates.jsonl row containing:

input — the original customer message and tenant context.
final_output — the agent's final reply.
tool_calls — the ordered list of tool calls and results.
user_feedback — placeholder for thumbs_up/thumbs_down if you wire user feedback later.
hitl_override — whether the approval gate fired, and what the reviewer did.

Phase 4 turns a labelled subset of this file into the capability eval and the regression set. Start the capture now or there will be no data to label later.

Worked example¶

The span emitted for the issue_refund tool call on Priya's case, redacted form:

{
  "trace_id": "abc-123",
  "parent_span_id": "agent_step_2",
  "span_id": "tool_call_2_1",
  "tenant_id": "acme",
  "session_id": "sess_88",
  "step_index": 2,
  "timestamp_ms": 1716371232123,
  "span_kind": "tool_call",
  "tool_name": "issue_refund",
  "tool_version": "1.0.0",
  "args_redacted": {
    "order_id": "448100",
    "amount_inr": 6400,
    "reason": "delay",
    "idempotency_key": "refund_448100_delay"
  },
  "result_status": "success",
  "result_summary": {"refund_id": "rf_77"},
  "latency_ms": 247,
  "blast_radius_class": 4
}

Notice the idempotency_key is visible — not redacted — because it is an audit identifier, not PII. The actual amount and order ID are also visible because for a refund agent they are the audit-critical facts. If this were a healthcare agent, the equivalent fields would be redacted at source per chapter 18's discipline. The point is to choose what to redact deliberately; the framework should never decide for you by accident.

Acceptance check¶

Before Phase 4:

Kill the process between policy-check and refund-issue for Suresh's case (post-approval). Restart. Show me the trace. The resume must pick up at iteration 4, not iteration 1. The issue_refund retry must carry the same idempotency key. The backend must recognise the duplicate and return the original refund ID. If the agent re-issues the refund, the checkpoint or the idempotency key is broken.
Name your topology's first crack and the first signal that catches it. If the answer is hand-wavy, chapter 17's table needs another read.
Show me a query that finds every session in the last 24 hours where the issue_refund tool returned an error. If you can't write it, your tool-call spans are missing status or tool_name. Fix the schema, not the query.
Roll up cost by tenant for the last week. The number is small in the hands_on_lab (synthetic traffic), but the roll-up code should produce it correctly. If it can't, your LLM spans don't carry cost_usd or tenant_id.
Open one trace and walk an on-call engineer through what the agent did, by reading only the spans (no chat history). If the trace doesn't carry enough structure to follow the agent's reasoning, the span design failed.

Common stumbles¶

Stumble 1 — end-of-turn checkpointing instead of post-observation. Symptom: pod dies mid-turn, resume picks up at the start of the turn, re-fires list_orders, looks fine because list_orders is read-only. Then mid-turn dies after issue_refund; resume re-fires issue_refund, but the idempotency-key catches the duplicate, so it still looks fine. Eventually, a crash between two writes corrupts state in a way idempotency cannot fix. Fix: checkpoint after every observation, not at end-of-turn.

Stumble 2 — missing tenant_id on a span. Symptom: the on-call's tenant-filter query returns the wrong number of sessions because some spans are missing the tag. Diagnosis: a span helper that defaults tenant_id to None instead of failing loudly. Fix: required tag, no defaults, helper raises if it isn't set.

Stumble 3 — confusing trace_id with session_id. They are different. session_id lives across many trace_ids — one session is one conversation, possibly many runs. A user comes back to the conversation tomorrow, the session_id is the same; the trace_id is new. Get this wrong and resume cannot find prior state.

Stumble 4 — redaction at query time instead of source. Symptom: raw PII is sitting in the trace store, the team plans to mask it "at query time" via a dashboard filter. A bug in the dashboard exposes raw data; a developer with debug access can read everything. Fix: redact at source. Keep a tightly-scoped raw store with 24-hour retention only if absolutely necessary, and even then assume it will leak.

Stumble 5 — observability bolted on after the fact. Symptom: a week into Phase 3 you realise some tool call sites don't emit spans because they're in code paths you forgot. Fix: route every tool call through a single dispatcher; the dispatcher emits the span; coverage becomes structural rather than incidental.

Reflection prompts¶

Walk through what happens at 3 AM if Acme reports "the refund agent is slow and sometimes wrong." How quickly do you find the failing layer? Open the trace store, write four queries, identify the bad model+prompt pair, flip the kill switch (Phase 4 will build this). If the path takes more than ten minutes, the span design is incomplete.
Your args_redacted field names what got redacted. What didn't get redacted, and why is that the right choice? The discipline is to scrub PII (email, name, contact info) but preserve audit identifiers (order ID, refund ID, idempotency key). Justify each choice in design-notes.md.
If you swapped the underlying model from Sonnet to Haiku, would your span schema notice? The model_id and model_version tags should make the swap immediately visible in any roll-up. If they don't, the schema is missing the change-detection that Phase 4's eval baseline depends on.
What's the slowest span in Priya's trace, and is it the model or the tool? The chapter-18 mini-FAQ about "optimising model latency while tool latency dominates" lives or dies on whether you can answer this from one query.

Continue to phase-4-ship-with-discipline.md.