Skip to content

11. Logging & tracing — when the A/B winner has a 9% bug nobody can name

~18 min read. An A/B tells you arm B lost CSAT four points. It does not tell you which conversation, which retrieved doc, or which tool call did the losing. This chapter is about the instrument that turns "the experiment regressed" into "row 4,182 fell here, for this reason, at this span."

Builds on 10-ab-testing.md. The shift change told us which version is better on a metric. The kitchen log — logging and tracing — tells us why one specific request went wrong inside the agent: which retrieval call, which rerank, which judge, which retry. Without it, an A/B regression becomes a mystery instead of a diagnosis.


What the A/B left unanswered

The refund-bot A/B from the previous chapter settled the headline. Arm B was crisper, more policy-correct on the rubric, slightly higher on self-serve resolution, and four points worse on CSAT. The decision was clear: iterate the prompt, restore the empathy cue, re-test. But sitting underneath that decision were 4,190 individual treatment chats. Some of them were the wins that drove the +6pp policy lift. Some of them were the losses that pulled CSAT down. The aggregate said something was off; it could not say what, where, or which chat to look at first.

A teammate pulled three low-CSAT conversations the next morning to read them. Two looked fine to her eyes — competent, terse, no obvious mistake. The third was clearly broken: the bot cited a refund window that did not exist in the policy. She forwarded it to engineering with the question every team eventually asks: which step inside this conversation produced that hallucinated window? She had the request payload and the final response. Between them sat seven invisible operations — a query rewrite, an embedding call, a vector retrieval, a reranker, the main generation, a citation pass, and a judge score. None of them were written down. The investigation stalled at the first question the answer needed.

Modules 01-10 of this module taught how to measure quality. This chapter teaches how to see inside one measurement when it fails. The A/B told us a regression exists. Drift detection will tell us when it appears. Alerts will wake someone up. But until the team can reconstruct one specific failing chat at full resolution — every prompt, every retrieved document, every latency hop, every token cost — every other instrument in this module is reporting symptoms it cannot explain.

What this file solves

When an LLM application is a single model call, request and response logs are nearly enough. When it is an agent — retrieval, rerank, multi-step reasoning, tool calls, judges, retries — the request and response capture only the two visible ends of a hidden chain. This chapter shows how to turn that hidden chain into a trace: one parent span per user request, child spans for every internal operation, each carrying input, output, latency, tokens, cost, and metadata. By the end you can take the refund-bot's mystery hallucination, open trace 48f, and point at exactly which span injected the bad context. The signature artifact this file produces is one full OpenTelemetry-shaped JSON trace for one refund turn that you should be able to redraw from memory.

Why request/response logs stop teaching once the agent has steps

In a single-call setup, a log line of {prompt, completion, latency, tokens} answers almost every operational question. The model is the only moving part. If the answer is wrong, the prompt and completion are the whole story; you reread them and find the mistake.

The refund bot is not single-call. One user turn produces seven internal operations on a busy day, ten on a complex one. The user types "I want a refund for order 4481"; the system rewrites the query into search form, embeds it, hits a vector store, reranks the candidates, builds the final prompt with retrieved policy text, generates an answer, runs a citation-faithfulness judge, and sometimes retries when the judge fails. The request/response pair captures the user message and the final reply. Everything in between is invisible to the log.

When the visible reply is wrong, the failure could live in any of those seven steps. Query-rewrite turned "4481" into a search for "order 4481 refund window" and pulled the wrong policy section. The embedder ranked a six-month-old draft above the current one. Retrieval returned three docs but the reranker put a stale one first. The judge passed an answer that did not actually cite the retrieved doc. Reading the user message and the final response, you cannot tell which it was. The kitchen log in its full form — one trace, structured as parent and child spans — is the instrument that makes the difference between the seven hypotheses visible.

The naive repair and the visible break

The first repair a smart team tries is to bolt on print statements at every internal step and pipe them to a file. "Add logs." This works for the first incident, fails by the third. The logs are unstructured strings — "retrieved 3 docs", "reranked", "called model". They cannot be queried. They cannot be joined to the user request. They do not carry timing. When the team wants the question "what was the p95 retrieval latency for failing chats yesterday?", the answer is locked inside grep.

The second repair is structured JSON logs — same string-per-step idea, but with consistent keys. Better. But still missing the relationship between events. Two lines of {"step": "retrieve", "duration_ms": 220} and {"step": "generate", "duration_ms": 1800} do not say they belonged to the same request, do not say which was the parent, and do not say which ran in parallel. A complex agent run produces dozens of these lines per second across hundreds of concurrent users; the team is left to reconstruct the tree by timestamp guessing.

Not a "we need more logs" problem. Not a "we need a fancier log database" problem. A structure problem — the data model has to encode the parent/child relationship between operations within one user request, or the cost of asking "what happened inside this chat?" will always exceed the value of the answer. So the natural question becomes: what is the smallest data structure that captures one request as a tree of operations, not a stream of lines?

The answer is the trace, borrowed almost wholesale from general-purpose distributed tracing.

When 400 logged lines become one inspectable trace

Strip the model to its bones. One refund chat. Seven internal operations. Naive structured logging produces seven independent lines:

{"ts": 12:01:03.120, "step": "rewrite",   "ms": 35}
{"ts": 12:01:03.155, "step": "embed",     "ms": 18}
{"ts": 12:01:03.173, "step": "retrieve",  "ms": 220, "k": 3}
{"ts": 12:01:03.393, "step": "rerank",    "ms": 45}
{"ts": 12:01:03.438, "step": "generate",  "ms": 1800, "in": 1870, "out": 180}
{"ts": 12:01:05.238, "step": "cite",      "ms": 110}
{"ts": 12:01:05.348, "step": "judge",     "ms": 380}

Same information, structured as a trace:

trace 48f (root: refund_chat, total 2,608 ms)
└─ rewrite        (35 ms)
└─ embed          (18 ms)
└─ retrieve       (220 ms, k=3)   ─┐
   └─ rerank      (45 ms)          │ child of retrieve
└─ generate       (1,800 ms, 1,870 in / 180 out)
└─ cite           (110 ms)
└─ judge          (380 ms)

The second form costs the same to write and pays back every time someone asks a question that crosses operations: what is the p95 latency from request to first generated token?, for chats where the judge failed, what was the average retrieved-doc count?, did rerank actually change top-1 in the failing chats?. The flat-log form answers none of these without expensive joins. The trace form answers all of them as ordinary queries over typed fields. The kitchen log is not a log file with more fields — it is a shape that mirrors how the agent actually thinks.

The rule: a trace is one user request decomposed into named spans with input, output, timing, and cost

State it plainly. Each user-facing request produces exactly one trace. Each meaningful internal operation produces exactly one span inside that trace. Each span carries its input, its output, its start, its duration, its tokens, its cost, and the metadata needed to ask questions across spans. Everything else — sampling, redaction, retention, dashboards — is operational discipline around this primitive.

The corollary that makes the rule load-bearing: if you cannot replay a specific failure from logs alone, your tracing is incomplete. This is the replay-ability test. Take a randomly chosen failing chat from yesterday, open its trace, and try to reconstruct exactly what the model saw, what it returned, and what each tool did. If a step is missing or its input/output is not recorded, that gap is a future investigation you will not be able to finish.

Teacher voice. Span boundaries should follow the boundaries where you would want to ask "is this where it went wrong?". Not every function call deserves a span. Every component a human might suspect during an incident does.


1) Picture before details — the trace tree

The mental model is a tree of timed boxes. The root is the user-facing request. Each child is an internal operation with its own duration, its own input and output, and its own potential to fail. Some children have grandchildren — a retrieval span often nests a rerank span inside it, because rerank only happens once retrieval returns candidates.

                  TRACE (one user turn)
                  ────────────────────
            ┌────────────────────────────────┐
            │ root: refund_chat              │
            │ user_id, session_id, variant_B │
            │ total: 2,608 ms / $0.0042      │
            └────────┬───────────────────────┘
        ┌────────────┼──────────────┬──────────────┬────────────┐
        ▼            ▼              ▼              ▼            ▼
   ┌─────────┐  ┌─────────┐    ┌─────────┐   ┌──────────┐  ┌─────────┐
   │ rewrite │  │  embed  │    │retrieve │   │ generate │  │  judge  │
   │  35 ms  │  │  18 ms  │    │ 220 ms  │   │ 1,800 ms │  │ 380 ms  │
   │   LLM   │  │ vector  │    │  k=3    │   │   LLM    │  │   LLM   │
   └─────────┘  └─────────┘    └────┬────┘   └─────┬────┘  └─────────┘
                                    │              │
                                    ▼              ▼
                              ┌─────────┐    ┌─────────┐
                              │ rerank  │    │  cite   │
                              │  45 ms  │    │ 110 ms  │
                              └─────────┘    └─────────┘

Each box owns: a name, a kind (LLM call, vector op, tool call, evaluator), an input payload, an output payload, a start timestamp, a duration, token counts where relevant, cost, and a status (ok / error / retried). Each parent edge encodes a causality claim: this child ran because that parent invoked it. The tree is the picture; the JSON below is the artifact you actually store.

The shape is the same whether you use OpenTelemetry with the GenAI semantic conventions, LangSmith's SDK, LangFuse, Arize Phoenix, or roll your own. The vocabulary differs; the parent-child-with-attributes structure does not.

2) The signature artifact — one OpenTelemetry-shaped trace for trace 48f

Here is trace 48f, the chat the teammate flagged. The user asked about a refund window. The bot answered with a window that does not exist. Eight spans, one root.

{
  "trace_id": "48f9c3e1b7a64f2c9d2e0a1b3c4d5e6f",
  "trace.attributes": {
    "service.name": "refund-bot",
    "deployment.env": "prod",
    "session.id": "sess_91ab",
    "user.id_hash": "h_38d2",
    "experiment.id": "exp_prompt_terser_v3",
    "experiment.variant": "B",
    "app.prompt.version": "v17",
    "app.tenant.tier": "paid",
    "app.locale": "en-IN"
  },
  "spans": [
    {
      "span_id": "01_root",
      "parent_id": null,
      "name": "refund_chat.turn",
      "kind": "SERVER",
      "start_ms": 1716370863120,
      "duration_ms": 2608,
      "status": "OK",
      "attributes": {
        "http.route": "/v1/chat",
        "user.message": "i want refund for order 4481, it was late by 2 weeks",
        "assistant.message": "You are eligible for a refund of ₹1,250 on order 4481. Our refund window is 14 days from delivery; since your delivery was late, this is processed under the courtesy clause.",
        "outcome.policy_correct": false,
        "outcome.csat_response": 2
      }
    },
    {
      "span_id": "02_rewrite",
      "parent_id": "01_root",
      "name": "llm.query_rewrite",
      "kind": "CLIENT",
      "start_ms": 1716370863125,
      "duration_ms": 35,
      "status": "OK",
      "attributes": {
        "gen_ai.system": "openai",
        "gen_ai.request.model": "gpt-4o-mini",
        "gen_ai.usage.input_tokens": 84,
        "gen_ai.usage.output_tokens": 22,
        "gen_ai.cost_usd": 0.000071,
        "input.preview": "rewrite user msg as search query...",
        "output": "refund window late delivery courtesy clause order 4481"
      }
    },
    {
      "span_id": "03_embed",
      "parent_id": "01_root",
      "name": "vector.embed",
      "kind": "CLIENT",
      "start_ms": 1716370863162,
      "duration_ms": 18,
      "status": "OK",
      "attributes": {
        "gen_ai.request.model": "text-embedding-3-small",
        "gen_ai.usage.input_tokens": 13,
        "gen_ai.cost_usd": 0.0000003,
        "vector.dim": 1536
      }
    },
    {
      "span_id": "04_retrieve",
      "parent_id": "01_root",
      "name": "vector.retrieve",
      "kind": "CLIENT",
      "start_ms": 1716370863182,
      "duration_ms": 220,
      "status": "OK",
      "attributes": {
        "vectordb.system": "pinecone",
        "vectordb.index": "policy-v4",
        "retrieval.k": 3,
        "retrieval.doc_ids": ["pol_courtesy_v2_draft", "pol_refund_v7", "pol_late_delivery_v3"],
        "retrieval.scores": [0.871, 0.842, 0.811]
      }
    },
    {
      "span_id": "05_rerank",
      "parent_id": "04_retrieve",
      "name": "rerank.cross_encoder",
      "kind": "CLIENT",
      "start_ms": 1716370863404,
      "duration_ms": 45,
      "status": "OK",
      "attributes": {
        "rerank.model": "cohere-rerank-3",
        "rerank.top1_id": "pol_courtesy_v2_draft",
        "rerank.top1_score": 0.94,
        "rerank.changed_top1": true
      }
    },
    {
      "span_id": "06_generate",
      "parent_id": "01_root",
      "name": "llm.generate",
      "kind": "CLIENT",
      "start_ms": 1716370863449,
      "duration_ms": 1800,
      "status": "OK",
      "attributes": {
        "gen_ai.system": "anthropic",
        "gen_ai.request.model": "claude-3-5-sonnet-20241022",
        "gen_ai.usage.input_tokens": 1870,
        "gen_ai.usage.output_tokens": 180,
        "gen_ai.cost_usd": 0.00831,
        "gen_ai.response.id": "msg_01H4r7K",
        "prompt.system_tokens": 420,
        "prompt.retrieval_tokens": 1100,
        "prompt.user_tokens": 90,
        "prompt.tool_tokens": 260,
        "prompt.composition_hash": "sha256:7c1e..."
      }
    },
    {
      "span_id": "07_cite",
      "parent_id": "01_root",
      "name": "tool.citation_check",
      "kind": "INTERNAL",
      "start_ms": 1716370865249,
      "duration_ms": 110,
      "status": "OK",
      "attributes": {
        "citation.claims_extracted": 2,
        "citation.claims_supported": 1,
        "citation.unsupported": ["14-day refund window"]
      }
    },
    {
      "span_id": "08_judge",
      "parent_id": "01_root",
      "name": "llm.judge.policy_correct",
      "kind": "CLIENT",
      "start_ms": 1716370865359,
      "duration_ms": 380,
      "status": "OK",
      "attributes": {
        "gen_ai.request.model": "gpt-4o",
        "judge.rubric_version": "v3.2",
        "judge.score": 1,
        "judge.label": "policy_correct",
        "judge.rationale": "Mentions courtesy clause; correct on amount."
      }
    }
  ]
}

Read it once and you can already point at the bug. Span 04_retrieve returned pol_courtesy_v2_draft as one of its candidates. That document is a draft — its ID makes that explicit. Span 05_rerank then promoted the draft to top-1, beating the current pol_refund_v7. Span 06_generate ingested the draft as 1,100 of the 1,870 prompt tokens. Span 07_cite flagged the "14-day refund window" claim as unsupported. Span 08_judge scored it policy-correct anyway because the judge's rubric looked for "mentions courtesy clause" — and the draft mentioned it.

Three lessons in one artifact. The retrieval index is serving stale drafts. The rerank made the failure worse by promoting them. The judge's rubric is blind to the cite-check signal. None of those three lessons survive in a {prompt, completion} log line. All three are inspectable in the trace. This is what the kitchen log earns its complexity for.

3) Why OpenTelemetry won

LLM observability could have invented its own format. It largely did not. The reason is operational: every serious production system already runs distributed tracing for non-LLM components — databases, caches, HTTP services, message queues — and OpenTelemetry (OTel) is the standard those traces speak. When a refund chat crosses an LLM, a Postgres call, and a payments service, the team wants one trace that spans all three, not three disconnected logs.

OpenTelemetry's GenAI semantic conventions (the gen_ai.* attribute namespace shown above) standardise the field names: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens. LangSmith, LangFuse, Arize Phoenix, Helicone, and Datadog all import or export OTel traces. Anthropic's request_id and OpenAI's response.id slot into span attributes so a span can be cross-referenced with the provider's own logs. The lesson the industry learned from non-LLM tracing applied directly: the gain from one shared schema across vendors dwarfs the loss from any individual format being slightly off-optimal.

Mini-FAQ. "Do I have to use OTel?" No. A bespoke trace format works fine if the data model has parent/child/attributes. But once a second team or a second vendor needs to read your traces, OTel pays back the conformance cost. Most LLM-observability vendors emit OTel by default now.

4) Sampling — 100% of errors, 10% of success, 100% of long-tail latency

Storing every span for every request gets expensive fast. A refund chat with eight spans averaging 3 KB compressed is ~25 KB per trace. At 8,400 chats/week the bill is modest; at the scale of an Intercom Fin deployment (millions of chats/week) the storage and indexing cost forces a sampling decision.

The wrong way to sample is uniform — 10% of all traces, head-sampled at the SDK before the span tree is known. Uniform sampling means you keep 10% of your boring chats and 10% of your failures. The failures are exactly what you need at full resolution; uniform sampling throws away 90% of the most valuable evidence.

The right way is tail sampling — the SDK buffers the full trace and the sampling decision is made after the trace is complete, based on the outcome:

  • 100% of traces where any span errored, the judge failed, the citation check failed, or the user CSAT was ≤ 2.
  • 100% of traces in the long-tail latency bucket (p99+) — slow chats are where infrastructure bugs hide.
  • 10% of successful, ordinary traces — enough for population-level analysis.
  • 100% of any trace marked by an A/B variant under active investigation, regardless of outcome, so the experimental arm is fully observable.

The cost arithmetic, for the refund bot at one million traces/day:

Strategy Traces kept/day Storage hot 7d Warm 30d Cold 365d Total/month
Keep all, uniform retention 1,000,000 175 GB 750 GB 9.1 TB ~$1,200
100% errors + 10% success + tail ~150,000 26 GB 113 GB 1.4 TB ~$180
Errors only ~30,000 5 GB 23 GB 274 GB ~$36

(Numbers are order-of-magnitude on a typical S3-hot + S3-IA-warm + Glacier-cold setup, ~$0.023/GB hot, $0.0125 warm, $0.004 cold; index costs at OpenSearch or ClickHouse roughly double the hot tier. Stack-specific.)

The middle row is the working choice for most teams. Errors-only feels frugal until the first investigation needs population context — "is this kind of chat unusually slow today, or just slow always?" — and the data to answer is not there. The team learns this once and moves to tail sampling.

5) Pure-text logs vs structured logs vs traces — when each one is enough

Tracing is not free. The instrumentation, the storage, and the team-attention cost are real. The question is when a less expensive shape is genuinely enough.

Pure-text logs — a single line per event, free-form string. Works for: a small CLI tool, a script you run once a week, a single-call LLM application where the prompt and completion are the whole story. Fails when: anyone needs to ask a quantitative question, the agent has more than one internal step, or two events need to be correlated.

Structured logs — JSON-per-event with consistent keys. Works for: a single-call LLM service with modest traffic, a system where every event is self-contained and ordering matters less than searchability. Fails when: events need to be joined by request, parent-child causality matters, or you need to ask "how long did the whole request take, decomposed by step?"

Trace-based observability — parent/child spans with full attributes. Works for: any multi-step agent, anywhere retrieval and tool calls and judges are involved, anywhere two systems share one request. Costs more to instrument; pays back the first time someone investigates an A/B regression by opening one bad chat.

The escalation rule, in plain operational terms:

  • one model call, no tools, no retrieval → structured logs are fine.
  • one model call plus retrieval → start adding spans for the retrieval; trace-light is sufficient.
  • multi-step agent, any tool calls, any judge or eval in the loop, any A/B running → full tracing is no longer optional. Skipping it means accepting that no individual failure will be diagnosable.

The refund bot crossed this boundary the day it added retrieval. Most production LLM applications cross it within their first quarter. The kitchen log was a nice-to-have on day one and the only debug instrument by day ninety.

6) Privacy and PII — redact at the span, not after

Traces capture user messages, tool arguments, retrieved documents, and assistant replies. All four routinely contain personal data — order numbers, email addresses, names, sometimes payment fragments. A trace store that holds raw user content is a privacy surface; treating it as just "logs" is how teams end up with PII in a system that was never reviewed for it.

The right discipline is to redact at the span emission point, before the span leaves the application, not in a downstream pipeline. Three patterns work in practice. Field allowlisting — only specific attribute keys are allowed to carry free-text content, and every other field is structured. Inline PII detection — a lightweight detector (regex for emails/phones/cards plus a small classifier for names) tokenizes detected entities into placeholders before the span is exported: "the customer priya@acme.com asked..." becomes "the customer <email:e_4f81> asked...". Tier-based retention — full content for 24 hours for live debugging, structured-only after that, hashed-only after 30 days.

Mini-FAQ. "Why not just store everything and audit access?" Access audits work for small teams. They fail under regulatory scrutiny (GDPR right-to-be-forgotten, India DPDP, HIPAA) because deletion is hard once data has been replicated to backups, dashboards, and analytics warehouses. Redaction at emission cuts the chain at the source.

Trace 48f above already shows the pattern in miniature: user.id_hash instead of user_id. The user's actual ID never enters the trace; the hash is enough for joining across spans without exposing the identity.

7) Operational signals — what tells you tracing is healthy or rotting

A healthy tracing system has three signatures. Span coverage is stable — every code path that touches an LLM, a vector store, or a tool emits a span, and the count of distinct span names matches the number of components in the architecture. Per-span p95 latency is visible and roughly stable from week to week; spikes show up as outlier traces, not aggregate creep. Error rate per span name is computed and alerted on independently — a 3% error rate on rerank and a 0.1% error rate on generate are different stories, and an aggregate "request error rate" hides both.

The first signal of decay is span count drift on the same request shape. Yesterday a refund chat produced 8 spans; today it produces 6 because a code change stopped emitting two of them. The trace still looks fine — short and fast — but two operations have gone invisible. The fix is a contract test in CI that asserts the expected span names for a canonical request.

The misleading metric a beginner watches: total trace count. It rises and falls with traffic and tells nothing about quality. The metric an experienced operator watches: the ratio of full-resolution traces to errors — if errors are climbing but full-resolution traces are not (sampling is throwing them away), the team is flying blind on the most important slice. The graph an expert opens first: per-span p95 latency over the last seven days, faceted by span name and by deployment variant. That one chart catches a slow retriever, a regressing reranker, a flaky tool, and an A/B-variant-specific latency tax in one screen.

Diagnose, do not guess: not a model regression, an indexing regression — when generate p95 is flat but retrieve p95 doubled, the user-felt slowness is in the vector store, not the LLM.

8) Boundary of applicability — when minimal logs suffice, when full tracing is mandatory

Tracing earns its cost when the system has internal complexity and a non-trivial failure rate and an investigative obligation. Where one of those three is absent, lighter instrumentation can be honest.

A weekend script that calls one model with one prompt and prints the output does not need spans. A research notebook running a one-off comparison does not need spans. An internal tool with five users and a known-low blast radius can run on structured logs for months without pain.

Where tracing becomes mandatory: any LLM agent shipped to external users, any retrieval-augmented system in production, any pipeline running judges or eval-in-the-loop, anywhere an A/B test is live, anywhere a regulator might ask "reconstruct what the model saw on this date for this user". Casetext CoCounsel after the Mata v. Avianca incident is the canonical case — citation hallucination is a legal exposure, and the only defense against "how did this happen?" is a trace that shows exactly which retrieved doc the model was conditioned on.

The pathology at the other extreme is over-tracing — every function call becomes a span, every span carries the full input and output, and the storage bill quietly doubles each quarter. The fix is the span-boundary discipline from section 1: every span must be a suspect. Functions that no human would ever investigate during an incident do not need their own spans; they ride inside their parent.

9) The wrong mental model — "request and response are enough"

The seductive belief: "if I log the user message and the assistant message, I have the conversation." This is true for a single-call system and dangerously false for an agent.

The conversation has at least three definitions, and the wrong-model collapses them into one. The user-facing conversation is the visible turns — what the user typed and what the assistant said. The model-facing conversation is what the LLM actually saw — system prompt plus retrieved context plus tool outputs plus user turn. The operational conversation is the full execution graph — every internal call, every retrieval, every retry. A request/response log captures only the first. A failure analysis usually needs the second. A latency or cost investigation needs the third.

Three concrete consequences when the wrong model is in place:

  • A hallucinated answer is impossible to attribute to retrieval vs the model. The team blames whichever they suspect most strongly, and frequently fixes the wrong layer.
  • A token-cost spike is impossible to attribute to a longer system prompt vs more retrieved context vs a longer user message. The team rewrites the prompt and the cost stays the same because the retrieval context grew.
  • A latency regression is impossible to attribute to a slow tool vs a slow model vs a network hop. The team upgrades the model and the latency stays the same because the slowness was in the vector store.

Replace the wrong model with the right one: the user's request is the visible tip; the trace is the iceberg. Operational truth lives below the waterline. The kitchen log is the instrument that makes the underwater shape visible.

10) Six more failure shapes tracing reveals (and lack of tracing hides)

  • Silent retry storms. A tool call fails, the agent retries three times silently, the user sees a 6-second response. Without spans, this looks like a slow model. With spans, you see three failed tool spans before the successful one.
  • Context truncation. Retrieved docs exceed the model's context window and are silently truncated. Without spans, the answer is fluent and wrong. With spans, the prompt.retrieval_tokens attribute is suspiciously round (the truncation ceiling) on every failure.
  • Tool-argument hallucination. The model called a tool with arguments the schema rejected. Without spans, the tool just "didn't help." With spans, the tool span shows status=ERROR and the validation message.
  • Judge-rubric drift. The judge passes answers a human would fail. Without spans, the eval looks healthy. With spans, you can replay the judge call on a sample of human-failed traces and see the rubric is blind to a dimension.
  • Per-variant cost asymmetry. Arm B costs 22% more per chat. Without spans, you do not know why. With spans, the per-span cost attributes show arm B's retrieval pulls 1.4x more tokens because its query rewrite is more verbose.
  • Cross-tenant leakage. A retrieval span for tenant X returned a document from tenant Y. Without spans, you might never notice; the answer looks reasonable. With spans, doc IDs are auditable and the leak is one query away.

Every one of these is a story a team tells about "the day we wished we had tracing". Each becomes a routine query once tracing exists.

11) Cross-topic reinforcement — same shape, different chapter

  • Same invariant as judge calibration. 08-judge-calibration.md said "the rubric must be measured for honesty, not assumed honest". Tracing extends the same invariant: the agent's internal steps must be measured for what they did, not assumed correct. Both fight the same pressure — the gap between what the system claims to do and what it actually did.
  • Pressure relieved that A/B exposed. 10-ab-testing.md gave the team the headline that arm B regressed CSAT. Tracing answers which chats and which steps, which the headline cannot. The A/B is necessary; the trace is what makes the A/B actionable.
  • Same failure geometry as confident-wrong RAG. The hallucinated refund window in trace 48f is the same failure shape as 08_rag_system_design/01-confident-wrong-answer.md — fluent output, ungrounded claim. Tracing is the operational instrument that catches it after the fact and feeds the next iteration of retrieval design.
  • Forward pressure into alerting. A trace store is a haystack until alerting and dashboards turn specific signal patterns into pages. The next chapter is what makes the trace pile actionable in real time, not only during post-hoc investigation.

A self-test before you call tracing done

  • Can you take a randomly chosen failing chat from yesterday and reconstruct exactly which retrieved docs the model saw?
  • For each major component (retrieval, rerank, generate, tool, judge), is there a span with a name that would let a new engineer search for it?
  • Is at least 100% of errored traces sampled, even at the highest traffic volume?
  • Does every span carry token counts and cost where the operation is paid?
  • Does no span attribute carry raw user PII that has not passed a redaction pass?

Four yeses out of five is acceptable. Three or fewer means the next investigation will start with "we don't have that data".

Where this lives in the wild

The LLM-observability market is essentially a race to turn traces into product. The teams below differ on UI, vendor lock-in, and how aggressively they integrate eval and trace into one surface.

  • LangSmith — every chain run produces a hierarchical trace with prompts, retrieved docs, tool I/O, and evaluator outputs in one view; the "compare runs" UI is built on the trace primitive.
  • LangFuse — open-source LangSmith-shaped product; OTel-compatible exports; popular when teams want to self-host trace storage.
  • Arize Phoenix — open-source LLM tracing with built-in eval and drift overlays; the span tree is the central object every other view derives from.
  • Helicone — proxy-based tracing for OpenAI/Anthropic; lower instrumentation cost because it sits on the wire, narrower depth because it cannot see what the application does between provider calls.
  • Comet Opik — trace and prompt-management as one product; emphasises versioned prompts joined to traces for reproducibility.
  • Honeycomb LLM traces — general-purpose OTel-native observability extended with gen_ai.* semantic conventions; the queryable trace store is the differentiator, not the LLM-specific UI.
  • Datadog LLM Observability — adds LLM spans on top of the existing APM trace; the value is that the LLM call sits inside the same trace as the upstream HTTP request and the downstream Postgres query.
  • New Relic AI monitoring — same pattern, mature APM repurposed for LLM spans with built-in PII redaction at agent emission.
  • OpenTelemetry GenAI semantic conventions — the upstream spec that names gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens; the schema every vendor above is converging on.
  • Anthropic request_id — present on every API response; the canonical join key from a trace span to Anthropic's own server logs during a vendor incident.
  • OpenAI response.id — same role; teams who store it in the generate span can ask OpenAI Support about a specific call by ID.
  • AWS Bedrock CloudWatch traces — model invocation logs with request/response payloads; OTel-exportable so they can join an application's own traces.
  • Azure OpenAI logging via Application Insights — request/response capture wired into the Microsoft observability stack; teams in regulated industries lean on it for audit trails.
  • Google Vertex AI request logging — equivalent capture on the GCP side; integrates with Cloud Trace for end-to-end span trees.
  • GitHub Copilot's telemetry — internal trace pipeline that joins repo-context-gathering, suggestion generation, and acceptance feedback into one trace; the survival metric A/Bs run on top of it.
  • Intercom Fin — every ticket produces a trace combining retrieval, generation, and policy-tool spans; the "why did this ticket escalate?" investigation is one trace open.
  • Cursor / Continue — capture local LLM call traces with prompt-and-completion structure to support post-hoc eval over real developer interactions.
  • Notion AI — internal trace pipeline so workspace Q&A failures can be re-run against the same retrieved blocks.
  • Vectara HHEM, Patronus AI, Galileo — eval products that consume traces from the vendors above as their input; their existence presumes the trace already exists.
  • Casetext CoCounsel — citation traces are the legal-defensibility artifact post-Mata v. Avianca; the trace is what proves the retrieved authority actually existed.
  • W&B Weave — Weights & Biases' LLM trace product; tight integration with model-development workflows and dataset versioning.
  • Traceloop / OpenLLMetry — open-source instrumentation library that emits OTel-native LLM spans from common SDKs; the cheapest way to get started.
  • Microsoft Copilot for M365 — graph-aware tracing so per-user organisational context is visible in the audit trail without leaking to other tenants.

The pattern across the list is consistent: every team running serious LLM applications either built this layer in-house or paid a vendor to do it. The ones who did neither are the ones whose post-mortems still start with "we don't have logs for that".

Pause and recall

  1. Why are request/response logs insufficient once an agent has internal steps?
  2. State the rule: what is the minimum data captured in one span?
  3. In trace 48f, name the three spans that together explain the hallucinated refund window.
  4. Why is uniform 10% sampling worse than tail sampling at 100% errors + 10% success?
  5. What is the replay-ability test, and how does it audit whether tracing is complete?
  6. Name three patterns for PII redaction at the trace boundary.
  7. When is structured logging genuinely sufficient and tracing optional?
  8. What is the misleading metric a beginner watches, and which one does an expert open first?

Interview Q&A

Q1. The A/B regressed CSAT by 4 points. The team has prompt-and-completion logs only. What is the first thing you tell them they cannot answer, and what do you build first?

A. They cannot answer which step inside any failing chat produced the regression. The prompt and completion show the visible ends; the regression is almost certainly in the middle (retrieval, rerank, tool call, or judge). The first build is a trace tree per request with parent/child spans for every internal operation, sampling 100% of low-CSAT traces. With that, the team can pull twenty low-CSAT chats from arm B and read the spans side by side until the pattern emerges. Without it, the iteration is guesswork. Common wrong answer to avoid: "Add more logging." Adding flat lines to the same log file does not give parent/child structure; the team will still be unable to ask cross-step questions.

Q2. Why is OpenTelemetry's GenAI semantic convention worth following over a bespoke schema?

A. Because the LLM call is rarely the only thing in the trace. A refund chat crosses an LLM, a vector store, a Postgres lookup, and a payments service. OTel is already what the non-LLM systems speak, so a gen_ai.* span slots into the same trace as a Postgres span and a payments span. The team gets one cross-system view with one schema. The bespoke alternative wins on short-term flexibility and loses every time someone needs to join an LLM trace to a database trace. Common wrong answer to avoid: "OTel is the industry standard." That answer is true but content-free; the real reason is cross-system join.

Q3. You see a refund chat with eight spans. Span 04 (retrieve) returned three docs. Span 06 (generate) used 1,100 retrieval tokens. The answer cites a window that does not exist. Where do you start the investigation?

A. Start at span 04 — read the three doc IDs and confirm whether any one of them contains the hallucinated window. If yes, the failure is upstream of the model: the retrieval indexed a doc it should not have, or rerank promoted the wrong one. If no, the failure is in span 06 — the model invented a clause not present in its context, which is a grounding regression. The trace narrows the suspect set from "everything" to "one of two places" in under a minute. The fix differs by which one — bad retrieval data needs an index refresh; ungrounded generation needs a stricter prompt or a faithfulness judge. Common wrong answer to avoid: "Re-run the chat in the playground." Re-running may not reproduce; the trace is the recorded evidence, and you should read it before you experiment.

Q4. Your trace storage bill is growing 30% month-over-month. You are sampling 100% of all traces uniformly. What is the right move?

A. Switch to tail sampling — keep 100% of errored or low-CSAT traces, 100% of long-tail latency traces (p99+), 100% of any traces in an active A/B variant, and 10% of healthy traces. That preserves the full-resolution data on the slices that need investigation and trims the boring middle. Combine with tiered retention — hot for 7 days, warm for 30, cold for a year. The two together typically cut cost 5–10x while increasing the share of failures kept at full resolution. Common wrong answer to avoid: "Drop the sampling to 5% uniform." That throws away 95% of failures, which is exactly the slice the team is paying for.

Q5. Cumulative — drift detection (chapter 9) fires on retrieval-doc-distribution. The A/B is also running. The dashboards say everything is "fine". How do traces help?

A. Open the trace store with a filter for the drift's signature — say, traces where the retrieval.doc_ids set contains an ID that was rare yesterday and common today. Compare per-arm. If the drift is concentrated in arm B, the new prompt is asking different questions and pulling different docs — the experiment is causing the drift, not just observing it. If the drift is uniform across arms, an external traffic shift is hitting both arms and the A/B comparison is still valid but the absolute levels are misleading. Traces are the only artifact that can disentangle drift-caused-by-experiment from drift-caused-by-world. Common wrong answer to avoid: "Stop the A/B and investigate drift first." That assumes they are independent. Traces let you check whether they are.

Q6. Why redact PII at span emission time rather than at the trace store?

A. Because raw PII that enters the store has to be deleted on request, replicated to backups, scrubbed from dashboards, and removed from analytics warehouses — and each of those is a leak vector for the GDPR/DPDP/HIPAA scope. Redacting at emission means the PII never leaves the application boundary in trace form. The cost is a small CPU overhead for the redactor; the upside is that the trace store is not a regulated data surface. Audit-only access controls work for small teams; redaction-at-source works for any team that will be audited. Common wrong answer to avoid: "We have access controls on the trace store." Access controls do not satisfy right-to-be-forgotten; deletion across replicas is the hard part.

Q7. What is the difference between a span emitting status=ERROR and a trace where the assistant's answer was wrong but every span was OK?

A. A span error is a control-flow failure — a tool crashed, an API returned 500, validation rejected an argument. Every span being OK with a wrong answer is a semantic failure — every operation executed correctly and the model still produced a wrong claim. The two failures require different debugging instruments: span errors are caught by status filters and standard error-rate alerts; semantic failures are caught by judge scores, citation checks, or downstream user signals (CSAT, escalation). A team that watches only span errors will think the system is healthy while it confidently hallucinates. Common wrong answer to avoid: "If all spans are OK, the trace is fine." This is exactly the wrong-mental-model the chapter dismantles.

Q8. Cumulative — your judge says the answer is policy-correct, the citation-check span says one claim is unsupported, and the user CSAT is 2. Three signals, three different answers. What does the trace tell you and which signal is the lesson?

A. The trace tells you all three signals exist on the same chat, which is the rare case where you can compare them directly. The judge's policy_correct label is honest about its rubric (chapter 8) — it scored the courtesy-clause mention. The citation-check is honest about its dimension — one specific claim is unsupported. The user CSAT is honest about the user's experience. The lesson is that the judge's rubric is incomplete: it needs to consume the citation-check signal or the dimension that the user reacted to. The fix lives upstream of the trace — rubric extension, then re-calibration, then a re-test. The trace is what made the diagnosis possible at all; without it, the three signals would have lived in three different places and the contradiction would never have surfaced on one screen. Common wrong answer to avoid: "The judge is wrong." The judge is doing its job. The rubric needs another dimension. Misdiagnosing this as a judge bug sends the team to the wrong layer.

Apply now (10 min)

Step 1 — model the exercise. Take trace 48f from this chapter. Here is the diagnostic table I would put on the incident review:

Span Suspect signal Evidence in trace
04_retrieve served a draft as a candidate retrieval.doc_ids[0] = pol_courtesy_v2_draft
05_rerank promoted the draft to top-1 rerank.changed_top1 = true, top1_id = same draft
06_generate 1,100 of 1,870 prompt tokens were retrieval prompt.retrieval_tokens
07_cite flagged the 14-day window as unsupported citation.unsupported = ["14-day refund window"]
08_judge passed it as policy-correct judge.score = 1, ignored cite signal
Root cause draft doc in index + judge rubric blind to citation

Three components share blame, and the trace makes the hands_on_lab auditable. The fix list — purge drafts from the index, re-test the reranker on the cleaned index, extend the judge rubric to consume the citation-check — emerges from the table.

Step 2 — your turn. Take one LLM workflow in your own system. Sketch the trace tree: what is the root span, what are the child spans, what attributes does each span need to carry? For each span, write the one investigation question it should be able to answer. If you cannot name the question, the span is decoration.

Step 3 — reproduce from memory. Without scrolling up, redraw the trace tree from section 1 and reproduce the eight spans of trace 48f as a flat list with span name, parent, and one critical attribute each. Then state the replay-ability test in one sentence. If you can do all three cold, you carry the kitchen log.

What you should remember

This chapter explained why the headline from an A/B regression is useless without an instrument that can open one failing conversation at full resolution. The refund chatbot's mystery hallucination — a refund window that does not exist in the policy — was invisible in the request/response log and obvious in trace 48f. Three spans together produced the failure: retrieval returned a draft document, rerank promoted it, and the judge's rubric was blind to the citation-check signal. None of those three lessons survive in {prompt, completion} logging. All three are inspectable in a span tree.

You learned the rule: each user request produces one trace; each meaningful internal operation produces one span; each span carries input, output, timing, tokens, cost, and metadata. The replay-ability test audits whether tracing is done — pick a random failing chat and try to reconstruct exactly what the model saw. If a step is missing, the next investigation will start with "we don't have that data". You learned OpenTelemetry's GenAI semantic conventions are the schema the industry is converging on because the LLM call rarely lives alone in a trace. You learned tail sampling — 100% of errors, 10% of success, 100% of long-tail latency — is how a million-trace-per-day system stays affordable without throwing away the slices that matter. And you learned to redact PII at the span boundary, not in the warehouse, because access controls do not satisfy deletion.

Carry this diagnostic forward: when a chat goes wrong, the first move is open the trace, read the spans in order, name the suspect span before naming the fix. When token cost spikes, decompose by prompt.*_tokens attribute before rewriting the system prompt. When latency regresses, read per-span p95 before upgrading the model. The kitchen log is the instrument that turns "the experiment regressed" into "row 4,182 fell at span 05, for this reason". Vibes belong on possibility, offline evals on candidate ranking, A/Bs on live effects, traces on individual failures. Use each where its question lives.

Remember:

  • One user request, one trace; one internal operation, one span. The tree is the data model.
  • The replay-ability test: if you cannot reconstruct a specific failing chat from logs alone, your tracing is incomplete.
  • Tail sampling beats uniform sampling — 100% of errors, 100% of long-tail latency, 10% of healthy traffic. Errors are the slice you are paying to keep.
  • Span boundaries follow suspect boundaries. If a human would never investigate it during an incident, it does not deserve its own span.
  • Per-span p95 latency, faceted by span name and variant, is the chart an expert opens first. Total trace count tells you nothing.
  • Redact PII at the span, not in the warehouse. Deletion across replicas is the hard part of regulated data.

Bridge. A trace store is a haystack. It answers any question you ask it, but it does not ask any questions itself. The team that opens it only after a user complains has already lost the day. The next chapter is the layer that turns the trace pile into a pager — which signals from spans deserve a dashboard tile, which deserve an alert, which deserve to wake someone up at 2 a.m., and which deserve to wait until Monday's review. We can dissect failures one at a time; now we need the system that surfaces the failures we have not noticed yet.

12-alerting-dashboards.md