03. Reading a trace — anatomy of the case file¶

~14 min read. The case file is a tree. Read it spine-first, branches-second, gaps-third. That order is the difference between a one-hour debug and a one-week guess.

Built on the ELI5 in 00-eli5.md. The case file — the full trail of one request — only debugs well when each witness note inside it carries the same trace ID and the right parent-child links. This chapter is how a senior engineer actually reads that file.

Start with the picture¶

A user sees one answer, but the system may run many steps to produce it. The gateway receives the message, a retriever queries the vector store, a ranker sorts chunks, a tool fetches billing data, the LLM answers, and a streaming layer sends the response back. Without distributed tracing those steps live in separate systems, each team sees only its own box, and root cause becomes politics. With tracing, one request gets one trace ID, every step becomes a span, and parent-child links preserve the shape of the work.

trace_id = tr_84a1

user request
    │
    ▼
┌──────────────────────── gateway span ────────────────────────┐
│ span_id=s1                                                   │
│                                                              │
│   ┌──────── retriever span ────────┐   ┌────── llm span ────┐│
│   │ span_id=s2  parent=s1          │   │ span_id=s4 parent=s1│
│   │                                │   │                    ││
│   │   ┌── vector DB span ───────┐  │   │  ┌─ stream span ─┐ ││
│   │   │ span_id=s3 parent=s2    │  │   │  │ parent=s4     │ ││
│   │   └─────────────────────────┘  │   │  └───────────────┘ ││
│   └────────────────────────────────┘   └────────────────────┘│
└───────────────────────────────────────────────────────────────┘

That whole drawing is the case file, and each box inside it is a witness note. Now the questions a debugger actually needs become askable: who called whom, which child slowed the parent, what branch failed first.

The three core objects¶

First is the trace ID. This names the full request journey. Every service touched by the request should carry it forward. No new trace unless a truly new request begins.

Second is the span. A span represents one unit of work. It has a start time and end time. It also has status, metadata, and events. Examples are retrieve.docs, tool.sql.query, or llm.generate.

Third is the parent-child relationship. A parent span starts child work. The child may run inside the same service or another one. This relation gives the tree shape. That shape is what plain logs miss.

Now what is the senior detail? Spans are not just timing wrappers. They are causal wrappers. They answer, "this work happened because that parent asked for it." That is why distributed tracing beats timestamp guessing.

Propagation is the real battle¶

Creating a trace ID once is easy. Keeping it alive across boundaries is harder. The gateway must pass it downstream. The retriever must attach it to its RPC call. The tool service must read it and continue the same case file. Async queues must preserve it too. Otherwise the tree breaks.

Common failure pattern. The web tier traces nicely. Then it pushes a job into a queue. The worker starts a fresh trace. Now the user request and the slow tool call look unrelated. Your detective room splits one case into two. Bad move.

What follows: standardize propagation. Headers for HTTP, metadata for gRPC, message attributes for queues, context objects inside code. Done consistently, every witness note stays inside the same case file regardless of which subsystem produced it.

Worked example: a fan-out retrieval path¶

Suppose one chat request searches three corpora. Policy docs. User notes. Past tickets. Each search runs in parallel. Then the system merges results.

trace tr_7007
│
├── gateway.chat_request                  35 ms
│
├── retrieve.all_sources                 190 ms
│   ├── search.policy_docs                80 ms
│   ├── search.user_notes               120 ms
│   └── search.past_tickets             160 ms
│
├── merge.results                         18 ms
└── llm.answer                         2,400 ms

What do we learn immediately? The parent retrieval span took 190 ms. That makes sense. It waited for the slowest child. The slowest child was search.past_tickets at 160 ms. So if retrieval becomes slow, start there.

Now add one more issue. search.user_notes failed once and retried. The child span shows an event. Its status stayed okay after retry. The parent still finished successfully. This is important. Without span structure, you might misread the incident. You would only see some failure log and panic. Tracing keeps causality clean.

Reading a trace like a systems engineer¶

When you open a trace, ask four questions. What is the root span? Which branch is slowest? Which child failed first? Which tags separate this request from healthy ones? That is enough to start well.

Look at duration nesting. If the parent is 8 seconds and one child is 7.7 seconds, that child dominates. If all children are tiny but the parent is large, you may have hidden work or bad instrumentation. If a retry child appears twice, count it as real cost. The user waited through both.

Also inspect missing spans. A blank area in the case file is itself a clue. Maybe a service was never instrumented. Maybe propagation broke. Maybe sampling dropped a critical branch. Senior engineers notice absent witness notes, not just red ones.

Trace design for AI pipelines¶

LLM systems need spans at useful boundaries. Not every line of code. That becomes noise. But not only one giant chat_request span either. That becomes fog.

Good default boundaries are these. Request entry. Authentication. Retrieval. Reranking. Tool call. Safety check. LLM generation. Streaming. Persistence. Human handoff.

Why this granularity? Each boundary is a real operational decision. Each one can fail alone, each one may have a different owner, and each one deserves its own witness note so a future debugger can localize the bug without re-running the request.

Trace-reading across LLM observability stacks¶

LangSmith trace UI — span tree per request with input/output preview at every node; the role is making parent-child relationships clickable instead of inferred from timestamps.
LangFuse — open-source span/trace model with token and cost rolled up per span; the role is reading "where did the budget go in this trace?" without aggregating logs by hand.
Arize Phoenix — trace + eval co-located so a low-scored output points at its trace tree directly; the role is closing the loop from "this answer was wrong" to "this retrieval step returned the wrong chunk."
Helicone trace explorer — request/response with retry chain, latency breakdown, and cost; the role is exposing hidden retries that double the user-visible bill.
Comet Opik — trace UI with eval scoring on each span; the role is enabling per-span quality regressions, not just per-request.
Honeycomb LLM observability — BubbleUp diff highlights anomalous spans; the role is finding the one slow span in 10K traces without staring at p99 graphs.
Datadog APM for LLM apps — parent-child spans with the existing service-map view; the role is letting an SRE who already reads Datadog traces inherit the LLM trace for free.
OpenTelemetry GenAI semantic conventions — standardised span attribute names (gen_ai.system, gen_ai.request.model, etc.); the role is preventing vendor-locked trace schemas so a team can swap observability tools without rewriting instrumentation.
AWS Bedrock CloudWatch traces — span trees for Bedrock agents and tool calls; the role is letting AWS shops reuse CloudWatch dashboards for AI workloads.
Azure OpenAI logging — request/response capture with optional content storage; the role is meeting compliance requirements while still preserving debug-ability.
GCP Cloud Logging for Vertex AI — structured logs with trace correlation; the role is feeding the same trace into BigQuery for long-horizon analysis.
Anthropic console (trace viewer) — per-request input/output with tool-call breakdown for Claude API users; the role is giving a first-party trace view without third-party instrumentation.
OpenAI usage dashboard — request-level token counts and latency; the role is the minimal trace data when no instrumentation exists.
Vercel AI SDK traces — built-in tracing for streamText and tool calls; the role is making tracing the default for Next.js LLM apps.
LlamaIndex Observability — span hooks per retrieval, rerank, and synthesis step; the role is letting RAG pipelines emit traces without manual instrumentation.
LangGraph trace UI — node-by-node graph traversal in the LangSmith UI; the role is reading multi-agent graphs as nested span trees instead of flat sequences.
BAML observability hooks — typed-DSL tracing with retry and parse-error spans; the role is keeping retries and validation visible at trace level.
Pydantic AI logfire integration — typed agent tracing into Logfire; the role is making structured logs and traces share one schema.
Sourcegraph Cody tracing — per-context-fetch spans for code retrieval; the role is exposing where code context bloats the prompt.
Cursor's internal trace replay — checkpoints per agent step that can be replayed locally; the role is shrinking the bug-to-repro gap from days to minutes.
OpenInference (Arize project) — open trace schema for LLM/agent runs; the role is the common substrate Phoenix and others build on.
Stripe Docs AI assistant — platform engineer: propagates trace IDs across edge API, vector search, and billing-data tool services.
Mercor recruiting assistant — reliability engineer: uses trace trees to spot queue workers that accidentally start new traces and break causality.

Recall — span trees, propagation, and witness notes¶

What is the job of a trace ID versus a span ID?
Why are parent-child links more useful than timestamps alone?
In the fan-out example, why did the parent retrieval span take 190 ms?
What kinds of boundaries should usually become spans in an AI pipeline?

Interview Q&A¶

Q: Why use parent-child spans and not just log every timestamped step? A: Parent-child links preserve causality, parallelism, and waiting relationships that raw timestamps cannot reconstruct reliably. Common wrong answer to avoid: "Because traces look nicer in vendor dashboards."

Q: Why is propagation the hardest part of distributed tracing and not span creation itself? A: Creating spans locally is easy, but preserving one continuous trace across services, queues, and async hops is what keeps the story intact. Common wrong answer to avoid: "Because generating unique IDs is computationally expensive."

Q: Why should an AI team avoid one giant top-level span for a request? A: A single coarse span hides which subsystem caused latency, failure, retries, or cost. Common wrong answer to avoid: "Because vendors charge by number of traces."

Q: Why do missing spans matter during debugging? A: Missing spans can indicate broken propagation, absent instrumentation, or sampling gaps, which are themselves root causes of observability blindness. Common wrong answer to avoid: "If a span is missing, that branch probably did not run."

Apply now (10 min)¶

Step 1 — model the exercise. Here is the span tree I would draw for one refund-bot request, with propagation notes:

trace_id: 7f3a...
└── root span: HTTP /chat                    ← Edge API
    ├── span: load conversation             ← DB (propagation: trace header)
    ├── span: route_intent (LLM call)       ← Anthropic (propagation: HTTP header)
    ├── span: retrieve_policy (parent)      ← fans out
    │   ├── span: bm25_query                ← parallel
    │   └── span: dense_query               ← parallel
    ├── span: rerank                        ← Cohere API
    ├── span: generate_answer (LLM call)
    └── span: emit_to_queue                 ← propagation: queue header carries trace_id

Each child can fail alone. Each one carries its own witness note. The retrieve fan-out is the one that surprised me last time — both children needed propagation because both legs lived in different services.

Step 2 — your turn. Take one AI request path from your own product. Write a root span and four child spans. Mark which child could fan out. Note exactly where the trace ID is propagated — HTTP header, queue message attribute, or function argument.

Step 3 — reproduce from memory. Draw the span tree from the chapter's diagram. Label the case file and at least three witness notes. Add one sentence on why propagation keeps the detective board honest.

What you should remember¶

This chapter explained why a trace is not a log file. The case file is a tree, not a timeline. Spans encode parent-child causality, parallelism, and waiting relationships that timestamps alone cannot recover. When a debugger reads a trace, the structure of the tree tells them where to look — fan-outs reveal which leg was slow, missing children reveal broken propagation, and orphan traces reveal context boundaries that were silently dropped.

You also learned why granularity decisions matter. One coarse span hides which subsystem caused latency, failure, or cost. The right boundaries are the ones that map to real operational decisions: a model call is a span because the team needs to attribute generation cost; a tool call is a span because failure routing depends on which tool failed; a queue boundary is a span because async hops are the place propagation breaks. Each boundary becomes a witness note future debuggers can read.

Carry this diagnostic forward: when a trace looks "small", suspect missing instrumentation before suspecting a clean run. Most short traces are not honest reports of a simple request; they are honest reports of a request you did not instrument.

Remember:

A trace is a tree of spans, not a list of timestamps. The shape carries causality.
Parent-child links survive parallelism and async hops; timestamps do not.
Missing spans are bugs in instrumentation, not evidence of a quiet code path.
Every operational boundary worth owning becomes a span. Tool calls, model calls, retries, queue hops.
Trace IDs propagate across HTTP, queues, async boundaries, and worker pools. One break and the case file splits.

Bridge. We have the tree now. But LLM calls need special visibility inside that tree, because prompt size, token count, and generation latency create new failure modes. → 04-llm-specific-traces.md