04. LLM-specific traces — watch the model like a subsystem, not a magic box¶

~14 min read. Generic tracing is not enough once tokens, prompts, and streams enter the picture.

Built on the ELI5 in 00-eli5.md. The witness note — one step in the case — must become richer for LLM calls because the model hides cost and delay inside one box.

A normal span is too shallow for model calls¶

A database span gets by with method, table, and duration. That works because the cost and meaning of the call are already implicit in those three fields. An LLM call hides much more inside one box: prompt and completion sizes, input and output token counts, model name and version, temperature, cache status, safety filter status, and streaming start time. Every one of these moves cost, latency, or correctness. A generic span captures none of them, and that is why generic tracing feels incomplete the moment a model call appears in your stack.

generic service span                  useful LLM span
┌──────────────────────┐              ┌────────────────────────────┐
│ name = external_call │              │ name = llm.generate        │
│ duration = 2400 ms   │              │ model = gpt-4.1-mini       │
│ status = ok          │              │ prompt_tokens = 1880       │
└──────────────────────┘              │ completion_tokens = 420    │
                                      │ first_token_ms = 780       │
                                      │ total_ms = 2400            │
                                      │ prompt_version = faq-v9    │
                                      │ cache_hit = false          │
                                      └────────────────────────────┘

The second span is a real witness note; the first is just a shrug. The discipline is to treat the LLM call as its own observability domain, with span fields that match how the call actually fails and costs money.

What an LLM span should capture¶

Start with identity — provider, model, deployment region, prompt template version, and experiment bucket. Without these evidence tags, two spans cannot even be compared, because you do not know whether you are looking at the same system twice.

Then capture request size: prompt tokens, retrieved context tokens, system message length, and tool-result token share. These numbers explain latency, cost, and truncation failures in a single glance. A 12k-token prompt that times out at 8k is not a model bug; it is a span field you forgot to record.

Then capture response behaviour: completion tokens, stop reason, refusal flag, JSON parse success, tool-call request count, streaming first-token latency, and streaming total latency. Together they answer whether the answer was fast, valid, and usable — three different questions that a single status = ok flattens into one misleading green tick.

Finally capture money — input token cost, output token cost, total estimated cost. When you run many models against many tenants, this is the only way the daily bill maps back to specific prompts. Your crime statistics roll up from these span fields; if the fields are missing, the statistics are fiction.

Worked example: slow but cheap versus fast but expensive¶

Suppose your assistant can answer with two models. Model A is small; Model B is larger. Product wants low latency, finance wants low cost, and the argument has been running for two weeks on vibes. With proper span fields, the argument ends in one query.

Trace sample one.

span: llm.generate
model = mini-answer-v1
prompt_tokens = 1200
completion_tokens = 180
first_token_ms = 320
total_ms = 1280
input_cost = $0.0012
output_cost = $0.0005
cache_hit = false

Trace sample two.

span: llm.generate
model = premium-answer-v3
prompt_tokens = 1200
completion_tokens = 210
first_token_ms = 910
total_ms = 2460
input_cost = $0.0090
output_cost = $0.0042
cache_hit = false

Model A is faster and cheaper, but that is only half the question. Quality may differ, so we tag both spans with eval_bucket, task_type, and user_segment and let answer quality be compared later against the same slice. The case file by itself does not pick the winner; without these span fields, the comparison never even starts.

Prompt-to-completion tracing matters¶

A common senior mistake is to trace only the API round trip and treat prompt assembly as a free preamble. The result is a span that blames the model for latency the model never spent. Prompt assembly can be huge — retrieval may add ten chunks, formatting may serialise tool state, and guardrails may rewrite the prompt before it ever leaves your service. Give that preparation work its own span and the actual model call finally gets a clean witness note.

trace tr_2210
│
├── prompt.build                 410 ms
│   ├── load.system_prompt        12 ms
│   ├── attach.history            48 ms
│   └── inject.retrieved_docs    310 ms
│
└── llm.generate               2,180 ms
    ├── first_token_ms          740 ms
    └── completion_tokens       388

A total of 2.59 seconds no longer reads as a model problem once the case file shows 410 ms vanished into prompt building before the request even left the service. Engineers stop blaming the model unfairly, and retrieval and guardrails finally appear in the latency budget they have always been quietly consuming.

Streaming, retries, and tool requests need events¶

LLM behaviour is rarely one clean request. Streaming begins before the answer ends, a retry may fire on a rate limit, and the model may ask for two tools mid-run. Each of these should appear as a span event or child span, not as a footnote in the final duration.

Consider a user who feels the answer is instant even though total generation took 4.6 seconds. The first token arrived at 280 ms and streaming carried the perception across the gap. If you only record total latency, you miss that product truth and end up "optimising" something users were already happy with. The opposite case is just as misleading: a first LLM attempt hit a rate limit at 50 ms, the retry succeeded after 2.1 seconds, the final response is fine, but cost doubled. The witness note must record retry count and cause, or finance sees the bill climb without an explanation.

A good LLM span schema avoids future pain¶

Pick a stable schema early and use consistent field names — model_name, never sometimes model and sometimes deployment; prompt_version, never template_ver in one service; input_tokens and output_tokens, never ten near-duplicates. Dashboards break when schema drifts, investigations slow down, and cross-team queries become brittle. The case board depends on consistent evidence tags, and observability turns out to be partly a naming discipline.

LLM-trace patterns across observability tools¶

Notion AI — ML platform engineer: tags every answer span with prompt_version, workspace_id, and retrieved_context_tokens to debug stale summaries.
Perplexity — search product engineer: compares first-token latency and final latency separately for citation-heavy answers.
Khanmigo — education platform engineer: traces refusal flags, tool-call counts, and completion tokens for math tutor sessions.
Harvey legal assistant — reliability engineer: inspects stop reasons and JSON-parse success on structured drafting flows.
Replit Agent — infra engineer: records retry_count, tool_request_count, and output_tokens when code generation sessions get expensive.
LangSmith — trace UI built around LLM spans; every run displays prompt, completion, token counts, and cost in one collapsible view so a reviewer can read the witness note without leaving the browser.
LangFuse — open-source LLM observability; stores prompt-version, model, and latency as first-class span attributes and lets you group runs by prompt revision to spot regressions.
Arize Phoenix — combines trace capture with eval scoring on the same span, so a failed answer shows latency, tokens, and rubric score side by side.
Helicone — proxy-based LLM analytics that captures raw request and response bodies, exposes token-by-token streaming timing, and rolls cost up per tenant.
Comet Opik — eval + trace platform aimed at LLM apps; surfaces stop reasons and refusal flags as filterable facets.
Honeycomb for LLM observability — uses wide events to carry prompt version, model, token counts, and latency on one row; lets you BubbleUp the prompt version behind a latency spike in seconds.
Datadog LLM Observability — bolts LLM-specific facets onto APM traces: prompt template, model, tokens, cost, and refusal class, joined to the rest of your service map.
Splunk for AI traces — ingests OTel-shaped LLM spans alongside ordinary application logs; the value is correlating model behaviour with infrastructure events in one query.
OpenTelemetry GenAI semantic conventions — defines standard attribute names (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens) so spans portable across vendors, ending the model vs deployment naming wars.
OTel Collector LLM receivers — accept vendor SDK telemetry and normalise it into the GenAI conventions before forwarding, so a polyglot stack still produces consistent evidence tags.
AWS Bedrock + CloudWatch — Bedrock emits invocation logs with model id, input/output token counts, and guardrail trace, which become CloudWatch metrics you can alert on.
Azure OpenAI logging — diagnostic settings stream prompt, completion, and token usage to Log Analytics; the schema is fixed, which is exactly why dashboards survive code refactors.
GCP Cloud Logging for Vertex AI — captures prediction requests with model version and token counts; pairs with Cloud Trace so prompt-build latency and model latency live in the same trace tree.
Anthropic Console trace viewer — shows tool use, stop reasons, and per-message token usage for Claude apps; the canonical place to inspect a single agent run end to end.
OpenAI usage dashboard — rolls token spend per API key and model; coarse, but it is the cross-check that catches when your own span schema undercounts cost.
Vercel AI SDK traces — emits OTel spans for streamText and generateObject calls with first-token latency baked in, so streaming-perception data is captured without custom instrumentation.
LlamaIndex Observability — span instrumentation for retrieval and synthesis layers; lets you separate retrieval token cost from generation token cost on RAG pipelines.
LangGraph trace UI — visualises graph node transitions with the LLM span fields attached to each node, useful when an agent's failure is a routing decision rather than a model call.
BAML observability — captures structured-output schema, parse success, and retry count on every call, because BAML's whole pitch is that JSON parse failure is a first-class span field.
Pydantic AI logfire — Python-native LLM tracing with token counts, model id, and structured-output validation results on each span; the Pydantic types double as the schema.

Recall — can you reconstruct an LLM span cold?¶

Why is a generic external_call span too weak for LLM debugging?
Which LLM span fields explain latency and cost most directly?
Why should prompt building have its own span?
What product truth does first-token latency reveal that total latency hides?

Interview Q&A¶

Q: Why trace prompt assembly separately from model generation? A: Because retrieval injection, history assembly, and guardrail rewriting can add significant delay that would otherwise be blamed on the model. Common wrong answer to avoid: "Because vendors bill prompt assembly separately."

Q: Why collect token counts on spans and not just on billing reports? A: Span-level token counts let you connect cost to exact prompts, tenants, failure paths, and latency outliers. Common wrong answer to avoid: "Because token counts are only useful for finance teams."

Q: Why is first-token latency often more important than total latency for chat UX? A: Users feel responsiveness when the stream starts, even if the full completion takes longer. Common wrong answer to avoid: "Because first-token latency equals model compute time."

Q: Why should retry count live on the LLM span? A: Retries change both user latency and cost, so hiding them outside the span obscures the true request behavior. Common wrong answer to avoid: "Retries are a transport detail and never matter to product analysis."

Apply now (10 min)¶

Step 1 — model the exercise. Here is the span schema I would draft for an llm.generate call on a refund-policy assistant. Identity: provider, model_name, model_version, region, prompt_version, experiment_bucket. Request: prompt_tokens, retrieved_context_tokens, system_message_tokens, tool_result_tokens. Response: completion_tokens, stop_reason, refusal_flag, json_parse_success, tool_call_count, first_token_ms, total_ms. Money: input_cost_usd, output_cost_usd, total_cost_usd. Retries: retry_count, retry_cause. Tenancy: tenant_id, user_segment, eval_bucket. That is one span; every dashboard in the rest of the module reads from it.

Step 2 — your turn. Take one LLM call in your own product. Write out the same four-bucket schema for it — identity, request size, response behaviour, money. Add two product-specific fields, name the case file they feed, and mark which fields are missing from your current tracing.

Step 3 — reproduce from memory. Close this file. On a blank page, draw the split between prompt.build and llm.generate, label where the witness note ends and where the evidence tags live, and write one sentence on how the crime statistics roll up from these span fields into the dashboards a finance reviewer reads. If you can sketch this cold, including the streaming first-token branch, you carry the chapter.

What you should remember¶

This chapter explained why a generic external_call span is too shallow once a model call enters your stack. The four-bucket schema — identity, request size, response behaviour, money — is the difference between a witness note that lets a reviewer reproduce the failure and a green tick that hides the actual cost movement. Cost, latency, and correctness all leak through different fields; flattening them into one duration loses all three.

You learned to split prompt assembly from generation, to record first-token latency separately from total latency, and to put retry count and cause on the span where they belong. Each of these solves the opening failure — engineers blaming the model for time it never spent, finance seeing mystery cost, product missing the streaming-perception win. The fix is naming discipline, not new infrastructure: stable field names that survive dashboards, deploys, and team handoffs.

Carry this diagnostic forward: when an LLM call looks slow or expensive, do not start with the model. Open the case file and ask which span fields are missing. If first_token_ms, prompt_tokens, retry_count, or stop_reason is absent, the question itself is unanswerable until the span gets richer.

Remember:

An LLM span without prompt_tokens, completion_tokens, first_token_ms, and model_name cannot debug latency, cost, or correctness — fix the witness note before fixing the model.
Prompt assembly deserves its own span; mixing it with generation hides retrieval and guardrail latency inside a number the model gets blamed for.
First-token latency is the chat-UX truth; total latency is the finance truth. Capture both.
Retries and tool requests belong as span events or child spans, not as silent contributors to the bill.
Stable field names across services are not cosmetic — the case board breaks the moment one service writes model and another writes deployment.

Bridge. Fine. We can read a trace and we know what an LLM span should carry. But reading is not solving. To debug, we must re-run the failure in a controlled scene — capture the inputs, the seed, the model version, the time-of-day variance. Reproduction comes next. → 05-reproducing-the-failure.md