Skip to content

02. From complaint to trace — the first move when a user reports a bug

~14 min read. Debugging starts the moment a complaint slip can open the exact case file. Not before.

Built on the ELI5 in 00-eli5.md. The complaint slip — a human report of something painful — is useless to a debugger until it points at one exact case file. This chapter is about making that pointer cheap and reliable.


A user says, "Your bot gave the wrong answer." That sentence is emotionally clear and operationally vague. Which session? Which turn? Which model? Which prompt version? Which retrieved documents? Without a link from the complaint slip to one exact trace ID, the support team becomes a detective without a case number — flipping through screenshots, paraphrased timelines, and best-guess timestamps to reconstruct a story that telemetry could have handed them in one click.

complaint only
┌────────────────────────────────┐
│ "Wrong refund answer"         │
│ user: maybe Acme?              │
│ time: sometime this morning    │
│ screenshot: partial            │
└────────────────────────────────┘
      many possible traces

complaint + trace link
┌────────────────────────────────┐
│ ticket_id = cs_441             │
│ session_id = sess_918          │
│ trace_id = tr_9021             │
│ turn_id = turn_07              │
└────────────────────────────────┘
        one exact case file

A complaint slip should arrive as a pointer, not a riddle.

What must be linked

At minimum, keep join keys. Conversation ID. Session ID. Message ID or turn ID. User or workspace ID if policy allows. Timestamp. Surface name. These let support find the right case file.

Better still, store the trace ID directly in the support record. When a user clicks thumbs down, attach the current trace ID. When a support ticket opens from chat, persist the session and turn IDs. When an email complaint arrives, map its message to the original interaction record. Now the detective path is short.

Also keep version context. Prompt version. Model deployment. Experiment bucket. If the complaint rate rises after a rollout, these evidence tags matter immediately.

Worked example: one bad policy answer

Suppose a user in workspace acme_eu clicks "bad answer." They write, "Agent said annual contracts are refundable anytime." The UI stores this:

  • ticket_id = cs_441
  • workspace_id = acme_eu
  • session_id = sess_918
  • turn_id = 07
  • trace_id = tr_9021
  • feedback_label = wrong_policy_answer

Support opens the complaint slip. One click opens trace tr_9021. Inside the case file, they see:

trace tr_9021
├── retrieve.policy_docs      ok   doc_version = v17
├── tool.contract_lookup      ok   contract_type = annual
├── llm.answer                ok   prompt_version = refund-v3
└── output review                  cited clause = stale_refunds.md#2

What is the root cause? The retrieval span used a stale policy page. The tool correctly said the contract was annual. The prompt template was normal. So support can answer with confidence. Engineering can fix sync freshness. Product can quantify all complaints with doc_version=v17. One complaint slip now drives three teams off the same pointer.

Senior mistake. Teams try to reconstruct complaints later from screenshots. This is painful. Clocks drift. Users paraphrase. Sessions overlap. Logs expire. Do not do this.

Instead, attach the trace link at interaction time. When the answer renders, store the trace ID beside the message. When the user copies the answer, keep the turn ID. When they thumbs-down, snapshot the relevant metadata. This makes the case board accountable to real user pain.

Another good pattern. Let support search by natural business keys. Order ID. Workspace ID. Conversation URL. Then internally resolve to trace IDs. Support should not memorize telemetry schemas. The system should do that mapping.

Complaint-linked traces improve prioritization

Now what is the problem if we skip this? Engineers may optimize visible charts while missing user pain. A metric could look healthy overall. Still, one premium customer may be broken repeatedly. Complaint-linked traces expose high-value failures early.

They also improve bug triage quality. Instead of "customer said bot was weird," you get "premium workspace, prompt version sales-v4, retrieval index 2025-02-01, tool timeout on CRM lookup." That is a serious bug report. That is action-ready.

And yes, this helps evaluation too. You can sample complaint-linked traces for offline review. Now your labeling set comes from real pain, not random traffic. The humble complaint slip becomes a gold data source.

Minimum workflow for support teams

Make the workflow dead simple. Feedback event creates a record. Record stores trace ID and key tags. Support UI shows the answer, the complaint text, and a trace deep link. Engineers can pivot from the complaint to all similar traces by tag. That is enough to start strong.

user feedback
store complaint record
    ├── ticket_id
    ├── trace_id
    ├── session_id
    ├── feature_name
    └── complaint_label
support console ──→ trace detail ──→ similar traces query

This is not flashy. It is deeply practical. It reduces mean time to understanding. It also builds trust with customers because replies become evidence-based.


From complaint to trace in shipped products

  • Notion AI — support lead: opens a thumbs-down event and jumps straight to the retrieval span that pulled stale workspace content.
  • Intercom Fin — customer support engineer: links each inbox-copilot complaint to the exact conversation trace and prompt version.
  • GitHub Copilot Enterprise — success engineer: maps an IDE feedback event to the specific completion trace and extension version.
  • Klarna assistant — operations manager: connects a refund complaint ticket to the policy-answer trace for that user session.
  • Shopify Sidekick — product support engineer: searches by store_id and order_id to resolve one merchant's complaint into a trace-backed root cause.
  • Anthropic request_id pattern: every API response carries a request_id header that customers paste into support tickets, so Anthropic's on-call can fetch the exact server-side trace without any backchannel coordination.
  • OpenAI response.id field: the response object's id is logged by SDK wrappers (LangChain, LlamaIndex, Vercel AI SDK) so a screenshot containing the ID is enough to reopen the case file.
  • OpenTelemetry GenAI semantic conventions: standardize attribute names (gen_ai.system, gen_ai.request.model, gen_ai.response.id) so the same complaint-to-trace mapping works across vendors.
  • Zendesk AI agents: every assistant turn writes the trace_id into the ticket metadata, so when a customer replies "this answer was wrong" three days later, the ticket already points at the exact case file.
  • LangSmith user-feedback API: client.create_feedback(run_id, score) attaches user thumbs-down to a specific run, turning the dashboard into a queue of complaint-linked traces.
  • LangFuse score endpoint: scores carry both trace_id and observation_id, so feedback can point at one span inside a trace, not only the whole conversation.
  • Arize Phoenix annotations: human review annotations are stored with the trace, so the next reviewer sees both the original failure and the previous diagnosis.
  • Helicone request IDs: every proxy hop returns a helicone-id that the product can store in the support row, even when the underlying model SDK does not surface its own ID cleanly.
  • Sentry session replay for AI apps: replay events carry the trace ID, so reproducing the exact UI state that produced the bad answer becomes a one-click navigation.
  • Datadog LLM Observability — error from logs: the trace explorer accepts a free-form support-ticket link as a search filter, mapping ticket→trace inside the UI.
  • Grafana Tempo: stores traces long enough that an enterprise complaint filed two weeks later can still be opened; retention is a debug feature, not a cost line.
  • Stripe Docs AI: every answer in the assistant footer carries a copyable "request reference" string; customers who pasted it into bug reports cut triage time roughly in half.
  • PagerDuty incident triage workflow: the first action on a customer-impact alert is "attach the trace link," before any human discussion of root cause.
  • Linear bug template for AI features: the bug-report form has a mandatory trace_id field, refusing to submit without one, which forces the join at intake.
  • Vercel AI SDK telemetry hooks: experimental_telemetry writes a traceId to OpenTelemetry, so a Next.js app's customer support can pivot from server logs to chat-completion span without writing custom plumbing.
  • AWS Bedrock CloudWatch query: customers query aws-customer-message-id to find the exact invocation trace from a downstream complaint, even when the complaint came in through a third-party support tool.
  • GCP Cloud Logging trace correlation: when the Vertex AI logger and the chat-app logger share a trace field, one query joins user feedback to model invocation without manual reconciliation.
  • Anthropic console trace viewer: displays the same request from the model-vendor side, so when a customer escalates and shares their request_id, the engineer can compare client-side trace with model-side trace to localize the bug.

Recall — turn a complaint into a pointer

  • Why is a vague complaint operationally weak without join keys?
  • Which identifiers should usually connect complaints to traces?
  • In the worked example, what actual component caused the bad refund answer?
  • Why should trace links be stored at interaction time, not reconstructed later?

Interview Q&A

Q: Why store trace IDs directly on feedback events and not reconstruct them later from timestamps? A: Reconstruction is error-prone in distributed systems with overlapping sessions, retries, and clock drift, while a direct link preserves exact evidence. Common wrong answer to avoid: "Because timestamps are too expensive to store."

Q: Why is complaint-linked observability better than watching aggregate metrics alone? A: Aggregate metrics show population patterns, but complaint-linked traces show concrete user harm and exact causal paths. Common wrong answer to avoid: "Because complaints are always more important than metrics."

Q: Why should support consoles resolve business keys to traces instead of exposing raw telemetry IDs only? A: Support teams work in customer language like workspace or order IDs, so the system must translate operationally. Common wrong answer to avoid: "Because support teams cannot learn technical tools."

Q: Why did the complaint example point to retrieval freshness rather than the contract lookup tool? A: The trace showed the tool returned the correct contract type, while the retrieved policy document was outdated. Common wrong answer to avoid: "Any policy complaint is caused by the tool because tools touch business data."


Apply now (10 min)

Step 1 — model the exercise. Here is the feedback record I would design for the acme_eu worked example, before I write any UI code:

{
  "ticket_id": "cs_441",
  "complaint_text": "Agent said annual contracts are refundable anytime.",
  "workspace_id": "acme_eu",
  "session_id": "sess_918",
  "turn_id": "07",
  "trace_id": "tr_9021",
  "feedback_label": "wrong_policy_answer",
  "prompt_version": "refund-v3",
  "model_deployment": "claude-2025-04-01",
  "experiment_bucket": "control",
  "captured_at": "2026-05-21T09:14:33Z"
}

Five join keys (ticket, workspace, session, turn, trace), three evidence tags (prompt_version, model_deployment, experiment_bucket), one label, one timestamp. Support resolves "customer email → ticket → trace" by joining their CRM on workspace_id and pivoting; no engineer is paged for the lookup.

Step 2 — your turn. Design one feedback record schema for your own product. Include complaint text, session key, turn key, trace ID, and one business identifier (order_id, repo_id, store_id, whichever your domain uses). Then write the exact query a support agent would run from "customer email" to that record. If the query needs more than two joins, your schema is missing a key.

Step 3 — reproduce from memory. Draw the flow from complaint slip to case file with the support console in the middle. Label every join key on the arrows. Add one sentence on why storing the link at interaction time — not at escalation time — reduces guessing, and connect it back to the eight-class taxonomy from the previous chapter: which failure types are impossible to triage without a pre-stored trace link, and why?

What you should remember

This chapter explained why a complaint slip is operationally useless until it carries a trace pointer, and why that pointer has to be stored at the moment of interaction rather than reconstructed later from screenshots and clocks. The opening failure — "your bot gave the wrong answer" — sits unresolvable in a support queue until five join keys (ticket, workspace, session, turn, trace) plus a handful of evidence tags (prompt version, model deployment, experiment bucket) turn that vague pain into a one-click path to a specific case file.

The worked example showed the payoff: the acme_eu refund complaint resolved to one trace, the trace pointed at one stale retrieval span, and three different teams (support, engineering, product) could act on the same pointer instead of inventing their own narratives. Without the link, that same complaint produces a Slack debate. With it, the suspect layer is named in minutes.

Carry this diagnostic forward: when a support process feels slow, ask one question — "is the trace ID stored on the feedback event, or is someone reconstructing it from a screenshot?" If reconstruction is happening, the bug is not in the model or the prompt; it is in the intake schema. Fix the schema first, then debug the agent. Reconstruction tax compounds: every unlinked complaint costs time today and silently destroys the labeling set you will want tomorrow when building regression evals.

Remember:

  • The complaint slip must arrive as a pointer, not a riddle. Store the trace ID at interaction time, never after.
  • Five join keys (ticket, workspace, session, turn, trace) plus three evidence tags (prompt, model, experiment) are the minimum viable schema.
  • Reconstruction from timestamps is a tax that compounds. Clocks drift, sessions overlap, sampling drops branches.
  • Complaint-linked traces are your highest-signal labeling set — they come from real pain, not random traffic.
  • Support should search by business keys (order, workspace, email); the system resolves to telemetry IDs internally.

Bridge. The complaint now points at one trace ID. Good. But a trace is a tree of spans — and spans are not all equal. To debug fast, we must read the case file the way a detective reads a witness statement: spine first, branches second, gaps third. → 03-reading-a-trace.md