15. Debugging tools workflow — LangSmith, Phoenix, Braintrust in practice¶

~13 min read. The right detective desk does not solve the case for you. But the wrong one makes the lineup unwinnable. Tools matter — selection logic matters more.

Built on the ELI5 in 00-eli5.md. The lineup — walking the suspects (prompt, tool, loop, memory, model) one at a time — is the work. The tools in this chapter are what let you do that work in minutes rather than days. Pick the desk that supports the lineup, not the desk with the prettiest demo.

Start from the job, not the brand¶

Teams typically ask "should we use LangSmith?" or "should we use Phoenix?" That is slightly backwards. First ask what problem you must solve — quick LLM trace visualisation, prompt playgrounds, dataset and eval links, full OpenTelemetry integration, or private deployment control. The answer changes the tool choice, and the tool choice changes how the lineup runs.

need
│
├── fast LLM tracing + prompt inspection ──→ LangSmith often fits
├── open evaluation + trace analysis ─────→ Phoenix often fits
├── company-wide telemetry standard ──────→ custom or OTel-first stack
└── deep product-specific workflows ──────→ custom overlays anyway

The job decides the desk. Choose the case board that matches your operational maturity, not the one with the nicest demo.

Where LangSmith helps¶

LangSmith is strong when your stack already uses LangChain or adjacent patterns. It makes LLM traces visible quickly. Prompt inspection is easy. Nested runs are readable. Dataset and eval workflows are nearby. That reduces time-to-first-observability.

For a fast-moving product team, this is valuable. You instrument chains. You inspect runs. You compare prompts. You share a trace URL with support or PMs. The case file becomes accessible to non-infra people.

Now what is the tradeoff? You may adopt a vendor-specific model of runs and datasets. Cross-stack telemetry may still need another home. If your company already standardizes on OpenTelemetry, you may need bridging work. That is normal. LangSmith is not wrong. It is optimized for a particular use case.

Where Phoenix helps¶

Arize Phoenix is strong for open, evaluation-friendly analysis. Teams like it for inspecting traces, prompts, retrieval quality, and experiments. It often fits well when you want observability tied to evaluation and dataset review. The interface encourages quality analysis, not just uptime.

This matters for LLM products. A request can be technically successful but semantically bad. Phoenix-style workflows help teams inspect output quality beside trace context. That makes the case board closer to both engineering and ML evaluation.

Tradeoff? You still need discipline around instrumentation and metadata. No tool can infer good evidence tags from chaos. Also, if you need broad org-wide logs, metrics, infra traces, and compliance controls, one specialized tool may not be enough.

When custom tracing wins¶

Custom does not mean starting from nothing. Usually it means building on OpenTelemetry, Datadog, Honeycomb, Grafana, or your internal platform. Why go custom? Because your AI system may cross many non-LLM services. You may need one standard case file for web, queues, tools, billing, search, and models. You may also need strict data residency or redaction rules.

Custom wins when the company already has telemetry maturity. It wins when you need full control over sampling. It wins when every service already emits traces in a shared schema. It wins when support workflows, incident tooling, and audits must integrate tightly.

But look. Custom also costs more. Schema design. UI gaps. Training users. Maintenance forever. So do not build custom just to feel sophisticated. Build it when the requirements actually demand it.

Worked example: choosing for a mid-stage AI product¶

Suppose a startup has one chat assistant. The app uses Python services, a vector DB, and two model providers. Team size is eight engineers. Support wants trace links. ML wants prompt comparison. Infra maturity is medium.

Option one. Adopt LangSmith quickly. Instrument chains in a week. Gain fast run inspection. Ship support links to traces. Later export core metrics to Datadog. This is a good path if speed matters most.

Option two. Adopt Phoenix plus OpenTelemetry. Use it for trace analysis, retrieval inspection, and eval comparisons. Keep broader infra telemetry in existing tools. This is strong if evaluation culture is already central.

Option three. Build directly on OpenTelemetry. Send traces to a general telemetry backend. Add custom pages for prompt details and complaint linking. This is best only if the company already has observability engineers and shared telemetry standards.

There is no holy answer. The senior answer is fit-for-purpose. The case board must help real investigators. Not impress conference slides.

A practical hybrid pattern¶

Many teams end up hybrid. Specialized LLM tool for rapid prompt and trace debugging. General observability stack for org-wide metrics, logs, and alerting. Maybe even a support console that deep-links into both. This is healthy.

user complaint
    │
    ▼
support console ──→ trace link ──→ LLM tool view
    │                                  │
    └─────────────→ infra dashboard ───┘

See the shape? One complaint slip can open the specialized case file. The same incident can open the general case board for regional impact. You do not need one tool to be perfect at everything.

Selection checklist¶

Before choosing, ask six questions. How quickly can we instrument? Can support and product use the trace UI? Can we connect evals to traces? Can we enforce redaction and retention? Can traces span non-LLM services too? What migration pain appears if we outgrow it?

If a tool answers these well, adopt it. If not, do not force it. Tools are multipliers; they do not replace observability thinking.

Debugging-workflow tools across the LLM-ops landscape¶

Khan Academy Khanmigo — uses an LLM-focused tracing UI to compare tutor prompt versions across lesson flows; the role is making per-lesson regression debuggable without engineer involvement.
Notion AI — pairs a specialised prompt-trace tool with broader Datadog dashboards; the role is the hybrid pattern as a deliberate choice.
Ramp AI assistant — chooses OpenTelemetry-first instrumentation; the role is meeting finance compliance requirements without dual-tool overhead.
Glean — uses Phoenix-style trace inspection for retrieval-failure analysis alongside offline relevance evals; the role is closing the loop from eval failure to trace.
Intercom Fin — shares trace URLs from the AI tooling view while checking company-wide alerting in the main observability stack; the role is the support-operator-facing case board.
LangSmith debug UI — span-tree + prompt-diff + dataset link in one screen; the role is making the lineup walk a one-screen UI.
LangFuse self-hosted — open-source trace + eval + cost; the role is enabling on-prem deployments where vendor SaaS is forbidden.
Arize Phoenix dev workflow — local trace + eval iteration via arize-phoenix Python package; the role is making the case board runnable in a Jupyter notebook before deploying.
Cursor's debug-replay — checkpoints replayed inside the editor with code context; the role is collapsing the bug-to-repro loop into the IDE.
Anthropic console debug mode — workbench-style replay with modified prompts; the role is first-party prompt-isolation without instrumentation.
OpenAI Evals debug mode — verbose run with intermediate output capture; the role is making suspect-elimination scripts portable across model providers.
Promptfoo CI — assertion-driven prompt regression in CI; the role is suspect-1 elimination as a GitHub Actions check.
Braintrust eval debugger — trace + eval + diff in one view; the role is comparing two prompt versions on the same dataset visually.
Pytest plugins for LLM (e.g., pytest-asyncio for streaming tests) — fits LLM tests into existing unit-test culture; the role is lowering the adoption barrier for teams without LLM-ops tooling.
Vellum's prompt-debug tooling — staged prompt rollout with replay; the role is treating prompt changes as deployments.
BAML's playground — typed prompt + tool isolation; the role is shifting suspect-1 elimination to compile time.
Helicone session debug — per-session trace with retry chain; the role is exposing hidden retries that change the case file shape.
Comet Opik dashboards — eval-correlated trace inspection; the role is correlating regression signals with specific traces.
Honeycomb's BubbleUp on LLM spans — anomaly-attribution UI; the role is identifying which span attribute correlates with the regression.
Datadog APM service map for LLM apps — request flow with model/tool/queue services; the role is fitting LLM tracing into existing SRE workflow.
OpenInference + OpenTelemetry GenAI — open trace schema; the role is the common substrate observability vendors build on.
MCP server inspector — protocol-level tool I/O visibility; the role is making MCP tool calls debuggable independent of agent framework.
AgentOps multi-agent trace — per-agent timeline view; the role is exposing handoff bugs in multi-agent runs.

Recall — choose the desk, walk the lineup¶

Why should teams start with operational needs instead of vendor names?
What kind of team often benefits most from LangSmith-like tooling?
When does a custom tracing stack become justified?
Why do many mature teams choose a hybrid setup?

Interview Q&A¶

Q: Why choose a specialized LLM observability tool and not only a general APM stack? A: Specialized tools expose prompt structure, retrieval context, and output review workflows that general APM stacks usually do not model well. Common wrong answer to avoid: "Because general APM tools cannot trace HTTP calls."

Q: Why might an enterprise still prefer custom tracing over an LLM-native platform? A: Enterprises often need shared telemetry standards, privacy controls, and cross-service traces that span far beyond the model layer. Common wrong answer to avoid: "Because custom tools are always cheaper."

Q: Why is a hybrid observability setup often the practical answer? A: One tool can be excellent for prompt-level debugging while another handles org-wide dashboards, alerts, and infrastructure context. Common wrong answer to avoid: "Because teams could not decide, so they duplicated tools accidentally."

Q: Why is fast instrumentation a serious selection criterion for AI teams? A: LLM products change rapidly, so a tool that delays visibility by months increases incident risk and slows learning loops. Common wrong answer to avoid: "Because fast instrumentation looks better for demos."

Apply now (10 min)¶

Step 1 — model the exercise. Here is the three-column selection worksheet I would build for a mid-stage AI product team adopting their first case board:

Need	Nice to have	Must integrate
span tree for every request	prompt playground in same UI	existing Datadog APM
trace shareable via URL	dataset linked to traces	existing PagerDuty alerts
eval scores attached to spans	A/B prompt diff	existing OTel collector
PII redaction at ingest	per-tenant cost rollup	existing AWS Bedrock invocation logs

Verdict: LangSmith or Phoenix covers the first three columns; Datadog/PagerDuty integration covers the last. The honest answer is hybrid — LangSmith for prompt-trace inspection, Datadog for org-wide alerts, with OTel as the bridge.

Step 2 — your turn. Take your product. Fill the same three columns honestly. Then mark whether a specialised tool, a custom stack, or a hybrid seems best. If the columns disagree, the hybrid is forced.

Step 3 — reproduce from memory. Draw the small hybrid diagram. Label the case board, the case file, and the complaint slip path. Write one line on why tool choice follows the job.

What you should remember¶

This chapter explained why "which observability tool should we buy?" is the wrong question. The right question is what does the lineup require of the desk? Different teams answer it differently. A LangChain-first product team often gets the fastest time-to-trace from LangSmith. An OpenTelemetry-standardised enterprise often gets more leverage from custom + Phoenix. A multi-agent system often forces the hybrid pattern: one tool for the case file, another for org-wide alerts and the case board.

You also learned that tool selection is not a one-time decision. Operational maturity changes the answer. The fast-moving team starts with the specialised tool that minimises time-to-first-observability and inherits the rest from existing infra. The mature enterprise inverts the priority — telemetry standardisation first, LLM-specific UX second.

Carry this diagnostic forward: when a tool decision feels stuck, write the selection checklist before the vendor demo. The vendor demo is optimised to sell you the demo, not your product's actual workflow. Without the checklist in front of you, the prettiest UI wins.

Remember:

Start from the job, not the brand. Six selection questions before any vendor demo.
LangSmith fits LangChain-shaped stacks fastest; Phoenix fits open-eval and OTel-standardised stacks; custom fits enterprise telemetry mandates.
Hybrid is the common production answer. One tool for the case file, another for the case board.
Adoption velocity matters more than feature completeness in the first six months.
Tools are multipliers, not replacements for observability thinking. Walk the lineup discipline first; pick the desk that supports it.

Bridge. Whatever tool you pick, the case file becomes searchable only when each witness note carries the right evidence tags. A pretty UI cannot rescue a span with no tenant, no model, no prompt version. So next we study tagging for debugging. → 16-span-tagging-for-debugging.md