Skip to content

12. Tooling landscape — what each prompt ops tool actually solves

~14 min read. Many tools, four real jobs. Knowing which tool does which job — and which gaps you must fill yourself — is the difference between a stack that runs and a stack that drowns.

Builds on 11-prompt-incidents-and-rollback.md. The previous eleven chapters described the practices. This chapter is the map of who sells what.


1) Hook — three companies, three stacks

A pre-Series-A startup with one AI engineer runs prompts as YAML files in a git repo, evaluated by Promptfoo in GitHub Actions, observed by Helicone, with feature flags via Statsig. Total monthly tooling cost — $80. The setup works because the team is small and disciplined.

A Series-B SaaS company with five AI engineers and 200 enterprise customers runs prompts in Langfuse (self-hosted), evaluated by Braintrust, observed by Langfuse + Datadog, feature-flagged by LaunchDarkly. Total monthly tooling cost — $4500. The setup works because the team needs multi-tenant prompt management, audit trails, and an eval surface non-engineers can use.

A late-stage company with 30 AI engineers and 5000 customers runs prompts in a homegrown registry on top of Postgres + S3, evaluated by a custom eval runner, observed by Langfuse self-hosted, with feature flags built on top of LaunchDarkly. Tooling cost — $25K/month plus two engineer-years to build and maintain. The setup works because the team has scale-specific needs that off-the-shelf tools cannot meet.

All three stacks are correct for their stage. The wrong stack at any stage is what bleeds money and slows the team. This chapter is how to pick the right stack for where you are.


2) The metaphor — four jobs, many vendors

THE FOUR JOBS OF PROMPT OPS
───────────────────────────
1. REGISTRY      versioned storage of prompts, with SHAs and audit
2. EVAL          run a suite, score outputs, gate ship/no-ship
3. OBSERVABILITY traces, prompt SHA in spans, cost and latency dashboards
4. ROLLOUT       feature flags, ramp percentages, kill switches

Some vendors do one job well. Some try to do all four and do most well-enough. Some do two jobs and leave the others to integrators. Knowing which vendor maps to which jobs lets you compose a stack instead of buying a suite.

A common mistake is buying one tool to do all four jobs when you only have budget or maturity for one. The other three slots become "we will figure it out later," and "later" is usually after the next incident.


3) The category-by-vendor map

                    REGISTRY  EVAL  OBSERVE  ROLLOUT
Langfuse              ✓✓✓     ✓✓    ✓✓✓       ─
LangSmith             ✓✓      ✓✓    ✓✓✓       ─
Helicone              ✓✓      ✓     ✓✓✓       ─
Braintrust            ✓       ✓✓✓   ✓         ─
Pezzo                 ✓✓✓     ✓✓    ✓✓        ─
PromptLayer           ✓✓✓     ✓✓    ✓✓        ─
Vellum                ✓✓✓     ✓✓    ✓         ─
Phoenix (Arize)       ✓       ✓✓    ✓✓✓       ─
Promptfoo             ─       ✓✓✓   ─         ─
DeepEval              ─       ✓✓✓   ─         ─
OpenAI Evals          ─       ✓✓    ─         ─
Patronus AI           ─       ✓✓✓   ✓         ─
Galileo               ✓       ✓✓✓   ✓✓        ─
LaunchDarkly          ─       ─     ─         ✓✓✓
Statsig               ─       ─     ─         ✓✓✓
Split.io              ─       ─     ─         ✓✓✓
Flagsmith             ─       ─     ─         ✓✓
ConfigCat             ─       ─     ─         ✓✓
GrowthBook            ─       ─     ─         ✓✓

Reading the map — ✓✓✓ means the tool's primary job; ✓✓ means a strong supporting capability; ✓ means present but thin; ─ means not offered.

Three takeaways from the map. First, no single vendor is ✓✓✓ across all four. The rollout job in particular is universally outsourced to feature-flag vendors. Second, the registry job is dominated by Langfuse, Pezzo, PromptLayer, and Vellum. Third, the eval job has a strong specialist tier (Braintrust, Promptfoo, DeepEval, Patronus, Galileo) that the registry-first tools cannot fully match.


4) Tool-by-tool — what each actually does

Langfuse

Open-core (Apache 2.0 self-hostable, paid SaaS). Strong on traces, prompt management, and evals as one product. The traces are the killer feature — every LLM call captured, prompt SHA tagged, cost and latency surfaced. Eval suite integrates cleanly with the registry. Self-hosting on a Postgres + ClickHouse stack is the choice for teams that need data residency or want to avoid recurring SaaS spend.

Best for — mid-stage teams that want one tool for registry, observability, and eval. Audience is engineers; non-engineer surfaces are functional but not the strongest.

Weakness — rollout job is not their lane; you bring your own feature flags.

LangSmith

Closed SaaS (no self-host as of 2026), built by LangChain. Strong on chain debugging — if your stack is LangChain or LangGraph, the integration is the best in the market. Tracing is mature. Eval is solid. Prompt management is functional.

Best for — teams deeply invested in LangChain. Less compelling if your codebase does not use LangChain.

Weakness — LangChain-flavored idioms leak into the product; non-LangChain teams find it awkward.

Helicone

Open-source observability-first. Wraps your LLM client; traces auto-flow. Cost and latency dashboards are best-in-class. Prompt management is present but not their primary focus.

Best for — teams whose first pain point is cost and observability, and who plan to layer registry and eval on top later. Good gateway tool.

Weakness — eval surface is thin; you will pair Helicone with Promptfoo or Braintrust.

Braintrust

Closed SaaS, eval-first. Pairwise judging is excellent. CI integration is mature — eval runs gate PRs cleanly. Strong support for regression testing and human-labeled calibration.

Best for — teams whose eval discipline is the bottleneck. Pairs well with Langfuse or Helicone for observability.

Weakness — registry capabilities are functional but secondary; observability is light.

Pezzo

Open-source registry plus observability. MIT licensed. Good for teams that want to self-host without the Langfuse footprint. Smaller community.

Best for — small teams or open-source-first organizations. DIY-friendly.

Weakness — community size and feature pace lag Langfuse; you may be filling gaps.

PromptLayer

Closed SaaS, registry-focused with observability. One of the older players in the space. Good for teams whose AI workflows are primarily prompt-engineering-heavy and less code-heavy.

Best for — teams where PMs and prompt engineers (non-coders) own most prompt iteration.

Weakness — engineering workflows feel less native than Langfuse or LangSmith.

Vellum

Closed SaaS, low-code UI-first prompt management. Strong choice for teams with significant non-engineer prompt editors. Eval and observability surfaces are designed for non-engineers.

Best for — companies where product managers, customer success, and AI designers do most prompt work.

Weakness — engineering integration is functional but less native than code-first tools.

Phoenix (Arize)

Open-source LLM observability, with eval and tracing. Strong on RAG-specific debugging — surfaces retrieval quality, context window utilization, and chain-of-thought traces.

Best for — RAG-heavy stacks where retrieval debugging matters as much as prompt debugging.

Weakness — registry is light; pair with a registry tool.

Promptfoo

Open-source eval framework, declarative YAML config. CLI and CI-friendly. Best for engineers who want eval-as-code with version control.

Best for — teams whose eval philosophy is "tests live next to code." Pairs well with any registry.

Weakness — UI is minimal; non-engineers will not use it.

DeepEval

Open-source Python-first eval framework. RAG, agent, and standard LLM metrics built in. Pytest-style API.

Best for — Python shops that want pytest-style eval in CI. Strong RAG metric library.

Weakness — registry and observability are out of scope; combine with other tools.

OpenAI Evals

Open-source eval framework, originally for OpenAI internal model gating. Flexible but lower-level than Promptfoo or DeepEval.

Best for — teams that want maximum flexibility and are willing to write more glue code.

Weakness — lower-level than commercial alternatives.

Patronus AI

Closed SaaS, eval-focused. Adversarial test generation and red-team evals are differentiators. Strong on RAG faithfulness and agent reliability metrics.

Best for — teams with safety, compliance, or adversarial robustness concerns.

Weakness — pricing aimed at enterprise; smaller teams may find it overkill.

Galileo

Closed SaaS, eval + observability with focus on RAG and agents. Continuous monitoring built in.

Best for — teams that want eval + monitoring + alerting in one product, especially for RAG.

Weakness — registry capabilities are thin.

Feature-flag vendors (LaunchDarkly, Statsig, Split.io, Flagsmith, ConfigCat, GrowthBook)

These are the rollout layer. Flag value points to a prompt SHA. Deterministic bucketing routes users. Kill switches roll back. None of these are AI-aware, but they do not need to be — they treat prompt SHAs like any other config value.

LaunchDarkly — enterprise standard, strong audit, deep integrations. Statsig — experimentation-first, strong on stats and rollouts. Split.io — strong on compliance and audit logging. Flagsmith — open-source option. ConfigCat — simpler, cheaper, common at smaller scale. GrowthBook — open-source with strong experimentation features.

Pick by team familiarity. The choice rarely affects the AI side.


Mid-content recall

  1. Which four jobs does prompt ops cover, and which one do prompt ops vendors universally outsource?
  2. Why does Langfuse and Pezzo's open-source nature matter at certain stages?
  3. When is Vellum the better choice than LangSmith?

5) The picking matrix

TEAM STAGE          PRIMARY PAIN         RECOMMENDED STACK
──────────────────  ──────────────────   ─────────────────────────────────
Solo founder        cost visibility       Helicone + Promptfoo + git + ConfigCat
1-3 AI engineers    eval discipline       Langfuse SaaS + Statsig + GitHub
3-10 engineers      multi-tenant prompts  Langfuse self-host + Braintrust + LaunchDarkly
10-30 engineers     audit + compliance    Langfuse self-host + Braintrust + LaunchDarkly + custom RBAC
30+ engineers       scale-specific gaps   Homegrown registry + Langfuse for traces + custom eval
PM-heavy team       non-engineer editors  Vellum or PromptLayer + Statsig
RAG-heavy team      retrieval debugging   Phoenix (Arize) + Promptfoo + Statsig
Compliance-heavy    audit + redteam       Patronus + Langfuse + LaunchDarkly + custom audit

The matrix is a starting point, not a contract. The right stack depends on the team's culture (Python vs TypeScript, open-source-first vs SaaS-first), the customers (regulated vs not), and the AI workload (RAG vs agents vs simple gen).

A useful sanity check — if your monthly tooling cost exceeds 10% of your monthly LLM cost, you are over-tooled for your stage. The most common rebalance is replacing two overlapping SaaS tools with one self-hosted one, or dropping a tool whose job your team has not yet matured into.


6) Build vs buy

The build-vs-buy decision recurs at every stage.

BUILD WHEN                              BUY WHEN
──────────────────────────              ──────────────────────────
your scale exceeds tool ceilings        you are still proving the workflow
you have engineers to maintain          ops budget > engineer-time
you have unusual requirements           your needs are standard
data residency or air-gap forces it     SaaS data policies are acceptable
the tool's lock-in is unacceptable      lock-in is a fair trade for speed

Most teams should buy for the first two stages and consider building for stage three onward. Building too early means you spend engineer-years on infrastructure when those engineers could be shipping features. Buying too late means you outgrow the SaaS and find migration painful.

The middle path — build on open-source. Langfuse self-hosted gives you the registry and observability surface for the cost of running a Postgres + ClickHouse stack. Adding a custom eval runner on top is straightforward. This middle path lets you defer the "fully custom" decision until you actually need it.


7) The integration surface

Every prompt ops tool wants to wrap your LLM client. They want to be the SDK you import. The integration is what makes traces flow automatically, what makes prompts resolve at runtime, what makes the eval suite know which prompt to test.

INTEGRATION PATTERNS
────────────────────
1. SDK WRAP        your code imports the tool's SDK; it wraps client calls
2. PROXY           your code talks to the tool's proxy URL; tool talks to the provider
3. HOOK / DECORATOR your code uses decorators or callbacks; tool intercepts
4. OPENTELEMETRY   tool reads OTel spans your code emits; no tool-specific SDK

OpenTelemetry (specifically OTel GenAI conventions) is the cleanest integration — your code emits standard spans, and any tool that consumes OTel can see them. It is also the slowest-moving. Most tools today still want their own SDK or proxy.

The trade-off — SDK lock-in is real. Migrating off Langfuse means rewriting every LLM call site. Migrating off Helicone (which is a proxy) means changing one URL. Choose your tool's integration mode with eventual exit in mind.


8) Failure modes

Signal Likely cause Fix
Three SaaS tools, overlapping features Bought without mapping to the four jobs Audit, pick one tool per job, cancel the rest
Tool changes break every release Vendor releases unannounced API changes Pin SDK versions; subscribe to changelog; have a contract for breaking changes
Cannot reproduce prompt outside the tool Prompt only exists in vendor's storage Export to git as a backup; treat tool as cache, not source of truth
Tool's eval differs from your interpretation Vendor's default rubric mismatches your domain Use custom rubric; do not rely on vendor presets
Costs growing faster than usage SaaS pricing scales on traces, not users Sample traces; self-host the heavy ingestion tier
Tool's RBAC is one role for everyone Tool predates production scale Wrap with your own RBAC layer; require approvals outside the tool
Latency added by SDK wrap Tool's SDK adds non-trivial inline cost Batch sends; use async wrappers; consider proxy mode
Migration cost surprises SDK-coupled to many call sites Wrap LLM calls in your own thin abstraction; tool's SDK calls it

The last row is the cheapest insurance. A single internal abstraction (one function, llm_call(prompt_ref, inputs, model)) is what makes future tool migrations bounded.


Where this lives in the wild

  • Langfuse — open-core registry + observability + eval.
  • LangSmith — LangChain ecosystem, closed SaaS.
  • Helicone — open-source observability-first.
  • Braintrust — eval-first SaaS.
  • Pezzo — open-source registry + observability.
  • PromptLayer — registry + observability SaaS.
  • Vellum — low-code prompt management.
  • Phoenix (Arize) — open-source RAG observability + eval.
  • Promptfoo — open-source eval framework.
  • DeepEval — Python eval framework.
  • OpenAI Evals — flexible eval framework.
  • Patronus AI — adversarial eval SaaS.
  • Galileo — RAG + agent eval SaaS.
  • LaunchDarkly — enterprise feature flags.
  • Statsig — experimentation + flags.
  • Split.io — feature flags with audit.
  • Flagsmith — open-source flags.
  • ConfigCat — simpler flags.
  • GrowthBook — open-source experimentation.
  • Datadog LLM Observability — observability for teams already on Datadog.
  • New Relic AI Monitoring — observability for New Relic shops.
  • OpenTelemetry GenAI conventions — vendor-neutral integration standard.
  • OpenLLMetry — OpenTelemetry instrumentation for LLM apps.
  • AWS Parameter Store — runtime prompt config storage.
  • Hashicorp Consul — distributed config with audit, sometimes used for prompts.
  • Doppler — secrets and config management.
  • GitHub Actions — common CI runner for eval suites.
  • GitLab CI — same.
  • CircleCI — same, larger scale.

Pause and recall

  1. What are the four jobs of prompt ops, and which one do the registry-first tools universally outsource?
  2. What is the build-vs-buy decision and when does each side win?
  3. Why is the SDK-wrap integration mode more migration-painful than proxy mode?
  4. What is the 10%-of-LLM-cost sanity check for tooling spend?
  5. When does Phoenix (Arize) beat Langfuse?
  6. What is the "thin internal LLM-call abstraction" pattern and what does it protect against?
  7. Which integration mode is the cleanest long-term, even if it is the slowest-moving?

Interview Q&A

Q1. How do you choose a prompt ops stack for a new team? A. Map your needs to the four jobs — registry, eval, observability, rollout. Pick the simplest tool per job. Start with one or two SaaS tools, not five. Use open-core for the registry/observability tier (Langfuse self-hosted is a strong default), add a specialist eval tool if eval is your primary pain (Braintrust, Promptfoo), and use whichever feature-flag system your team already knows. Trap: "Buy the most-featured tool." Most teams over-buy. The four-job map prevents that.

Q2. When do you build instead of buy? A. Three triggers. (1) Your scale exceeds the tool's tier ceilings (millions of traces, hundreds of customers, data residency requirements). (2) Your team has the engineers to maintain a custom stack. (3) Your requirements are unusual (regulated industry, air-gap, multi-cloud). Until then, buy or self-host open-source. Trap: "Build for differentiation." Prompt ops is rarely a competitive moat. Spend engineer-time on product, not on rebuilding Langfuse.

Q3. Your team uses LangChain. Which stack is the best fit? A. LangSmith is the strongest integration. Tracing is native, prompt management is built in, eval is solid. Pair with LaunchDarkly or Statsig for rollouts. The downside is LangSmith lock-in — if you ever migrate off LangChain, you are also migrating off LangSmith. Mitigate with the thin LLM-call abstraction. Trap: "We are on LangChain so we need LangSmith." You can also pair Langfuse with LangChain — slightly more setup, less lock-in.

Q4. How do you prevent tool lock-in? A. Two patterns. (1) A thin internal abstraction — a single function or class that wraps every LLM call. The tool's SDK calls into your abstraction; you can swap tools by changing one file. (2) Export your prompt registry to git as a backup, so the tool is a cache rather than the source of truth. Trap: "Use the vendor SDK everywhere." Migration becomes a multi-month project.

Q5. Your tooling cost is growing faster than your LLM cost. What do you investigate? A. (1) Most prompt-ops SaaS pricing scales on traces, not users — high-traffic apps see disproportionate trace volume. (2) Audit which features each tool actually delivers — many teams pay for capabilities they do not use. (3) Consider self-hosting the heavy ingestion tier (Langfuse self-hosted) while keeping specialist SaaS for eval (Braintrust). (4) Sample traces in production — 100% trace capture is usually unnecessary. Trap: "We will eat the cost." A tooling cost that exceeds 10% of LLM cost compounds; rebalance before it becomes a budget fight.

Q6. When do you need a non-engineer-friendly prompt management surface? A. When PMs, customer-success leads, or AI designers own most prompt iteration. Vellum is the strongest in this niche; PromptLayer is a close second. Langfuse has a functional non-engineer surface but feels engineer-first. The decision is cultural — if your AI team is mostly engineers, the dashboard surface matters less. Trap: "Engineering should own prompts." That can be the right call, but it bottlenecks teams where domain experts (legal, medical, support) are the ones with the right phrasing.

Q7. What is OpenTelemetry GenAI and why does it matter? A. A vendor-neutral standard for instrumenting LLM apps. Your code emits standard spans (genai. attributes) — any tool that consumes OTel can read them. Matters because it decouples instrumentation from observability vendors. Today it is slower-moving than vendor SDKs; over the next 2-3 years it will likely become the default integration mode. Trap:* "OTel is academic." OTel underpins most cloud-native observability already; AI is catching up.


Apply now (5 min)

Step 1 — map your stack. For each of the four jobs (registry, eval, observability, rollout), name the tool you use today. Empty cells are real — and intentional gaps.

Step 2 — find the overlap. If any tool spans two jobs, ask whether it is doing both well or one well and one poorly. Most multi-job tools have one strong job and one tagalong.

Step 3 — find the gap. Empty cells are where the next incident comes from. Pick one and decide whether to fill it this quarter — and with what.

The map is the artifact. Carry it to your next AI infra planning meeting.


Bridge. Tools are the instruments. The next chapter is the final accounting — what prompt ops still does not solve, even with the best instruments in the world. → 13-honest-admission.md