03. Success metrics — Separating product outcomes from model scores¶
~18 min read. A team that measures only model quality is a team that ships confidently into user disappointment. This page draws the line between "the model is good" and "the feature works" — and shows what breaks when you confuse the two.
Builds on 00-first-principles.md. The recurring pressures apply here: AI-fit routing told us which tasks belong to a model; job framing told us what the user actually needs done. Now we need a way to know whether we delivered. Every metric choice is a bet on what "delivered" means — and most teams bet wrong.
1) What this file solves¶
A search-quality team at a large Indian bank shipped a wiki assistant with 0.92 BLEU on held-out QA pairs and 0.95 cosine similarity on retrieval. The model dashboard was green across every checkpoint. Three weeks post-launch, internal support tickets had not dropped. Employees still called the helpdesk. The assistant answered — but it cited outdated policy documents, gave technically correct text that required 4 minutes of cross-referencing to trust, and never surfaced the one-line answer that a human agent would give in 8 seconds.
Not a model problem. A measurement problem. The cosine similarity was 0.94 but the answer cited an outdated policy document. Users don't measure cosine similarity — they measure whether the answer was trustworthy and current.
This file teaches you to build a metrics stack that catches divergence between model scores and user outcomes before you ship.
2) What framing and fit-routing taught us — and what still breaks¶
Chapter 01 gave us jobs. We decomposed "employee finds policy answer" into atomic tasks. Chapter 02 routed those tasks — retrieval here, generation there, human-in-the-loop for edge cases. But we still cannot answer: "Is this feature working?"
We can answer "is the model behaving?" — that is what BLEU, ROUGE, cosine similarity, and perplexity measure. But model behavior is not user outcome. A retrieval system can return the right document (model metric: green) while the user still fails (product metric: red) because the document is 47 pages long and the answer is on page 38.
This chapter fixes the gap. After it, you will be able to define success at four layers, detect when layers diverge, and avoid the trap of optimising a proxy that stopped correlating with the outcome you actually need.
3) The dashboard that said "green" while users switched back to manual search¶
Real scenario. The fintech wiki assistant from chapters 01-02. Week one post-launch:
- Retrieval recall@5: 0.91 (green)
- Answer BLEU vs gold set: 0.87 (green)
- Cosine similarity, query ↔ retrieved chunk: 0.93 (green)
- Latency p95: 1.2s (green)
The team celebrated. Then the product manager pulled usage logs:
- Daily active users: dropped 34% from day 3 to day 14
- Repeat query rate: 41% of users re-asked the same question within 10 minutes (signal: first answer didn't resolve)
- Fallback to helpdesk: 28% of users who got an assistant answer still opened a ticket
- Time-to-resolution (end-to-end): 6.4 minutes — worse than the 5.1 minutes before the assistant existed
Every model metric was green. Every product metric was red.
The root cause: the model retrieved correct chunks from outdated documents. The retrieval system had no recency signal. Policy documents from 2021 ranked above 2024 updates because they had richer text and more keyword overlap. The model metric measured similarity. The user needed currency.
4) One support interaction measured four different ways¶
A single query: "What is the reimbursement limit for domestic travel?"
| Layer | Metric | Value | Verdict |
|---|---|---|---|
| Model | Cosine similarity (query ↔ retrieved chunk) | 0.94 | ✅ Green |
| Task | Answer correctness (matches current policy) | ❌ Cited 2022 limit (₹1,500/day) instead of 2024 limit (₹2,200/day) | 🔴 Red |
| Product | User accepted answer without filing ticket | No — user opened helpdesk ticket | 🔴 Red |
| Business | Ticket deflected (cost saved) | ₹0 saved — ticket still created | 🔴 Red |
One interaction. Four measurements. Three of the four say failure. The one that says success (model layer) is the one the ML team was watching.
Callout — the visibility trap. Model metrics are easy to compute, cheap to automate, and live inside the ML team's tooling. Product metrics require instrumentation across the full user journey. Teams measure what is easy, not what matters.
5) The rule: product metrics measure user outcomes; model metrics measure model behavior¶
State it plainly:
- Model metrics answer: "Did the model produce output that resembles the reference?" (BLEU, ROUGE, cosine similarity, perplexity, F1 on NER, exact match)
- Task metrics answer: "Did the system complete the task correctly?" (answer correctness, tool-call success rate, end-to-end accuracy)
- Product metrics answer: "Did the user get what they needed?" (task completion rate, time-to-resolution, repeat query rate, fallback rate, NPS on feature)
- Business metrics answer: "Did the organisation benefit?" (ticket deflection, cost per resolution, revenue per feature interaction, retention lift)
These four layers can diverge. When they do, the higher layer (user/business) is the truth. The lower layer (model) is a proxy — useful only while it correlates with the higher layer.
The moment a proxy stops correlating, optimising it makes things worse. You get a faster, more confident model that answers the wrong question with higher cosine similarity.
6) The metrics stack — from model internals to business outcomes¶
┌─────────────────────────────────────────────────────────────────┐
│ THE METRICS STACK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ │
│ │ BUSINESS │ ← revenue, cost, retention, NPS │
│ │ OUTCOMES │ (lagging; weeks-months to move) │
│ └───────┬───────┘ │
│ │ ← can diverge here: product works but │
│ │ business case was wrong │
│ ┌───────▼───────┐ │
│ │ PRODUCT │ ← task completion, adoption, fallback │
│ │ METRICS │ (leading; days-weeks to signal) │
│ └───────┬───────┘ │
│ │ ← MOST COMMON DIVERGENCE POINT │
│ │ model is "correct" but user fails │
│ ┌───────▼───────┐ │
│ │ TASK │ ← end-to-end correctness, tool success │
│ │ METRICS │ (near-real-time; per-request) │
│ └───────┬───────┘ │
│ │ ← can diverge here: task correct but │
│ │ not what user asked │
│ ┌───────▼───────┐ │
│ │ MODEL │ ← BLEU, cosine, perplexity, F1 │
│ │ METRICS │ (instant; per-inference) │
│ └───────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Each upward arrow is a point where layers can diverge. Teams that measure only at the bottom are blind to failures at every layer above.
7) Mapping wiki-assistant metrics at each layer¶
Back to our fintech wiki assistant. Here is the full metrics stack, populated:
| Layer | Metric | Target | Current | Status |
|---|---|---|---|---|
| Model | Retrieval recall@5 | ≥ 0.90 | 0.91 | ✅ |
| Model | Answer BLEU vs gold | ≥ 0.80 | 0.87 | ✅ |
| Model | Cosine sim (query ↔ chunk) | ≥ 0.88 | 0.93 | ✅ |
| Task | Answer cites current policy (< 6 months old) | ≥ 0.95 | 0.61 | 🔴 |
| Task | Answer resolves query without follow-up | ≥ 0.80 | 0.59 | 🔴 |
| Product | Repeat-query rate (same user, < 10 min) | ≤ 0.10 | 0.41 | 🔴 |
| Product | Helpdesk fallback rate | ≤ 0.15 | 0.28 | 🔴 |
| Product | DAU retention (day 14 / day 1) | ≥ 0.70 | 0.46 | 🔴 |
| Business | Ticket deflection rate | ≥ 0.40 | 0.12 | 🔴 |
| Business | Cost per resolution | ≤ ₹45 | ₹112 | 🔴 |
The divergence point: model → task. The retrieval system finds similar text but does not verify recency. Everything above that layer inherits the failure.
Callout — fix at the lowest divergent layer. The product team wanted to add a "Was this helpful?" button (product-layer fix). The actual fix was a recency filter on retrieval (model-layer fix that repaired the task-layer metric). Always diagnose downward, fix at the root, verify upward.
8) Leading vs lagging indicators — which metrics tell you early¶
| Indicator type | Example | Signal speed | Action speed |
|---|---|---|---|
| Leading | Repeat-query rate spikes | Hours | Same day |
| Leading | Thumbs-down rate on answers | Hours | Same day |
| Leading | Retrieval confidence drops below threshold | Real-time | Automated alert |
| Lagging | Monthly ticket deflection | Weeks | Next sprint |
| Lagging | Support cost reduction | Months | Quarterly review |
| Lagging | NPS lift | Months | Strategy change |
The operational rule: alert on leading indicators, report on lagging indicators. If you only watch lagging metrics, you discover failure at the quarterly business review — 8 weeks after users gave up.
For the wiki assistant, the first leading signal was repeat-query rate hitting 0.41 on day 3. The team did not instrument it. They found out on day 14 when the product manager pulled raw logs. Three weeks of user frustration were invisible because the alert was wired to the wrong layer.
Callout — the 48-hour rule. If a new AI feature does not have at least one leading indicator alerting within 48 hours of a quality regression, it is flying blind. Instrument before you launch.
9) Why teams over-index on model metrics¶
Two dashboards. Same feature. Different stories.
Dashboard A — Model score dashboard (what ML teams build first)
┌─────────────────────────────────────────────────┐
│ Wiki Assistant — Model Quality │
│ │
│ BLEU: 0.87 ████████████████░░ ✅ │
│ Cosine sim: 0.93 ██████████████████░ ✅ │
│ Recall@5: 0.91 █████████████████░░ ✅ │
│ Latency p95: 1.2s ███████░░░░░░░░░░░ ✅ │
│ │
│ Status: ALL GREEN │
└─────────────────────────────────────────────────┘
Dashboard B — Product outcome dashboard (what PMs need to see)
┌─────────────────────────────────────────────────┐
│ Wiki Assistant — User Outcomes │
│ │
│ Task resolved (no follow-up): 59% 🔴 │
│ Repeat-query rate: 41% 🔴 │
│ Helpdesk fallback: 28% 🔴 │
│ DAU retention (d14): 46% 🔴 │
│ Avg time-to-resolution: 6.4m 🔴 │
│ │
│ Status: FEATURE FAILING │
└─────────────────────────────────────────────────┘
Why does Dashboard A get built first?
- Tooling gravity. ML frameworks emit model metrics by default. Product metrics require custom instrumentation across frontend, backend, and support systems.
- Ownership gap. The ML team owns the model. The product metric lives across teams — frontend, support ops, data eng. Nobody owns it fully.
- Eval set availability. Gold-standard QA pairs exist from the training pipeline. Product ground truth (did the user actually succeed?) requires event tracking that often doesn't exist at launch.
- Review incentives. Model papers report BLEU and F1. Launch reviews ask "what are your model metrics?" Teams optimise for what the review asks.
- Comfort bias. Model metrics are deterministic, reproducible, and run in CI. Product metrics are noisy, delayed, and require statistical reasoning.
None of these are bad reasons. They are structural forces. You counteract them by making the product dashboard the launch gate — not a nice-to-have that arrives in sprint 3.
10) Signals that your metrics are lying¶
Operational signals that your metric stack has a divergence you haven't caught:
| Signal | What it suggests | Example |
|---|---|---|
| Model metrics stable but user complaints rising | Task or product layer broken | Cosine sim = 0.93 but users say "answers are wrong" |
| High accuracy on eval set but low adoption | Eval set doesn't represent real queries | 95% accuracy on 200 curated questions; 60% on the long tail |
| Improving model scores over time but flat business metrics | Optimising a decoupled proxy | BLEU went 0.82 → 0.89 but ticket deflection stayed at 12% |
| Users engage but don't complete | UX or trust problem, not quality | Users read the answer but still call the helpdesk to confirm |
| A/B test shows no lift despite better model | Bottleneck is elsewhere | Better retrieval, but the answer format is unusable on mobile |
When you see these signals, do not tune the model. Diagnose the stack. Walk upward from model to business. Find the divergence point. Fix there.
Callout — the "users are wrong" trap. When metrics diverge, some teams conclude users are not using the feature correctly. This is almost never the diagnosis. If users are failing, the product is failing. Redefine success from the user's frame, not the model's frame.
11) Where proxy metrics stop working — Goodhart's law in AI products¶
Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."
In AI products this shows up with brutal regularity:
| Proxy metric targeted | What actually happened | Why it broke |
|---|---|---|
| Cosine similarity maximised | Retrieval started returning long, generic documents that lexically overlap with everything | Longer docs have more token overlap by construction |
| BLEU maximised | Answers became verbose, repeating the reference phrasing even when a shorter answer was clearer | BLEU rewards n-gram overlap, not clarity |
| Response time minimised | System started returning cached stale answers instead of computing fresh ones | Cache hit = fast; cache miss = slow. Optimiser learned to cache aggressively |
| Thumbs-up rate maximised | Answers became sycophantic — agreeing with the user's premise even when wrong | Users thumbs-up answers that confirm their belief |
| Ticket deflection maximised | Assistant started saying "I've resolved this" without actually resolving, discouraging users from opening tickets | Deflection measured whether ticket was opened, not whether problem was solved |
The defense: never optimise a single proxy. Monitor the full stack. When a lower-layer metric improves but a higher-layer metric doesn't follow, the proxy has decoupled. Stop optimising it.
12) Wrong assumption: "high model accuracy = product success"¶
This is the most expensive wrong assumption in AI product development.
The assumption: If the model scores well on the eval set, the feature will succeed with users.
Why it is wrong:
- The eval set is not the real distribution. Real queries are messier, more ambiguous, and more diverse than any curated set.
- "Correct" is necessary but not sufficient. A correct answer that takes 3 minutes to parse is worse than an incorrect answer that signals uncertainty and routes to a human in 30 seconds.
- Trust is not a model property. A user who received one wrong answer will distrust the next five correct ones. Trust compounds over time — accuracy is measured per-request.
- Context matters. A correct answer at the wrong moment (user is on a phone call, user is on mobile, user needed a one-line summary not a paragraph) fails despite being correct.
The wiki-assistant version: Model accuracy on the eval set was 89%. But 31% of "correct" answers cited policy documents older than 12 months. Users learned that the assistant sometimes gives outdated information. Within two weeks, power users — the ones who would have driven adoption — stopped using it. They told new employees: "Don't trust the wiki bot, just ask Priya on the support team." The model was 89% accurate. The feature was dead.
Callout — trust decays faster than accuracy improves. One wrong answer erodes trust that took twenty right answers to build. Your metrics must capture trust dynamics, not just per-request accuracy.
13) Cross-topic connections¶
Metrics do not live in isolation. They connect to:
- Evals (Module 04 — AI Product Evals): Evals are the mechanism that computes task-layer metrics. A metric without an eval is a wish. An eval without a metric is busywork.
- Acceptance tests (Chapter 04): Acceptance tests turn metrics into pass/fail gates. A metric says "we are at 59% task resolution." An acceptance test says "we do not ship until task resolution ≥ 80%."
- Risk (Chapter 05): Risk metrics are a special case — they measure what can go wrong, not what should go right. A feature can pass all success metrics and still carry unacceptable risk (e.g., 0.3% hallucination rate on medical advice).
- Fit-routing (Chapter 02): If a task was routed to AI that should have stayed with a human, no amount of model metric tuning will fix the product metric. The routing decision precedes the metric.
- Job framing (Chapter 01): If the job was decomposed wrong — if you're answering a question the user didn't actually ask — then even perfect task accuracy on the wrong task produces zero product value.
Wiki assistant metrics — what each layer costs to measure¶
| Layer | Metric | Instrumentation cost | Data source | Refresh cadence |
|---|---|---|---|---|
| Model | BLEU / cosine sim | Low — runs in eval pipeline | Eval set + model output | Per-deploy |
| Model | Retrieval recall@k | Low — offline eval | Eval set + retrieval logs | Per-deploy |
| Task | Answer cites current doc | Medium — needs doc freshness metadata | Doc store + answer trace | Daily |
| Task | Answer correctness (human judge) | High — requires human annotation | Sampled interactions | Weekly |
| Product | Repeat-query rate | Medium — needs session tracking | Event logs | Real-time |
| Product | Helpdesk fallback rate | Medium — needs cross-system join | Assistant logs + ticketing system | Daily |
| Product | DAU retention | Low — standard product analytics | Analytics platform | Daily |
| Business | Ticket deflection | Medium — causal attribution needed | Ticketing + assistant logs | Weekly |
| Business | Cost per resolution | High — needs full cost model | Finance + ops + engineering | Monthly |
The pattern: model metrics are cheap and fast. Business metrics are expensive and slow. This is why teams over-index on the cheap layer. The fix is not to stop measuring model metrics — it is to also instrument the expensive layers before launch, accepting that the first measurement will be imperfect.
Recall — what should stick¶
- What are the four layers of the metrics stack, bottom to top?
- At which layer does the most common divergence occur in AI products?
- Why did the wiki assistant's model metrics show green while the product was failing?
- What is the difference between a leading indicator and a lagging indicator? Give one example of each from this chapter.
- State Goodhart's law in one sentence and give one AI-product example.
- Why is "high model accuracy = product success" wrong? Name two reasons.
- What is the 48-hour rule for leading indicators?
- When a proxy metric decouples from the outcome it was meant to measure, what should you do?
Interview Q&A¶
Q1. A team reports 95% accuracy on their eval set but users are complaining. What is your diagnosis framework? A. Walk the metrics stack upward. Check: (1) Does the eval set represent real query distribution? (2) Are task-layer metrics (correctness on live traffic) also at 95%? (3) Are product-layer metrics (resolution rate, fallback rate) healthy? (4) Is there a trust/UX gap — correct answers that are hard to consume? The divergence point tells you where to fix. Common wrong answer to avoid: "The eval set needs more examples." More data at the wrong layer doesn't fix a measurement gap.
Q2. How do you decide which metrics to alert on vs which to report quarterly? A. Alert on leading indicators — metrics that move within hours of a regression (repeat-query rate, confidence drops, thumbs-down spikes). Report on lagging indicators — metrics that aggregate over weeks (ticket deflection, cost per resolution, NPS). Leading indicators drive immediate action. Lagging indicators drive strategy. Common wrong answer to avoid: "Alert on everything." Alert fatigue makes every alert meaningless.
Q3. A product manager asks you to optimise for ticket deflection rate. What is the risk? A. Goodhart's law. If you optimise deflection directly, the system can learn to discourage ticket creation (sycophantic answers, false confidence) rather than actually resolving problems. You need a paired metric: deflection rate and re-contact rate within 48 hours. If deflection rises but re-contact also rises, you are suppressing tickets, not solving problems. Common wrong answer to avoid: "Ticket deflection is a clean business metric, no risk." Every single metric is gameable.
Q4. Your BLEU score improved from 0.82 to 0.89 after a model update, but ticket deflection stayed flat. Explain possible causes. A. The BLEU improvement is in a dimension users don't care about — perhaps longer, more reference-like phrasing that doesn't improve resolution. Or the eval set covers easy queries where BLEU tracks correctness, but the deflection gap is on hard queries not in the eval set. Or the bottleneck is upstream (retrieval freshness) or downstream (UX: user doesn't trust the answer format). Common wrong answer to avoid: "We need more training data." This assumes the model layer is still the bottleneck when the evidence says otherwise.
Q5. How do you measure trust in an AI feature? A. Trust is a product-layer metric, not a model-layer metric. Proxies: (1) Does the user act on the answer without seeking confirmation elsewhere? (2) Does the user return for subsequent queries? (3) Does the user recommend the feature to colleagues? (4) Does repeat-query rate stay low? Trust compounds — measure it longitudinally, not per-request. Common wrong answer to avoid: "Add a thumbs-up button." Self-reported satisfaction ≠ trust. Users who don't trust a feature simply stop using it — they don't leave feedback.
Q6. What is the minimum metric set you would require before launching an AI feature? A. At minimum: one model metric (confirms model is performing), one task metric (confirms end-to-end correctness on live-like traffic), one product metric (confirms users succeed), and one leading indicator with an alert threshold. Business metrics can lag by one sprint but must be defined pre-launch with a target. Common wrong answer to avoid: "Just model metrics and we will add product metrics in v2." V2 arrives after users have already churned.
Q7. A team wants to use cosine similarity as their primary success metric for a RAG system. What do you tell them? A. Cosine similarity measures whether the retrieved chunk is textually close to the query. It does not measure whether the chunk contains the correct answer, whether the answer is current, whether the user understood it, or whether it resolved their need. Use cosine similarity as a model-layer diagnostic, not as a success metric. The success metric belongs at the task or product layer. Common wrong answer to avoid: "Cosine similarity above 0.9 means the system is working." It means the system is retrieving lexically similar text. That is a necessary condition, not a sufficient one.
Q8. How do you handle the case where model metrics and product metrics are both green, but the business metric is red? A. The product works for users but the business case was wrong. Either: (1) the feature serves a need that doesn't translate to business value (users love it but it doesn't reduce cost or drive revenue), or (2) the audience is too small to move business metrics, or (3) the business metric has a longer lag than expected. This is a strategy problem, not an engineering problem. Escalate to product leadership with the data showing the product-business divergence. Common wrong answer to avoid: "Improve the model further." The model and product layers are already green. The problem is above them.
Design / debug — apply this now¶
Step 1 — Map your stack. Take the AI feature closest to your current work (or the wiki assistant if you don't have one). Write down one metric at each of the four layers. For each metric, write what it tells you and what it hides. If you cannot fill a layer, that is the layer where you are flying blind.
Step 2 — Find the divergence. Look at your current dashboard. Is there a layer where the metric is green but the layer above is red (or unmeasured)? If unmeasured: that is your highest-priority instrumentation task. If red: diagnose the gap. What is the model doing "right" that the user experiences as "wrong"?
Step 3 — Set a leading indicator alert. Pick one metric from your stack that would move within 48 hours of a regression. Define the threshold. Wire the alert. Until this alert exists, your feature is flying blind to production quality changes.
Operational memory¶
This chapter explained why model metrics and product metrics can diverge — and why that divergence is the most common failure mode in AI product development. The important idea is that the metrics stack has four layers (model → task → product → business), each layer can show green while the layer above shows red, and the higher layer is always the truth because it measures what users and the business actually experience.
You learned that the wiki assistant had 0.93 cosine similarity (model: green) but 41% repeat-query rate (product: red) because retrieval optimised for textual similarity, not document recency. The fix was at the lowest divergent layer (adding a recency filter to retrieval), not at the product layer (adding a feedback button). That solves the "dashboard said green but users left" failure from section 3 because it teaches you to instrument and alert on the right layer before launch.
Carry this diagnostic forward: when someone says "the model metrics look great," ask what the product metrics say. If nobody knows, the feature is shipping blind.
Remember:
- Model metrics measure model behavior; product metrics measure user outcomes. They can diverge. When they do, the product metric is the truth.
- The most common divergence point is model → task: the model produces textually good output that fails to complete the user's actual task.
- Alert on leading indicators (repeat-query rate, thumbs-down, confidence drops) that move within hours. Report on lagging indicators (ticket deflection, NPS, cost) that aggregate over weeks.
- Never optimise a single proxy metric in isolation. When a lower-layer metric improves but the higher-layer metric doesn't follow, the proxy has decoupled — stop optimising it.
- Trust decays faster than accuracy improves. One wrong answer erodes trust that took twenty right answers to build.
- A metric without an eval is a wish. An eval without a metric is busywork. A launch without a product-layer metric is blind.
Bridge. We now have success metrics at every layer. But a metric is not a launch gate. A metric says "we are at 59%." A launch gate says "we do not ship until we are at 80%." Next: turning metrics into acceptance tests — concrete pass/fail criteria that block shipping until the system proves it works.