11. Measuring AI UX¶
Accessibility extends the design. Measurement tells the team whether the design is working. The discipline is to choose metrics that reflect AI UX quality — trust, repair, calibration — rather than engagement metrics that an engaging-but-misleading product would also win.
A platform engineer at a Gurgaon SaaS company reviews the AI feature's quarterly metrics. Sessions are up; queries per session are up; thumbs-up rate is high. The product team is celebrating. The platform engineer looks deeper: queries per session is up because users are re-asking the same question with different phrasings — they cannot find what they want. Thumbs-up rate is high because users default-tap thumbs-up after every response by habit; the actual usefulness signal is noise. The team replaces the dashboard: time-to-correct-answer, repair rate, escalation rate, abandonment-after-failure. The new metrics tell a different story — the feature is engaging but not useful. Six weeks later, with the new metrics steering iteration, usefulness scores climb.
This chapter is the measurement discipline.
What measuring AI UX is¶
Measuring AI UX is the discipline of selecting metrics that capture whether users are getting useful, trustworthy, recoverable AI experiences — not whether they are spending time in the product.
Engagement metrics — sessions, queries per session, time-on-feature — measure activity. Activity is not the goal. Usefulness is. The two diverge in AI products more than in traditional products because friction in an AI product looks indistinguishable from engagement on the surface.
The metric families¶
Six families:
| Family | What it measures | Example |
|---|---|---|
| Outcome | Did the user get what they wanted? | task completion, time-to-correct-answer |
| Trust | Does the user act on the AI's output? | confirmation rate, override rate, citation-click rate |
| Repair | When the AI is wrong, can the user recover? | correction rate, retry success, escalation rate |
| Calibration | Does the AI's confidence match reality? | accuracy by confidence band |
| Adoption | Are users returning? | weekly active by cohort, depth-of-use |
| Safety | Are harms being avoided? | refusal accuracy, distress-handling rate |
A balanced dashboard pulls from all six. A dashboard heavy on adoption alone misses the difference between engaging and useful.
Outcome metrics¶
The most important and the hardest to measure.
Task completion. Did the user accomplish the underlying goal — file the claim, find the document, write the email? Requires knowing what the user came to do, which is hard for open-ended AI surfaces. Usable signals: explicit completion flags ("Got it" button), behavioural proxies (user closed the session after a satisfied-looking response), follow-up absence (user did not come back with the same question).
Time-to-correct-answer. From the user's first query to the response they accept. Includes retries and corrections within the same task. Penalises long-tail repair loops.
First-attempt success. Fraction of tasks where the first AI response was accepted without correction or retry. A strong signal when measurable.
The trap is treating session length as outcome. A long session can mean either rich engagement or a frustrated user fighting with the AI. The metric cannot tell them apart.
Trust metrics¶
How much do users actually rely on the AI's output?
Confirmation rate. When the AI suggests an action and the user must confirm, what fraction confirm without modification? A high rate means users trust the suggestion; a very high rate may mean rubber-stamping, which is its own risk.
Override rate. Same idea inverted. Useful when paired with the disposition of overrides — were the overrides correct?
Citation-click rate. When the AI provides citations, do users open them? Low click rates mean either users trust the AI fully (good) or the citations are decorative (bad); paired qualitative review distinguishes.
Tone-of-engagement. Free-text channel sentiment. Users who push back politely have a different trust profile than users who push back angrily.
Trust is high-signal but easy to misread. Always pair with calibration to know whether trust is warranted.
Repair metrics¶
When something goes wrong, how well does the system support recovery?
Correction rate. Fraction of responses corrected by the user. Counter-intuitively, a moderate correction rate is healthy — it means users are pushing back and the system is capturing the signal. Zero correction rate often means the affordance is too hidden or the users have given up.
Retry success rate. Of corrected responses, what fraction of retries produced an acceptable answer?
Escalation rate to human. How often does the AI hand off, and what is the disposition of those handoffs?
Abandonment-after-failure. Did the user leave the feature within five minutes of a failed AI response, without escalating? The worst signal — the user neither corrected nor escalated, just left.
Calibration metrics¶
Does the AI's expressed confidence match its actual accuracy?
Accuracy by confidence band. Slice predictions by stated confidence; compute accuracy within each band. A well-calibrated AI says "90% confident" on questions where it is right 90% of the time. A miscalibrated AI says "90% confident" and is right 60% of the time.
Hedging accuracy. When the AI hedges ("I'm not sure, but..."), what fraction are actually wrong? Should be substantially higher than the base error rate.
Refusal precision and recall. Of refused queries, what fraction should have been refused? Of refused-worthy queries, what fraction were refused?
Calibration requires a labelled eval set. Without one, calibration metrics are guesses dressed as numbers.
Adoption metrics¶
The traditional metrics, in their place.
Weekly active users by cohort. Segment by onboarding cohort, role, account tenure, prior AI exposure.
Depth-of-use. Number of distinct flows the user engages with, not just sessions.
Time-to-second-session. Did the user come back? When? Faster returns are usually better, with the exception of high-stakes infrequent-use features where returning fast may indicate a problem with the prior session.
Cohort retention curves. The Nth-week return rate by acquisition cohort. Reveals onboarding and capability changes more reliably than aggregate.
Safety metrics¶
Often understaffed in AI UX dashboards.
Refusal accuracy. Both directions — false refusals and missed refusals.
Distress-handling rate. When users show signs of distress (sentiment signals, explicit keywords), did the AI route them appropriately?
Harm-event count. Severity-classified. Track even when low; the absolute floor matters more than the rate.
Compliance metric per regulated surface. Whatever the regulation requires; track on the same dashboard so it does not become a separate-team problem.
The misleading-metric trap¶
Two specific metrics deserve scepticism:
Thumbs-up rate. High thumbs-up rates can mean usefulness, but they can also mean habituated tapping, social-desirability bias, or selection (users who would tap thumbs-down abandon before tapping anything). Use as a coarse signal, not as a target.
Queries per session. High queries-per-session can mean rich engagement or frustrated retry. Pair with task completion or time-to-correct-answer to disambiguate.
The discipline is to treat any single metric as a hypothesis to verify, not a number to optimise.
How to build the dashboard¶
Three-tier structure:
- North-star (one or two). Time-to-correct-answer or task completion rate. The number the team rallies behind.
- Diagnostic (six to ten). One or two from each metric family. Used to interpret north-star moves.
- Safety floor (three to five). Hard thresholds that cannot regress regardless of north-star gains.
The dashboard is reviewed weekly at minimum. Major model, prompt, or UX changes are evaluated against the diagnostic layer, not just the north-star.
Qualitative signal¶
Numbers are necessary and insufficient. Qualitative signal — user research interviews, support ticket review, sales-call mentions, social-media mentions — adds the texture quantitative metrics miss.
Practical cadence:
- Weekly: ten support tickets reviewed for AI-specific friction.
- Monthly: five user research interviews focused on the AI surface.
- Quarterly: a deep review of the qualitative themes against the quantitative trends.
The combination catches what either alone misses.
Common mistakes¶
Engagement-only dashboards. Sessions and queries-per-session as the primary metrics.
No calibration measurement. The team has no idea whether confidence indicators match reality.
Thumbs-up as a target. A noisy signal optimised for; the team rewards habituated tapping.
No abandonment tracking. The most-failed users are invisible because they leave.
Safety on a separate dashboard. Safety regression is not seen until it is an incident.
No qualitative cadence. Numbers without texture.
Interview Q&A¶
Q1. The team's dashboard shows rising sessions, rising queries-per-session, and high thumbs-up rate. The product team is celebrating. What is the platform engineer's view? Sceptical. Queries-per-session can mean engagement or frustrated retry; thumbs-up rate is noisy and habituated. Without outcome metrics (task completion, time-to-correct-answer) and abandonment tracking, the dashboard cannot tell whether users are succeeding or fighting with the AI. Add outcome and repair metrics; review whether the celebrated trends survive the addition. Often they do not. Wrong-answer note: "engagement is the goal" misreads activity for outcome.
Q2. Walk through how to measure calibration.
Slice predictions by the AI's stated confidence — "high," "medium," "low" or numeric bands — and compute accuracy within each. A well-calibrated AI is right at the stated rate within each band. A miscalibrated AI is right less often than its confidence implies; the gap is the calibration error. This requires a labelled eval set (see 01_dataset_golden_set_operations). Without ground truth, calibration metrics are estimates dressed as facts. Wrong-answer note: "trust the AI's self-reported confidence" measures the symptom as the cause.
Q3. The correction rate is near zero. Is that good or bad? Almost always bad. Zero correction means either the affordance is too hidden (users cannot easily push back), the users have given up (they are not pushing back because they have stopped trying), or the AI is genuinely flawless (rare). The diagnostic is to check the abandonment rate at the same time. If abandonment is low and correction is zero, the AI may be working. If abandonment is high and correction is zero, the correction affordance is broken. Wrong-answer note: "zero correction is the goal" ignores that correction is a healthy signal.
Q4. What is a north-star metric for an AI customer-support feature? A composite of task resolution rate and time-to-resolution, weighted toward resolution. Sessions per user and queries per session are diagnostic, not north-star. CSAT or NPS is a secondary outcome metric — qualitative-leaning but still numeric. The discipline is to pick a north-star the team cannot game without genuinely improving the user's outcome. Wrong-answer note: "CSAT alone" can be gamed by selection effects in who responds to surveys.
Q5. How do you integrate safety metrics so they are not a separate-team afterthought? Place them on the same primary dashboard as adoption and outcome. Define hard floors — refusal accuracy ≥ X, harm-event count below Y — that cannot regress regardless of north-star gains. Make the safety review part of the weekly review, not a quarterly compliance ceremony. The aim is that a safety regression is visible to the product team at the same speed as an adoption regression. Wrong-answer note: "safety is the security team's job" treats safety as adjacent rather than integral.
What to do differently after reading this¶
- Build a dashboard with outcome, trust, repair, calibration, adoption, and safety families.
- Pick a north-star that is hard to game without improving the user's outcome.
- Treat thumbs-up rate and queries-per-session as diagnostic, not as targets.
- Pair quantitative review with weekly support-ticket triage and monthly user research.
- Track abandonment-after-failure; the worst signal is the invisible one.
- Put safety on the primary dashboard with hard floors that cannot regress.
Bridge. Measurement closes the operating loop. The architect checklist condenses the module into the twenty items a designer or engineer can run through on any AI surface to catch the obvious failures before they ship. The next chapter is that checklist. → 12-architect-checklist.md