01. The feedback-loop problem¶
Before designing capture or pipelines, we feel the cost of operating without feedback. Eval drift, label staleness, prompt aging — the conditions that compound when the team has no production signal to inform iteration.
A platform engineer at a Bengaluru SaaS company finds her team's AI feature operating well by every internal metric. The eval score is 0.86; the model has not changed; the prompt has not changed; the audit shows steady performance. The customer-success team has been quietly accumulating concerns. Customer-renewal conversations include "the assistant is helpful but I usually have to ask twice." Support tickets mention "the bot got my question wrong even though I asked clearly." None of this reaches the AI team. The audit cannot detect "user had to ask twice" — that information lives in conversational patterns the team is not capturing. The customer-success team's concerns are individual; without the systematic capture, they are anecdotes, not data. The eval stays 0.86; the customers churn at a rate the team cannot diagnose.
This is the feedback-loop absence. The team operates on internal metrics; reality operates on user response; the two diverge.
What goes wrong without feedback loops¶
Five pathologies, each documented in production:
1. Eval drifts away from user perception¶
The team's eval scores cases against the team's labels. The team labels reflect what the team thinks is correct. Without feedback, the team's notion of correct can drift from the user's notion. The eval stays high; users are unhappy; the team is confused.
2. Labels age silently¶
Cases labelled six months ago reflect the team's understanding then. Product policy may have changed; user expectations may have evolved; rubrics may have implicitly shifted. Without signal from real users, the team does not know the labels are stale.
3. Prompts age without renewal¶
The prompt that worked at launch may not be the best prompt now. New failure modes appear; user vocabulary changes; the model itself shifts (even at a pinned version, providers update behaviour). Without production signal, the team does not know which prompt revisions would help.
4. Model decisions ossify¶
The team picked the model at launch based on then-known eval performance. New models become available; their fit for the current workload is unknown. Without feedback signal on the current model, the comparison to a new model is muddled by lack of baseline.
5. Anecdotes substitute for data¶
The customer-success team has impressions; the engineering team has metrics; the gap between them is filled by anecdotes. Decisions are driven by whoever is loudest in the last meeting; systematic learning is absent.
The five pathologies compound. A team with stale labels, an aging prompt, and an ossified model decision discovers all three only when a sustained customer-impact event forces an audit.
What feedback specifically provides¶
Each pathology has a matching feedback type that addresses it.
| Pathology | Feedback type |
|---|---|
| Eval drifts from user perception | Explicit feedback (thumbs, ratings) on production responses |
| Labels age silently | Production cases reaching customer support reveal mis-labels |
| Prompts age | Implicit signals (repeat-ask, abandonment) flag confusing prompts |
| Model decisions ossify | A/B test data with feedback shows current model's relative performance |
| Anecdotes substitute for data | Structured feedback aggregates anecdotes into signal |
The feedback closes the loop. The eval set absorbs new cases; the labels are refreshed against current user expectations; the prompts iterate against measured user response; the model selection re-evaluates against current data.
Why teams skip feedback loops¶
Three patterns.
"We don't want to bother the user." Explicit feedback (thumbs) feels intrusive. The team avoids adding it. The user has no way to register a complaint short of customer support — which the AI team does not see.
"The metrics look fine." The eval score is 0.86; the latency is good; the cost is in budget. The team focuses on these. The signals that would reveal user dissatisfaction are not collected.
"We'll add feedback later." Like the eval backstop in module 01, the discipline is deferred. Six months later, the team has six months of operations without the signal that would have informed iteration.
Each pattern is intuitive in the short term and expensive in the medium term. The chapter-opening churn is the cost of all three combined.
What "feedback" actually is¶
Two broad classes.
Explicit feedback. The user actively provides a signal — thumbs up/down, a star rating, a comment, a structured "was this helpful?" response. Low volume (typically 1-5% of users respond); high specificity (the user is telling you what they think).
Implicit signals. The user's behaviour reveals their satisfaction without them stating it. Engaging further or abandoning, asking follow-up questions, repeating the same question, copying the response, sharing it. High volume (every user produces signal); lower specificity (interpretation required).
Both matter. Explicit alone is biased toward extreme reactions; implicit alone misses what the user thinks. Together they triangulate user perception.
The signal-to-eval pipeline¶
The feedback's destination is not just a dashboard. The signals feed back into the team's core artefacts:
- The eval set absorbs new cases (chapter 05).
- The judge is calibrated against user perception (chapter 06).
- The prompts iterate based on observed user struggle (chapter 10).
- The model decisions inform selection (chapter 10).
The pipeline is the discipline that makes feedback actionable. Without the pipeline, feedback sits in a table; with it, feedback shapes the platform.
When feedback loops fail¶
Two failure modes for the loops themselves.
Captured but unused. Feedback collected and stored; never routed to the pipeline; never informs decisions. The chapter-opening case in part — thumbs collected but unread.
Used without bias awareness. Feedback acted on naively; the bias (chapter 07 covers this) distorts the decisions. The team's product decisions follow loud minority responses, not the broader distribution.
The disciplines in chapters 02-11 address both.
Interview Q&A¶
Q1. Why is "the eval score is 0.86" not a sufficient quality signal? Because the eval scores against the team's labels and the cases the team selected; it does not capture user perception or production patterns not yet in the set. The score can be high while users are unhappy if the set's labels diverge from user expectation or if production cases are not represented in the set. Production feedback (explicit and implicit) is the ground-truth signal of user perception; the eval and the feedback must both be operated. Wrong-answer notes: trusting the eval alone is the chapter-opening pathology.
Q2. Walk through how the five pathologies compound. The eval drifts from user perception (the team's labels diverge from what users want). Labels age silently (no signal to refresh them). Prompts age without renewal (no signal to revise). Model decisions ossify (no signal to re-evaluate). Anecdotes substitute for data (no systematic capture to ground decisions). Six months in, the team has high eval scores, no new prompt iterations, the same model from launch, decisions driven by whoever spoke last in the meeting. The gap between team perception and user reality is large. Wrong-answer notes: treating the pathologies as independent misses the compounding.
Q3. The team avoids adding thumbs feedback because "it bothers users." How do you respond? Two moves. One: implicit signals (chapter 03) capture user satisfaction without active prompting; the team has feedback even without thumbs. Two: explicit thumbs, if designed minimally (one click, optional comment), are not perceived as intrusive by users; the response rate is 1-5% but each response is informative. The cost of bothering is real but small; the cost of operating blind is the chapter-opening churn. The decision is "explicit + implicit"; not "neither." Wrong-answer notes: agreeing to skip both produces the operating-blind state.
Q4. What is the difference between collecting feedback and using it? Collecting is the start; the signal is in a table. Using is routing the signal into the team's artefacts — eval set absorbs new cases, judge calibrates, prompts iterate, model selection re-evaluates. The chapter-opening had thumbs collected and unread; the loop was unclosed. The pipeline (chapter 05 onward) is what closes it. A platform that collects without using is operating-blind plus a database of data nobody reads. Wrong-answer notes: "we have the data" without the pipeline is the failure mode this module addresses.
What to do differently after reading this¶
- Treat operating without feedback as a known risk, not the default.
- Collect both explicit and implicit signals; each addresses different pathologies.
- Build the pipeline that converts signals into artefact updates, not just dashboards.
- Periodically (monthly) read the feedback explicitly — what is the user telling us we are not catching?
- Tie product and engineering decisions to the feedback signal, not to anecdotes.
Bridge. Diagnosis in hand, the next move is concrete: how do you capture explicit feedback well? Thumbs, ratings, comments — each has trade-offs. The next chapter is the discipline of explicit feedback capture. → 02-explicit-feedback-capture.md