00. Telemetry and feedback loops — First-principles overview¶
Module 01 of this category taught you to build and operate the golden set. This module is the discipline that keeps the set fed: production telemetry, user feedback capture, and the signal-to-eval pipeline that converts production reality into improvements to the eval, the prompts, and the model selection.
A platform engineer at a Pune SaaS company has built a strong eval set. Six months in, the set's score stays 0.86; customer-impact metrics are also steady. The team is pleased. The audit finds two quieter things. A small but persistent fraction of users repeatedly ask the same question across multiple turns ("but what about my actual case?") — a signal of misunderstanding the platform never reads. A subset of feedback (thumbs-down) is collected by the UI but is not routed anywhere; the data sits in a table nobody queries. The team is operating blind to the production signal that would tell them what to improve. The fix is the feedback loop: capture the signals systematically, route them to the eval pipeline, refresh the set with cases representing what real users actually struggle with.
This module is the discipline. The signals are abundant; the question is how to capture, store, route, debias, and act on them. The eval set (module 01) is a workbench; the feedback loops are how the workbench grows from what the system actually meets in production.
What feedback loops are for¶
Telemetry and feedback loops are the production-to-eval pipeline that keeps the team's quality discipline grounded in what the system actually does, for whom, with what user response.
Three concrete uses.
Surface failure modes the eval did not predict. Every novel user complaint, every implicit signal of struggle, every explicit thumbs-down is a candidate case for the eval set.
Calibrate the judge. Production user reactions are the ground truth that LLM-as-judge scores should align with. Feedback closes the gap between judge scores and user perception.
Drive prompt and model decisions. The system's behaviour on real production traffic is the most accurate signal of what to change next.
Without feedback loops, the eval drifts; the prompts age; the model choice ossifies; the team operates on confidence that is increasingly disconnected from reality.
The six feedback surfaces¶
| Surface | One-liner | Pressure it answers |
|---|---|---|
| Explicit feedback | Thumbs, ratings, comments captured directly from users | direct user voice; small N but high signal per response |
| Implicit signals | Engagement, follow-up, abandonment, repeat-ask | large N; lower signal per case; representative of broader user behaviour |
| Storage and schema | Per-event structured records with provenance | retrievability: signals must be queryable and joinable |
| Pipeline to eval | Routing signals into the eval set's refresh process | conversion: signals are inputs, the eval set is the output |
| Bias awareness | Selection bias, response bias, sycophancy in feedback | reliability: not all feedback is unbiased |
| Privacy and retention | The signals are user data; same disciplines apply | governance: feedback storage is regulated too |
A seventh concern — incident response when feedback signals show problems — runs across the others and is its own chapter.
The recurring vocabulary¶
| Name | Surface | What it is |
|---|---|---|
| the explicit feedback | Explicit | Direct rating or comment from a user |
| the implicit signal | Implicit | Behaviour-derived signal (repeat-ask, abandonment, follow-up) |
| the feedback event | Storage | The per-call record capturing one signal |
| the conversion pipeline | Pipeline | The process that turns signals into eval cases |
| the judge calibration set | Pipeline | Cases with both user feedback and judge scores; the alignment substrate |
| the bias mitigation | Bias | The discipline that accounts for who responds and how |
| the feedback retention window | Privacy | The bounded period feedback is kept |
The journey¶
This module has two acts.
Act 1 — Capture (files 01–05). What feedback signals exist; how to capture explicit and implicit; how to store and schema them; how to convert them into eval and prompt feedback.
Act 2 — Use (files 06–11). Calibrating the judge, accounting for bias, privacy in feedback, the cadence of looking at signals, closing the loop on prompts and models, and what to do when feedback reveals problems.
Synthesis (files 12–13). Architect checklist and honest admission.
Memory map¶
| # | File | Surface | What it adds |
|---|---|---|---|
| 01 | the-feedback-loop-problem | — | the cost of operating without production signals |
| 02 | explicit-feedback-capture | Explicit | thumbs, ratings, comments, structured prompts |
| 03 | implicit-signals | Implicit | engagement, follow-up, abandonment, repeat-ask |
| 04 | feedback-storage-and-schema | Storage | structured records with provenance |
| 05 | from-signal-to-eval | Pipeline | converting signals into eval cases |
| — milestone: the signal is captured — | |||
| 06 | judge-and-rubric-calibration | Pipeline | aligning judge scores with user perception |
| 07 | bias-in-feedback | Bias | selection, response, sycophancy |
| 08 | privacy-in-feedback | Privacy | governance applied to feedback data |
| 09 | feedback-cadence | Cross | the rhythm of looking and acting |
| 10 | closing-the-loop | Pipeline | feeding prompts, models, set decisions |
| 11 | feedback-incident-response | All | what to do when signals show systemic problems |
| — milestone: the loop is operational — | |||
| 12 | architect-checklist | Synthesis | 20 items |
| 13 | honest-admission | Boundaries | what feedback loops cannot solve |
How this module relates to its neighbours¶
01_dataset_golden_set_operations— chapter 02 there discussed sourcing from production; this module is the pipeline that does that systematically.00_ai_evals_release_gates— covers eval as release gate; the feedback loop calibrates the judge that makes those gates meaningful.03_ai_release_management— release decisions use both the gate (golden set + judge) and the feedback (production signal); this module produces the latter.13_prompt_lifecycle_operations— prompts iterate based on the signals captured here.03_ai_security_safety/03_data_access_governance— chapters 05, 06, 09 apply to feedback data.
Top resources¶
- Sculley et al., "Hidden Technical Debt in Machine Learning Systems" — https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
- Microsoft — Responsible AI feedback loop — https://learn.microsoft.com/en-us/azure/machine-learning/concept-responsible-ml
- Anthropic — building with feedback — https://docs.anthropic.com/en/docs/build-with-claude/
- OpenAI — collecting user feedback — https://platform.openai.com/docs/guides/
What's coming¶
- 01-the-feedback-loop-problem.md — Why eval drift and model staleness happen without production feedback.
- 02-explicit-feedback-capture.md — Thumbs, ratings, comments, structured forms.
- 03-implicit-signals.md — Engagement, follow-up, abandonment, repeat-ask.
- 04-feedback-storage-and-schema.md — Per-event structured records with provenance.
- 05-from-signal-to-eval.md — Converting signals into eval cases and prompt iterations.
- 06-judge-and-rubric-calibration.md — Aligning judge scores with user perception.
- 07-bias-in-feedback.md — Selection bias, response bias, sycophancy.
- 08-privacy-in-feedback.md — Governance on feedback data.
- 09-feedback-cadence.md — The rhythm of looking and acting.
- 10-closing-the-loop.md — Feeding prompts, models, eval set decisions.
- 11-feedback-incident-response.md — Systemic signals; what to do.
- 12-architect-checklist.md — Twenty items.
- 13-honest-admission.md — What feedback loops cannot solve.
Bridge. Before designing capture or pipelines, we feel the cost of operating without feedback. The first chapter is the diagnosis — eval drift, label staleness, prompt aging — that the rest of the module addresses. → 01-the-feedback-loop-problem.md