00. Telemetry and feedback loops — First-principles overview¶

Module 01 of this category taught you to build and operate the golden set. This module is the discipline that keeps the set fed: production telemetry, user feedback capture, and the signal-to-eval pipeline that converts production reality into improvements to the eval, the prompts, and the model selection.

A platform engineer at a Pune SaaS company has built a strong eval set. Six months in, the set's score stays 0.86; customer-impact metrics are also steady. The team is pleased. The audit finds two quieter things. A small but persistent fraction of users repeatedly ask the same question across multiple turns ("but what about my actual case?") — a signal of misunderstanding the platform never reads. A subset of feedback (thumbs-down) is collected by the UI but is not routed anywhere; the data sits in a table nobody queries. The team is operating blind to the production signal that would tell them what to improve. The fix is the feedback loop: capture the signals systematically, route them to the eval pipeline, refresh the set with cases representing what real users actually struggle with.

This module is the discipline. The signals are abundant; the question is how to capture, store, route, debias, and act on them. The eval set (module 01) is a workbench; the feedback loops are how the workbench grows from what the system actually meets in production.

What feedback loops are for¶

Telemetry and feedback loops are the production-to-eval pipeline that keeps the team's quality discipline grounded in what the system actually does, for whom, with what user response.

Three concrete uses.

Surface failure modes the eval did not predict. Every novel user complaint, every implicit signal of struggle, every explicit thumbs-down is a candidate case for the eval set.

Calibrate the judge. Production user reactions are the ground truth that LLM-as-judge scores should align with. Feedback closes the gap between judge scores and user perception.

Drive prompt and model decisions. The system's behaviour on real production traffic is the most accurate signal of what to change next.

Without feedback loops, the eval drifts; the prompts age; the model choice ossifies; the team operates on confidence that is increasingly disconnected from reality.

The six feedback surfaces¶

Surface	One-liner	Pressure it answers
Explicit feedback	Thumbs, ratings, comments captured directly from users	direct user voice; small N but high signal per response
Implicit signals	Engagement, follow-up, abandonment, repeat-ask	large N; lower signal per case; representative of broader user behaviour
Storage and schema	Per-event structured records with provenance	retrievability: signals must be queryable and joinable
Pipeline to eval	Routing signals into the eval set's refresh process	conversion: signals are inputs, the eval set is the output
Bias awareness	Selection bias, response bias, sycophancy in feedback	reliability: not all feedback is unbiased
Privacy and retention	The signals are user data; same disciplines apply	governance: feedback storage is regulated too

A seventh concern — incident response when feedback signals show problems — runs across the others and is its own chapter.

The recurring vocabulary¶

Name	Surface	What it is
the explicit feedback	Explicit	Direct rating or comment from a user
the implicit signal	Implicit	Behaviour-derived signal (repeat-ask, abandonment, follow-up)
the feedback event	Storage	The per-call record capturing one signal
the conversion pipeline	Pipeline	The process that turns signals into eval cases
the judge calibration set	Pipeline	Cases with both user feedback and judge scores; the alignment substrate
the bias mitigation	Bias	The discipline that accounts for who responds and how
the feedback retention window	Privacy	The bounded period feedback is kept

The journey¶

This module has two acts.

Act 1 — Capture (files 01–05). What feedback signals exist; how to capture explicit and implicit; how to store and schema them; how to convert them into eval and prompt feedback.

Act 2 — Use (files 06–11). Calibrating the judge, accounting for bias, privacy in feedback, the cadence of looking at signals, closing the loop on prompts and models, and what to do when feedback reveals problems.

Synthesis (files 12–13). Architect checklist and honest admission.

Memory map¶

#	File	Surface	What it adds
01	the-feedback-loop-problem	—	the cost of operating without production signals
02	explicit-feedback-capture	Explicit	thumbs, ratings, comments, structured prompts
03	implicit-signals	Implicit	engagement, follow-up, abandonment, repeat-ask
04	feedback-storage-and-schema	Storage	structured records with provenance
05	from-signal-to-eval	Pipeline	converting signals into eval cases
	— milestone: the signal is captured —
06	judge-and-rubric-calibration	Pipeline	aligning judge scores with user perception
07	bias-in-feedback	Bias	selection, response, sycophancy
08	privacy-in-feedback	Privacy	governance applied to feedback data
09	feedback-cadence	Cross	the rhythm of looking and acting
10	closing-the-loop	Pipeline	feeding prompts, models, set decisions
11	feedback-incident-response	All	what to do when signals show systemic problems
	— milestone: the loop is operational —
12	architect-checklist	Synthesis	20 items
13	honest-admission	Boundaries	what feedback loops cannot solve

How this module relates to its neighbours¶

01_dataset_golden_set_operations — chapter 02 there discussed sourcing from production; this module is the pipeline that does that systematically.
00_ai_evals_release_gates — covers eval as release gate; the feedback loop calibrates the judge that makes those gates meaningful.
03_ai_release_management — release decisions use both the gate (golden set + judge) and the feedback (production signal); this module produces the latter.
13_prompt_lifecycle_operations — prompts iterate based on the signals captured here.
03_ai_security_safety/03_data_access_governance — chapters 05, 06, 09 apply to feedback data.

Top resources¶

Sculley et al., "Hidden Technical Debt in Machine Learning Systems" — https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
Microsoft — Responsible AI feedback loop — https://learn.microsoft.com/en-us/azure/machine-learning/concept-responsible-ml
Anthropic — building with feedback — https://docs.anthropic.com/en/docs/build-with-claude/
OpenAI — collecting user feedback — https://platform.openai.com/docs/guides/

What's coming¶

01-the-feedback-loop-problem.md — Why eval drift and model staleness happen without production feedback.
02-explicit-feedback-capture.md — Thumbs, ratings, comments, structured forms.
03-implicit-signals.md — Engagement, follow-up, abandonment, repeat-ask.
04-feedback-storage-and-schema.md — Per-event structured records with provenance.
05-from-signal-to-eval.md — Converting signals into eval cases and prompt iterations.
06-judge-and-rubric-calibration.md — Aligning judge scores with user perception.
07-bias-in-feedback.md — Selection bias, response bias, sycophancy.
08-privacy-in-feedback.md — Governance on feedback data.
09-feedback-cadence.md — The rhythm of looking and acting.
10-closing-the-loop.md — Feeding prompts, models, eval set decisions.
11-feedback-incident-response.md — Systemic signals; what to do.
12-architect-checklist.md — Twenty items.
13-honest-admission.md — What feedback loops cannot solve.

Bridge. Before designing capture or pipelines, we feel the cost of operating without feedback. The first chapter is the diagnosis — eval drift, label staleness, prompt aging — that the rest of the module addresses. → 01-the-feedback-loop-problem.md