Skip to content

00. Telemetry and feedback loops — First-principles overview

Module 01 of this category taught you to build and operate the golden set. This module is the discipline that keeps the set fed: production telemetry, user feedback capture, and the signal-to-eval pipeline that converts production reality into improvements to the eval, the prompts, and the model selection.


A platform engineer at a Pune SaaS company has built a strong eval set. Six months in, the set's score stays 0.86; customer-impact metrics are also steady. The team is pleased. The audit finds two quieter things. A small but persistent fraction of users repeatedly ask the same question across multiple turns ("but what about my actual case?") — a signal of misunderstanding the platform never reads. A subset of feedback (thumbs-down) is collected by the UI but is not routed anywhere; the data sits in a table nobody queries. The team is operating blind to the production signal that would tell them what to improve. The fix is the feedback loop: capture the signals systematically, route them to the eval pipeline, refresh the set with cases representing what real users actually struggle with.

This module is the discipline. The signals are abundant; the question is how to capture, store, route, debias, and act on them. The eval set (module 01) is a workbench; the feedback loops are how the workbench grows from what the system actually meets in production.


What feedback loops are for

Telemetry and feedback loops are the production-to-eval pipeline that keeps the team's quality discipline grounded in what the system actually does, for whom, with what user response.

Three concrete uses.

Surface failure modes the eval did not predict. Every novel user complaint, every implicit signal of struggle, every explicit thumbs-down is a candidate case for the eval set.

Calibrate the judge. Production user reactions are the ground truth that LLM-as-judge scores should align with. Feedback closes the gap between judge scores and user perception.

Drive prompt and model decisions. The system's behaviour on real production traffic is the most accurate signal of what to change next.

Without feedback loops, the eval drifts; the prompts age; the model choice ossifies; the team operates on confidence that is increasingly disconnected from reality.


The six feedback surfaces

Surface One-liner Pressure it answers
Explicit feedback Thumbs, ratings, comments captured directly from users direct user voice; small N but high signal per response
Implicit signals Engagement, follow-up, abandonment, repeat-ask large N; lower signal per case; representative of broader user behaviour
Storage and schema Per-event structured records with provenance retrievability: signals must be queryable and joinable
Pipeline to eval Routing signals into the eval set's refresh process conversion: signals are inputs, the eval set is the output
Bias awareness Selection bias, response bias, sycophancy in feedback reliability: not all feedback is unbiased
Privacy and retention The signals are user data; same disciplines apply governance: feedback storage is regulated too

A seventh concern — incident response when feedback signals show problems — runs across the others and is its own chapter.


The recurring vocabulary

Name Surface What it is
the explicit feedback Explicit Direct rating or comment from a user
the implicit signal Implicit Behaviour-derived signal (repeat-ask, abandonment, follow-up)
the feedback event Storage The per-call record capturing one signal
the conversion pipeline Pipeline The process that turns signals into eval cases
the judge calibration set Pipeline Cases with both user feedback and judge scores; the alignment substrate
the bias mitigation Bias The discipline that accounts for who responds and how
the feedback retention window Privacy The bounded period feedback is kept

The journey

This module has two acts.

Act 1 — Capture (files 01–05). What feedback signals exist; how to capture explicit and implicit; how to store and schema them; how to convert them into eval and prompt feedback.

Act 2 — Use (files 06–11). Calibrating the judge, accounting for bias, privacy in feedback, the cadence of looking at signals, closing the loop on prompts and models, and what to do when feedback reveals problems.

Synthesis (files 12–13). Architect checklist and honest admission.


Memory map

# File Surface What it adds
01 the-feedback-loop-problem the cost of operating without production signals
02 explicit-feedback-capture Explicit thumbs, ratings, comments, structured prompts
03 implicit-signals Implicit engagement, follow-up, abandonment, repeat-ask
04 feedback-storage-and-schema Storage structured records with provenance
05 from-signal-to-eval Pipeline converting signals into eval cases
— milestone: the signal is captured —
06 judge-and-rubric-calibration Pipeline aligning judge scores with user perception
07 bias-in-feedback Bias selection, response, sycophancy
08 privacy-in-feedback Privacy governance applied to feedback data
09 feedback-cadence Cross the rhythm of looking and acting
10 closing-the-loop Pipeline feeding prompts, models, set decisions
11 feedback-incident-response All what to do when signals show systemic problems
— milestone: the loop is operational —
12 architect-checklist Synthesis 20 items
13 honest-admission Boundaries what feedback loops cannot solve

How this module relates to its neighbours


Top resources

  • Sculley et al., "Hidden Technical Debt in Machine Learning Systems" — https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
  • Microsoft — Responsible AI feedback loop — https://learn.microsoft.com/en-us/azure/machine-learning/concept-responsible-ml
  • Anthropic — building with feedback — https://docs.anthropic.com/en/docs/build-with-claude/
  • OpenAI — collecting user feedback — https://platform.openai.com/docs/guides/

What's coming

  1. 01-the-feedback-loop-problem.md — Why eval drift and model staleness happen without production feedback.
  2. 02-explicit-feedback-capture.md — Thumbs, ratings, comments, structured forms.
  3. 03-implicit-signals.md — Engagement, follow-up, abandonment, repeat-ask.
  4. 04-feedback-storage-and-schema.md — Per-event structured records with provenance.
  5. 05-from-signal-to-eval.md — Converting signals into eval cases and prompt iterations.
  6. 06-judge-and-rubric-calibration.md — Aligning judge scores with user perception.
  7. 07-bias-in-feedback.md — Selection bias, response bias, sycophancy.
  8. 08-privacy-in-feedback.md — Governance on feedback data.
  9. 09-feedback-cadence.md — The rhythm of looking and acting.
  10. 10-closing-the-loop.md — Feeding prompts, models, eval set decisions.
  11. 11-feedback-incident-response.md — Systemic signals; what to do.
  12. 12-architect-checklist.md — Twenty items.
  13. 13-honest-admission.md — What feedback loops cannot solve.

Bridge. Before designing capture or pipelines, we feel the cost of operating without feedback. The first chapter is the diagnosis — eval drift, label staleness, prompt aging — that the rest of the module addresses. → 01-the-feedback-loop-problem.md