05. From signal to eval¶
Captured signals sit in storage until something converts them into the team's artefacts. The signal-to-eval pipeline is the discipline that turns raw feedback into eval cases, judge calibration data, and prompt-iteration inputs.
A platform engineer at a Chennai SaaS company runs the feedback dashboard weekly. Negative thumbs are at 27%; the dashboard shows the count. Six months in, no eval case has been added based on a thumbs-down. The team has the signal and no path from signal to action. The fix is a weekly conversion process: a small batch of negative-thumbs cases (10-30 per week) goes through review — the response_id retrieves the audit; the case is read by a domain expert; if the case represents a failure mode worth preventing, it enters the eval set with the appropriate label. After three months, the eval set has grown by 200 cases, all sourced from real production negative feedback; the failure modes the eval catches now closely match the failure modes users encounter.
This chapter is that conversion process. The pipeline turns signal into artefact; without the pipeline, the signal is dashboard-only.
What the pipeline does¶
The pipeline converts production feedback into three artefacts:
| Artefact | What feedback contributes |
|---|---|
| Eval set cases | Cases representing failure modes that need regression-prevention |
| Judge calibration data | Production cases with explicit feedback used to calibrate LLM-as-judge against user perception |
| Prompt iteration inputs | Patterns in feedback that suggest prompt revisions |
Each artefact is fed by a different subset of the pipeline.
The eval-case pipeline¶
Most important. The flow:
1. Filter feedback. Last week's negative thumbs + comments + critical implicit signals.
2. Sample. 30-60 events per week (more if volume warrants).
3. Retrieve. For each event, the response_id pulls the audit.
4. Triage. A reviewer (chapter 03 of module 01 - labeller) reads the case and the
user's reaction. Is this a failure mode worth preventing?
5. Label. If yes, label the expected behaviour per chapter 03 of module 01.
6. Stratum tag. Assign failure mode, segment, input shape tags.
7. Add to eval set. Per the contribution flow of chapter 10 of module 01.
8. Re-baseline. The case enters the next eval run; the baseline is recorded.
The pipeline produces 10-30 new eval cases per week for an active platform. The eval set grows from production reality, not from imagination.
What to triage and what to skip¶
Not every negative feedback is an eval case.
Skip:
- The user's expectation was unreasonable (the system was correct).
- The case is a duplicate of an existing eval case (the failure mode is covered).
- The case is a one-off random failure (no pattern).
- The case is from a sandbox or test session.
Add:
- A failure mode the eval set does not currently cover.
- A previously-fixed failure mode that has recurred.
- A user expectation the system did not meet that aligns with product intent.
- A new segment whose patterns the eval set under-represents.
The triage is the human judgement step; usually a domain expert in a 1-hour weekly session covers 30-60 cases.
The judge calibration pipeline¶
The judge (LLM-as-judge from 00_ai_evals_release_gates) needs calibration against user perception.
The flow:
1. Identify cases with both explicit user feedback AND eval set membership.
- These are cases that have both a user rating and a judge score.
2. For each case, compare:
- Judge score on the case
- User feedback (thumbs up/down, rating)
3. Aggregate: how aligned is the judge with the user?
- Agreement >0.80 — judge is calibrated.
- Agreement 0.60-0.80 — judge is reasonable; refine rubric.
- Agreement <0.60 — judge is misaligned; investigate and recalibrate.
4. For systematically-misaligned cases, refine the rubric.
- Cases where the judge said "good" but the user said "bad" reveal what the rubric is missing.
- Cases where the judge said "bad" but the user said "good" reveal where the rubric is too strict.
5. Run the refined judge against the calibration set; re-measure agreement.
A monthly calibration cycle keeps the judge aligned with user perception. Without it, judge drift (chapter 11 of module 01) goes undetected.
The prompt-iteration pipeline¶
Patterns in feedback inform prompt revisions.
The flow:
1. Group negative feedback by prompt version and failure mode.
2. For each (prompt, failure mode) combination with sufficient volume,
pull a sample of cases.
3. Read the cases. What is the prompt doing or failing to do?
4. Propose a prompt revision that addresses the pattern.
5. Test the revision against the regression eval set (module 01).
6. If the revision passes, canary it (chapter 05 of `13_prompt_lifecycle_operations`).
7. Monitor the feedback rate on the revised prompt.
The pipeline is human-led; an LLM may suggest revisions, but the human judgement on what to ship is essential.
Throughput and the human bottleneck¶
The pipeline is bottlenecked by human review. A team that captures 1,000 feedback events per week but reviews 0 produces no artefact changes. A team that reviews 30 cases per week, with 10-15 becoming eval cases and 5-10 driving prompt iterations, makes meaningful progress.
The throughput is set by the team's reviewer capacity, not the feedback volume. The discipline is to allocate weekly time for the review; without allocation, the queue grows and the pipeline stalls.
A reasonable allocation: 2-4 hours per week for review and labelling, distributed across reviewers.
Integration with the existing artefacts¶
The pipeline does not create new artefacts; it feeds existing ones:
- New eval cases enter the set via the contribution flow (chapter 10 of module 01).
- Judge refinements update the existing judge prompt (versioned per
13_prompt_lifecycle_operations). - Prompt revisions update the existing prompt registry (chapter 05 of
14_legacy_ai_modernizationand13_prompt_lifecycle_operations).
The pipeline is the source of changes; the existing disciplines apply for how the changes ship.
When the pipeline reveals systemic problems¶
Sometimes a sustained pattern in feedback indicates something larger than per-case iteration:
- Repeated negative feedback on a specific feature → the feature's design may be wrong, not just the prompt.
- High repeat-ask rate across many prompts → the system's understanding of the domain may be wrong, not just specific responses.
- Sudden drop in positive feedback after a model migration → the migration may have regressed despite the eval gates.
These are chapter 11 territory — feedback incident response.
Common mistakes¶
Dashboard without pipeline. Feedback is visible, never converted.
Pipeline without triage. Every feedback event added to the eval set; the set bloats with noise.
Triage without domain experts. Engineers labelling cases they do not have the context for.
One-time pipeline run. The pipeline runs once; the team feels good; the discipline lapses.
Judge calibration skipped. The judge drifts from user perception silently.
Interview Q&A¶
Q1. The dashboard shows 27% negative thumbs but no eval case has been added in six months. What is happening? The team has captured signal but has not converted it into artefacts. The pipeline that connects feedback to the eval set has not been built or run. The team's iteration loop is still operating on the original eval set; the production feedback is informational only. The fix is to establish the weekly conversion process: filter, sample, retrieve, triage, label, add. After three months, the eval set reflects production reality. Wrong-answer notes: "the team needs better dashboards" misses that the dashboard is operating; the missing piece is the pipeline.
Q2. Walk through how a weekly negative-thumbs review becomes new eval cases. Filter: last week's events with signal_value = down and (optionally) comment_text present. Sample: 30-60 events. For each, retrieve the audit by response_id; read the input, the system's response, the user's comment. Triage: skip if the user's expectation was wrong, if it's a duplicate, if it's a one-off. Add if it represents a failure mode the eval should prevent. Label: expected behaviour per chapter 03 of module 01. Stratum tag. Add to eval set with provenance noting the source as "production feedback, date Y." Re-baseline. The pipeline produces 10-30 new cases per week for an active platform. Wrong-answer notes: "every negative thumbs becomes a case" produces eval bloat.
Q3. How do you calibrate the judge against user perception? Find cases that have both user feedback (thumbs/rating) and a judge score on the eval. For each, compare the judge's score to the user's signal. Aggregate the agreement: above 0.80 is calibrated, 0.60-0.80 is reasonable, below 0.60 is misaligned. For misaligned cases, refine the rubric — cases where the judge said "good" but the user said "bad" reveal what the rubric misses; cases the reverse reveal where it's too strict. Re-measure agreement after refinement. A monthly cycle keeps the judge aligned. Wrong-answer notes: "the judge is a separate concern" misses that user feedback is the ground truth the judge should approximate.
Q4. The feedback volume is 1,000 events per week; the team reviews 0. What is the constraint? Reviewer time. The pipeline is bottlenecked by human attention, not by data. Capturing more does not help if no one reviews. Allocate 2-4 hours per week for review and labelling. Distribute across team members. With allocation, 30-60 cases are reviewed weekly; 10-15 become eval cases. Without allocation, the queue grows and the pipeline stalls. The pipeline is a process discipline, not a tool. Wrong-answer notes: "we need a better tool" misses that the tool is fine; the time is the gap.
What to do differently after reading this¶
- Build the weekly conversion process — schedule it, allocate reviewer time, track its output.
- Triage cases per the skip/add discipline; do not add everything.
- Calibrate the judge against user feedback monthly.
- Use feedback patterns to drive prompt iteration, not just to monitor.
- When patterns are systemic, escalate to chapter 11's incident response.
Bridge. The signal feeds the artefacts. The judge calibration is one of the artefacts that benefits most directly. The next chapter is the discipline of using feedback to keep the judge aligned with user perception over time. → 06-judge-and-rubric-calibration.md