03. Labelling discipline¶

Cases sourced are inputs. Labels are the spec. The labels are the team's accumulated judgement of what the system should do. Bad labels mean the eval scores something other than what the team thinks; the discipline is about getting them right.

A platform engineer at a Chennai legal-tech company labels eval cases for a contract-review agent by herself in a sprint. The set runs against the system; scores are 0.91. Three months later a senior lawyer reviews the set as part of a routine audit. Of the 100 cases, the lawyer disagrees with the labels on 23 — sometimes about what "correct" means in legal terms, sometimes about what "complete" means for a contract review, sometimes about acceptable phrasing in a regulated communication. The 0.91 score was honest about what it measured; what it measured was the engineer's judgement, not the domain expert's. The set is partially rebuilt; the engineer learns that labelling alone, without a domain expert, produced a number with less authority than anyone realised.

This chapter is the discipline that prevents that outcome. The right labellers, the right process, the calibration that aligns multiple judgements.

Who labels¶

The right labeller is whoever decides what the system should do in production. For most systems:

Domain experts — lawyers for legal-tech, doctors for healthcare, support specialists for customer support, financial advisors for finance. They know what correct means in the domain.
Product managers — for cases where the judgement is product-policy (what should the agent say in this situation? what should it refuse?).
Engineers — for cases where the judgement is technical (does the agent invoke the correct tool with valid arguments?).

A single labeller is rarely enough. Two labellers with calibration (below) produces more reliable labels than one. For high-stakes domains, three labellers with consensus or majority is the discipline.

What a label looks like¶

Three styles, picked per case type (chapter 07 deepens this).

Exact-match. The expected output is one specific value. Used for structured outputs (classifications, extractions, decisions).

input: { customer_query: "what is the status of order 12345?" }
expected:
  intent: "order_status_inquiry"
  order_id: "12345"

Reference output. A known-good output. The system's output is compared by similarity (semantic, rubric-graded).

input: { ... }
expected_reference: "The order is currently in transit, expected delivery on..."
similarity_threshold: 0.80

Rubric. A set of criteria the output must satisfy. Most flexible; most common for open-ended outputs.

input: { ... }
rubric:
  - "Mentions the actual delivery date or estimated range"
  - "Acknowledges the customer's order number"
  - "Avoids promising specific times not in the data"
  - "Is under 100 words"
must_not:
  - "Discloses internal logistics information"
  - "Promises a refund without authorisation"

The label style fits the case. Mixed-style sets are common; one style across the whole set is rarely optimal.

The labelling process¶

A reasonable workflow.

1. Brief the labellers. A short document explaining the system's intent, the labelling style for each case type, the criteria for "good" outputs, examples of accepted and rejected outputs. The brief is the calibration substrate.

2. Initial label pass. Each labeller works through the cases independently. Disagreement is fine; the process catches it next.

3. Calibration session. All labellers meet (synchronously or async). Cases with disagreement are discussed. Often the discussion reveals an ambiguity in the brief or the rubric; the brief is updated, and labellers re-label affected cases.

4. Consensus. The final label is agreed by consensus on each case. For domains where consensus is hard (subjective judgement), majority of three is acceptable. Cases with no consensus are excluded from the set — they are not stable labels.

5. Sign-off. The set's owner reviews the final labels and signs off. Cases are added to the set with the labellers' names recorded as provenance.

The process is heavier than ad-hoc labelling by one person. The payoff is labels the team trusts.

Calibration¶

The calibration session is the heart of the process. Without it, multiple labellers diverge silently.

What a calibration session looks like:

60–90 minutes.
Labellers walk through cases where their labels disagreed.
For each disagreement: each labeller explains their reasoning; the group discusses; one of (a) the case has a clear right answer the other labellers agree with, (b) the case is ambiguous and the rubric needs to be sharper, (c) the case is genuinely subjective and the set should exclude it.
The brief or rubric is updated based on the calibration.
All labellers re-label affected cases.

The session itself is documentation: notes on which cases were discussed, what was decided, what the rubric change was. The notes become the basis for labelling the next batch.

Inter-labeller agreement¶

A metric that quantifies how well labellers agree. Cohen's kappa or Krippendorff's alpha are standard. Values:

>0.80 — strong agreement; the labels are reliable.
0.60–0.80 — substantial agreement; usable with attention to disagreement cases.
<0.60 — moderate to poor; the rubric is too ambiguous or the domain is genuinely subjective; the set's reliability is in question.

A platform that measures inter-labeller agreement on every labelling round knows whether the labels are stable. A platform that does not measure has labels with unknown reliability.

When labels are wrong¶

Labels are decisions. Decisions can be wrong. Three cases:

The label was wrong at labelling time. The labeller misread the case, mis-applied the rubric. Discovery happens during calibration, on review by the owner, or later when a system change produces a "regression" that on inspection is actually correct. Fix the label.

The label was right at labelling time but is now wrong. The product policy changed; what was "good" is no longer good. Fix the label as part of the set's refresh (chapter 05).

The label is genuinely ambiguous. Different reasonable labellers would label differently. Move the case to a "needs review" status and discuss in the next calibration; either resolve or remove.

The process for fixing labels: a PR-like change with reviewer, with the old label and the new label both recorded in the changelog. The change is auditable; "this case used to be labelled X, now it's Y, because Z."

LLM-assisted labelling¶

A practical accelerator. An LLM produces a suggested label; a human reviews and confirms or corrects.

input: { ... }
llm_suggested_label:
  intent: "order_status_inquiry"
  confidence: 0.92
human_confirmed: true
human_corrected: false
final_label:
  intent: "order_status_inquiry"

The LLM is faster; the human is correct. Together they label 5–10× faster than human-alone. The risk: the human becomes a rubber stamp; the LLM's biases enter the labels. Discipline:

Random subset is double-labelled by humans without seeing the LLM's suggestion; inter-labeller agreement with the LLM is monitored. If the agreement is high, the workflow is healthy; if low, the LLM is biasing.
High-stakes cases (regulated, sensitive) are always human-labelled without LLM assist.
The brief is shared with the LLM via system prompt; the LLM's output is calibrated to the team's intent.

This is the most expensive part of operating a large eval set, and LLM assistance is often the difference between sustainable and impossible.

Common mistakes¶

One labeller, no calibration. Labels reflect the single labeller's judgement, with biases and gaps.

Labellers without domain expertise. Engineers labelling legal cases produce labels lawyers disagree with.

No calibration session. Disagreement persists silently; labels are inconsistent.

No inter-labeller agreement metric. Reliability is unknown; the team has no signal when the rubric is too ambiguous.

Labels never revised. Labels age; policies change; the spec encoded in the labels diverges from the current product intent.

LLM-assist without human-only spot-checks. The LLM's biases are quietly absorbed.

Interview Q&A¶

Q1. Why does labelling require domain experts? Because the label is the spec — what the system should do. The spec is a domain question (what is a correct legal review, a correct medical response, a correct financial advice). Engineers can label structural correctness (did the tool fire? was the schema right?); the substantive judgement of "is this a good answer for this domain" requires the domain expert. Labels from non-experts encode the non-expert's model of the domain; the eval scores against the wrong spec. Wrong-answer notes: "engineers can label anything" misses the domain dimension.

Q2. Walk through how you calibrate two labellers. Both label the same batch of 50–100 cases independently. The cases where they disagreed are surfaced. In a calibration session, each explains their reasoning; the group decides (a) one is right and the other is mistaken, (b) the case is ambiguous and the rubric needs sharpening, or (c) genuinely subjective; exclude. The rubric is updated; affected cases are re-labelled. The process is documented. After a few calibration rounds, agreement reaches >0.80 on most batches; new disagreements are exception, not the norm. Wrong-answer notes: averaging labels without calibration produces noise, not consensus.

Q3. The team uses an LLM to suggest labels and a human to confirm. What is the risk and how do you mitigate? The human becomes a rubber stamp; the LLM's biases enter the labels uncritically. Mitigation: a random subset is double-labelled by humans without seeing the LLM's suggestion; the agreement with the LLM is tracked. If the agreement drops, the workflow is off; investigate whether the LLM is producing systematic errors. For high-stakes cases (regulated, sensitive), human-only labelling without LLM assist. The accelerator is appropriate when calibrated; not when blind. Wrong-answer notes: "the human always corrects the LLM" assumes attention that does not survive scale.

Q4. A label is found to be wrong six months after the case entered the set. What is the right response? Update the label via a PR-equivalent change with the old and new labels recorded in the set's changelog. Re-run the set against current model versions; the score may shift, which is expected — the prior score was based on the wrong label. The change is auditable; future investigations can see when and why the label changed. The set is a living artefact; label changes are part of its evolution, not failures of the labelling process. Wrong-answer notes: "leave the wrong label to preserve historical comparability" preserves wrong scores.

What to do differently after reading this¶

Recruit domain experts for labelling; engineers alone are not enough for domain judgement.
Use multi-labeller with calibration sessions. Measure inter-labeller agreement.
Make the brief and the rubric living documents updated through calibration.
Use LLM-assist for scale; spot-check with human-only labelling.
Treat label changes as a normal lifecycle event with audit.

Bridge. Cases and labels exist. The next question is whether they cover what matters. A set that scores well on a narrow slice of the input distribution is the chapter-1 trap. The next chapter is coverage and stratification — designing the set to represent the failure modes and segments the team cares about. → 04-coverage-and-stratification.md