03. Fairness metrics — competing jury instructions for one courtroom¶
~16 min read. Different fairness metrics answer different moral questions, and they often cannot all be satisfied together.
Built on the ELI5 in 00-eli5.md. The jury instructions — the rules telling the judge what counts as fair — can point in different directions even on the same evidence file.
First picture: fairness is not one dial¶
Imagine the judge asking the jury, "What exactly do you want from me?" Do you want equal approval rates? Do you want equal error rates? Do you want the same meaning for a score across groups? These are different instructions.
That is why fairness metrics exist. They translate vague concern into measurable checks. But they do not collapse into one universal number. Simple, no? One metric may praise a model that another metric rejects.
same judge, same scores
│
▼
┌──────────────────────┐
│ jury instruction A │ equal selection rate?
├──────────────────────┤
│ jury instruction B │ equal TPR and FPR?
├──────────────────────┤
│ jury instruction C │ same score meaning?
└──────────────────────┘
Look. Fairness work gets confused when people say, "We checked fairness." Checked which one? For what harm model? Under what base rates? For what intervention cost? Without that, the courtroom is speaking vague English while the model behaves in precise numbers.
Demographic parity and equalized odds¶
Demographic parity asks a simple question. Does the judge approve groups at equal rates? If Group A receives positive verdicts 50% of the time and Group B 30% of the time, parity is broken. This metric cares about outcome distribution. It does not ask whether underlying qualification rates differ.
Equalized odds asks a different question. Conditional on the truth, does the model make errors at equal rates across groups? That means matching true positive rate and false positive rate. This metric cares about error symmetry. It is often more appropriate when stakes attach to mistakes, not only to output totals.
Use this example. Group A has 100 applicants. True qualified = 60. True unqualified = 40. Predicted positive = 50. Among predicted positive, 42 are truly qualified. So Group A has: - TPR = 42 / 60 = 70% - FPR = 8 / 40 = 20% - Selection rate = 50 / 100 = 50%
Group B has 100 applicants. True qualified = 20. True unqualified = 80. Predicted positive = 30. Among predicted positive, 18 are truly qualified. So Group B has: - TPR = 18 / 20 = 90% - FPR = 12 / 80 = 15% - Selection rate = 30 / 100 = 30%
See the picture. Demographic parity says unfair. Selection rate differs by 20 points. Equalized odds also says unfair. TPR differs by 20 points. FPR differs by 5 points. The same verdict violates both, but in different ways.
Calibration: what does a score mean?¶
Now imagine the judge outputs a score, not just yes or no. Calibration asks whether the same score means the same empirical likelihood across groups. If two applicants both receive risk score 0.8, do about 80% of them truly default, regardless of group? That is the idea.
Picture a thermometer. Calibration means 38°C should mean the same heat for everyone. If 0.8 means very different realities across groups, the score is misleading.
Worked example. Suppose score bucket 0.8 contains 10 applicants from Group A. Eight actually default. Bucket default rate = 8 / 10 = 80%. Now score bucket 0.8 also contains 10 applicants from Group B. Eight default there too. Good. That bucket is calibrated across groups.
But thresholding can still create disparity. Suppose most Group A applicants receive scores above 0.8, while most Group B applicants cluster near 0.6. If the bank approves only below 0.7, Group B sees many more denials. So calibration alone does not guarantee equal selection or equal errors.
score meaning check
0.8 score bucket
├── Group A: 8 default / 10 = 80%
└── Group B: 8 default / 10 = 80%
same meaning, different distribution of scores
Look. Calibration answers, "Is this score honest?" It does not answer, "Are decisions balanced?" Different jury instructions again.
Why the impossibility result appears¶
Now what is the hard part? When base rates differ across groups, you often cannot satisfy calibration and equalized odds simultaneously unless the model is perfect or the groups have identical distributions. This is the famous tension people call the impossibility result.
Do not treat it like mystical theorem worship. See the geometry. If Group A has many more truly qualified cases than Group B, calibrated scores must reflect that different reality. But a common threshold over those scores will usually create different selection and error patterns. To force equalized odds, you may need different thresholds or score distortions. Then calibration can break.
Use the earlier numbers. Group A base qualification rate = 60%. Group B base qualification rate = 20%. Suppose your score is calibrated in both groups. Then high scores will naturally be more common in Group A. A single threshold will approve more of Group A. Now adjust thresholds by group to equalize TPR and FPR. The accepted 0.7 score in Group A may now be treated differently from 0.7 in Group B. The score meaning at the decision boundary stops lining up.
Simple, no? The jury instructions are not just hard to optimize. They sometimes conflict structurally. That is why fairness is not a checkbox. It is a policy choice about what kind of harm to minimize.
Choosing metrics in real systems¶
So what to do? Start with the harm model. If a false denial blocks needed care or credit, inspect TPR and false negative disparities. If a platform controls exposure, inspect selection rate and ranking exposure. If the score will be reused by downstream humans, calibration matters a lot. If the system is a recommender or generator, representation metrics may matter more than binary confusion matrices.
Also separate measurement from rhetoric. Never say, "We are fair." Say, "For this hiring screen, we monitor false negative disparity by group, calibration by score band, and selection rate shifts after threshold changes." That sentence is honest. It names the actual jury instructions.
The courtroom becomes manageable only when the rules are explicit. Otherwise teams argue morals with no instrumentation. And the appeal process has nothing measurable to inspect.
Where this lives in the wild¶
- Zest AI credit models — model risk manager: compares approval-rate disparity, default-rate calibration, and error tradeoffs when lenders adjust thresholds.
- Stripe Radar review queues — fraud operations lead: cares about false positive disparities because innocent customers suffer when payments are blocked.
- LinkedIn candidate ranking tools — talent product analyst: may inspect exposure and selection-rate fairness, not just binary acceptance.
- Hospital triage models — clinical ML governance lead: needs calibrated risk scores so clinicians interpret predicted deterioration consistently across groups.
- Content moderation systems at YouTube — trust and safety scientist: often compare false positive and false negative disparities because both over-removal and missed harm matter.
Pause and recall¶
- What question does demographic parity answer that equalized odds does not?
- Why can a calibrated score still lead to unfair binary decisions?
- What role do base rates play in the impossibility tension?
- Why should fairness metric choice start from harm, not from a generic checklist?
Interview Q&A¶
Q: Why optimize equalized odds and not demographic parity in a lending denial setting? A: Because lending harm often concentrates in mistaken approvals and denials, so error-rate symmetry is usually more relevant than matching raw approval totals alone. Common wrong answer to avoid: "Because demographic parity is mathematically obsolete."
Q: Why is calibration valuable even when it does not guarantee fair decisions? A: Because downstream operators need the same score to mean the same empirical risk, otherwise the score itself becomes deceptive. Common wrong answer to avoid: "Because calibration automatically fixes group disparity after thresholding."
Q: Why do fairness metrics conflict when base rates differ? A: Because honest score meaning, matched error rates, and matched outcome rates pull the decision surface in incompatible directions unless the prediction problem is almost perfect. Common wrong answer to avoid: "Because one of the fairness formulas must be implemented incorrectly."
Q: Why should product teams state metric choice explicitly in documentation? A: Because fairness is a policy decision about which harms to prioritize, and silent metric choice hides that judgment from reviewers and affected users. Common wrong answer to avoid: "Because regulators only care about how many metrics appear in the appendix."
Apply now (5 min)¶
Exercise. Create two tiny confusion matrices for two groups. Compute selection rate, true positive rate, and false positive rate. Then ask which jury instructions would call the system acceptable and which would reject it.
Sketch from memory. Draw three boxes labeled demographic parity, equalized odds, and calibration. Under each, write one short sentence answering, "Fair in what sense?" Then add one arrow to show where the instructions can conflict.
Bridge. Once the jury instructions are explicit, the next job is operational: how do we run an appeal process that actually measures these disparities in production data? → 04-bias-detection-auditing.md