02. Sources of bias — how the evidence file gets crooked before training starts¶
~14 min read. Bias usually enters before the optimizer ever sees the first batch.
Built on the ELI5 in 00-eli5.md. The evidence file — the stack of examples handed to the judge — can already be slanted by sampling, labels, and social history.
First picture: bias enters from many doors¶
Look at a courtroom clerk assembling the file. One witness is missing. One form is filled carelessly. One neighborhood is overrepresented. One past policy already punished some people. Then the judge trains on that file as if it were truth.
That is bias in ML. Not one thing. Many entry points. Data bias. Selection bias. Measurement bias. Historical bias. Proxy bias. Feedback loops.
real world cases
│
├── missing groups ──────────────┐
├── bad labels ──────────────────┤
├── skewed sampling ─────────────┤
├── historical policy residue ───┤
└── proxy features ──────────────┘
▼
┌──────────────────────┐
│ evidence file │
└──────────┬───────────┘
▼
┌──────────────────────┐
│ judge │
└──────────────────────┘
See. The model can be mathematically correct relative to the file. And still be socially wrong relative to reality. So fairness work starts upstream. Before model choice. Before loss tuning. Before threshold setting.
Selection bias: the file is not the world¶
Selection bias means the examples you collected are not representative of the population where the verdict will be used. Simple, no? If only previously approved loan applicants enter the training set, you learn from an already filtered world. If only urban English speakers appear in voice logs, the system underlearns rural accents and code-mixed speech. If abuse reports come mostly from power users, moderation labels skew toward their norms.
Here is a small worked example. Suppose two groups each have 100 applicants. Within each group: - 50 are low-risk and repay at 98% - 50 are higher-risk and repay at 82% So both groups have the same true repayment rate. True rate = (49 + 41) / 100 = 90%.
But historical approvals were selective. Group A approved sample contains 40 low-risk and 10 higher-risk people. Observed repayment in sample = (39.2 + 8.2) / 50 = 94.8%. Group B approved sample contains 10 low-risk and 40 higher-risk people. Observed repayment in sample = (9.8 + 32.8) / 50 = 85.2%.
Same real world. Different sample. The evidence file now tells the judge that Group A is safer. That belief was created by old selection, not by true behavior.
Measurement bias: the label is a distorted ruler¶
Now what is measurement bias? The thing you record is not the thing you truly care about. The ruler itself is crooked.
Healthcare teams often want "clinical need." But the label in the evidence file may be historical spend. Spend is not the same as need. Communities with lower access can have lower spend while being sicker. The judge learns the wrong surrogate.
Take a tiny example. Suppose two groups each have average illness severity 8 out of 10. Group A historically spent $1000 per patient. Group B historically spent $600 per patient because access was worse. If the training label is spend, the model learns: - Group A looks high need. - Group B looks lower need. But the true need is equal.
Measurement bias also shows up in moderation and policing. Arrest count is not the same as crime. Reported abuse is not the same as actual abuse. Complaint volume is not the same as harmfulness. If the label is shaped by who was watched, the verdict inherits surveillance patterns.
true concept wanted recorded label used
┌──────────────────────┐ ┌──────────────────────┐
│ clinical need │ ──X──▶ │ past spend │
│ true risk │ ──X──▶ │ arrest count │
│ actual toxicity │ ──X──▶ │ report volume │
└──────────────────────┘ └──────────────────────┘
Historical bias and proxy bias: society leaks into the file¶
Historical bias means the world itself was unequal before the model arrived. The evidence file captures that residue. The model compresses it. Then the judge scales it.
Suppose past hiring favored one school network. Now school name looks predictive. But what it predicts may partly be privilege, not merit. Suppose past loan officers denied some zip codes aggressively. Then default data for approved borrowers from those zip codes becomes sparse. The model learns from the survivors of that process.
Proxy bias is similar. A protected trait may be absent. But a correlated feature stands in. Zip code may proxy race. Shopping basket may proxy religion. Browser language may proxy nationality. College gap may proxy caregiving burden. Removing the protected column does not magically remove the social signal.
Here is the picture.
protected trait
│
├── correlated with zip code
├── correlated with school name
├── correlated with device type
└── correlated with work gap
│
▼
┌──────────────────────┐
│ model sees proxy │
└──────────────────────┘
Yes? That is why "we removed race from the table" is never a complete fairness argument. The courtroom still whispers through other fields.
Feedback loops make the file worse over time¶
Some systems do not just read the world. They change it. Then the changed world returns as future data. That is a feedback loop.
A predictive policing model sends more patrols to one area. More patrols find more incidents there. The next evidence file now records that area as even riskier. A recommendation system shows some creators less often. They get less engagement. Future labels say they are less engaging. A loan model denies a group more often. That group has fewer chances to build formal credit history. Future data becomes even thinner.
So what to do? Audit collection pathways. Ask who is missing. Ask what label stands in for the real concept. Ask which proxies might carry protected information. Ask whether the model changes future data collection. This is upstream fairness work. Without it, downstream jury instructions are fighting a crooked file.
Bias is rarely a single bug. It is usually a pipeline shape. That is why teams need taxonomy before metrics.
Where this lives in the wild¶
- Google Photos face grouping — computer vision data curator: underrepresentation in the evidence file can degrade recognition quality for darker skin tones and varied lighting conditions.
- Optum risk scoring workflow — population health analyst: past healthcare spending behaves as a biased ruler for clinical need.
- TikTok recommendation systems — trust and safety data scientist: complaint and watch-time signals can reflect uneven reporting and exposure, not pure user value.
- Voice assistants like Alexa — speech ML lead: training logs dominated by certain accents or household settings produce poorer recognition for others.
- Marketplace lending platforms — credit model validator: previously approved applicants create a filtered training set that differs from the future applicant pool.
Pause and recall¶
- What is the difference between selection bias and measurement bias?
- Why does removing a protected attribute not remove proxy bias?
- In the worked example, why did the two groups look different even though their true repayment rates matched?
- How do feedback loops make fairness problems worse after deployment?
Interview Q&A¶
Q: Why inspect data collection pathways and not only model weights when diagnosing unfairness? A: Because the judge can only learn from the evidence file, and upstream sampling or labeling distortions often create the disparity before training begins. Common wrong answer to avoid: "Because model architecture never matters for fairness."
Q: Why is historical spend a poor label for clinical need? A: Because spending reflects access, pricing, and prior treatment patterns, not just underlying illness severity. Common wrong answer to avoid: "Because monetary features are always too noisy for healthcare models."
Q: Why does dropping a protected column not settle proxy discrimination? A: Because correlated features can carry much of the same social information into the prediction pipeline. Common wrong answer to avoid: "Because the regulator requires all sensitive columns to stay in the model."
Q: Why are feedback loops especially dangerous in fairness-sensitive systems? A: Because the model's own verdict changes who gets observed, helped, or watched next, which then contaminates future training data. Common wrong answer to avoid: "Because feedback loops only happen in recommender systems."
Apply now (5 min)¶
Exercise. Pick one label from a product you know. Ask, "What true concept do we actually care about?" Then list two ways the recorded label could be a crooked ruler inside the evidence file.
Sketch from memory. Draw four arrows into the evidence file. Label them selection bias, measurement bias, historical bias, and proxy bias. Under the picture, write one sentence on how the judge could look fair mathematically while the file is already distorted.
Bridge. Once you know where bias enters, the next problem is harder: which jury instructions should the judge obey when different fairness goals conflict? → 03-fairness-metrics.md