Skip to content

04. Bias detection & auditing — building the appeal process for a deployed judge

~15 min read. Fairness metrics matter only if you can measure them reliably on real slices, real data, and real time windows.

Built on the ELI5 in 00-eli5.md. The appeal process — the way people challenge and review verdicts — becomes real through slice analysis, disparity tests, and repeatable audits.


Picture first: an appeal process is a pipeline, not a meeting

Many teams say they did an audit. What they mean is one spreadsheet meeting. That is not enough. A real appeal process is operational. It defines which slices matter. It tracks which harms matter. It tests whether gaps are large enough to act on. And it records what changed after mitigation.

raw model outcomes
┌──────────────────────┐
│ slice the population │  group, region, age, language, device
└──────────┬───────────┘
┌──────────────────────┐
│ compute disparities  │  rate gaps, ratios, score shifts
└──────────┬───────────┘
┌──────────────────────┐
│ statistical check    │  noise or real signal?
└──────────┬───────────┘
┌──────────────────────┐
│ investigate causes   │  data, threshold, proxy, drift
└──────────┬───────────┘
┌──────────────────────┐
│ record in case file  │  owner, decision, follow-up
└──────────────────────┘

See. The appeal process is not only ethics language. It is observability plus domain judgment. Without instrumentation, a fairness review becomes vibes.

Slice analysis: do not audit only the average person

Slice analysis means splitting outcomes into meaningful subpopulations. Group, language, geography, age band, device type, traffic source, time of day, disability proxy, new versus returning users. The right slices depend on harm pathways.

Why is this necessary? Because aggregate fairness can still hide intersectional pain. A model may look acceptable for women overall and acceptable for older users overall, yet fail badly for older women in one region. The judge acts on combinations, not on summary rows.

Start with obvious protected or sensitive slices. Then add operational slices. Cold-start users. Manual-review cases. Users from one acquisition channel. High-confidence versus low-confidence score bins. Review queue escalations. Those often reveal where the verdict becomes brittle.

Look. Slice explosion is real. You cannot test every possible combination. So what to do? Prioritize slices by exposure, harm severity, and plausible causal pathway. That sentence should appear in the case record. Audits need scope discipline.

Worked example: is this disparity just noise?

Now bring numbers. Suppose a hiring screen evaluates truly qualified applicants. We compare false negative rates. Group A has 100 qualified people. The model falsely rejects 10. Group B has 100 qualified people. The model falsely rejects 25.

So the false negative rates are: - Group A FNR = 10 / 100 = 10% - Group B FNR = 25 / 100 = 25% - Absolute disparity = 25% - 10% = 15 points

Good. But maybe someone says, "Small sample noise." So we do a rough significance check.

For a proportion, standard error is roughly sqrt(p(1-p)/n). For the difference of two independent proportions, add the variances. SE ≈ sqrt(0.10×0.90/100 + 0.25×0.75/100). SE ≈ sqrt(0.0009 + 0.001875). SE ≈ sqrt(0.002775). SE ≈ 0.0527.

Now z-score ≈ difference / SE = 0.15 / 0.0527 ≈ 2.85. That is large enough to take seriously. A rough 95% interval on the gap is 15% ± 10.3%. So the interval is about [4.7%, 25.3%]. Even the low end is still meaningful in many products.

Simple, no? Statistics does not make the decision for you. It helps the appeal process separate likely signal from random wobble. If harm is high, even modest evidence may justify action. If samples are tiny, you may widen the review window or collect more data.

Beyond one test: auditing habits that work in production

A good audit does more than compute one p-value. It compares rates over time. It checks calibration by group. It inspects feature distributions for drift. It reviews adverse examples manually. It reruns on backfilled labels when ground truth arrives later.

Disparity testing also needs practical thresholds. A tiny statistically significant difference at huge scale may be operationally irrelevant. A large but noisy gap in a high-stakes domain may still demand intervention. That is where the jury instructions and domain risk come back. Numbers help. Governance still decides.

Counterfactual testing can help too. Hold everything fixed except a sensitive attribute or proxy text. Does the verdict change sharply? For resume screening, swap first names. For moderation, paraphrase dialect while keeping meaning constant. For vision, vary lighting or skin-tone representation in otherwise similar scenes. These tests are not perfect causal proof. They are sharp probes.

Yes? A mature appeal process combines: - slice metrics - significance checks - counterfactual probes - manual case review - documented owner decisions

Without documentation, the same disparity gets rediscovered every quarter. With a living case record, teams learn faster. The audit becomes cumulative, not episodic.

What auditors should write down every time

Write down the slice definition. Write down the metric. Write down the sample size. Write down the disparity threshold that triggers action. Write down whether the issue is likely data, threshold, or feature related. Write down the owner. Write down the planned follow-up.

That is boring. Good. Boring systems scale. The courtroom does not stay fair through inspiration. It stays fair through repeatable clerical discipline around the appeal process.


Where this lives in the wild

  • Stripe Radar review dashboards — fraud analytics lead: monitors chargeback, decline, and manual-review disparities across geography, issuer, and customer segments.
  • LinkedIn talent recommendation audits — responsible AI reviewer: checks ranking exposure gaps across job seeker groups and industry slices.
  • Uber driver safety and dispatch models — marketplace scientist: compares error rates across city, shift, and rider-driver segment combinations.
  • YouTube moderation review tooling — policy enforcement analyst: runs slice audits by language, region, and policy category to catch uneven removals.
  • CV screening products like Eightfold — fairness program manager: combines group metrics with counterfactual resume tests and reviewer escalation logs.

Pause and recall

  • Why is slice analysis necessary even when global fairness metrics look acceptable?
  • In the worked example, what did the standard error help you decide?
  • Why is one statistical test alone not a complete audit?
  • What should always be recorded in the case record after a disparity review?

Interview Q&A

Q: Why use slice analysis and not only protected-group averages in fairness audits? A: Because harm often concentrates in intersections or operational cohorts that disappear inside coarse averages. Common wrong answer to avoid: "Because protected groups are too politically sensitive to measure directly."

Q: Why compute confidence intervals or significance checks instead of reacting to every raw gap? A: Because some disparities come from sampling noise, and you need a disciplined way to distinguish unstable fluctuations from likely real patterns. Common wrong answer to avoid: "Because statistical tests tell you whether something is morally acceptable."

Q: Why pair counterfactual probes with historical log analysis? A: Because logs reveal realized harm patterns while counterfactual probes stress-test the judge on controlled near-duplicate cases. Common wrong answer to avoid: "Because counterfactual tests replace the need for real-world outcome data."

Q: Why must audits be documented rather than handled as one-off investigations? A: Because fairness work is cumulative, and repeatable records let future teams track unresolved issues, mitigations, and ownership. Common wrong answer to avoid: "Because regulators mainly care about how long the audit report looks."


Apply now (5 min)

Exercise. Take one binary metric you care about. Split it by two slices. Compute the gap and a rough standard error. Would your appeal process escalate this gap immediately, monitor it, or ignore it?

Sketch from memory. Draw the audit pipeline from raw outcomes to case record. Label slice analysis, disparity calculation, statistical check, and owner decision. Then add one note on why the verdict must be audited over time, not once.


Bridge. Auditing tells you that the judge is uneven. The next step is to ask why the judge behaves that way at all by opening its internal logic. → 05-interpretability-basics.md