09. Fairness in LLMs — when the verdict is language, not only a score¶

~16 min read. Large language models create harms through what they say, whom they represent, and how their outputs shape downstream opportunities.

Built on the ELI5 in 00-eli5.md. The verdict — here a generated sentence or ranking — can harm people even when there is no simple approve-or-deny button.

Picture first: language models are judges who also write the courtroom narrative¶

A tabular classifier often gives a score. An LLM gives text. That changes the fairness surface. The judge is not only deciding. It is also describing people, summarizing them, recommending actions, and shaping what others believe.

So harms split into at least three families. Representation harms. Who appears how often, in what roles, and with what stereotypes. Allocation harms. Who gets opportunities, visibility, or assistance because of the model's output. Interaction harms. Who gets more refusals, more toxicity, or worse service quality. Simple, no?

LLM output harm map
├── representation harm ──→ stereotype, erasure, disrespect
├── allocation harm ──────→ ranking, screening, routing, access
└── interaction harm ─────→ worse answers, harsher refusals, uneven safety flags

See. The jury instructions for LLM fairness cannot be only confusion-matrix parity. We need benchmark prompts, qualitative review, paired tests, and downstream product analysis.

Stereotype benchmarks and representation harms¶

A stereotype benchmark probes what associations the model tends to produce. Ask for professions. Descriptions. Traits. Family roles. Danger or competence cues. Then inspect whether certain groups are described with narrower or more harmful patterns.

Worked example. Suppose you prompt the LLM 100 times with, "Write one sentence about a nurse." When the prompt implies a woman, 78 outputs mention caring or softness. When it implies a man, only 34 do. Now prompt, "Write one sentence about an engineer." When the prompt implies a man, 72 outputs mention brilliance or leadership. When it implies a woman, 41 do.

Look at the gap. Engineer leadership association gap = 72% - 41% = 31 points. That is representation harm. No one was denied a loan directly. Still, the verdict keeps reinforcing who belongs where. Repeated at scale, these patterns shape expectations.

Benchmarks also test dialect and identity mentions. AAVE prompts may be misread as toxic more often. Some religions may be associated with violence more often. Some nationalities may get more corruption or scam cues. The courtroom narrative itself becomes skewed.

Allocation harms: text changes who gets access next¶

Now what is allocation harm? An LLM output may feed a downstream decision. Resume summary. Lead-priority note. Support urgency tag. Scholarship ranking explanation. The generated text influences humans or other systems. Then fairness becomes material.

Suppose a recruiting assistant summarizes 100 matched resumes for a recruiter. For equally qualified candidates, men receive interview-worthy wording 60 times. Women receive it 45 times. Gap = 15 points. Even if the final human chooses, the LLM has tilted the queue. That is an allocation effect. The judge is acting through language rather than a final button.

resume text
   │
   ▼
┌──────────────────────┐
│ LLM summary layer    │  "strong leader", "supportive", "risky gap"
└──────────┬───────────┘
           ▼
 recruiter attention shifts
           ▼
 interview access shifts

See the danger. The system may claim, "We do not automate hiring decisions." True, formally. Still, biased summaries steer scarce human attention. That is enough to matter.

Practical evaluation and mitigation for LLM fairness¶

So what to do? Use paired prompts. Swap names, dialect markers, or identity references while keeping task-relevant content fixed. Measure output sentiment, helpfulness, refusal rate, ranking position, and stereotype cues. Then review both quantitative gaps and qualitative patterns.

Benchmark suites help. But do not overtrust them. A benchmark can catch known stereotypes and still miss subtle harms in your product workflow. Fairness in LLMs is highly task-shaped. A writing assistant, coding assistant, support bot, and screening tool have different jury instructions.

Mitigation may include prompt design, retrieval grounding, post-generation filtering, dataset curation, and targeted red teaming. Sometimes the right move is product scoping. Do not let the LLM summarize protected traits at all. Do not let it rank candidates automatically. Do not let it infer missing demographic information from proxies. The case record should state these restrictions clearly. Context is the hidden variable in many LLM fairness debates.

Why LLM fairness remains hard¶

LLMs are broad. They operate across languages, cultures, and open-ended tasks. What counts as respectful or biased wording can shift by context. The same phrase can be supportive in one setting and patronizing in another. That makes fairness evaluation slower than simple score auditing.

Yes? The courtroom analogy still works. The judge is now also writing courtroom opinions, side comments, and summaries. So fairness is about both decisions and narratives. That is why classic fairness metrics are necessary but not sufficient here.

Where this lives in the wild¶

GitHub Copilot code completion — developer experience researcher: checks whether language, naming, or docstring suggestions encode stereotypes or uneven quality across user contexts.
LinkedIn recruiter assistants — talent product reviewer: must inspect whether resume summaries or candidate recommendations allocate attention unevenly by gender-coded or school-coded cues.
Intercom Fin support bot — CX quality lead: watches for harsher refusals, less empathy, or lower helpfulness for certain names, dialects, or regions.
Perplexity and answer-engine products — information quality analyst: evaluate whether generated summaries represent communities accurately instead of repeating biased stereotypes from sources.
Content moderation copilots — trust and safety scientist: compare toxicity judgments across dialects, identity mentions, and reclaimed language patterns.

Pause and recall¶

What is the difference between representation harm and allocation harm in LLM systems?
In the stereotype example, which gap quantified the representation problem?
Why can an LLM affect hiring fairness even if a human makes the final decision?
Why are classic fairness metrics alone insufficient for LLM fairness work?

Interview Q&A¶

Q: Why evaluate paired prompts and not only average helpfulness for LLM fairness? A: Because fairness harms often appear only when identity references change while task-relevant content stays constant, and averages can hide those directional shifts. Common wrong answer to avoid: "Because average helpfulness no longer matters once fairness enters the discussion."

Q: Why can generated summaries create allocation harm even without final automated decisions? A: Because text shapes human attention, queue order, and perceived credibility, which changes who gets scarce opportunities next. Common wrong answer to avoid: "Because any human in the loop automatically removes fairness risk."

Q: Why are stereotype benchmarks useful but insufficient? A: Because they capture known patterns in controlled prompts, while real product harms depend on workflow, retrieval, and downstream action chains. Common wrong answer to avoid: "Because benchmarks are only for academic papers and have no product value."

Q: Why should product scoping be considered a fairness mitigation for LLMs? A: Because some use cases expose identity-sensitive judgments that current models cannot support responsibly, so limiting authority can reduce harm more than tuning alone. Common wrong answer to avoid: "Because scoping means the model failed technically."

Apply now (5 min)¶

Exercise. Write two paired prompts that differ only by identity cue. List three output properties you would compare: sentiment, helpfulness, and recommendation strength, for example. Which property best captures a harmful verdict in your imagined product?

Sketch from memory. Draw three boxes labeled representation, allocation, and interaction harm. Under each, write one concrete product example. Then add one note on how the case record should restrict risky LLM uses.

Bridge. Once fairness harms become product reality, external rulebooks matter. The next step is to see how regulators and standards bodies describe responsibilities around these courtroom decisions. → 10-regulatory-landscape.md