Skip to content

07. Model cards & documentation — writing the case record before trouble starts

~14 min read. A strong model without honest documentation is a courtroom with missing paperwork.

Built on the ELI5 in 00-eli5.md. The case record — the written record of what the judge is for and where it fails — keeps teams from shipping mystery verdicts.


Picture first: the case record is not marketing copy

Imagine a courtroom where nobody wrote down jurisdiction, evidence rules, known limitations, or appeal conditions. Chaos, no? People would overtrust the judge. Operators would not know when to escalate. Auditors would repeat the same questions. That is what missing model documentation feels like.

A model card is the system's case record. It says what the model does. Who should use it. Who should not use it. What data shaped it. How it performs overall and by slice. What limitations are already known. What monitoring or human review is required.

case record
   ├── intended use
   ├── out-of-scope use
   ├── training/eval data notes
   ├── performance by slice
   ├── limitations and harms
   └── monitoring and owners

See. A real case record makes downstream use safer. It does not make the model fair automatically. It makes the organization honest. That is already a huge gain.

What belongs inside a strong model card

Start with the job description. What exact prediction or generation task does the judge perform? What is the unit of decision? What are the high-stakes versus low-stakes contexts?

Then write intended users. Risk analysts? Support agents? Clinicians? Consumers directly? These are different deployment surfaces. The jury instructions may differ across them.

Then record the data story. Where did the evidence file come from? What groups may be underrepresented? What labels are proxies rather than direct truth? What preprocessing steps matter? If a reader cannot answer these, they cannot judge reliability.

Then write evaluation results. Overall metrics. Slice metrics. Known failure modes. Confidence intervals if relevant. Threshold choices. Human-override expectations. Version number and owner. Yes? That is the minimum serious paperwork.

Worked example: a tiny fairness table in the case record

Suppose a resume-screening model predicts interview likelihood. Its model card includes this slice table.

Overall precision = 0.78. Overall recall = 0.74. Threshold = 0.65.

Now by slice: - Age 18-25 recall = 0.68 - Age 26-55 recall = 0.77 - Age 56+ recall = 0.70 - Career-gap applicants recall = 0.61 - No career gap recall = 0.76

Look at the gap. Career-gap recall disparity = 0.76 - 0.61 = 0.15. That is 15 points. If interview access is beneficial, this matters. The case record should not hide it in an appendix. It should surface it plainly.

Now add limitation text. "This model under-recalls applicants with long caregiving or illness-related career gaps. Human review is required for low-score candidates in this slice until retraining is complete." That one sentence changes deployment behavior. Documentation can prevent blind automation.

slice table excerpt
├── overall recall ............ 0.74
├── age 18-25 ................. 0.68
├── age 26-55 ................. 0.77
├── age 56+ ................... 0.70
├── career gap yes ............ 0.61
└── career gap no ............. 0.76

Simple, no? The case record becomes actionable when it names slices, numbers, and operational consequences.

Datasheets, evaluation cards, and living documentation

Model cards are not the only paperwork. Datasheets describe datasets. System cards describe broader product behavior. Evaluation cards describe benchmarks and known blind spots. These documents complement each other.

Why do teams still skip them? Because documentation feels slow. Because product launch pressure is real. Because nobody wants to memorialize ugly limitations. But that is exactly why documentation is governance, not decoration. The case record protects future operators from present optimism.

So what to do? Make documentation part of the release gate. No card, no launch. No slice table, no launch. No owner, no launch. No out-of-scope section, no launch. Boring, yes. Necessary, also yes.

What good documentation changes in practice

Good documentation changes user expectations. It changes on-call behavior. It changes procurement reviews. It changes internal blame patterns. When trouble appears, the team can ask, "Did we violate the documented intended use?" That is a much stronger conversation than, "I thought the model probably handled that."

The case record also supports the appeal process. Auditors can compare observed harm against documented assumptions. If actual drift exceeds what was recorded, that is a concrete signal. If a use case was marked out of scope but deployed anyway, that is a governance breach, not a mysterious model bug.

Look. Documentation does not slow serious teams down. It prevents them from running fast in the wrong direction.


Where this lives in the wild

  • Google model cards for vision and language releases — model release manager: publish intended use, eval slices, and known limitations so downstream teams do not overclaim capability.
  • Hugging Face model cards — open-model maintainer: use repository-native documentation to note training data, risks, and unsupported use cases.
  • OpenAI system cards — safety reviewer: summarize testing, limitations, and mitigations around model deployment decisions.
  • Anthropic model and system documentation — responsible scaling lead: records behavior constraints, evaluation scope, and known gaps for users and enterprise buyers.
  • Enterprise procurement of AI vendors — model risk officer: relies on documentation quality before allowing a third-party judge into regulated workflows.

Pause and recall

  • Why is a model card more like a case record than a marketing page?
  • What minimum sections should appear in serious model documentation?
  • In the worked example, which slice demanded an operational mitigation?
  • Why should documentation be part of the release gate rather than a later cleanup task?

Interview Q&A

Q: Why require a model card before launch and not document later if issues appear? A: Because intended use, slice results, and limitations shape deployment decisions upfront, and missing them leads to avoidable overreach from day one. Common wrong answer to avoid: "Because documentation mainly helps junior engineers learn the codebase."

Q: Why include out-of-scope uses explicitly instead of only listing strengths? A: Because users often overextend capable systems, and written boundaries reduce silent misuse far better than optimistic silence. Common wrong answer to avoid: "Because every model is legally required to list every imaginable bad use."

Q: Why surface subgroup metrics in the main card and not bury them in appendices? A: Because slice disparities often determine real-world harm, and operators need them visible when deciding how much authority to give the judge. Common wrong answer to avoid: "Because fairness issues disappear once documentation exists."

Q: Why pair model cards with dataset documentation? A: Because the case record for the judge is incomplete without the story of how the evidence file was assembled and labeled. Common wrong answer to avoid: "Because model cards cannot mention data at all."


Apply now (5 min)

Exercise. Take one model or AI feature you know. Write five lines for its case record: intended use, out-of-scope use, key data caveat, one slice metric, and one required human check.

Sketch from memory. Draw a file folder labeled case record. Inside it, list six tabs. Include intended use, limitations, slice metrics, and owners. Then circle the tab you think most teams skip first.


Bridge. Once the case record reveals the harms, the obvious next question appears: what can we actually change in the data, training, or threshold to reduce those disparities? → 08-debiasing-techniques.md