09. Privacy in the golden set¶

Cost is one operational concern. The set carries data — sometimes personal data — and is subject to the same privacy discipline as production stores. PII handling, retention, right-to-be-forgotten, the discipline from 03_ai_security_safety/03_data_access_governance applied to the workbench.

A platform engineer at a Mumbai healthcare-tech company reviews the eval set after a year of operation. The set has 300 cases. Roughly 150 of them include patient identifiers sampled from production six months ago and never re-anonymised. The set lives in the team's git repo; every engineer has read access; the repo is replicated to multiple developer machines. The audit lead reads this back as a breach in waiting: regulated patient data, in source control, with broad read access, indefinite retention. The set is rebuilt from scratch with synthetic substitutes; the discipline of "no real PII in the set" is added to the team's process; the original set is purged from git history. The whole effort takes three weeks.

This chapter is the discipline that prevents that rebuild. PII handling at the source intake; synthetic substitution; retention policies on the set itself; participation in right-to-be-forgotten.

The set is a data store¶

The set is treated by some teams as "test data" — incidentally personal data because it was sampled from production. From the privacy perspective, the set is a data store. The same disciplines apply:

Classification per field (chapter 02 of 03_ai_security_safety/03_data_access_governance).
Per-call scope when reading (less relevant for an internal set, but access logging is).
PII redaction or synthetic substitution at intake (chapter 05 there).
Retention windows (chapter 06).
Participation in RTBF (chapter 09).

The set's intuitive "it's test data" framing is the failure mode. The set is data.

Synthetic substitution at intake¶

When sampling from production, direct identifiers are replaced with synthetic substitutes that preserve the structure of the input.

original_input:
  customer_email: "ravi@example.com"
  customer_name: "Ravi Kumar"
  account_number: "1234567890"
  query: "I cannot access my account, my email is ravi@example.com"

eval_case_input:
  customer_email: "[SYNTH:email_001]"
  customer_name: "[SYNTH:name_001]"
  account_number: "[SYNTH:account_001]"
  query: "I cannot access my account, my email is [SYNTH:email_001]"

The substitutes are stable within the case (the email mentioned in the query matches the email field) and unique within the case (different cases use different SYNTH IDs). The structure that affects the system's behaviour is preserved; the personal data is gone.

For some cases the substitution requires care:

A free-text input containing PII needs pattern-driven scrubbing (chapter 05 of 03_ai_security_safety/03_data_access_governance).
A long document with embedded PII (a contract, a medical record) may need a fully synthetic substitute generated to preserve the document's structure and intent.
A case where the PII is load-bearing for the system's behaviour (e.g., aadhaar validation) needs a synthetic identifier that passes validation but is not a real aadhaar.

What "real PII in the set" causes¶

Three categories of harm.

Direct breach risk. The set is in source control, replicated to developer machines, sometimes backed up, sometimes shared. Each copy is a potential leak vector.

RTBF compliance. A data subject exercises the right to be forgotten. The set contains their data. The RTBF workflow (chapter 09 of 03_ai_security_safety/03_data_access_governance) must reach the set; the set's storage becomes a stop on the workflow.

Eval poisoning by personal data. A case containing real PII may bias the eval if the model has seen that data in training (rare for production data, but for public-figure data more likely). Synthetic substitution removes this confound.

The discipline is "no real PII in the set." Synthetic substitution at intake; never load raw production cases unchanged.

Sample store separation¶

Recall from chapter 06 of 01_model_gateway_provider_ops: a small sample of production traffic is captured with full content (redacted) for review and drift detection. The sample store is different from the eval set.

Sample store. High-fidelity production captures; redacted; restricted access; short retention (90 days typical).
Eval set. Curated cases with synthetic substitutes; lower-fidelity; broader access; long retention.

The two should not be confused. Production samples may inform the eval set (cases are sourced from production), but the transformation — anonymisation, labelling, classification — happens at the boundary between them. A case enters the eval set only after the transformation.

Retention on the set¶

The active set is current; older versions (chapter 06) are historical. Both have retention.

Active set. No retention limit per se; the set evolves through versions.
Historical versions. Retained per the platform's reproducibility policy; typically aligned with audit retention (1-7 years).
Retired cases. Kept in historical versions; removed from active sets.

If the historical versions contained real PII (a legacy state), the RTBF workflow on a data subject requires reaching those versions too. This is one reason synthetic-only is the right discipline from the start — RTBF on an eval set with real PII is operationally painful.

Right-to-be-forgotten and the set¶

If, despite the discipline, the set has some real PII:

The data subject's identifiers are searched in the set's contents and metadata.
Matches are replaced (with synthetic) or removed (cases retired).
The set version is bumped; the changelog notes the RTBF case.
Historical versions are similarly purged or re-marked.

For a clean (synthetic-only) set, RTBF requires only verifying that no real identifiers are present — a fast check.

Access to the set¶

The set is in source control or a small service. Read access is typically broad (engineers need to see cases); write access is restricted (the owner approves changes per chapter 06 versioning).

For sets containing any potentially-sensitive content (even synthetic, even after substitution), the read access should be limited to engineers who need it. The eval set is not a public artefact within the company.

Audit reads on the set itself if the set contains sensitive cases. Most platforms log access to the set's repository or service.

Eval-derived data¶

Eval runs produce data — scores, judge reasoning, per-case results. This data is also subject to privacy considerations:

If the judge reasoning includes the case's content (and the case has any sensitive content), the reasoning is sensitive too.
Score logs that include case IDs are not sensitive unless the case IDs map to real subjects.
Per-case rerun stores (for investigation) hold the system's output, which may include sensitive content depending on the input.

The discipline: eval-derived data inherits the sensitivity of the input case. Synthetic-only input cases mean synthetic-only outputs and reasoning, with lower sensitivity overall.

Common mistakes¶

Production cases imported as-is. "Convenience" produces a privacy debt that compounds.

Synthetic substitution that breaks structure. A substitute that does not preserve the input's shape produces unrealistic cases; scores do not reflect production reality.

Forgetting historical versions. RTBF on the active set; historical versions silently retain the data.

Sample store and eval set confused. Production samples used as eval cases without anonymisation.

Eval runs producing PII-rich logs. Judge reasoning logs include the case content unredacted.

Interview Q&A¶

Q1. Why is "no real PII in the eval set" a hard discipline rather than a soft preference? Because the set has broad read access, lives in source control, is replicated to developer machines, persists across versions, and may be exposed by routine engineering operations (copying for analysis, sharing in code review, putting in test environments). Each copy is a leak vector. The cost of synthetic substitution at intake is small; the cost of cleaning up real PII after the fact is multi-week and may not be complete. The discipline is "synthetic from the start" because the alternative does not scale. Wrong-answer notes: "we trust our engineers" misses the structural exposure surface.

Q2. Walk through synthetic substitution for a case where the input is "I cannot access account 1234567890, my email is ravi@example.com." Replace the account number with a synthetic that has the same format and validation properties: e.g., a stable synthetic ID [SYNTH:acct_001] mapped to a fake account number that satisfies the account-number checksum. Replace the email with a synthetic in the same format: [SYNTH:email_001] → fake.user@synth.example.com. Within the case, every occurrence of the same identifier uses the same substitute (the email in the free-text query matches the email field). The transformation produces a case the system can still process (validation passes; structure is real) but no real data is held. Wrong-answer notes: "remove the email" breaks the input's structure; the substitution preserves the testable shape.

Q3. The team discovers that 50 cases in the eval set contain real customer emails (from a refresh six months ago). What is the remediation? First: scope the affected data — which customers, which versions of the set contain it. Second: replace with synthetic substitutes in the active set; bump the set version; record the remediation in the changelog. Third: purge or re-mark historical versions containing the data — depending on retention obligations, either delete the affected historical versions or note that they contain real data and restrict access further. Fourth: notify the customers per applicable regulation if the discipline is a notifiable breach. Fifth: tighten the intake process so the same incident does not recur. Wrong-answer notes: "fix the active set only" misses historical exposure.

Q4. How does the eval set participate in right-to-be-forgotten? For a synthetic-only set, the participation is fast: verify no real identifiers of the data subject are present (usually nothing to do). For a set with real PII, the participation is heavier: search the set's contents and metadata for the subject's identifiers; replace or remove; bump the version; verify; historical versions also addressed per retention policy; notify the subject of the action. The synthetic-only discipline pays off here — RTBF on the set is a check, not a workflow. Wrong-answer notes: "the set is internal; RTBF doesn't apply" misreads the regulation, which applies to all stores of personal data.

What to do differently after reading this¶

Treat the set as a data store; apply the privacy discipline.
Synthetic substitution at intake; never load raw production cases.
Limit read access to the set appropriately; audit access where the content is sensitive.
Participate in RTBF; the synthetic-only set makes this fast.
The eval-derived data (scores, reasoning, rerun stores) inherits the case's sensitivity.

Bridge. Privacy is one cross-cutting concern. Ownership is another — who decides what enters the set, what gets retired, when refreshes happen. The next chapter is the cross-team governance. → 10-cross-team-ownership.md