00. Data access governance — First-principles overview¶

The prompt-injection security module taught you to defend the agent from instructions in untrusted input. This module is the matching discipline for the data the agent is allowed to read, retrieve, write, or surface in the first place.

A security engineer at a Mumbai healthcare-tech company runs a quarterly review of the agent platform and finds something quietly catastrophic. A doctor's assistant agent, scoped to "read patient records for the active consultation," has occasionally surfaced details from a different patient when the doctor's question was ambiguous about whose record they meant. The audit log shows seven cases in the past quarter. The credential the agent ran under had patient:read across the entire clinic — the data layer enforced no per-patient scoping. The model did exactly what it was authorised to do; the authority was wider than the operation required.

This is a data-access-governance failure. The agent was not hacked; it was not injected. It simply had access to data it should not have used, and a routine ambiguity in user input was enough to cause the leak. The fix is not a smarter model or a tighter prompt. The fix is per-query scope — every read, write, and retrieval bound to the smallest set of records the operation needs, enforced by the data layer, audited per call.

This module is the discipline. Classification, purpose binding, scope resolution, PII handling, retention, audit, leak detection, and the incident response when data does get out.

What data access governance is, in one sentence¶

Data access governance is the production discipline that ensures every read, write, retrieval, and surface of data by an AI agent is classified, purpose-bound, narrowly scoped, audited, retained per policy, and recoverable when something leaks.

Read the sentence left to right.

Classified — every data field has a known sensitivity tier.
Purpose-bound — every access is for a named purpose that the data layer recognises.
Narrowly scoped — access is to the smallest set of records that satisfies the purpose.
Audited — every access produces a per-call record sufficient for incident review.
Retained per policy — access logs and the data itself are retained per regulatory and contractual requirements, then deleted.
Recoverable — when data leaks, the discipline supports detection, containment, notification, and remediation.

If you remember one thing from this module: the model is not the security boundary; the data layer is. The model can be confused, prompt-injected, or simply wrong; the data layer's enforcement is what bounds the worst case.

The six governance surfaces¶

Every production data-access governance has exactly six surfaces. Memorise them once.

Surface	One-liner	Pressure it answers
The classification	Every field has a sensitivity tier (public, internal, sensitive, regulated)	language: what is this data, and how do we treat it?
The purpose binding	Every access has a named purpose the data layer recognises	intent: why is the agent reading this?
The scope	Per-call resolution of the smallest record set the purpose justifies	least privilege: the model has access only to what the operation needs
The PII discipline	Detection, redaction, hashing, minimisation of personally-identifying data	privacy: what leaks if logs are exposed?
The audit trail	Per-access record: what, by whom, on whose behalf, for what purpose, with what outcome	accountability: who did what, when, and was it allowed?
The retention policy	Time-bounded storage of both data and access logs; lawful deletion on schedule	obligation: regulators and customers require limits

A seventh concern — incident response when a leak happens — runs across all six and is its own chapter (11) rather than a surface.

The recurring vocabulary¶

These terms appear in every chapter.

Name	Surface	What it is
the data tier	Classification	the sensitivity label on each field — public, internal, sensitive, regulated
the purpose	Purpose binding	the named reason for the access — `consultation:read_active_patient`, `support:read_own_orders`
the per-call scope	Scope	the resolved set of records this specific call may touch
the PII filter	PII discipline	the layer that detects, redacts, or hashes personal data before storage or transmission
the access audit	Audit	the per-call record: who accessed what, by what scope, for what purpose, with what outcome
the retention window	Retention	the time-bound period during which data and audit records are kept; deletion at boundary
the data subject	Cross-cutting	the person the data is about; their rights drive much of the policy
the breach detector	Incident	the monitor that catches unusual access patterns suggesting leak
the right-to-be-forgotten	Retention	the discipline that supports lawful deletion on request

The journey: build the discipline, then operate it¶

This module has two acts.

Act 1 — Build the discipline (files 01–07). The data-access problem with agents, classification, purpose binding, per-call scope, PII handling, retention, and audit. By file 07 the discipline exists as a coherent set of policies and mechanisms.

Act 2 — Operate it under pressure (files 08–11). Leak detection, right-to-be-forgotten, cross-tenant and cross-region considerations, incident response. The discipline does not become more permissive; it becomes survivable.

Synthesis (files 12–13). Architect checklist and honest admission.

Memory map¶

#	File	Surface	Pressure answered	What it adds
01	the-data-access-problem-with-agents	—	why this matters more for AI than for ordinary APIs	reframes the boundary
02	classification-and-data-tiers	Classification	what data is this, and how do we treat it	sensitivity tiers and labels
03	purpose-binding	Purpose	why is the agent reading this?	named purposes; binding access to them
04	per-call-scope-resolution	Scope	tenant-wide credentials are too broad	per-call narrowing to the operation's needs
05	pii-detection-and-redaction	PII	personal data in prompts, responses, logs	detection, redaction, hashing, minimisation
06	retention-and-jurisdiction	Retention	regulators and contracts impose limits	time-bounded storage; lawful deletion
07	access-audit	Audit	who did what, on whose behalf, with what outcome	per-call record sufficient for review
	— milestone: discipline is in place —
08	leak-detection	Incident	leaks happen; how do we notice?	anomaly detection on access patterns
09	right-to-be-forgotten	Retention	data subjects have rights	erasure workflows that touch all data and audit
10	cross-tenant-and-cross-region	Scope	multi-tenant agents and global data	tenant isolation; regional residency
11	incident-response-data-breach	All	when a leak happens, what do you do?	containment, notification, remediation
	— milestone: discipline is operable —
12	architect-checklist	Synthesis	completeness	20-item design / build / launch / operate
13	honest-admission	Boundaries	humility	what governance still cannot prevent

How this module relates to its neighbours¶

00_safety_guardrail_design — that module is the input/output guard layer; this module is the data layer. Both are required for end-to-end safety.
01_prompt_injection_security — that module defends against malicious instructions in untrusted input; this module bounds the data the agent can reach even when instructions are followed.
19_tool_integration_contracts — chapter 06 of that module covered scopes and credentials at the tool boundary; this module is the deeper data-level discipline behind those scopes.
02_ai_infrastructure/01_model_gateway_provider_ops — that module governs which provider sees data; this module governs which data is sent at all.
04_ai_product_evals — evals can include data-handling checks; the discipline here is the substrate they evaluate against.

Top resources¶

NIST SP 800-122 — Guide to Protecting PII — https://csrc.nist.gov/publications/detail/sp/800-122/final
GDPR Article 25 — Data protection by design and by default — https://gdpr-info.eu/art-25-gdpr/
OWASP Top 10 for LLM Applications — https://owasp.org/www-project-top-10-for-large-language-model-applications/
DPDP Act (India) 2023 — https://www.meity.gov.in/data-protection-framework
MITRE ATT&CK — data exfiltration techniques — https://attack.mitre.org/tactics/TA0010/
AWS — encrypting and tokenising in pipelines — https://aws.amazon.com/blogs/architecture/

What's coming¶

01-the-data-access-problem-with-agents.md — Why agents change the data-access problem compared to ordinary APIs.
02-classification-and-data-tiers.md — Public, internal, sensitive, regulated — and what changes at each tier.
03-purpose-binding.md — Every access has a purpose the data layer recognises; access without a purpose is refused.
04-per-call-scope-resolution.md — Per-call narrowing of access to the smallest set the operation needs.
05-pii-detection-and-redaction.md — Detection, redaction, hashing, minimisation across prompts, responses, and logs.
06-retention-and-jurisdiction.md — Time-bounded storage; regulatory regimes; lawful deletion.
07-access-audit.md — The per-call record that makes accountability possible.
08-leak-detection.md — Anomaly detection on access patterns; the leading signals of breach.
09-right-to-be-forgotten.md — Erasure workflows that touch live data, audit, backups, and embeddings.
10-cross-tenant-and-cross-region.md — Multi-tenant agents and the regional residency the model gateway enforces.
11-incident-response-data-breach.md — Containment, notification, remediation.
12-architect-checklist.md — Twenty items.
13-honest-admission.md — Where governance has no defensible answer.

Bridge. Before we design the classification or the purpose binding, we have to feel why agents change the data-access problem. Ordinary APIs are called by ordinary clients with ordinary intent; agents are called by ordinary users but the LLM in between can be confused, manipulated, or surprising. The first chapter is that reframe. → 01-the-data-access-problem-with-agents.md