00. Data access governance — First-principles overview¶
The prompt-injection security module taught you to defend the agent from instructions in untrusted input. This module is the matching discipline for the data the agent is allowed to read, retrieve, write, or surface in the first place.
A security engineer at a Mumbai healthcare-tech company runs a quarterly review of the agent platform and finds something quietly catastrophic. A doctor's assistant agent, scoped to "read patient records for the active consultation," has occasionally surfaced details from a different patient when the doctor's question was ambiguous about whose record they meant. The audit log shows seven cases in the past quarter. The credential the agent ran under had patient:read across the entire clinic — the data layer enforced no per-patient scoping. The model did exactly what it was authorised to do; the authority was wider than the operation required.
This is a data-access-governance failure. The agent was not hacked; it was not injected. It simply had access to data it should not have used, and a routine ambiguity in user input was enough to cause the leak. The fix is not a smarter model or a tighter prompt. The fix is per-query scope — every read, write, and retrieval bound to the smallest set of records the operation needs, enforced by the data layer, audited per call.
This module is the discipline. Classification, purpose binding, scope resolution, PII handling, retention, audit, leak detection, and the incident response when data does get out.
What data access governance is, in one sentence¶
Data access governance is the production discipline that ensures every read, write, retrieval, and surface of data by an AI agent is classified, purpose-bound, narrowly scoped, audited, retained per policy, and recoverable when something leaks.
Read the sentence left to right.
- Classified — every data field has a known sensitivity tier.
- Purpose-bound — every access is for a named purpose that the data layer recognises.
- Narrowly scoped — access is to the smallest set of records that satisfies the purpose.
- Audited — every access produces a per-call record sufficient for incident review.
- Retained per policy — access logs and the data itself are retained per regulatory and contractual requirements, then deleted.
- Recoverable — when data leaks, the discipline supports detection, containment, notification, and remediation.
If you remember one thing from this module: the model is not the security boundary; the data layer is. The model can be confused, prompt-injected, or simply wrong; the data layer's enforcement is what bounds the worst case.
The six governance surfaces¶
Every production data-access governance has exactly six surfaces. Memorise them once.
| Surface | One-liner | Pressure it answers |
|---|---|---|
| The classification | Every field has a sensitivity tier (public, internal, sensitive, regulated) | language: what is this data, and how do we treat it? |
| The purpose binding | Every access has a named purpose the data layer recognises | intent: why is the agent reading this? |
| The scope | Per-call resolution of the smallest record set the purpose justifies | least privilege: the model has access only to what the operation needs |
| The PII discipline | Detection, redaction, hashing, minimisation of personally-identifying data | privacy: what leaks if logs are exposed? |
| The audit trail | Per-access record: what, by whom, on whose behalf, for what purpose, with what outcome | accountability: who did what, when, and was it allowed? |
| The retention policy | Time-bounded storage of both data and access logs; lawful deletion on schedule | obligation: regulators and customers require limits |
A seventh concern — incident response when a leak happens — runs across all six and is its own chapter (11) rather than a surface.
The recurring vocabulary¶
These terms appear in every chapter.
| Name | Surface | What it is |
|---|---|---|
| the data tier | Classification | the sensitivity label on each field — public, internal, sensitive, regulated |
| the purpose | Purpose binding | the named reason for the access — consultation:read_active_patient, support:read_own_orders |
| the per-call scope | Scope | the resolved set of records this specific call may touch |
| the PII filter | PII discipline | the layer that detects, redacts, or hashes personal data before storage or transmission |
| the access audit | Audit | the per-call record: who accessed what, by what scope, for what purpose, with what outcome |
| the retention window | Retention | the time-bound period during which data and audit records are kept; deletion at boundary |
| the data subject | Cross-cutting | the person the data is about; their rights drive much of the policy |
| the breach detector | Incident | the monitor that catches unusual access patterns suggesting leak |
| the right-to-be-forgotten | Retention | the discipline that supports lawful deletion on request |
The journey: build the discipline, then operate it¶
This module has two acts.
Act 1 — Build the discipline (files 01–07). The data-access problem with agents, classification, purpose binding, per-call scope, PII handling, retention, and audit. By file 07 the discipline exists as a coherent set of policies and mechanisms.
Act 2 — Operate it under pressure (files 08–11). Leak detection, right-to-be-forgotten, cross-tenant and cross-region considerations, incident response. The discipline does not become more permissive; it becomes survivable.
Synthesis (files 12–13). Architect checklist and honest admission.
Memory map¶
| # | File | Surface | Pressure answered | What it adds |
|---|---|---|---|---|
| 01 | the-data-access-problem-with-agents | — | why this matters more for AI than for ordinary APIs | reframes the boundary |
| 02 | classification-and-data-tiers | Classification | what data is this, and how do we treat it | sensitivity tiers and labels |
| 03 | purpose-binding | Purpose | why is the agent reading this? | named purposes; binding access to them |
| 04 | per-call-scope-resolution | Scope | tenant-wide credentials are too broad | per-call narrowing to the operation's needs |
| 05 | pii-detection-and-redaction | PII | personal data in prompts, responses, logs | detection, redaction, hashing, minimisation |
| 06 | retention-and-jurisdiction | Retention | regulators and contracts impose limits | time-bounded storage; lawful deletion |
| 07 | access-audit | Audit | who did what, on whose behalf, with what outcome | per-call record sufficient for review |
| — milestone: discipline is in place — | ||||
| 08 | leak-detection | Incident | leaks happen; how do we notice? | anomaly detection on access patterns |
| 09 | right-to-be-forgotten | Retention | data subjects have rights | erasure workflows that touch all data and audit |
| 10 | cross-tenant-and-cross-region | Scope | multi-tenant agents and global data | tenant isolation; regional residency |
| 11 | incident-response-data-breach | All | when a leak happens, what do you do? | containment, notification, remediation |
| — milestone: discipline is operable — | ||||
| 12 | architect-checklist | Synthesis | completeness | 20-item design / build / launch / operate |
| 13 | honest-admission | Boundaries | humility | what governance still cannot prevent |
How this module relates to its neighbours¶
00_safety_guardrail_design— that module is the input/output guard layer; this module is the data layer. Both are required for end-to-end safety.01_prompt_injection_security— that module defends against malicious instructions in untrusted input; this module bounds the data the agent can reach even when instructions are followed.19_tool_integration_contracts— chapter 06 of that module covered scopes and credentials at the tool boundary; this module is the deeper data-level discipline behind those scopes.02_ai_infrastructure/01_model_gateway_provider_ops— that module governs which provider sees data; this module governs which data is sent at all.04_ai_product_evals— evals can include data-handling checks; the discipline here is the substrate they evaluate against.
Top resources¶
- NIST SP 800-122 — Guide to Protecting PII — https://csrc.nist.gov/publications/detail/sp/800-122/final
- GDPR Article 25 — Data protection by design and by default — https://gdpr-info.eu/art-25-gdpr/
- OWASP Top 10 for LLM Applications — https://owasp.org/www-project-top-10-for-large-language-model-applications/
- DPDP Act (India) 2023 — https://www.meity.gov.in/data-protection-framework
- MITRE ATT&CK — data exfiltration techniques — https://attack.mitre.org/tactics/TA0010/
- AWS — encrypting and tokenising in pipelines — https://aws.amazon.com/blogs/architecture/
What's coming¶
- 01-the-data-access-problem-with-agents.md — Why agents change the data-access problem compared to ordinary APIs.
- 02-classification-and-data-tiers.md — Public, internal, sensitive, regulated — and what changes at each tier.
- 03-purpose-binding.md — Every access has a purpose the data layer recognises; access without a purpose is refused.
- 04-per-call-scope-resolution.md — Per-call narrowing of access to the smallest set the operation needs.
- 05-pii-detection-and-redaction.md — Detection, redaction, hashing, minimisation across prompts, responses, and logs.
- 06-retention-and-jurisdiction.md — Time-bounded storage; regulatory regimes; lawful deletion.
- 07-access-audit.md — The per-call record that makes accountability possible.
- 08-leak-detection.md — Anomaly detection on access patterns; the leading signals of breach.
- 09-right-to-be-forgotten.md — Erasure workflows that touch live data, audit, backups, and embeddings.
- 10-cross-tenant-and-cross-region.md — Multi-tenant agents and the regional residency the model gateway enforces.
- 11-incident-response-data-breach.md — Containment, notification, remediation.
- 12-architect-checklist.md — Twenty items.
- 13-honest-admission.md — Where governance has no defensible answer.
Bridge. Before we design the classification or the purpose binding, we have to feel why agents change the data-access problem. Ordinary APIs are called by ordinary clients with ordinary intent; agents are called by ordinary users but the LLM in between can be confused, manipulated, or surprising. The first chapter is that reframe. → 01-the-data-access-problem-with-agents.md