Skip to content

00. Data access governance — First-principles overview

The prompt-injection security module taught you to defend the agent from instructions in untrusted input. This module is the matching discipline for the data the agent is allowed to read, retrieve, write, or surface in the first place.


A security engineer at a Mumbai healthcare-tech company runs a quarterly review of the agent platform and finds something quietly catastrophic. A doctor's assistant agent, scoped to "read patient records for the active consultation," has occasionally surfaced details from a different patient when the doctor's question was ambiguous about whose record they meant. The audit log shows seven cases in the past quarter. The credential the agent ran under had patient:read across the entire clinic — the data layer enforced no per-patient scoping. The model did exactly what it was authorised to do; the authority was wider than the operation required.

This is a data-access-governance failure. The agent was not hacked; it was not injected. It simply had access to data it should not have used, and a routine ambiguity in user input was enough to cause the leak. The fix is not a smarter model or a tighter prompt. The fix is per-query scope — every read, write, and retrieval bound to the smallest set of records the operation needs, enforced by the data layer, audited per call.

This module is the discipline. Classification, purpose binding, scope resolution, PII handling, retention, audit, leak detection, and the incident response when data does get out.


What data access governance is, in one sentence

Data access governance is the production discipline that ensures every read, write, retrieval, and surface of data by an AI agent is classified, purpose-bound, narrowly scoped, audited, retained per policy, and recoverable when something leaks.

Read the sentence left to right.

  • Classified — every data field has a known sensitivity tier.
  • Purpose-bound — every access is for a named purpose that the data layer recognises.
  • Narrowly scoped — access is to the smallest set of records that satisfies the purpose.
  • Audited — every access produces a per-call record sufficient for incident review.
  • Retained per policy — access logs and the data itself are retained per regulatory and contractual requirements, then deleted.
  • Recoverable — when data leaks, the discipline supports detection, containment, notification, and remediation.

If you remember one thing from this module: the model is not the security boundary; the data layer is. The model can be confused, prompt-injected, or simply wrong; the data layer's enforcement is what bounds the worst case.


The six governance surfaces

Every production data-access governance has exactly six surfaces. Memorise them once.

Surface One-liner Pressure it answers
The classification Every field has a sensitivity tier (public, internal, sensitive, regulated) language: what is this data, and how do we treat it?
The purpose binding Every access has a named purpose the data layer recognises intent: why is the agent reading this?
The scope Per-call resolution of the smallest record set the purpose justifies least privilege: the model has access only to what the operation needs
The PII discipline Detection, redaction, hashing, minimisation of personally-identifying data privacy: what leaks if logs are exposed?
The audit trail Per-access record: what, by whom, on whose behalf, for what purpose, with what outcome accountability: who did what, when, and was it allowed?
The retention policy Time-bounded storage of both data and access logs; lawful deletion on schedule obligation: regulators and customers require limits

A seventh concern — incident response when a leak happens — runs across all six and is its own chapter (11) rather than a surface.


The recurring vocabulary

These terms appear in every chapter.

Name Surface What it is
the data tier Classification the sensitivity label on each field — public, internal, sensitive, regulated
the purpose Purpose binding the named reason for the access — consultation:read_active_patient, support:read_own_orders
the per-call scope Scope the resolved set of records this specific call may touch
the PII filter PII discipline the layer that detects, redacts, or hashes personal data before storage or transmission
the access audit Audit the per-call record: who accessed what, by what scope, for what purpose, with what outcome
the retention window Retention the time-bound period during which data and audit records are kept; deletion at boundary
the data subject Cross-cutting the person the data is about; their rights drive much of the policy
the breach detector Incident the monitor that catches unusual access patterns suggesting leak
the right-to-be-forgotten Retention the discipline that supports lawful deletion on request

The journey: build the discipline, then operate it

This module has two acts.

Act 1 — Build the discipline (files 01–07). The data-access problem with agents, classification, purpose binding, per-call scope, PII handling, retention, and audit. By file 07 the discipline exists as a coherent set of policies and mechanisms.

Act 2 — Operate it under pressure (files 08–11). Leak detection, right-to-be-forgotten, cross-tenant and cross-region considerations, incident response. The discipline does not become more permissive; it becomes survivable.

Synthesis (files 12–13). Architect checklist and honest admission.


Memory map

# File Surface Pressure answered What it adds
01 the-data-access-problem-with-agents why this matters more for AI than for ordinary APIs reframes the boundary
02 classification-and-data-tiers Classification what data is this, and how do we treat it sensitivity tiers and labels
03 purpose-binding Purpose why is the agent reading this? named purposes; binding access to them
04 per-call-scope-resolution Scope tenant-wide credentials are too broad per-call narrowing to the operation's needs
05 pii-detection-and-redaction PII personal data in prompts, responses, logs detection, redaction, hashing, minimisation
06 retention-and-jurisdiction Retention regulators and contracts impose limits time-bounded storage; lawful deletion
07 access-audit Audit who did what, on whose behalf, with what outcome per-call record sufficient for review
— milestone: discipline is in place —
08 leak-detection Incident leaks happen; how do we notice? anomaly detection on access patterns
09 right-to-be-forgotten Retention data subjects have rights erasure workflows that touch all data and audit
10 cross-tenant-and-cross-region Scope multi-tenant agents and global data tenant isolation; regional residency
11 incident-response-data-breach All when a leak happens, what do you do? containment, notification, remediation
— milestone: discipline is operable —
12 architect-checklist Synthesis completeness 20-item design / build / launch / operate
13 honest-admission Boundaries humility what governance still cannot prevent

How this module relates to its neighbours

  • 00_safety_guardrail_design — that module is the input/output guard layer; this module is the data layer. Both are required for end-to-end safety.
  • 01_prompt_injection_security — that module defends against malicious instructions in untrusted input; this module bounds the data the agent can reach even when instructions are followed.
  • 19_tool_integration_contracts — chapter 06 of that module covered scopes and credentials at the tool boundary; this module is the deeper data-level discipline behind those scopes.
  • 02_ai_infrastructure/01_model_gateway_provider_ops — that module governs which provider sees data; this module governs which data is sent at all.
  • 04_ai_product_evals — evals can include data-handling checks; the discipline here is the substrate they evaluate against.

Top resources

  • NIST SP 800-122 — Guide to Protecting PII — https://csrc.nist.gov/publications/detail/sp/800-122/final
  • GDPR Article 25 — Data protection by design and by default — https://gdpr-info.eu/art-25-gdpr/
  • OWASP Top 10 for LLM Applications — https://owasp.org/www-project-top-10-for-large-language-model-applications/
  • DPDP Act (India) 2023 — https://www.meity.gov.in/data-protection-framework
  • MITRE ATT&CK — data exfiltration techniques — https://attack.mitre.org/tactics/TA0010/
  • AWS — encrypting and tokenising in pipelines — https://aws.amazon.com/blogs/architecture/

What's coming

  1. 01-the-data-access-problem-with-agents.md — Why agents change the data-access problem compared to ordinary APIs.
  2. 02-classification-and-data-tiers.md — Public, internal, sensitive, regulated — and what changes at each tier.
  3. 03-purpose-binding.md — Every access has a purpose the data layer recognises; access without a purpose is refused.
  4. 04-per-call-scope-resolution.md — Per-call narrowing of access to the smallest set the operation needs.
  5. 05-pii-detection-and-redaction.md — Detection, redaction, hashing, minimisation across prompts, responses, and logs.
  6. 06-retention-and-jurisdiction.md — Time-bounded storage; regulatory regimes; lawful deletion.
  7. 07-access-audit.md — The per-call record that makes accountability possible.
  8. 08-leak-detection.md — Anomaly detection on access patterns; the leading signals of breach.
  9. 09-right-to-be-forgotten.md — Erasure workflows that touch live data, audit, backups, and embeddings.
  10. 10-cross-tenant-and-cross-region.md — Multi-tenant agents and the regional residency the model gateway enforces.
  11. 11-incident-response-data-breach.md — Containment, notification, remediation.
  12. 12-architect-checklist.md — Twenty items.
  13. 13-honest-admission.md — Where governance has no defensible answer.

Bridge. Before we design the classification or the purpose binding, we have to feel why agents change the data-access problem. Ordinary APIs are called by ordinary clients with ordinary intent; agents are called by ordinary users but the LLM in between can be confused, manipulated, or surprising. The first chapter is that reframe. → 01-the-data-access-problem-with-agents.md