02. Classification and data tiers¶

The data layer is the boundary. The first move in operating that boundary is to know what data you have and how strict each piece needs to be. Not every field needs the strictest treatment; not every field can have the loosest. The data tier is the label that drives every downstream rule.

A platform engineer at a Chennai marketplace inherits an agent system without a data classification. Every field is treated the same. The audit shows that customer email addresses, internal product taxonomy codes, vendor pricing margins, and customer aadhaar identifiers are all in the same store, retrieved by the same query patterns, logged with the same retention, surfaced to the agent under the same scope. The fix is unfortunate but standard: classify the fields, tier them, apply different handling per tier. The work takes three weeks. The outcome: the most sensitive fields are protected per regulation, the public fields are usable freely, and the middle ground has explicit handling rules. The agent's behaviour does not visibly change, but the platform's posture for the next audit, the next incident, and the next compliance review is materially different.

The discipline is unglamorous and structural. Without it, every other chapter in this module collapses; with it, the rest of the work has a foundation.

What classification is for¶

Classification labels each data field with a sensitivity tier. The tier drives:

What scope can read the field (chapter 04)
What PII handling applies in logs and audit (chapter 05)
What retention applies (chapter 06)
What audit detail is captured (chapter 07)
What jurisdiction the data may be processed in (chapter 06, chapter 10)
What incident response triggers if it leaks (chapter 11)

Without classification, every field has the strictest handling (impractical) or the loosest (unsafe). The classification is the lever that lets you treat different data appropriately.

The four-tier model¶

Most platforms work well with four tiers. The exact names vary; the substance is similar.

Tier	Definition	Example fields	Default handling
Public	Information already public or freely shareable	Product catalogue, public docs, marketing copy	No special handling; retention by business need; standard audit
Internal	Information for internal use; not sensitive but not for public release	Internal taxonomy codes, internal metrics, anonymised aggregates	Access by employees and authorised systems; retention per business need; standard audit
Sensitive	Information whose unauthorised disclosure would cause harm or breach trust	Customer email, address, order history, support transcripts	Access narrowly scoped; redacted in non-essential logs; retention per data-subject contract; full audit
Regulated	Information governed by external regulations	Aadhaar/SSN, payment card data, medical records, GDPR-special-category	Access strictly purposed; encryption in addition to access controls; retention per regulatory minimum; signed audit; jurisdiction constraints

Some platforms add a fifth tier — secret — for credentials, API keys, encryption keys, and the like. These are not data the agent reads as part of its work; they are platform secrets. They live in a vault (module 19 chapter 06 and 02_ai_infrastructure/01 chapter 06 govern this). Treating them as a tier in this taxonomy can confuse the boundary; better to keep them separate.

Classifying fields, not records¶

The right granularity is the field, not the record.

A customer record contains: - customer_id — internal - email — sensitive - name — sensitive - phone — sensitive - aadhaar_number — regulated (India) - birth_date — sensitive (in many jurisdictions, regulated) - account_status — internal - order_history_summary — internal - loyalty_tier — internal

The record is "sensitive" overall, but per-field handling lets the agent retrieve just the customer_id and loyalty_tier (internal) without invoking the strict handling required for aadhaar_number (regulated). Without field-level classification, every read of any customer field triggers the strict handling, which is operationally expensive and often impossible.

The implementation: every field has a tier label in the schema or in a separate metadata layer. Reads, writes, and exposures respect the tier.

How classification flows through the system¶

A field's tier label propagates through every system that touches it.

Source of truth. The primary store (database, document store) tags each field with its tier. Schema migrations are reviewed for tier changes.

Replicas and derived stores. Vector indexes, search indexes, caches, analytics warehouses — each must preserve the tier label. A regulated field that lands in an internal-tagged warehouse has been quietly downgraded.

Logs and audit. When data is logged, the redaction policy (chapter 05) reads the tier and applies the rule for that tier (e.g., regulated → fully redact; sensitive → redact value, keep field name; internal → keep value).

Outputs to the agent. When the data is fetched into the agent's context, the tier travels with it. The agent platform's audit records what tier of data entered the context.

Outputs to users. When the agent surfaces data in a response, the tier informs whether and how to surface it. Regulated data is rarely surfaced directly; sensitive data is often surfaced with the user's own data only.

A platform that does not propagate tier labels through every layer has classification on paper but not in enforcement.

How to classify a system today¶

Procedure to classify an existing data store (this is a one-time effort that should be repeated annually):

Inventory the fields. Walk the schemas. List every field across every table or document type.
Assign a tier per field. Use the default rules (public, internal, sensitive, regulated). Disagreements get resolved with the data owner and the legal/compliance team.
Validate against use cases. For each business use of the data, check whether the proposed tier supports the use (e.g., a marketing team's analytics use of email must work even at sensitive tier).
Apply labels. Update the schema or the metadata layer. Most modern data stores support column-level labels (Postgres comments, BigQuery labels, Snowflake tags). Use the native mechanism.
Propagate to derived stores. Tag the same fields in vector indexes, search indexes, and warehouses.
Update the access mediator. Chapter 03 (purpose binding) and chapter 04 (per-call scope) start enforcing tier-aware rules.
Audit yearly. New fields land; old fields get re-purposed; tiers should be reviewed.

For a moderate system, this is a two-to-four-week effort. It is not glamorous; it is the foundation.

What changes per tier in practice¶

A concrete table of what changes at each tier in a typical platform.

Property	Public	Internal	Sensitive	Regulated
Access scope per call	Open	Authenticated	Per-call narrow	Per-call narrow + purpose-bound
Encryption in transit	Standard	Standard	Standard	Standard + envelope encryption
Encryption at rest	Standard	Standard	Standard	Standard + field-level encryption
Logged in audit	Standard	Standard	Standard	Full + signed
Logged in app logs	Yes	Yes	Redacted	Never
Surfaced in error messages	Yes	Yes	Redacted	Never
Cached in agent context	Yes	Yes	With redaction policy	Never beyond the immediate call
Retained in audit	Standard	Standard	Per-contract	Per-regulation (often years)
Right-to-be-forgotten erasure	N/A	Standard	Required	Required + verification
Cross-region travel	Allowed	Allowed	Per residency policy	Refused unless explicit
Eval set inclusion	Allowed	Allowed	Synthetic substitute	Synthetic substitute only

The right-most three columns demand the discipline; the left two enjoy looser handling. The classification is what tells the platform which rules to apply.

Classification mistakes to avoid¶

Over-classifying. Treating everything as regulated is operationally infeasible. The platform becomes too slow, too expensive, and the team starts skipping rules. Classification has to be honest: only regulated data is regulated.

Under-classifying. Treating sensitive data as internal because "we know our employees are trustworthy" misreads the regulatory and contractual posture. The classification is for the worst plausible reader, not the typical one.

Classifying records, not fields. Mentioned above. A whole-record classification overstates the protection needed for low-tier fields and produces operational pain that drives the team toward shortcuts.

Stale classification. Fields drift. A field added as internal becomes sensitive when the business starts using it for personalisation. The yearly review catches this; the lack of one lets the drift compound.

Classification without enforcement. A spreadsheet of tier labels that the platform does not actually read is theatre. Tiers have to be in the schema metadata and consulted by the access mediator.

Interview Q&A¶

Q1. The team wants to "treat all data as sensitive to be safe." Why is that the wrong default? Because operational realities push the team toward shortcuts when the rules are too strict. Public marketing copy treated as sensitive means audit overhead on every read, redaction in logs that makes debugging hard, retention rules that conflict with the data's intended use. Within months, engineers start labelling things as "internal" to bypass the sensitive rules — and the actual sensitive data loses its differentiation. Honest classification per field — public, internal, sensitive, regulated — produces a system where each tier's rules are followed because they are appropriate, not bypassed because they are oppressive. Wrong-answer notes: "more strict is always safer" misses the bypass dynamics that strict-everywhere produces.

Q2. Walk through how you would classify a customer table. Walk the fields. customer_id — internal (identifier but not by itself sensitive). email — sensitive (PII; identifies a person). name — sensitive. phone — sensitive. aadhaar_number — regulated (India statute). birth_date — sensitive or regulated depending on jurisdiction. loyalty_tier — internal (business metric). account_status — internal. order_history_summary — internal (aggregated, no direct PII). The classification is per field. Validate the assignments with the data owner and a compliance review. Apply labels in the schema; propagate to vector index and analytics warehouse. Wrong-answer notes: classifying the table as one tier loses the per-field discipline.

Q3. The platform has labels in the schema but the agent does not consult them. What is missing? Enforcement at the access layer. The labels are metadata; the access mediator (chapter 03 onward) must read them and apply tier-specific rules per call. The data layer's storage engine should also enforce — a regulated field accessed through a credential without the right purpose is refused at the storage layer. Without enforcement, the labels are documentation only; the protection they imply is theatrical. Wrong-answer notes: "we will train people to respect the labels" is documentation, not security.

Q4. A new field is added. Whose job is it to classify it? The data owner, with a default tier suggested by a classification policy and reviewed by compliance for any field that might be sensitive or regulated. The default suggestion in most platforms is "sensitive unless proven otherwise" for any field that touches user data, "internal" for business metrics, "public" only for fields that are already public. The schema migration that adds the field includes the tier label; CI rejects migrations without one. Wrong-answer notes: "the engineer who adds it picks" lacks review; "compliance classifies everything" creates a bottleneck that produces shortcut behaviour.

What to do differently after reading this¶

Classify every field in every store. Do not classify records — fields are the right granularity.
Use the four-tier model unless a domain reason demands more or fewer.
Propagate classification through every derived store, log, and audit.
Enforce at the access layer; labels alone are documentation.
Make classification a required field in schema migrations; CI blocks omissions.

Bridge. With fields classified, the next discipline is the purpose of each access. A query that just says "give me this customer's email" is half a story; the access mediator should know why. The next chapter is purpose binding — the discipline that turns access from authority into intent. → 03-purpose-binding.md