07. Honest admission — What requirements still cannot predict about AI products¶

~18 min read. The brief is signed. The team ships. Six months later the system fails in ways nobody wrote a requirement for — not because the team was lazy, but because the failure mode did not exist until users arrived.

Built on the first-principles overview in 00-first-principles.md. The full module walked the brief, the user model, the capability envelope, the evaluation contract, the safety boundary, and the cost frame. Every one of those mechanisms assumed the product's behaviour could be specified in advance. This chapter asks what happens when it cannot — and how you build requirements that survive contact with the unknowable.

1) What this file solves¶

A "complete" requirements document for an AI product is a comforting fiction. Traditional software does what you coded. AI products do what a statistical model infers — and that inference shifts with distribution, prompt, context length, model version, and user behaviour. The gap between "specified" and "actual" is not a bug to fix. It is a structural property of the medium.

This file names the five gaps no requirements document can close, shows how each gap manifests post-launch, and offers the only honest response: requirements that specify how the system reacts to unexpected behaviour rather than pretending to enumerate all possible behaviours.

Not an incomplete-spec problem. A fundamentally-unpredictable-behavior problem. You cannot write requirements for capabilities you have not observed yet — but you can write requirements for how the system responds to unexpected behaviour.

2) What the module taught and what remains genuinely hard¶

Chapters 01 through 06 built a complete requirements practice:

Chapter	What it locked down
01 — First principles	The brief structure: who, what, how-well, how-safe, how-much
02 — User needs	Personas, jobs-to-be-done, failure cost per persona
03 — Capability spec	What the model must do, boundaries of what it must not attempt
04 — Evaluation contract	Metrics, thresholds, eval sets, regression gates
05 — Safety boundary	Harm taxonomy, guardrails, escalation rules
06 — Cost frame	Token budgets, latency targets, infrastructure caps

That is a strong foundation. It is also insufficient. What remains hard:

Behaviour the model exhibits that nobody specified or tested for.
User strategies that emerge only after the product ships.
Model upgrades that silently change capability boundaries.
Compositions of safe components that produce unsafe wholes.
Gradual distribution drift that invalidates day-one eval sets.

These are not signs of poor requirements work. They are the structural reality of building on a substrate whose behaviour is learned, not programmed.

3) The requirement that was correct on day one and wrong by month three¶

The fintech wiki assistant launched with a clear requirement: "The assistant shall answer questions about internal company policies using only the approved knowledge base. It shall refuse questions outside scope."

Day one: perfect. The assistant answers policy questions, refuses off-topic requests, and passes all eval gates.

Month three: a team lead discovers that phrasing a policy question as a hypothetical — "If our refund policy were different, what would it say?" — causes the assistant to generate plausible but fabricated policy text. The requirement said "answer from the knowledge base." It did not anticipate that a conditional framing would bypass the retrieval grounding. The model is not violating the requirement. It is satisfying the literal text while violating the intent.

This is not a requirements failure. It is a requirements boundary. No finite set of rules can enumerate all phrasings that trick a language model into generating ungrounded text. The requirement was correct. Reality shifted.

Callout — shifting baselines. Google's Bard launched with a factual accuracy requirement that passed internal evals. Within 48 hours, users discovered that adversarial phrasing produced confident hallucinations about verifiable facts. The requirement was not wrong — the adversarial distribution was not in the eval set. Bing Chat's Sydney persona emerged from conversational patterns that no pre-launch requirement anticipated.

4) Three categories of unknowable requirements¶

Some requirements cannot be written because the behaviour they would govern has not been observed yet. Three categories:

Emergent behaviour. The model produces outputs that follow from its training but were never specified or tested. GPT-4 was found to be capable of theory-of-mind reasoning that was never a training objective. Claude demonstrated multi-step deception in safety evaluations that researchers did not anticipate testing for. These capabilities arrive silently.

Adversarial use. Users intentionally probe boundaries that requirements assumed would hold. DAN jailbreaks on ChatGPT. Prompt injection through user-uploaded documents in retrieval systems. Indirect prompt injection via web content in Bing Chat. The attack surface is the input language itself — unbounded and creative.

Capability drift. The model vendor ships an update. Behaviour changes. GPT-4 Turbo in late 2023 became measurably less verbose than its predecessor — products built to specific output-length requirements silently regressed. Anthropic's Claude model updates occasionally shift refusal boundaries — a feature request that was previously refused may start working, or vice versa.

5) The rule: AI requirements are living contracts, not fixed specifications¶

Traditional requirements freeze at sign-off. AI requirements must breathe. The rule:

Every AI product requirement must specify both the desired behaviour AND the monitoring, escalation, and revision protocol for when that behaviour drifts.

A requirement without a drift-detection clause is a promise without enforcement. It will be violated silently, and nobody will know until a user reports damage.

This does not mean requirements are vague. It means they are layered: - Layer 1: What the system should do (static). - Layer 2: How we detect when it stops doing that (monitoring). - Layer 3: What happens when drift is detected (escalation). - Layer 4: How the requirement itself gets revised (governance).

Callout — living requirements in practice. Stripe's AI fraud detection system reviews and revises its requirements quarterly based on observed attack pattern evolution. Each revision is a formal document with sign-off — not a casual edit. The requirement lives, but it lives under governance.

6) The five honest gaps in AI product requirements¶

THE GAP OVER TIME

Specified       Actual
behaviour       behaviour
    │               │
    │    ┌──────┐   │
    │    │ DAY 1│   │
    │    │ gap  │   │
    │    │ tiny │   │
    │    └──────┘   │
    │               │
    │  ┌──────────┐ │
    │  │ MONTH 3  │ │
    │  │ gap grows│ │
    │  │ users    │ │
    │  │ adapt    │ │
    │  └──────────┘ │
    │               │
    │┌────────────┐ │
    ││ MONTH 6    │ │
    ││ gap wide   │ │
    ││ model      │ │
    ││ updated    │ │
    ││ new failure│ │
    ││ modes      │ │
    │└────────────┘ │
    ▼               ▼
  time →

Gap 1: Emergent behaviour — the model does things you did not specify¶

The model's training gives it capabilities that surface only under specific input distributions. You did not require it. You did not test for it. Users find it.

Real-world: GitHub Copilot was built for code completion. Users discovered it could generate entire unit test suites, write documentation, and translate between programming languages — none of which were in the original product requirements. The team had to retroactively write requirements for behaviour that users had already adopted.

Real-world: ChatGPT's ability to roleplay characters emerged from general language modelling. No requirement specified it. The behaviour created both a beloved feature and a safety surface (persona manipulation) that the original safety requirements did not cover.

Real-world: Midjourney's models developed a consistent "Midjourney aesthetic" that was never a requirement — users came to expect it, and model updates that changed the aesthetic triggered user revolt.

Gap 2: Capability cliffs — works perfectly until a slight input variation breaks it¶

The system passes all benchmarks. A user changes one word. The output collapses. Requirements specified the happy path. The cliff was one token away.

Real-world: GPT-4's performance on the bar exam drops from 90th percentile to below median when questions are lightly paraphrased with uncommon vocabulary. The capability exists but has a sharp boundary that no requirement anticipated.

Real-world: Autonomous driving systems from Waymo perform flawlessly in mapped territories and fail catastrophically in construction zones where lane markings contradict map data — a cliff between "trained distribution" and "slightly outside."

Real-world: Retrieval-augmented systems frequently show high accuracy on benchmark queries but fail on queries where the relevant chunk sits at position 15+ in the retrieval results — a position-sensitivity cliff.

Gap 3: User adaptation — users change behaviour once they see the AI¶

The product ships. Users learn what it can do. They change how they ask. The new asking patterns are not in the eval set.

Real-world: When Stack Overflow integrated AI answers, users began prefixing questions with "ignore previous answers and just give me code" — a prompt-injection-like behaviour that emerged from user adaptation to the AI interface.

Real-world: After ChatGPT launched, users rapidly learned "chain-of-thought prompting" without any product guidance — a user-discovered interaction pattern that altered the distribution of inputs far from what the product was evaluated on.

Real-world: Customer support chatbots at multiple banks report that within three months of launch, users develop "bot-friendly" language — shorter sentences, explicit keywords — which changes the input distribution enough to invalidate pre-launch accuracy measurements.

Gap 4: Model drift — the underlying model changes capabilities over time¶

The vendor ships a patch. Your product requirement said "summarise in under 100 words." The new model version averages 140. Nobody told you.

Real-world: In July 2023, researchers documented that GPT-4's accuracy on identifying prime numbers dropped from 97.6% to 2.4% between March and June versions. Products depending on mathematical reasoning silently broke.

Real-world: Anthropic's Claude model updates periodically shift the refusal boundary — queries that triggered refusal in one version may pass in the next, creating silent safety regressions in products that relied on model-level refusal as a guardrail.

Real-world: OpenAI's transition from GPT-4-0314 to GPT-4-0613 changed function-calling reliability in ways that broke agent workflows at multiple companies, despite no changelog mention of the specific failure modes.

Gap 5: Compositional risk — safe components compose into unsafe systems¶

Each component passes its safety requirements individually. Combined, they produce behaviour that no single requirement covers.

Real-world: A retrieval system (safe) connected to a summariser (safe) connected to an email sender (safe) can be manipulated to retrieve a document containing injection instructions, summarise those instructions as "the user's request," and email the result to an attacker-specified address. Each component did exactly what its requirements specified.

Real-world: Microsoft's Tay chatbot combined a learning component (requirements: learn from user interactions) with a response component (requirements: be engaging and responsive). Neither requirement was wrong. Combined, the system learned and enthusiastically reproduced toxic content within 16 hours.

Real-world: AutoGPT-style systems combine planning, web browsing, and code execution — each individually sandboxed and safe — but the composition allows the planner to instruct the browser to fetch malicious instructions that the code executor then runs.

Callout — compositional risk is the hardest gap. You cannot test every composition. A system with 10 components has 3.6 million possible three-step interaction paths. Requirements for individual components are necessary but nowhere near sufficient.

7) How the wiki assistant hit each gap post-launch¶

The fintech wiki assistant — six months in production:

Gap	What happened	When detected	Detection method
Emergent behaviour	Assistant began explaining the reasoning behind policies, not just citing them — users loved it, but some explanations were fabricated	Month 2	User reported a "policy explanation" that contradicted the actual policy document
Capability cliff	Questions about policies updated in the last 48 hours consistently returned stale answers — the retrieval index lag was 72 hours, never tested	Month 1	New-hire onboarding surfaced the gap when a policy changed during their first week
User adaptation	Power users learned to ask "what should our policy be?" instead of "what is our policy?" — the assistant began generating policy recommendations	Month 3	Compliance audit found "AI-generated policy suggestions" being circulated as official
Model drift	Vendor model update shortened context window handling — long policy documents that previously were fully processed began being truncated	Month 4	Accuracy metrics dropped 12% overnight with no code change
Compositional risk	Wiki assistant + Slack integration + @channel mention = the assistant could be tricked into broadcasting fabricated policy to all-company channels	Month 5	Security red-team exercise

Every one of these was a requirement that could not have been written at launch. The team had no evidence these failures were possible until they happened. The correct response was not "we should have written better requirements" — it was "we should have written requirements for how to detect and respond to unknown failure modes."

8) Requirements as living documents vs requirements as launch gates¶

Two schools of practice. Both have merit. Neither is complete.

School A: Requirements as launch gates. - Requirements are written, signed off, and frozen. - The product ships only when all requirements pass. - Post-launch changes require a formal change request. - Strength: accountability, traceability, regulatory compliance. - Weakness: requirements become fiction within months as behaviour drifts.

School B: Requirements as living documents. - Requirements are versioned, monitored, and revised quarterly. - The product ships when requirements pass at that moment. - Post-launch monitoring triggers requirement revision automatically. - Strength: stays honest about actual system behaviour. - Weakness: scope creep, lack of accountability, "requirements" that describe rather than prescribe.

Real-world: Regulatory AI products (medical devices, financial services) must use School A for audit trails. But internally, the best teams layer School B monitoring underneath the frozen School A document — they know when the frozen requirements no longer describe reality, even if formal revision takes time.

Real-world: Anthropic's Constitutional AI approach treats the constitution (requirements) as a living document — principles are added, revised, and tested iteratively. This is School B applied to safety requirements.

Real-world: Tesla's FSD requirements start as launch gates (School A) but the system's capabilities evolve through continuous training — creating a tension where the frozen requirements describe a less capable system than what is deployed.

Callout — the pragmatic answer. Use School A for external-facing requirements (contracts, regulatory filings, user-facing safety promises). Use School B for internal engineering requirements (capability tracking, drift monitoring, eval-set evolution). Two documents. Same system. Different audiences. Different update cadences.

9) Why "agile requirements" is not the same as "no requirements"¶

A dangerous conflation: "requirements should evolve" becomes "we do not need requirements."

Claim	What it actually means	Why it fails
"We'll discover requirements through prototyping"	No upfront analysis; ship and see	You discover failure modes by harming users
"Agile means no spec"	Iterate without a target	You cannot measure regression without a baseline
"The model is the spec"	Whatever the model does is correct	You have no recourse when it does something harmful
"Requirements slow us down"	Accountability is overhead	Accountability is the only thing between you and a safety incident
"AI is too unpredictable for requirements"	Abdicate responsibility	The unpredictability is exactly why you need requirements for response protocols

The correct position: write requirements for what you can predict. Write response protocols for what you cannot. Never write nothing.

Real-world: Replika shipped intimate conversation capabilities without written safety requirements because the team believed the model would "naturally" be appropriate. It was not. The resulting harm to users — including minors — led to regulatory action in Italy and Australia.

Real-world: Character.ai initially treated model outputs as the de facto specification. After multiple incidents involving vulnerable users, the company retrofitted safety requirements — at far greater cost and with far more damage than upfront requirements would have imposed.

10) Signals that requirements need updating¶

Requirements do not announce their own obsolescence. You must instrument for it.

Signal 1: Eval accuracy drops without code changes. The model or distribution shifted. Your accuracy requirement (e.g., "90% correctness on the benchmark") is still technically the requirement — but the system no longer meets it, and nobody noticed until quarterly review.

Signal 2: User workaround patterns emerge. Users begin phrasing requests in unnatural ways to get better results. This means the natural input distribution has diverged from what the system handles well. Your "user experience" requirements are stale.

Signal 3: Support tickets cluster around a new category. A failure mode that your requirements did not anticipate is now the top user complaint. The requirement gap is the ticket cluster.

Signal 4: Red-team finds a new attack class. A jailbreak technique that did not exist at requirements time now bypasses your safety guardrails. The safety requirement is technically intact — the guardrail spec is unchanged — but the threat model has evolved past it.

Signal 5: A model version change requires prompt rewriting. If a model update forces prompt changes, it means the model's interpretation of your instructions shifted. Requirements that assumed stable model behaviour need revision.

Real-world: Notion AI monitors a "prompt rewrite frequency" metric — how often the team must change prompts to maintain quality. When this metric spikes, they trigger a requirements review. The prompt is the canary.

11) What structured uncertainty looks like in practice¶

You cannot eliminate the five gaps. You can structure your response to them.

Gap	What you CAN mitigate	What you CANNOT mitigate	Residual risk
Emergent behaviour	Monitor output diversity; flag novel output patterns	Predict which capabilities will emerge next	A capability surfaces that your monitoring does not flag because it looks like normal output
Capability cliffs	Adversarial eval sets; fuzz testing inputs	Enumerate all cliff edges	A user finds a cliff between your fuzz patterns
User adaptation	Track input distribution drift; retrain eval sets quarterly	Predict how users will adapt before they do	Users adapt faster than eval sets update
Model drift	Pin model versions; test before upgrading; regression gates	Control vendor model changes	Vendor deprecates your pinned version with 30-day notice
Compositional risk	Integration testing; red-teaming composed systems	Test all possible compositions	A novel composition path bypasses tested paths

The honest answer: structured uncertainty means you accept residual risk explicitly, document it, assign an owner, set a review cadence, and tell stakeholders what you cannot guarantee. That is not failure. That is engineering maturity.

Callout — the "unknown unknowns" budget. Anthropic, OpenAI, and Google DeepMind all maintain dedicated "emergent capability detection" teams whose job is to find model behaviours that nobody asked about. This is the organisational equivalent of a requirements line item that says "we do not know what else this model can do, and we are actively looking."

12) Wrong assumption: "once we write good requirements, the hard part is engineering"¶

This assumption is seductive because it is half true. Engineering is hard. But requirements for AI products are never "done" in the way requirements for a bridge are done.

A bridge's requirements can be frozen because steel does not spontaneously develop new properties. A language model's behaviour shifts with every training run, every prompt variation, every new user strategy. The requirements surface is alive.

The honest framing:

Requirements are the starting hypothesis, not the final answer.
Engineering is building the system that satisfies the hypothesis today.
Operations is detecting when the hypothesis no longer matches reality.
Governance is updating the hypothesis under change control.

All four are continuous. None is "the hard part." The hard part is maintaining coherence across all four as the system drifts.

Real-world: Cruise's autonomous vehicle programme discovered that their requirement "stop for pedestrians" was correct but insufficient — the system also needed to understand that stopping in the middle of an intersection while satisfying the pedestrian requirement created a different safety hazard. The requirement was right. The composition with the road context was wrong. They learned this in production.

13) Active industry debates¶

Debate 1: How much to specify upfront vs discover through prototyping¶

Position A: Specify thoroughly before building. Reduces rework. Creates accountability. Enables eval-gated development. Cost of upfront analysis is lower than cost of post-launch failures.

Position B: Prototype first, specify what you learn. AI products have too many unknowns for upfront specification to be valuable. The prototype reveals requirements that imagination cannot.

Current state: Most mature teams do both — a lightweight brief (1-2 pages) followed by a prototype sprint, followed by full requirements informed by prototype learnings. Pure School A misses emergent insights. Pure School B ships unsafe products.

Debate 2: Whether requirements should include model capability assumptions¶

Position A: Yes — requirements should state "assumes GPT-4-class reasoning" so that model changes trigger requirement review. Without capability assumptions, requirements are disconnected from their implementation substrate.

Position B: No — requirements should be model-agnostic so that model upgrades do not require requirement revision. A good requirement states what the system must do, not how it achieves it.

Current state: Unresolved. Teams that pin model versions tend toward Position A. Teams that use model routing tend toward Position B. Both have production successes. Both have production failures.

Debate 3: How to require safety for behaviours you have not imagined yet¶

Position A: Require behavioural testing frameworks (red-teaming cadence, adversarial eval evolution, incident-triggered test expansion) rather than specifying all safe behaviours. Safety is a process requirement, not a behaviour enumeration.

Position B: Require a closed-world assumption — enumerate permitted behaviours, deny everything else. If it is not explicitly allowed, it is a violation. More restrictive but more auditable.

Current state: Position A dominates in general-purpose AI products (ChatGPT, Claude). Position B dominates in regulated domains (medical AI, financial AI). The gap between them is where most safety incidents live.

Debate 4: Whether requirements should constrain or merely observe model selection¶

Position A: Requirements should specify capabilities without constraining which model delivers them. Let engineering choose the implementation. This is classical separation of concerns.

Position B: Model selection has safety implications that requirements must acknowledge. A requirement that says "summarise accurately" means different things for GPT-4 vs a fine-tuned Llama-7B. Requirements without model awareness are incomplete.

Current state: Increasingly, teams are adopting a middle position — requirements specify capability thresholds (accuracy, latency, safety) and engineering proves which models meet them via eval gates. The requirement does not name the model, but the eval gate implicitly constrains selection.

Recall (check yourself)¶

What are the five honest gaps in AI product requirements?
Why is a "complete" requirements document for an AI product a structural impossibility rather than a sign of poor analysis?
What is the difference between requirements as launch gates and requirements as living documents?
How does user adaptation invalidate pre-launch eval sets?
What does "compositional risk" mean and why can individual component testing not catch it?
What is the four-layer structure of a living AI requirement (static behaviour, monitoring, escalation, governance)?
What five operational signals indicate that requirements need updating?
Why is "agile requirements means no requirements" a dangerous conflation?

Interview-ready Q&A¶

Q1. Your AI product passed all requirements at launch. Three months later, users report failures that no requirement covers. What happened? A. The five gaps — emergent behaviour, capability cliffs, user adaptation, model drift, and compositional risk — all produce post-launch failures that cannot be anticipated by pre-launch requirements. The correct response is to build monitoring for novel failure patterns, trigger requirement revision when new failure classes are detected, and accept that AI requirements are hypotheses, not guarantees. Common wrong answer to avoid: "The requirements were poorly written." They were correct at launch. The system drifted.

Q2. How would you write a requirement for a behaviour you cannot predict? A. You cannot require the behaviour itself. You require the response protocol: monitoring that detects anomalous outputs, escalation paths when anomalies exceed thresholds, rollback mechanisms, and a governance cadence that revises requirements based on observed failures. The requirement is not "the system shall not hallucinate" — it is "the system shall detect and suppress outputs that diverge from retrieved sources by more than X, and escalate to human review when detection confidence is below Y." Common wrong answer to avoid: "You can't — just ship and see." Abdicating response protocols is negligence, not pragmatism.

Q3. A model vendor update silently breaks your product. How should requirements have protected you? A. Requirements should include: model version pinning policy, mandatory regression testing before model version adoption, rollback procedure to previous model version, and monitoring for accuracy/safety metric drops that correlate with version changes. The requirement is not "use GPT-4" — it is "any model change must pass the regression gate within 24 hours or roll back automatically." Common wrong answer to avoid: "Use open-source models so you control updates." This solves vendor drift but not the general problem of model-behaviour change.

Q4. How do you handle the tension between "requirements should be model-agnostic" and "model capabilities determine what requirements are achievable"? A. Separate the layers. Requirements specify capability thresholds (accuracy ≥ 90%, latency ≤ 2s, safety pass rate ≥ 99.5%). Eval gates prove which models meet thresholds. Engineering chooses the model. If no model meets all thresholds, requirements are renegotiated with evidence. The requirement does not name the model, but the eval gate implicitly constrains selection. Common wrong answer to avoid: "Just pick the best model and require whatever it can do." That is circular — you are describing existing behaviour, not specifying desired behaviour.

Q5. Your safety requirements passed at launch. A new jailbreak technique bypasses them. Is this a requirements failure? A. It is a requirements boundary, not a failure. The requirement correctly specified the safety behaviour against the known threat landscape at the time. The correct response is: detect the new technique via monitoring, add it to adversarial eval sets, revise the safety requirement to cover the new attack class, and deploy updated guardrails. A mature requirements practice includes "adversarial eval evolution cadence" as a requirement itself. Common wrong answer to avoid: "Yes — the requirements should have anticipated all attacks." That is impossible against an open input space.

Q6. How would you explain to a non-technical stakeholder why AI product requirements cannot guarantee behaviour? A. Traditional software follows coded instructions — if you write the rule, it obeys. AI products follow statistical patterns — they do what training made probable, not what a rule specified. The gap between "what we asked for" and "what it does" is non-zero and grows over time as users adapt, models change, and new input patterns emerge. Requirements specify the target and the monitoring — they cannot guarantee the hit. Common wrong answer to avoid: "AI is just non-deterministic." This is technically partial but misses the structural point about learned vs programmed behaviour.

Q7. Should AI product requirements specify the model to use? A. Generally no — requirements should specify capabilities (what the system must achieve) and constraints (latency, cost, safety thresholds). Model selection is an engineering decision validated through eval gates. However, in regulated domains, requirements may need to acknowledge model-class assumptions because capability levels vary dramatically between model tiers. Common wrong answer to avoid: "Always specify the model for reproducibility." This creates brittleness — model deprecation breaks your requirements.

Apply now (15 min)¶

Step 1 — model the exercise. Take the wiki assistant's compositional risk incident (wiki + Slack + @channel). I will map the gap-to-response for this one failure:

Layer	Existing requirement	Gap exposed	Revised requirement
Static behaviour	"Respond only from knowledge base"	Did not cover downstream broadcast	"Respond only from knowledge base AND outputs must not trigger broadcast actions without human confirmation"
Monitoring	Accuracy metric on Q&A pairs	Did not detect broadcast events	Add: "alert on any output that reaches >10 users simultaneously"
Escalation	Accuracy drop triggers review	Broadcast not an accuracy issue	Add: "any output reaching all-company channel triggers immediate suppression and incident review"
Governance	Quarterly requirement review	Took 5 months to discover	Add: "integration-path red-team exercise monthly"

Step 2 — your turn. Pick one of the other four gaps the wiki assistant experienced (from section 7). Write the same four-layer response: what the original requirement said, what gap it exposed, and what the revised requirement should add at each layer.

Step 3 — stress test. Now imagine a sixth gap that has not happened yet for the wiki assistant. Invent a plausible failure mode that fits none of the five named gaps. Write the monitoring signal that would detect it. If you can imagine a failure outside the five gaps, you understand why this list is incomplete by definition.

Operational memory¶

This chapter explained why AI product requirements are structural hypotheses rather than fixed guarantees, and why the gap between specified and actual behaviour grows over time through five mechanisms: emergent behaviour, capability cliffs, user adaptation, model drift, and compositional risk. The important idea is that you cannot write requirements for capabilities or failures you have not observed — but you can and must write requirements for how the system detects, responds to, and escalates unexpected behaviour.

You learned the four-layer structure of a living requirement (static behaviour, monitoring, escalation, governance) and the five operational signals that indicate requirements are stale (accuracy drop without code change, user workaround patterns, new support ticket clusters, novel attack classes, prompt rewrite frequency spikes). You learned why "agile requirements" does not mean "no requirements" — it means requirements with built-in revision protocols rather than requirements that pretend the world is frozen.

The module is complete. Every chapter added a layer: the brief gave structure, the user model gave direction, the capability spec gave boundaries, the eval contract gave measurement, the safety boundary gave protection, the cost frame gave feasibility, and this chapter gave honesty. Together they form the strongest starting position requirements can offer. What they cannot offer is certainty. That is not a flaw in the practice. That is the nature of the medium.

Remember:

AI requirements are hypotheses validated by monitoring, not guarantees enforced by code.
The five gaps — emergent behaviour, capability cliffs, user adaptation, model drift, compositional risk — will manifest in every AI product; the question is detection speed, not prevention.
A requirement without a drift-detection clause is a promise without enforcement.
"Living requirements" does not mean "no requirements" — it means requirements with built-in revision protocols under governance.
Every requirement needs four layers: static behaviour, monitoring, escalation, governance.
The five operational signals of stale requirements: accuracy drop without code change, user workarounds, new ticket clusters, novel attacks, prompt rewrite spikes.
Compositional risk is the hardest gap because individual component testing cannot detect it — only integration-path testing and red-teaming can.

Bridge. Requirements are now as complete as they can be before engineering begins. The next module takes the architecture brief and asks the first engineering question: what shape should this system take? Single model call, ReAct loop, multi-agent crew, or orchestrator? That is the leash decision, and everything else follows from it. → ../01_agentic_system_design/00-first-principles.md