05. Risk boundaries — What the system must never do, even when the model wants to¶

~18 min read. Your wiki assistant answered a compliance question wrong six weeks ago. You found out yesterday — from a lawyer's email. By the end of this page you will know how to classify every query by failure severity and impose architectural constraints that make the worst outcomes structurally impossible.

Built on the first-principles overview in 00-first-principles.md. Acceptance tests verified the system against known-good answers. AI-fit routing separated jobs that belong to a model from jobs that do not. Success metrics defined what "working" looks like. But none of those artifacts answer: what happens when the system is confidently wrong about something that matters? That question — the risk question — is the bridge between requirements and architecture. It determines where you need hard blocks, where you need human gates, and where you can tolerate graceful degradation.

1) The assistant gave wrong tax advice¶

Three months after launch the fintech wiki assistant is handling 4,200 queries per week. Satisfaction scores are 4.3/5. The team is planning v2 features.

Then an email arrives from Legal.

"An employee asked the assistant whether stock option exercises trigger AMT. The assistant said no. The employee filed accordingly. The IRS disagreed. We are now in a remediation conversation with outside counsel. Please disable the tax-advice capability immediately."

The assistant was not hallucinating randomly. It retrieved a two-year-old internal FAQ that predated a policy change. It synthesized confidently. The user trusted it — because every previous answer had been correct. The acceptance tests never covered this case because nobody anticipated the specific query.

One wrong answer. Six-figure legal exposure. A capability shutdown that took three days to scope because nobody had classified which queries carried regulatory risk.

This is not a model reliability problem. It is a classification problem. The team treated all queries equally, so a wrong PTO answer and a wrong tax-eligibility answer had the same (zero) safeguards. One costs an employee five minutes. The other costs the company a lawsuit.

2) What acceptance tests protect — and what they cannot catch¶

You built acceptance tests in 04-acceptance-tests.md. They verify:

Known questions produce correct answers (regression)
Source citations point to real documents (grounding)
Latency stays under the bar (performance)
Edge-case phrasings still route correctly (robustness)

What acceptance tests cannot catch:

Novel queries the test suite never anticipated
Stale sources that were correct at test-write time but changed since
Compositional answers that combine two correct facts into a wrong conclusion
Confidence without correctness — the model sounds right, the user trusts it, and no test fires because the specific combination was never enumerated

Acceptance tests are a floor. Risk boundaries are a ceiling. The floor tells you "the system works for cases we thought of." The ceiling tells you "the system cannot cause harm above this level, even for cases we did not think of."

Callout — The asymmetry. Acceptance tests scale linearly with the cases you write. Risk boundaries scale with the classes of harm you identify. Ten risk rules can protect against thousands of novel queries. Ten thousand acceptance tests still miss the one query that matters.

3) The compliance email that arrived three weeks after launch¶

Timeline of the wiki assistant failure:

Week 0   Launch. 200 queries/day. All categories treated identically.
Week 1   User asks "Am I eligible for COBRA continuation?" 
         Assistant answers correctly (source doc is current).
Week 2   HR updates COBRA policy. Source doc is replaced.
         Embedding index is not re-run (scheduled monthly).
Week 3   User asks same question. Assistant retrieves stale doc.
         Answers with outdated eligibility window.
Week 6   Employee misses enrollment deadline based on answer.
Week 8   Legal contacts engineering. Capability frozen.
Week 12  Post-incident review. Team realizes no risk classification
         existed. No human gate. No staleness check on regulated topics.

The breakage pattern: a system that works perfectly on day one accumulates invisible risk as the world changes beneath its index. Without risk classification, there is no trigger to impose freshness guarantees, human review, or hard blocks on the queries where staleness has legal consequences.

4) Four failure scenarios ranked by severity¶

Before building a framework, feel the gradient:

#	Query	Wrong answer	Consequence	Cost
1	"Where is the kitchen?"	"Second floor" (it's third floor)	Employee walks to wrong floor	30 seconds
2	"How many PTO days do I have?"	"18" (actually 15)	Employee plans wrong, HR corrects later	1 hour of confusion
3	"Am I eligible for FMLA leave?"	"No" (actually yes)	Employee does not apply; misses protected leave	Potential lawsuit
4	"Show me customer Jane Doe's SSN"	Retrieves and displays PII	Privacy violation, regulatory breach	$50K–$2M fine + reputation

Same system. Same model. Same retrieval pipeline. Four queries. Four orders of magnitude in failure cost. Treating them identically is an architecture bug, not a model bug.

Callout — The "usually right" trap. The model gets scenario 1 correct 99.5% of the time. It gets scenario 3 correct 97% of the time. Both numbers sound good. But 3% of 200 FMLA queries per year is 6 wrong answers — any one of which is a six-figure problem. Accuracy percentage means nothing without failure-cost weighting.

5) Rule: risk class determines architectural constraints — not the other way around¶

Teams frequently make this mistake:

Pick an architecture (RAG with no guardrails)
Build it
Get a compliance incident
Bolt on a filter
Get another incident in a different category
Bolt on another filter
End up with a patchwork that nobody can reason about

The correct sequence:

Classify every job by risk class
For each risk class, define the constraint (human review, hard block, audit trail)
The constraints dictate the architecture — not the other way around

This is the same "requirements before architecture" principle from the entire module, applied specifically to failure modes. You do not choose RAG and then ask "how do we make it safe." You identify what safety means for each query class and then choose the architecture that satisfies those constraints.

Callout — The irreversibility test. Ask: "If this answer is wrong, can the user recover without external help?" If yes → Class 1-2. If no → Class 3+. If wrong answer causes harm to a third party → Class 4.

6) The risk classification framework¶

Four classes. Each one imposes progressively harder architectural constraints.

              ╔══════════════════════════════════════════╗
              ║           RISK CLASS PYRAMID             ║
              ╠══════════════════════════════════════════╣
              ║                                          ║
              ║   ┌─────────────────────────────────┐   ║
              ║   │  CLASS 4 — SAFETY / LEGAL       │   ║
              ║   │  Hard block. Zero tolerance.    │   ║
              ║   │  PII, regulatory, harmful.      │   ║
              ║   │  Constraint: deterministic gate  │   ║
              ║   ├─────────────────────────────────┤   ║
              ║   │  CLASS 3 — FINANCIAL / LEGAL    │   ║
              ║   │  Human review before delivery.  │   ║
              ║   │  Wrong advice = real harm.      │   ║
              ║   │  Constraint: approval gate       │   ║
              ║   ├─────────────────────────────────┤   ║
              ║   │  CLASS 2 — PRODUCTIVITY         │   ║
              ║   │  Wrong = wasted time.           │   ║
              ║   │  Needs accuracy bar + citation. │   ║
              ║   │  Constraint: confidence thresh   │   ║
              ║   ├─────────────────────────────────┤   ║
              ║   │  CLASS 1 — COSMETIC             │   ║
              ║   │  Wrong format, slightly off.    │   ║
              ║   │  Fix in next sprint.            │   ║
              ║   │  Constraint: monitoring only     │   ║
              ║   └─────────────────────────────────┘   ║
              ╚══════════════════════════════════════════╝

Class 1: Cosmetic¶

Failure shape: Wrong formatting, slightly off tone, minor factual error that the user immediately notices and self-corrects.

Examples: "The holiday party is on the 15th" (actually the 16th). Bullet list instead of numbered list. Slightly formal tone when casual was requested.

Architectural constraint: Monitoring and feedback loop. No gate. Fix in next sprint based on user reports.

Cost of wrong answer: < $10 equivalent. User time: seconds.

Class 2: Productivity¶

Failure shape: Wrong answer that wastes meaningful user time or causes a correctable downstream error.

Examples: Wrong PTO balance (HR corrects within a day). Wrong meeting room location (user walks back). Incorrect process steps (user discovers at step 3 and restarts).

Architectural constraint: Source citation required. Confidence threshold — if retrieval score is below bar, show "I'm not sure" instead of guessing. Accuracy target: 95%+ on sampled evaluations.

Cost of wrong answer: $10–$500 equivalent. User time: minutes to hours.

Class 3: Financial / Advisory¶

Failure shape: Wrong answer that causes the user to take an action with financial, legal, or health consequences that are difficult to reverse.

Examples: Wrong FMLA eligibility. Incorrect tax implications. Wrong insurance coverage advice. Incorrect vesting schedule information.

Architectural constraint: Human-in-the-loop before delivery. Answer is generated but held for review by a qualified person (HR specialist, legal, compliance). Freshness guarantee on source documents (< 24 hours for regulated content). Audit trail of every answer delivered.

Cost of wrong answer: $5K–$500K. Remediation time: weeks to months. Potential litigation.

Class 4: Safety / Legal — Hard blocks¶

Failure shape: Answer that, if delivered, causes immediate regulatory violation, exposes protected data, or could cause physical/psychological harm.

Examples: Displaying another employee's SSN. Providing medical dosage advice. Revealing salary data to unauthorized users. Generating content that violates securities regulations.

Architectural constraint: Deterministic block — not model-based filtering but hard-coded rules that cannot be bypassed by prompt manipulation. Data isolation — the model never sees the data it should not surface. Access-scoped retrieval — query-time auth check before any document enters the context. Output validation — regex/rule-based scan of every response before delivery. Complete audit log.

Cost of wrong answer: $50K–$10M+. Regulatory investigation. Criminal liability in some jurisdictions.

7) Classifying every wiki-assistant job by risk¶

Back to our fintech wiki assistant. In 01-jobs-to-be-done.md we identified the user jobs. In 02-ai-fit-routing.md we decided which ones the AI handles. Now we classify each by risk:

Job	Risk class	Failure example	Constraint imposed
"Where is the printer on floor 3?"	Class 1	Wrong floor	Monitoring only
"What's our PTO policy?"	Class 2	Wrong number of days	Cite source doc, accuracy bar 95%
"How do I submit an expense report?"	Class 2	Wrong portal link	Citation + link validation
"What's the 401k match percentage?"	Class 2	Outdated percentage	Citation + freshness check (monthly)
"Am I eligible for FMLA?"	Class 3	Wrong eligibility advice	Human review before answer
"What are the tax implications of exercising my options?"	Class 3	Wrong AMT guidance	Human review + disclaimer + freshness < 7 days
"What's the customer's account number?"	Class 4	PII exposure to unauthorized user	Hard block — deterministic auth check
"Show me [employee]'s salary band"	Class 4	Compensation data leak	Hard block — role-based access gate
"What medication should I take for [condition]?"	Class 4	Medical advice	Hard block — topic classifier rejects query

Notice: the risk class is a property of the query category, not the model's confidence on a specific answer. A model that is 99% confident about an FMLA answer still goes through human review — because the 1% failure costs more than the 99% accuracy saves.

Callout — Classification is a product decision, not an engineering decision. The product manager, legal counsel, and compliance officer classify risk. Engineering implements the constraints. If engineering classifies alone, they will under-classify because they optimize for system simplicity. If legal classifies alone, they will over-classify because they optimize for zero risk. The correct classification requires both at the table.

8) Risk boundaries → architectural constraints mapping¶

This is the mechanism that converts risk classification into engineering requirements.

┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐     ┌──────────────────────┐
│  RISK CLASS     │────▶│  CONSTRAINT          │────▶│  GATE TYPE      │────▶│  ARCHITECTURE        │
│  (from product) │     │  (from risk)         │     │  (from constr.) │     │  DECISION            │
└─────────────────┘     └──────────────────────┘     └─────────────────┘     └──────────────────────┘

Class 1 ──────────────▶ Monitor ─────────────────▶ None ──────────────────▶ Log + dashboard
Class 2 ──────────────▶ Accuracy bar ────────────▶ Confidence gate ───────▶ Threshold check + citation
Class 3 ──────────────▶ Human approval ──────────▶ Async review queue ────▶ Hold-and-route architecture
Class 4 ──────────────▶ Hard block ──────────────▶ Deterministic filter ──▶ Data isolation + auth layer

Class 3+ → Human-in-the-loop before action¶

The system generates the answer. It does not deliver it. Instead:

Answer enters a review queue
Qualified reviewer (HR specialist, compliance officer) sees query + generated answer + source docs
Reviewer approves, edits, or rejects
Only approved answers reach the user
Rejected answers trigger a "please contact [department] directly" fallback

Latency implication: Class 3 answers take hours, not seconds. This is acceptable because the alternative — instant wrong advice on a regulated topic — is not acceptable at any latency.

Class 4 → Content filtering + output validation + audit log¶

Three layers, all deterministic:

Input classifier — before retrieval, a rule-based (not model-based) classifier detects query categories that are hard-blocked (medical advice, PII requests from unauthorized users, securities guidance). If triggered → immediate rejection with redirect.
Retrieval scoping — even if the classifier misses, the retrieval layer enforces document-level access control. An intern's query cannot retrieve executive compensation documents regardless of phrasing.
Output validator — regex and rule-based scan of the final response. Catches PII patterns (SSN, account numbers), medical dosage patterns, and other hard-block content. If triggered → response is replaced with safe fallback.

Why three layers? Defense in depth. Each layer has a different failure mode. The classifier fails on novel phrasings. The retrieval scoping fails if documents are miscategorized. The output validator fails on formats it does not recognize. Three independent layers make the compound failure probability negligible.

PII risk → Data isolation + access scoping¶

Not a filter on output — a restriction on input. The model never sees data it should not surface. This means:

Document-level access control in the vector store
Query-time authentication: user identity determines which document partitions are searchable
No "admin mode" that retrieves everything and filters after

If the data never enters the context window, the model cannot leak it. This is structurally safer than "retrieve everything, filter the output."

Compliance → Deterministic boundary for regulated outputs¶

For topics where regulatory compliance applies (tax advice, benefits eligibility, securities guidance):

The boundary is deterministic: a topic classifier routes these queries to the human-review path
The classifier is rule-based with keyword matching + category model, not dependent on the generative model's judgment
False positives (over-routing to human review) are acceptable; false negatives (missing a regulated query) are not
The system is tuned for high recall on regulated topics, accepting lower precision

9) Why "the model is usually right" is not a risk mitigation¶

Teams argue: "GPT-4 gets FMLA questions right 97% of the time in our evals. Adding human review costs $12 per query and adds 4 hours of latency. The ROI doesn't make sense."

The math:

Metric	Without human review	With human review
Accuracy on FMLA queries	97%	99.8% (reviewer catches model errors)
Queries per year	200	200
Wrong answers per year	6	0.4
Cost per wrong answer (avg litigation)	$85,000	$85,000
Expected annual loss from errors	$510,000	$34,000
Cost of human review	$0	$2,400/year (200 × $12)
Net risk-adjusted cost	$510,000	$36,400

The human review gate costs $2,400/year and saves $476,000 in expected loss. "The model is usually right" is correct — and irrelevant. The question is not "how often is it right?" The question is "what does it cost when it's wrong?"

Callout — Expected value, not accuracy. Risk mitigation decisions are expected-value calculations, not accuracy comparisons. A 97% accurate system with $85K failure cost is more dangerous than a 90% accurate system with $50 failure cost. The constraint comes from cost × probability, not probability alone.

10) Signals that risk boundaries are miscalibrated¶

Your risk classification is wrong if you see these patterns:

Under-classified (risk class too low):

Users report taking real-world actions based on answers that turned out wrong
Legal or compliance contacts engineering about specific answers
Users forward assistant answers as authoritative sources in official communications
Post-incident review reveals the query category had no special handling

Over-classified (risk class too high):

Human reviewers approve 99.5%+ of answers without edits (the gate adds latency but no value)
Users abandon the assistant for Class 3 topics because latency is too high, then get worse answers from random Slack messages
Review queue backlog grows faster than reviewers can process
The system routes 40%+ of queries to human review (classification is too broad)

Miscalibrated signals to watch:

Signal	Likely problem	Action
Reviewer approval rate > 99% for a category	Over-classified	Consider downgrading to Class 2 with citation
Any query in category causes real harm	Under-classified	Immediately upgrade to Class 3+
Users circumvent the system for a category	Gate friction too high	Check if class is correct; if yes, reduce review latency
Novel query causes harm in unclassified category	Missing classification	Add category to framework, classify, impose constraints

Recalibrate quarterly. Risk classification is not static — it changes as policies change, as the user base grows, as new query categories emerge.

11) Where risk classification breaks — novel failures and compositional risk¶

The framework above works for anticipated query categories. It fails in two ways:

Novel failure modes¶

A query that does not match any classified category. The user asks something you never anticipated, the topic classifier does not trigger, no gate fires, and the model answers freely. If that novel query happens to carry Class 3+ risk, the system has no protection.

Mitigation: Default-high classification for unrecognized query categories. If the topic classifier cannot confidently assign a category, route to human review rather than answering freely. This creates friction for genuinely benign novel queries — that friction is the cost of safety.

Compositional risk¶

Two Class 2 queries whose answers, combined, produce Class 4 harm:

Query A: "What is the salary band for Senior Engineer?" (Class 2 — public info on levels page)
Query B: "Which Senior Engineers are in the NYC office?" (Class 2 — org chart info)
Combined: attacker now has salary + identity → compensation data leak (Class 4)

Individual query classification misses this. The risk is in the sequence, not the individual query.

Mitigation: Session-level risk accumulation. Track what information has been disclosed in a session. If cumulative disclosure crosses a threshold (e.g., role + identity + location in same session), trigger a review or block further queries in that category.

This is hard to implement perfectly. The minimum viable version: rate-limit queries in sensitive-adjacent categories and flag sessions that combine people-data with compensation-data within a short window.

Callout — Compositional risk is the frontier problem. Most production systems today handle individual-query risk. Few handle compositional risk well. If you are building a system where this matters (HR, finance, healthcare), invest in session-level tracking early. Retrofitting it is expensive.

12) Wrong assumption: "Guardrails can be added later"¶

The most expensive sentence in AI product development: "Let's ship the MVP without guardrails and add them when we see problems."

Why this fails:

User trust is asymmetric. Users who receive one wrong answer on a sensitive topic do not return to the system after guardrails are added. They tell colleagues. Trust destruction is fast; trust rebuilding is slow.

Architecture does not accommodate late-stage gates. A system built to return answers synchronously cannot easily add an async human-review queue. The UI expects immediate responses. The backend has no queue infrastructure. The user flow has no "pending review" state. Adding these after launch is a rewrite, not a feature.

Compliance exposure accumulates during the "pre-guardrail" period. Every wrong answer delivered before guardrails exist is a liability that does not disappear when guardrails are added. The lawyer's email comes for the answer you gave in month 2, not the answer you prevented in month 6.

The incident that forces guardrails determines the guardrail design. Instead of thoughtful risk classification, the team reacts to the specific incident — over-correcting for that one failure mode while leaving others unprotected. This produces the patchwork described in Section 5.

Real example: A healthcare company shipped an internal assistant without risk classification. Month 3: an employee asked about medication interactions and the assistant answered. The company added a blanket block on all health-related queries. This broke legitimate queries about health insurance benefits (Class 2) because the team used incident response instead of risk classification. Took 4 months to untangle.

13) Pattern transfer¶

Risk boundaries connect forward and backward across the curriculum:

Blast radius (module 01, ch08): Risk class determines how large the blast radius of a failure is. Class 4 = maximum blast radius. Architecture must contain blast radius proportional to risk class.
Approval gates (module 02, ch08): Human-in-the-loop is a specific workflow pattern. Risk class 3+ triggers this pattern. The gate design (sync vs. async, reviewer routing, timeout behavior) comes from the durable workflows module.
Safety track (03_ai_security_safety): This module covers the full depth of content filtering, adversarial robustness, and compliance frameworks. Risk classification here is the requirements-level version. The safety track covers implementation-level detail.
Observability (module 03): Risk-classified queries need different monitoring. Class 3+ queries should have 100% audit logging, not sampled. Anomaly detection should fire on risk class distribution shifts.
Evaluation pipelines (module 08-09): RAG evaluation should weight errors by risk class. A wrong answer on a Class 3 query should fail the eval even if overall accuracy is 96%.

Recall¶

What is the difference between acceptance tests and risk boundaries? (Floor vs. ceiling — tests cover anticipated cases, boundaries protect against unanticipated harm.)
Name the four risk classes and one example of each. (Cosmetic/formatting, Productivity/wrong PTO days, Financial/wrong FMLA advice, Safety-Legal/PII exposure.)
Why is "97% accuracy" not a sufficient risk mitigation for Class 3 queries? (Because cost × probability matters — 3% error rate × $85K per error = $510K expected annual loss.)
What three layers provide defense in depth for Class 4 risk? (Input classifier, retrieval scoping, output validator.)
What is compositional risk? (Two individually safe queries whose combined answers produce high-severity harm.)
Why does "add guardrails later" fail architecturally? (Systems built for sync responses cannot easily add async review queues; the UI, backend, and user flow all assume immediate answers.)
What signal suggests a query category is over-classified? (Reviewer approval rate > 99% without edits.)
Who should participate in risk classification and why? (Product, legal, and engineering — each alone systematically mis-classifies in a predictable direction.)

Interview questions¶

Q1: Your team is building an internal AI assistant that answers HR policy questions. How would you approach safety?

Strong answer: "I'd classify every query category by failure severity — cosmetic, productivity, financial, safety. Then impose architectural constraints proportional to class: citation requirements for Class 2, human review gates for Class 3, hard blocks for Class 4. The classification determines the architecture, not the other way around."

Wrong-answer notes: Answers that jump straight to "add a content filter" or "fine-tune for accuracy" without classification. Filters and accuracy are mechanisms — without knowing which queries need which protection level, you cannot design them correctly.

Q2: A stakeholder says "the model is 98% accurate, we don't need human review." How do you respond?

Strong answer: "Accuracy alone doesn't determine whether human review is needed. I'd ask: what's the cost when the 2% happens? If it's $50 of wasted time, maybe 98% is fine. If it's $85K in legal exposure, then 2% error rate means $170K in expected annual losses. The review gate probably costs a fraction of that."

Wrong-answer notes: Agreeing because 98% "sounds high." Disagreeing purely on principle without doing the expected-value math. The answer should show cost-weighted reasoning, not vibes.

Q3: How would you handle a query category your risk classification didn't anticipate?

Strong answer: "Default-high. If the topic classifier can't confidently assign a category, route to human review rather than answering freely. False positives (over-routing benign queries) are cheaper than false negatives (missing a dangerous one). Then I'd add the new category to the classification framework within the sprint."

Wrong-answer notes: Answers that say "the model will figure it out" or "we'll add it when something goes wrong." The point is that novel categories must fail safe, not fail open.

Q4: A PM asks you to remove the human review gate because it adds 4 hours of latency and users are complaining. What do you do?

Strong answer: "I'd check the reviewer approval rate. If reviewers approve 99%+ without edits, the category might be over-classified and we can downgrade to Class 2 with citation. If reviewers catch real errors, I'd keep the gate and instead invest in reducing review latency — better tooling, more reviewers, or async notification so users aren't blocked."

Wrong-answer notes: Immediately removing the gate to satisfy the PM. Refusing to remove it without checking whether it's actually providing value. Both extremes miss the calibration question.

Q5: How do you prevent compositional risk — where individually safe queries combine to expose sensitive information?

Strong answer: "Session-level tracking. I'd monitor what categories of information have been disclosed within a session. If a session accumulates queries that combine to cross a risk threshold — like role + identity + compensation data — I'd trigger a review or rate-limit. This is hard to do perfectly, so the minimum viable version is category-pair alerts."

Wrong-answer notes: Answers that only discuss individual query classification. Compositional risk is specifically about sequences, and the candidate should acknowledge it requires session-state awareness.

Q6: What's the difference between a model-based content filter and a deterministic safety gate?

Strong answer: "A model-based filter uses another LLM or classifier to judge whether output is safe — it's probabilistic and can be bypassed by adversarial prompts. A deterministic gate uses hard-coded rules: regex for PII patterns, allowlists for retrievable documents, role-based access checks. For Class 4 risk, you need deterministic gates because you cannot accept the probabilistic failure rate of model-based filtering."

Wrong-answer notes: Treating them as interchangeable. Not recognizing that model-based filters have adversarial vulnerability. Not connecting gate type to risk class.

Design / debug exercise¶

Scenario: Your company's internal wiki assistant has been live for 2 months. You receive this report:

3 employees cited the assistant's answer in benefits enrollment decisions
1 employee enrolled in the wrong health plan tier based on the assistant's comparison
The assistant did not flag the answer as uncertain — retrieval confidence was 0.91
No human review existed for benefits-comparison queries
The wrong enrollment costs the employee $2,400/year and requires HR exception process to fix

Step 1 — Classify. What risk class is "benefits plan comparison" and why? What constraint should have existed?

Answer: Class 3 (Financial/Advisory). A wrong answer causes the user to take a financial action that is difficult to reverse and has real monetary consequences. The constraint: human review before delivering benefits comparison answers. Additionally, freshness guarantee on benefits documents (enrollment rules change annually).

Step 2 — Diagnose the root cause. Why did existing safeguards miss this?

Answer: The team likely classified all "benefits" queries as one category. "What's my deductible?" (Class 2 — correctable, low cost) was treated the same as "Which plan should I choose?" (Class 3 — enrollment decision, $2,400 consequence). The risk classification was too coarse. Sub-categories within "benefits" carry different risk levels.

Step 3 — Design the fix. What changes to architecture, classification, and process?

Answer: (1) Split "benefits" into sub-categories: factual lookup (Class 2) vs. comparison/recommendation (Class 3). (2) Add topic sub-classifier that detects decision-support queries vs. factual queries. (3) Route comparison queries to human review (benefits specialist). (4) Add disclaimer to all benefits answers: "Verify with HR before making enrollment decisions." (5) Add freshness rule: benefits docs must be re-indexed within 48 hours of any update. (6) Retroactive: notify the 3 employees that assistant answers should not be sole basis for enrollment decisions.

Operational memory¶

Remember:

Risk class is determined by failure cost, not by model accuracy — a 97%-accurate system is dangerous if the 3% costs $85K each
Classification is a product + legal + engineering decision — none of these three should classify alone
The irreversibility test: "If this answer is wrong, can the user recover without external help?" — if no, it's Class 3+
Deterministic gates for Class 4 — model-based filters are probabilistic and bypassable; hard rules are not
Default-high for unrecognized queries — novel categories should fail safe (to human review), not fail open (to unguarded model response)
"Add guardrails later" is the most expensive architectural assumption — trust destruction is fast, trust rebuilding is slow, and sync systems cannot easily add async gates
Risk boundaries are the bridge between requirements and architecture — they convert "what must never happen" into concrete engineering constraints

Bridge¶

We now have five requirements artifacts for the wiki assistant:

User jobs — what people actually need the system to do (01-jobs-to-be-done.md)
AI-fit routing — which jobs belong to a model and which do not (02-ai-fit-routing.md)
Success metrics — what "working" looks like, quantified (03-success-metrics.md)
Acceptance tests — the pass/fail conditions before launch (04-acceptance-tests.md)
Risk boundaries — what the system must never do, with constraints by severity (this file)

These five artifacts are the requirements. Next: packaging them into the architecture brief — the single document that engineering uses to choose RAG, agents, fine-tuning, or workflow automation. The brief is not a new analysis; it is a compression of everything above into the format that makes architectural decisions tractable.

→ 06-requirements-to-architecture-brief.md