01. Problem framing — Turning "add AI" into concrete user jobs¶

Before any model is chosen, any prompt is drafted, or any retrieval pipeline is wired — the requirement must name the user's job, not the technology.

This is the first topic file in the AI Product Requirements module. The recurring pressures introduced in 00-first-principles.md — the user job, the failure cost, the task decomposition, the evidence need, and the fit decision — all begin here. Problem framing is where they originate. Every later chapter assumes you did this work first.

What this file solves¶

Teams that skip problem framing build expensive AI features that answer the wrong question. A product manager says "add an AI assistant," an engineer spins up a RAG pipeline, and three months later the feature answers 40% of queries incorrectly because nobody asked what the user actually needs to accomplish, what evidence each task requires, or what happens when the answer is wrong. This chapter teaches you to decompose vague AI requests into concrete user jobs, sub-tasks, evidence needs, and failure costs — so the engineering work that follows solves a real problem at an acceptable error rate.

1) Why vague AI requests create expensive failures¶

A request that says "add AI" is not a requirement. It is a solution searching for a problem. It tells you which hammer to use before you have looked at what is broken.

Here is what actually happens when a team takes "add AI" at face value:

An engineer picks a model and a technique (usually RAG, because it is the current default).
The team builds a pipeline shaped by the technique, not by the user's need.
Three months later, the feature is live. Users ask questions the pipeline cannot answer — not because the model is weak, but because the pipeline retrieves the wrong documents, or the question requires reasoning the retrieval step was never designed to support.
The team debugs by trying bigger models, better embeddings, more documents. Costs climb. Accuracy does not.

The root cause is not the model. Not the embeddings. Not the chunk size. The root cause is that nobody decomposed the user's actual work into tasks with known evidence requirements and failure costs. The team optimized a pipeline nobody specified.

Teacher voice. "Add AI" has the same diagnostic value as "make it faster." It tells you a stakeholder is excited. It tells you nothing about what to build.

2) The visible break when teams skip framing¶

The scenario. A product manager at a mid-size fintech says: "We need an AI assistant for our internal support team — something that answers policy questions so agents don't have to search 200+ wiki pages."

The team hears "RAG over wiki pages" and starts building:

Week 1:  Embed all 200+ wiki pages into a vector store
Week 2:  Wire a chat interface that retrieves top-5 chunks and feeds them to GPT-4
Week 3:  Demo to stakeholders — looks great on easy queries
Week 4:  Internal pilot — support agents start using it
Week 6:  Complaints: "It told a customer they could get a refund after 90 days.
          Our policy is 60 days. The wiki page it cited was from 2022 and
          was never archived."
Week 8:  Accuracy audit: 38% of policy answers contain at least one error.
          Feature is pulled from production.

The team's mistake was not technical incompetence. They built a competent RAG pipeline. The mistake was building any pipeline before asking:

What specific jobs do support agents perform when they search the wiki?
Which of those jobs tolerate approximate answers, and which require exact policy language?
What is the cost of a wrong answer for each job type?
Which documents are authoritative, and which are stale drafts?

Without those answers, no amount of engineering can produce a reliable system. Not a bigger model. Not a reranker. Not a fine-tune. The failure is upstream of all of them.

Mini-FAQ. "But we did talk to users — they said they wanted faster answers." Faster answers to what? "Faster" is a quality of the solution, not a description of the job. The job is "confirm the exact refund window for a subscription cancellation initiated after the billing cycle closes." That job has evidence requirements, failure costs, and latency expectations that "faster answers" never names.

3) A support query that needs three different capabilities¶

A support agent types: "Customer cancelled their Pro subscription mid-billing-cycle and wants a prorated refund. They signed up under the 2023 promotion. What's our policy?"

This single query decomposes into three distinct tasks:

Sub-task	What it requires	Failure cost if wrong
Identify the applicable policy version	Document retrieval + temporal reasoning (2023 promo terms, not current terms)	Agent quotes wrong refund amount → financial loss + compliance risk
Determine proration calculation rules	Exact extraction from policy doc (not paraphrase)	Overpay: ~$40–200 per case. Underpay: customer escalation + regulatory complaint
Confirm any exceptions for mid-cycle cancellation	Boolean lookup across multiple policy documents	Agent gives false "no exceptions" answer → customer churns; or false "yes" → revenue leak

A single RAG call that retrieves "top-5 similar chunks" cannot reliably handle all three. The first needs temporal filtering. The second needs exact quotation, not summarization. The third needs exhaustive search across documents, not similarity-based retrieval.

This is why framing matters. The shape of the user's work determines the shape of the system. Not the reverse.

4) Rule: every AI feature decomposes into user jobs, not model capabilities¶

The unit of analysis is the user job — not the model capability, not the technique, not the API.

A user job is: "Find the exact refund policy that applies to this customer's contract vintage and calculate the prorated amount." It is not: "Do RAG." It is not: "Summarize documents." It is not: "Use function calling."

Why this rule carries weight. Model capabilities change every quarter. User jobs change when the business changes — which is slower and more predictable. If you specify requirements in terms of user jobs, your system survives a model upgrade. If you specify in terms of model capabilities, every model change is a potential regression you cannot predict because the requirement never said what success looks like for the user.

The decomposition hierarchy:

User request (what the person types)
  └── User job (what they need to accomplish)
        └── Sub-tasks (discrete steps within the job)
              └── Evidence needs (what data each step requires)
                    └── Failure costs (what goes wrong when each step fails)

5) Job-task decomposition walkthrough — how it works step by step¶

The framework has five steps. Each one produces a concrete artifact — not a philosophical discussion, but a table or list that engineers can build against.

Step 1 — List observed user requests. Collect 30–50 real queries from the target users. Not hypothetical queries. Real ones. Pull them from chat logs, ticket systems, search histories.

Step 2 — Cluster into user jobs. Group the queries by what the user is trying to accomplish, not by topic or keyword. "What's the refund policy?" and "Can this customer get their money back?" are the same job. "What's the refund policy?" and "Who owns the refund policy doc?" are different jobs.

Step 3 — Decompose each job into sub-tasks. For each job, ask: what discrete steps does a human expert perform to answer this? Write them down. A typical job has 2–5 sub-tasks.

Step 4 — Identify evidence needs per sub-task. For each sub-task: what data source does the expert consult? Is the answer a direct quote, a calculation, a yes/no lookup, or a judgment call? Does the data source have versions, and does the version matter?

Step 5 — Assign failure costs. For each sub-task: what happens when the answer is wrong? Quantify in dollars, compliance risk, customer impact, or operational cost. This number determines your error tolerance.

The output is a table. Not a slide deck. Not a PRD paragraph. A table that an engineer can implement against and a QA team can write test cases from.

6) The mental model — user job → sub-tasks → evidence needs → failure costs¶

┌─────────────────────────────────────────────────────────────────────┐
│                     VAGUE REQUEST                                    │
│         "We need an AI assistant for support"                       │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      USER JOBS (5–15 per product)                    │
│                                                                     │
│  Job 1: Confirm refund eligibility for a specific case              │
│  Job 2: Find the SLA commitment for a customer's plan tier          │
│  Job 3: Determine escalation path for a compliance question         │
│  Job 4: Look up internal process steps for account closure          │
│  ...                                                                │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│              SUB-TASKS per job (2–5 each)                            │
│                                                                     │
│  Job 1 → [ identify policy version,                                 │
│            extract refund window,                                    │
│            check exceptions,                                        │
│            calculate proration ]                                     │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│            EVIDENCE NEEDS per sub-task                               │
│                                                                     │
│  "identify policy version"                                          │
│     → needs: customer contract date, policy version history         │
│     → type: temporal lookup (not similarity search)                 │
│                                                                     │
│  "extract refund window"                                            │
│     → needs: exact text from identified policy version              │
│     → type: verbatim extraction (not paraphrase)                    │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│            FAILURE COSTS per sub-task                                │
│                                                                     │
│  Wrong policy version cited    → $200 avg overpayment per case      │
│  Paraphrased instead of exact  → compliance audit flag              │
│  Missed exception clause       → customer escalation (est. $50/case)│
│  Wrong proration calc          → $40–200 financial exposure         │
└─────────────────────────────────────────────────────────────────────┘

This diagram is the chapter's core artifact. Every AI product requirement should bottom out in a version of this hierarchy. If you cannot fill in the bottom two rows, you are not ready to choose a model or a technique.

7) Running the framework on the wiki assistant¶

Back to our fintech PM's request. Here is what the decomposition produces when applied rigorously.

Observed user requests (sampled from 3 weeks of support chat logs, n=847):

Cluster	Frequency	Example query
Refund eligibility	31%	"Pro user cancelled mid-cycle under 2023 promo — can they get prorated refund?"
SLA / uptime commitments	18%	"What's our response-time guarantee for Enterprise tier?"
Escalation routing	14%	"Customer threatening legal action over data deletion — who handles this?"
Account lifecycle	12%	"Steps to close a dormant business account with outstanding balance?"
Feature entitlements	11%	"Does Starter plan include API access after the Jan 2024 change?"
Internal process	9%	"How do I submit a manual credit above $500?"
Compliance / regulatory	5%	"Are we required to disclose this under DPDP Act?"

Job-task decomposition for top job (refund eligibility):

Sub-task	Evidence source	Evidence type	Failure cost	Latency target
Identify customer's contract vintage	CRM system (structured data)	Exact lookup	Medium — wrong vintage = wrong policy	< 2s
Retrieve applicable policy version	Policy doc store (versioned)	Temporal filter + exact retrieval	High — $200 avg overpayment	< 3s
Extract refund window and conditions	Retrieved policy document	Verbatim extraction	High — compliance risk	< 2s
Check for promotional exceptions	Promo terms archive	Exhaustive boolean search	Medium — $50 per missed exception	< 2s
Calculate prorated amount	Billing system + policy rules	Deterministic calculation	High — direct financial impact	< 1s

Notice what this reveals: sub-task 5 (calculate prorated amount) should not use a language model at all. It is arithmetic over known inputs. A deterministic function is cheaper, faster, and guaranteed correct. Sub-task 1 is a database lookup. Only sub-tasks 2–4 are candidates for AI — and even there, "AI" means different things: temporal retrieval, verbatim extraction, and exhaustive search are three distinct capabilities.

Teacher voice. The decomposition did not tell us to use RAG. It told us we need a temporal document filter, a verbatim extractor, an exhaustive searcher, a CRM integration, and a calculator. The architecture follows the jobs, not the technique.

8) Why task decomposition beats feature-list thinking¶

Teams that skip decomposition default to feature-list thinking: "The AI should be able to answer questions, summarize documents, and search the wiki." This sounds reasonable but produces systems that are mediocre at everything and excellent at nothing.

The comparison:

Dimension	Feature-list approach	Task-decomposition approach
Success metric	"% of queries answered" (ambiguous)	"% of refund-eligibility queries answered correctly with exact policy citation" (testable)
Error handling	Uniform — same fallback for all failures	Per-task — wrong policy version triggers different alert than slow response
Architecture	Monolithic RAG pipeline	Composed system: lookup + retrieval + extraction + calculation
Cost of wrong answer	Unknown until incident	Quantified per sub-task before build starts
Eval design	"Does it sound right?"	"Does sub-task 2 return the 2023 policy when customer signed up in 2023?"
Model upgrade safety	Unknown regression surface	Test suite covers each sub-task independently
Build time (initial)	~4 weeks	~6 weeks
Time to acceptable accuracy	~16 weeks (debugging blind)	~8 weeks (targeted fixes per sub-task)

The feature-list approach is faster to start. The task-decomposition approach is faster to finish. Teams that ship the feature-list version spend months debugging accuracy problems they cannot localize because nothing in the system maps to a specific user need.

Mini-FAQ. "Isn't this just waterfall requirements gathering?" No. Waterfall gathers requirements once and freezes them. Task decomposition is iterative — you sample queries weekly, discover new job types, and update the table. The artifact is a living document, not a spec that goes stale. The difference: waterfall asks "what should it do?" once. Decomposition asks "what are users actually trying to do?" continuously.

9) Signals that framing is working vs failing¶

How do you know your problem framing is producing useful output versus generating busywork?

Healthy signals:

Engineers can point to a specific row in the decomposition table and say "I'm building this sub-task"
QA can write test cases directly from the evidence-needs column
Product can prioritize jobs by failure cost without re-reading the PRD
When accuracy drops, the team can localize the failure to a specific sub-task within hours
Model upgrades can be tested against the sub-task eval suite before deploy

First signal of degradation:

The decomposition table has not been updated in 4+ weeks, but new query patterns keep appearing in logs
Engineers are building capabilities nobody mapped to a user job ("let's add summarization!")

Misleading signal people watch:

Overall answer rate (% of queries that get any response). A system can answer 95% of queries and still be dangerous if 15% of those answers are wrong on high-cost sub-tasks.

What experienced engineers inspect first:

Failure cost distribution: are the highest-cost sub-tasks also the ones with the highest error rates? If yes, the system is failing where it hurts most. That is a framing problem — you built for coverage instead of for risk.

10) Where framing breaks down — ambiguous user populations, shifting requirements¶

Task decomposition is not free, and it does not solve every problem. It breaks when:

The user population is ambiguous. If you cannot identify who the users are, you cannot sample their real queries, and the decomposition is built on imagination instead of evidence. This happens with B2B products where the buyer is not the user, or internal tools that serve three teams with different needs.

Mitigation: Pick one user segment. Decompose for them. Ship. Observe what the other segments actually do. Decompose again.

Requirements shift faster than you can decompose. If the product pivots every 6 weeks, a detailed decomposition table becomes stale before it pays off. This is rare for AI features (which have long build cycles) but real for experimental products.

Mitigation: Do lightweight framing — top 3 jobs, top 1 sub-task per job, rough failure costs. Enough to prevent the worst mistakes without becoming a planning bottleneck.

The job is genuinely novel. If nobody has ever done this task before (not even manually), you cannot observe how experts do it. Creative tasks, research tasks, and generative tasks sometimes resist decomposition.

Mitigation: Frame what you can (inputs, outputs, constraints, failure modes), and explicitly mark the remainder as "exploratory — needs user testing to decompose further."

Mini-FAQ. "What if stakeholders refuse to do this work?" Then you do it yourself. Sample 50 queries from logs. Cluster them. Decompose the top 3. Present the failure-cost column. Stakeholders who see "$200 average overpayment per wrong refund answer × 150 queries/week = $30K/week exposure" suddenly have time for framing.

11) The tempting wrong model — "the model is the product"¶

The most common failure-prone assumption in AI product work: believing that the model is the product and everything else is plumbing.

This assumption leads to a specific failure pattern:

Team evaluates models by benchmark score or vibe check
Team picks the "best" model and builds a thin wrapper around it
Product quality equals model quality — if the model is good, the product is good
When quality drops, the only lever is "try a better model" or "write a better prompt"

The correct model: the model is one component in a system that serves user jobs. The system includes retrieval, filtering, routing, calculation, validation, fallback, and human escalation. The model's contribution varies by sub-task — sometimes 80% of the value, sometimes 5%.

graph TD
    A["The model is the product" mindset] --> B[Monolithic pipeline]
    B --> C[Single point of failure]
    C --> D[Cannot localize errors]
    D --> E[Only lever: bigger model]
    E --> F[Costs climb, accuracy plateaus]

    G["User jobs drive the system" mindset] --> H[Composed architecture]
    H --> I[Each sub-task independently testable]
    I --> J[Errors localized to specific sub-task]
    J --> K[Fix: better retrieval, or deterministic logic, or model, depending on sub-task]
    K --> L[Costs controlled, accuracy improves per-component]

Not a model quality problem. Not a prompt engineering problem. A requirements problem. The team never specified what each component should accomplish, so they cannot tell which component is failing.

12) Other failure shapes in problem framing¶

Beyond "the model is the product," framing fails in predictable ways:

Anchoring on the demo query. The team builds the system to nail 5 demo queries shown to stakeholders. Real query distribution looks nothing like the demo. Accuracy in production is 30 points below the demo.
Confusing frequency with importance. The most common query type gets all the engineering attention. The rare-but-expensive query type (compliance, legal, financial) gets none. One wrong answer on a rare query costs more than 1,000 wrong answers on a common query.
Decomposing into model capabilities instead of user tasks. "The system should do summarization, Q&A, and classification." These are model capabilities. What user job does "classification" serve? If you cannot answer, you cannot test it.
Treating all sub-tasks as equally tolerant of error. Uniform accuracy targets (e.g., "95% correct across all queries") hide the fact that some sub-tasks have 10× higher failure costs. A 95% rate on refund calculations with $200 exposure means $10 lost per query on average — fine on SLA lookups, unacceptable on financial calculations.
Skipping the "what should NOT be AI" question. Framing should reveal sub-tasks that belong to deterministic logic, database lookups, or human escalation. If every sub-task gets routed to a language model, you over-spent on inference and under-invested in reliable components.
Single-user-type assumption. The PM says "support agents" as if they are one population. In reality: L1 agents handle simple queries (and need quick answers with citations), L2 agents handle escalations (and need exhaustive policy analysis), and team leads audit answers (and need provenance trails). Three populations with different jobs.
No latency target per job. "Fast" is not a latency target. L1 agents need < 5s. L2 agents will wait 15s for exhaustive analysis. Without per-job latency targets, you cannot make architecture tradeoffs.

Where problem framing lives in production systems¶

Products built on visible task decomposition:

Stripe Radar — decomposes "is this transaction fraudulent?" into sub-signals: velocity, device fingerprint, behavioral pattern, merchant category risk — each scored independently before combination
Notion AI — separates "help me write" into distinct jobs: summarize existing content, generate from prompt, edit for tone, translate — each with different evidence needs and error tolerance
GitHub Copilot — decomposes "help me code" into completion (inline, low-latency, tolerates imprecision), chat (conversational, higher-latency, needs accuracy), and review (batch, highest accuracy requirement)
Intercom Fin — separates customer support into jobs the bot handles (FAQ, status lookup) vs jobs that route to humans (billing disputes, account recovery) based on failure-cost analysis
Linear's AI triage — decomposes "assign this issue" into: detect team from keywords, estimate priority from severity signals, suggest assignee from workload — each a distinct sub-task
Ramp's expense categorization — separates "categorize this expense" from "flag policy violations" — one tolerates ~10% error, the other needs near-zero false negatives
Superhuman's AI compose — distinguishes "draft from scratch" (generative, high tolerance) from "reply in my voice" (constrained, needs style matching, low tolerance for tone errors)
Slack AI search — decomposes "find that message" into: semantic search (recall-optimized), permission filtering (deterministic, zero-tolerance), and answer synthesis (summarization)
Figma's AI features — separates "rename layers" (batch, tolerates errors, user reviews) from "generate design" (interactive, user steers) — different autonomy levels from different failure costs
Plaid's transaction enrichment — decomposes merchant identification into: name normalization (deterministic rules), category classification (ML), and logo matching (embedding similarity) — three techniques for three sub-tasks
Zendesk's answer bot — routes by job type: factual lookups get retrieval, process questions get step-by-step extraction, complaints get routed to humans
Grammarly — decomposes "fix my writing" into: grammar (rule-based, near-zero tolerance), clarity (ML, moderate tolerance), tone (generative, high tolerance) — architecture follows the decomposition
Cursor — separates tab completion (must be fast, can be wrong), cmd-K editing (can be slower, must be precise), and chat (conversational, exploratory) — three products in one IDE
Canva's Magic Design — decomposes "design this for me" into: layout selection (template matching), content placement (constraint solver), and style transfer (generative model) — only the last one is "AI" in the LLM sense
Dovetail's AI analysis — separates "find themes in user research" into: clustering (unsupervised, exploratory), tagging (classification, needs consistency), and summarization (generative, needs faithfulness)

Products that visibly failed from missing decomposition:

Early chatbot deployments (2023) — wrapped GPT-4 in a chat UI, launched without job decomposition, pulled within weeks when hallucinated answers caused customer harm
Enterprise "AI search" products — shipped unified RAG without distinguishing exact-answer jobs from exploratory-search jobs, resulting in systems that were mediocre at both
AI coding tools pre-2022 — treated "help with code" as one job, producing tools that completed syntax well but could not reason about architecture — decomposition into distinct job types (completion vs explanation vs refactoring) improved each independently

Recall questions¶

Why is "add AI" not a requirement? What is it missing that a real requirement needs?
What are the five outputs of the task-decomposition framework?
In the wiki assistant example, which sub-task should explicitly NOT use a language model, and why?
What is the difference between a user job and a model capability? Give one example of each for the same feature.
Why does "95% accuracy across all queries" hide a critical product risk?
What is the first signal that framing is degrading — before accuracy drops?
Name two conditions under which detailed task decomposition is not worth the investment.
Why does the feature-list approach take longer to reach acceptable accuracy than the decomposition approach, despite starting faster?

Interview Q&A¶

Q1: A PM asks you to "add AI to our customer support tool." What is your first move — and what do you explicitly NOT do first?

First move: sample 30–50 real support queries from logs and cluster them into user jobs. Explicitly do not: evaluate models, pick a technique (RAG, fine-tuning, agents), or estimate timelines. You cannot estimate what you have not decomposed. The decomposition determines the architecture, and the architecture determines the timeline.

Common wrong answer to avoid: "I'd start by evaluating which model to use and whether RAG or fine-tuning is more appropriate." This puts the solution before the problem. You cannot choose a technique until you know what tasks the system must perform and what failure costs each task carries.

Q2: How do you convince a skeptical stakeholder that problem framing is worth 1–2 weeks before building?

Show them the failure-cost math. Sample real queries, identify the highest-cost job type, estimate the per-error cost (financial loss, compliance risk, customer churn), multiply by expected error rate of an unframed system (typically 15–40% on complex queries), multiply by weekly query volume. The resulting dollar figure per week of exposure usually exceeds the cost of 2 weeks of framing by 10×.

Common wrong answer to avoid: "Explain that it's best practice" or "Show them case studies from other companies." Stakeholders respond to their own numbers, not industry wisdom. Abstract arguments about process quality lose to concrete financial exposure in their own product.

Q3: What is the difference between decomposing into user jobs vs decomposing into model capabilities? Why does the distinction matter for system longevity?

User jobs are stable — they change when the business changes. "Confirm refund eligibility for this case" is a job that exists regardless of whether you use GPT-4, Claude, or a future model. Model capabilities are volatile — they change every quarter. "Do RAG" or "use function calling" are capabilities that may be superseded. If requirements are specified in job terms, you can swap models without rewriting the spec. If specified in capability terms, every model upgrade requires re-evaluation of whether the new model still supports the specified capability in the same way.

Common wrong answer to avoid: "They're basically the same thing — user jobs map 1:1 to model capabilities." They do not. A single user job often requires multiple capabilities (retrieval + extraction + calculation), and a single capability serves multiple jobs. The mapping is many-to-many, not 1:1.

Q4: You have a decomposition table with 8 user jobs and 30 sub-tasks. How do you decide what to build first?

Rank by: failure cost × frequency × current error rate. The sub-task that is expensive when wrong, asked often, and currently handled poorly is the highest-priority build target. Secondary factor: dependency structure — some sub-tasks share evidence sources or infrastructure, so building one unlocks others cheaply.

Common wrong answer to avoid: "Build the easiest one first to show quick wins." Quick wins on low-cost sub-tasks do not reduce risk. A system that perfectly answers low-stakes questions while failing on high-stakes ones is worse than a system that handles fewer queries but gets the expensive ones right.

Q5: When should you explicitly decide that a sub-task should NOT use AI?

When the sub-task has: (a) zero tolerance for error AND deterministic inputs (use a function), (b) a well-defined lookup with structured data (use a database query), (c) regulatory requirements for auditability that probabilistic systems cannot satisfy, or (d) costs that make model inference uneconomical for the accuracy required (a $0.003 model call that needs 99.99% accuracy on arithmetic is worse than a free function that gives 100%).

Common wrong answer to avoid: "Every sub-task should at least try AI first, and we can fall back to deterministic logic if it doesn't work." This wastes build time and creates maintenance burden. If the sub-task's requirements clearly point to deterministic logic, choosing AI is not "trying" — it is choosing the wrong tool and then discovering that fact expensively.

Q6: How does task decomposition change your eval strategy compared to a feature-list approach?

With decomposition, each sub-task gets its own eval set with its own accuracy target calibrated to failure cost. "Extract refund window" is tested on 200 examples with exact-match scoring. "Calculate proration" is tested on 100 examples with numerical equality. With feature-list, you get one aggregate metric ("overall accuracy") that cannot tell you which component is failing or whether a model upgrade helped sub-task A while hurting sub-task B.

Common wrong answer to avoid: "You just test the overall system end-to-end." End-to-end testing cannot localize failures. When accuracy drops from 82% to 78%, end-to-end metrics cannot tell you whether retrieval degraded, extraction regressed, or a new query type appeared that no component handles. Per-sub-task evals can.

Q7: A team shows you their "AI product requirements" document. It says: "The system should achieve 90% accuracy, respond in under 5 seconds, and handle 1000 queries per day." What is missing, and what risk does the gap create?

Missing: what user jobs the system serves, what sub-tasks each job requires, what evidence each sub-task needs, and what the failure cost is per sub-task. The risk: 90% accuracy is an average. If the system is 99% accurate on easy jobs (60% of volume) and 75% accurate on hard jobs (40% of volume), the average is ~89% — close to target — but the hard jobs are failing at a rate that may be unacceptable given their failure costs. Without decomposition, you cannot detect or fix this.

Common wrong answer to avoid: "Those are reasonable requirements — just add more detail about the types of queries." Adding query types without failure costs and evidence needs is still a feature list. The requirements must connect each type to what happens when it fails and what the system needs to answer it correctly.

Q8: How do you handle a user job where you cannot determine the failure cost because it has never gone wrong before?

Estimate by analogy and bound it. Find the closest job type where failures have occurred and use that cost as a starting estimate. Set up monitoring to detect the first real failure and update the estimate. In the meantime, treat unknown-cost jobs as medium-high priority (not low — "we've never seen it fail" often means "we've never measured it failing," not "it cannot fail").

Common wrong answer to avoid: "If we can't measure the failure cost, we should deprioritize it." Unknown cost ≠ low cost. It means unmeasured cost. The jobs most likely to cause catastrophic failures are often the ones nobody has systematically tracked — because they were rare enough to be handled manually until now.

Design/debug exercise (10 min)¶

Step 1 — Modeled example. Here is a complete decomposition for one user job from the wiki assistant:

Layer	Content
User request	"Customer on 2023 promo wants prorated refund after mid-cycle cancellation"
User job	Confirm refund eligibility and calculate amount for this specific case
Sub-task 1	Look up customer contract vintage → CRM query → exact lookup → failure cost: medium
Sub-task 2	Retrieve applicable policy version → versioned doc store → temporal filter → failure cost: high ($200 avg)
Sub-task 3	Extract refund conditions → policy doc → verbatim extraction → failure cost: high (compliance)
Sub-task 4	Check promotional exceptions → promo archive → exhaustive boolean search → failure cost: medium ($50)
Sub-task 5	Calculate prorated amount → billing data + policy rules → deterministic function → failure cost: high (financial)
Architecture implication	Sub-tasks 1, 5: no LLM. Sub-tasks 2–4: retrieval + extraction (distinct capabilities).

Step 2 — Your turn. Pick one of the other job types from the wiki assistant (SLA commitments, escalation routing, or account lifecycle). Decompose it into the same layers: user job, sub-tasks, evidence source per sub-task, evidence type, failure cost, and architecture implication. You should find at least one sub-task that does not need a language model.

Step 3 — Reproduce from memory. Close this file. Draw the core decomposition hierarchy (vague request → user jobs → sub-tasks → evidence needs → failure costs) from memory. For each layer, write one sentence explaining what it produces and why the layer below it is necessary. Check your version against section 6.

Operational memory¶

Problem framing converts "add AI" — which is not a requirement — into a concrete artifact that engineers can build against and QA can test against. The artifact is a decomposition table: user jobs, sub-tasks per job, evidence needs per sub-task, and failure costs per sub-task. Without this table, teams build pipelines that optimize for the wrong thing, cannot localize failures, and take 2–3× longer to reach acceptable accuracy.

The mechanism works because user jobs are stable while model capabilities are volatile. Specifying requirements in job terms means the system survives model upgrades, technique changes, and architectural pivots. Specifying in capability terms means every change is a regression risk.

The decomposition also reveals which sub-tasks should not use AI at all. Deterministic calculations, structured data lookups, and zero-error-tolerance steps belong to traditional software. AI belongs where evidence is unstructured, reasoning is required, and some error rate is acceptable given the failure cost.

The diagnostic question to carry forward: for any proposed AI feature, can you fill in a row that says "user job → sub-task → evidence source → evidence type → failure cost → latency target"? If not, the requirement is not ready for engineering.

Remember:

"Add AI" is a solution without a problem. Convert it to user jobs before any architecture discussion.
The unit of decomposition is the user job, not the model capability. Jobs are stable; capabilities change quarterly.
Every sub-task has a failure cost. If you cannot name it, you cannot set an accuracy target or prioritize engineering work.
Not every sub-task needs AI. Deterministic logic, database lookups, and human escalation are often cheaper, faster, and more reliable for specific sub-tasks.
A system built from decomposed sub-tasks is independently testable, independently improvable, and survives model upgrades.
The first sign of framing failure: engineers building capabilities nobody mapped to a user job.
Unknown failure cost ≠ low failure cost. It means unmeasured.

Bridge. We now have user jobs decomposed into sub-tasks with evidence needs and failure costs. But not every sub-task should be solved by a model — some belong to deterministic logic, some to database queries, some to human experts. The next chapter tackles the fit decision: when AI is the right tool for a sub-task, when it is not, and how to make that call before committing engineering resources. → 02-ai-fit-decision.md