Skip to content

02. AI fit decision — When the model earns its cost and when it doesn't

~16 min read. Most AI features fail not because the model is bad but because the task never needed a model. This chapter teaches a four-question decision tree that routes each user job to the cheapest, fastest, most reliable implementation — model, rule engine, lookup, or workflow automation — before any prompt is written.

Built on 01-problem-framing.md. User jobs, task decomposition, and failure-cost awareness are the recurring concepts here. If the job list from ch01 is not yet stable, this chapter's decisions will drift as the jobs change.


1) What this chapter solves

Teams reach for a language model because it feels like the universal tool. It parses, it classifies, it generates. The result: a system that costs 200× more per request than a SQL query, runs 50× slower, and hallucinates on inputs that a regex handles deterministically.

This chapter prevents that mistake. It gives you a repeatable decision function — not intuition, not "let's prototype and see" — that assigns each decomposed task to the right execution layer before you spend a single GPU-second.

The output is a scored task matrix: every job from your ch01 decomposition gets a row, four boolean columns, and a routing verdict.


2) What problem framing revealed and what it cannot decide

Chapter 01 decomposed the fintech wiki assistant into discrete user jobs:

  • Parse a natural-language question into intent + entities
  • Retrieve the relevant policy document section
  • Generate a human-readable explanation of a rule
  • Check whether the user's account status matches an eligibility criterion
  • Format a date range from free-text input ("last quarter") into YYYY-MM-DD
  • Route the user to the correct support queue based on topic

Problem framing tells you what the user needs. It does not tell you how to build it. A job like "format a date range" looks simple enough for a model — and the model will get it right 97% of the time. But a two-line date parser gets it right 100% of the time, in 0.2ms, at zero marginal cost.

What still breaks: teams treat the job list as a shopping list for model capabilities. Every job becomes a prompt. The decision tree in this chapter interrupts that reflex.


3) The expensive lesson: when a model replaced a SQL query

A payments team built an LLM-powered "transaction summarizer." One feature: tell the user their total spend in a category last month. The model received raw transaction rows, counted them, and returned a summary sentence.

Latency: 2.3 seconds. Cost: $0.004 per request. Accuracy: 94.2% (the model occasionally miscounted rows or hallucinated a rounding).

The deterministic alternative: SELECT SUM(amount) FROM transactions WHERE category = ? AND month = ?. Latency: 12ms. Cost: $0.000002. Accuracy: 100%.

The model was not bad. It was wrong-tool. The input space is bounded (structured transaction records), the output requires no judgment (a number), and the failure cost of a wrong sum in a financial product is severe.

Not a model-quality problem. A wrong-tool problem. The LLM was parsing a date string that a regex handles in 0.2ms with zero hallucination risk.


4) Three support queries with different AI-fit profiles

Consider three real queries hitting the wiki assistant:

Query A: "What's the wire transfer cutoff time?" → Exact answer lives in one row of a policy table. Deterministic lookup. No model needed.

Query B: "I sent a wire yesterday but it hasn't arrived — what happened?" → Requires reasoning across account state, transaction status, holiday calendar, correspondent bank rules. Judgment needed. Model-fit.

Query C: "Route this ticket to the right team." → Topic classification against a fixed taxonomy of 12 queues. Bounded input space, bounded output space. A fine-tuned classifier or keyword router handles this; a general LLM is overkill.

Three queries, three different verdicts. The decision tree formalizes why.


5) The rule: use a model only when the input space is unbounded and the output requires judgment

This is the single sentence that saves months of misallocation:

Deploy a language model when the input cannot be enumerated in advance AND the correct output requires synthesis, judgment, or generation — not retrieval of a known fact.

If either condition fails, a cheaper tool dominates: - Bounded input + known output → lookup table or rule engine - Bounded input + generated output → template with slot-filling - Unbounded input + known output → search index with exact-match return - Unbounded input + generated output → model territory

Teams misfire most often on the third case: unbounded input tricks them into thinking a model is required, but if the output is a known fact, retrieval plus extraction wins.


6) The AI-fit decision tree — four questions that route each task

┌─────────────────────────────────────────────────────────────┐
│              AI-FIT DECISION TREE                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Q1: Is the input space bounded and enumerable?             │
│      │                                                      │
│      ├── YES → Rule engine / lookup table / SQL query       │
│      │         (stop here)                                  │
│      │                                                      │
│      └── NO ↓                                               │
│                                                             │
│  Q2: Does the output require natural-language generation    │
│      or judgment?                                           │
│      │                                                      │
│      ├── NO → Structured query / workflow automation /      │
│      │        search index (stop here)                      │
│      │                                                      │
│      └── YES ↓                                              │
│                                                             │
│  Q3: Is the failure cost of a wrong answer < cost of        │
│      human review per instance?                             │
│      │                                                      │
│      ├── NO → Human-in-the-loop required                   │
│      │        (model assists, human decides)                │
│      │                                                      │
│      └── YES ↓                                              │
│                                                             │
│  Q4: Does the evidence needed to answer exist and           │
│      is it accessible to the system?                        │
│      │                                                      │
│      ├── NO → Cannot ship. Fix the data layer first.       │
│      │                                                      │
│      └── YES → MODEL-DRIVEN TASK ✓                         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Each question is binary. No "it depends." If you cannot answer a question with confidence, that itself is a signal — you need more problem framing before you build.


7) Running the decision tree on five wiki-assistant tasks

Task Q1: Bounded input? Q2: Needs generation/judgment? Q3: Failure cost < review cost? Q4: Evidence accessible? Verdict
Parse natural-language question into intent NO YES YES YES Model
Retrieve policy section for "wire cutoff" NO (free text) NO (return exact doc) YES Search index
Explain an eligibility rule in plain language NO YES YES YES Model
Format "last quarter" → date range YES (finite patterns) NO Rule engine
Route ticket to support queue YES (12 categories) NO Classifier / rules

The table is the inspectable artifact. Pin it to your design doc. Revisit it when scope changes.

Every routing decision names a pressure it relieves (cost, latency, hallucination risk) and a new pressure it creates (maintenance of rules, staleness of search index, coverage gaps in the classifier).


8) Why the decision tree beats "just try GPT on everything"

The alternative approach — prototype every task with a general-purpose LLM and keep what works — has three failure modes:

  1. Survivorship bias. You keep the tasks where the model demo'd well on 10 examples. You miss the 2% failure rate that surfaces at 10,000 requests/day.

  2. Cost creep. Each task is cheap individually. Multiplied across all tasks and all users, the bill compounds. A team at a logistics company found their "summarize shipment status" feature cost $14,000/month — replaced by a template engine for $40/month on the same infra.

  3. Latency floor. Every model call adds 400ms–2s. Chain three model calls for a single user action and you hit 3–6 seconds. Users leave. A rule-engine + search-index path delivers the same answer in 80ms.

The decision tree is not anti-AI. It is pro-correct-tool. It protects the AI budget for tasks where the model actually earns its latency and cost.


9) Cost comparison: model call vs rule engine vs search index

Dimension LLM API call Rule engine Search index Template engine
Latency (p50) 800ms 2ms 15ms 1ms
Latency (p99) 3.2s 8ms 45ms 4ms
Cost per 1K requests $3.50 $0.001 $0.02 $0.0005
Accuracy (bounded input) 97.1% 100% 99.8% 100%
Accuracy (unbounded input) 89–94% N/A 78% N/A
Maintenance burden Prompt versioning, eval sets Rule updates Index freshness Template updates
Failure mode Hallucination, refusal Missing rule → hard error Stale index → wrong doc Missing slot → blank

Real numbers from production systems at fintech scale (50K–500K requests/day). The model wins exactly one column: accuracy on unbounded input. Everywhere else, simpler tools dominate.

If your task lands in a cell where the model does not win, you are paying a tax for no return. The decision tree prevents that tax from compounding silently.


10) Signals that you chose wrong — the model is a glorified lookup

Watch for these in production:

  1. Output entropy is near zero. The model returns the same 4–5 answers for 90%+ of requests. You built a $3.50/1K lookup table.

  2. Prompt contains the answer. You stuffed the policy text into context and asked the model to "find the relevant sentence." That is retrieval with extra steps.

  3. Eval shows 100% accuracy on a bounded test set. If perfect accuracy is achievable, a deterministic path achieves it cheaper.

  4. Latency is the top user complaint and the model call is the bottleneck. If the answer is deterministic, you are paying 800ms for something a cache serves in 2ms.

  5. The error cases are all formatting errors, not reasoning errors. The model is generating structured output that a template engine handles natively.

  6. You are fine-tuning to reduce hallucination on a fixed fact set. You are training the model to memorize — use a database instead.

When you see these signals, do not tune the prompt. Replace the tool.


11) Where the boundary blurs — tasks that start deterministic and drift toward AI

Some tasks genuinely migrate:

  • FAQ lookup starts as keyword search. After 6 months, users ask increasingly varied phrasings. Search recall drops. Semantic retrieval + generation becomes justified.

  • Ticket routing starts as a 12-category classifier. The taxonomy grows to 60 categories with overlapping definitions. A model with chain-of-thought routing outperforms the rule set.

  • Compliance checking starts as a rule engine against a stable regulation. The regulation changes quarterly and the rule updates lag. An LLM that reads the raw regulation text closes the lag.

The decision tree is not a one-time gate. Re-run it quarterly or whenever: - Input variety grows beyond what rules cover - Accuracy of the deterministic path drops below threshold - A new data source makes evidence accessible that was not before

The pressure this creates: you need monitoring on the deterministic paths, not just the model paths. A rule engine that silently falls behind is worse than a model that visibly hallucinates — at least hallucination triggers investigation.


12) Wrong assumption: "AI is always better because it handles edge cases"

This belief sounds reasonable. Models are flexible; rules are brittle. But flexibility has a cost:

  • Edge-case handling is not free. The model handles the 3% edge cases and introduces 2% novel errors on the 97% that rules handle perfectly.

  • Edge cases may not matter. If the edge case is "user typed the date in French" and your user base is English-only, you are paying for coverage you do not need.

  • Edge-case volume determines ROI. If 500 requests/day hit the edge case, model cost is justified. If 2 requests/week hit it, a human-in-the-loop fallback is cheaper.

The correct frame: calculate the cost of handling edge cases with a model vs the cost of letting them fall to a human queue. If human handling is cheaper at current volume, ship the rule engine with a human fallback and revisit when volume changes.

Real example: An insurance company deployed an LLM to parse claim descriptions because "some claims have unusual phrasing." The unusual cases were 1.4% of volume. The model cost $8,200/month. A rule engine + human queue for edge cases cost $1,100/month total. The model was retired in week 3.


13) Failure catalog — six ways teams misuse AI

# Pattern What they built What they should have built Cost of the mistake
1 LLM as calculator Model sums transaction amounts SQL aggregate $4,200/mo wasted, 94% accuracy vs 100%
2 LLM as router GPT-4 classifies tickets into 8 fixed buckets Logistic regression on TF-IDF 200× latency, same accuracy
3 LLM as template Model generates the same confirmation email with slot values Jinja template $1,800/mo for zero creative value
4 LLM as search Model "finds" the answer from 3 candidate paragraphs stuffed into prompt BM25 + extractive highlight 600ms added, no quality gain
5 LLM as validator Model checks if a phone number is valid Regex: ^\+?[1-9]\d{1,14}$ Hallucinated "valid" on malformed inputs
6 LLM as state machine Model decides next workflow step from fixed transitions Finite state machine Non-deterministic transitions broke SLAs

Each entry is a real production system observed between 2023–2025. None failed because the model was incapable. All failed because a simpler tool was ignored.

The most expensive line of code in AI engineering is not the model call. It is the missing if statement that would have prevented the model call.


Recall

  1. State the single-sentence rule for when to deploy a language model.
  2. List the four questions in the AI-fit decision tree in order.
  3. For the wiki assistant's "format last quarter to date range" task, which question stops the tree and what is the verdict?
  4. Name three production signals that indicate a model is acting as a glorified lookup.
  5. What is the cost-per-1K-requests difference between an LLM API call and a rule engine at fintech scale?
  6. Why does "just try GPT on everything" create survivorship bias?
  7. Under what condition does a task that starts deterministic justify migration to a model?
  8. In the insurance company example, what was the monthly cost difference between the LLM and the rule-engine-plus-human alternative?

Interview Q&A

Q1: How do you decide whether a task should use an LLM or a deterministic system? Apply the four-question decision tree: (1) Is input bounded? (2) Does output need generation/judgment? (3) Is failure cost tolerable without human review? (4) Is evidence accessible? Only if all four pass does the task warrant a model. Wrong-answer note: Saying "we prototype with GPT and keep what works" skips the cost and reliability analysis. Interviewers want a framework, not trial-and-error.

Q2: Give an example where using an LLM was the wrong choice. Transaction summing: a payments team used GPT to sum amounts from raw rows. A SQL query does it in 12ms at 100% accuracy. The model added 2.3s latency, cost 2000× more per request, and hallucinated totals 5.8% of the time. Wrong-answer note: Do not cite a case where the model was simply inaccurate. The point is that a simpler tool dominates — accuracy is one dimension, not the whole story.

Q3: When would you migrate a deterministic system to an LLM? When input variety outgrows what rules cover, deterministic accuracy drops below threshold, or new evidence becomes accessible. Re-evaluate quarterly. The migration is justified by volume × error-cost math, not by intuition. Wrong-answer note: Saying "when the rule engine gets too complex" is necessary but insufficient — complexity alone does not justify the latency/cost/hallucination trade-offs of a model.

Q4: How do you handle edge cases that a rule engine misses? Calculate: (edge-case volume × cost-per-human-review) vs (total volume × model-cost-per-request). If human fallback is cheaper, ship rules + human queue. Revisit when volume crosses the break-even point. Wrong-answer note: "Use the LLM because it handles edge cases" ignores that the model introduces novel errors on the non-edge majority.

Q5: What signals tell you a deployed model should be replaced with a simpler system? Near-zero output entropy, prompt containing the answer, 100% eval accuracy on bounded test sets, formatting-only errors, and latency complaints where the model call is the bottleneck. Wrong-answer note: Saying "low accuracy" points the wrong direction — the issue is that the model is too good at a task that does not need it, wasting cost for marginal benefit.

Q6: How do you prevent cost creep in a multi-task AI system? Route each task through the decision tree independently. Aggregate cost per task. Set per-task cost budgets. Alert when a task's model spend exceeds what a deterministic alternative would cost. Replace tools that fail the cost test. Wrong-answer note: "Negotiate volume discounts with the API provider" treats the symptom. The structural fix is routing tasks to the cheapest capable tool.

Q7: A PM says "let's just use AI for everything — it's more future-proof." How do you respond? Present the cost comparison table. Show that bounded-input tasks cost 3,500× more per request with a model. Frame it as: "We use AI where it earns its cost — four tasks out of nine. The other five ship faster, cheaper, and more reliably without it." Wrong-answer note: Do not argue against AI philosophically. Use numbers. PMs respond to cost, latency, and reliability data.

Q8: What is the relationship between the AI-fit decision and success metrics? The decision tree determines which tasks get a model. Success metrics determine whether the model-driven tasks are working. You cannot define success metrics for tasks that should not use a model — that conflates tool selection with outcome measurement. Wrong-answer note: Conflating "the model works" with "the feature works" is the core error this chapter prevents.


Design / debug exercise

Scenario: You are building a customer support bot for a SaaS product. The PM has identified 8 user jobs. Your task: route each job through the decision tree.

Step 1 — Score the jobs. Take these 8 jobs and fill in the 4-question matrix (template below). For each, write YES/NO for Q1–Q4 and assign a verdict.

Job Q1 Q2 Q3 Q4 Verdict
Answer "what's my plan?"
Explain a pricing tier difference
Reset a password
Diagnose why an integration failed
Generate a cancellation-save offer
Check if a feature exists
Summarize a user's usage history
Escalate to a human agent

Step 2 — Identify the mis-routed job. One of the 8 jobs is commonly shipped as an LLM task but should be deterministic. Identify it, explain why, and propose the simpler implementation.

Step 3 — Design the monitoring signal. For one model-routed job, define the production signal that would tell you the model has degraded into a glorified lookup (i.e., it should be replaced). Specify the metric, threshold, and action.


Operational memory

Remember:

  • A model earns its place only when input is unbounded AND output requires judgment — if either condition fails, a cheaper tool dominates.
  • The four-question decision tree: bounded input? → needs generation? → failure cost tolerable? → evidence accessible? All YES → model task.
  • Near-zero output entropy means the model is a $3.50/1K lookup table — replace with rules or cache.
  • Re-run the decision tree quarterly; tasks migrate from deterministic to AI-fit as input variety grows.
  • The cost of a wrong tool compounds silently — monitor deterministic paths for accuracy decay, not just model paths for hallucination.
  • Every routing decision has a pressure trade-off: the model relieves input-variety pressure but creates cost/latency/hallucination pressure.

Bridge

We now know which tasks belong to the model. But "use AI" is still not a launch criterion. Next: defining the success signal — what measurable outcome proves the feature works, separated from what the model metrics say.

03-success-metrics.md