07. AI Technical Debt — the silent bill below the deck¶
~13 min read. Demos can look healthy while hidden debt keeps growing.
Built on the ELI5 in 00-eli5.md. The ship's log — reminder of shared memory — shows why hidden debt hurts the whole voyage.
1) Debt grows quietly because demos lie¶
Look. A ship can move forward while the hull slowly weakens. People on deck still clap because the motion looks smooth. AI systems behave like that. A demo can work while debt keeps growing below.
The course still appears straight. The compass still seems calm. The crew still feels productive while the weather check still looks green. But the hidden bill is already forming. That is the danger.
See. Prompt patches pile up, data provenance gets fuzzy, and evals stay tiny and stale. Workflow branches grow without ownership, ops shortcuts become normal, and key reasoning stays in one head. Nothing crashes immediately. That delay creates false confidence.
Simple, no? Debt is not only bad code. Debt is anything that makes later change slower, riskier, or more expensive. AI systems collect this debt faster because behavior is fuzzy. The failure may appear only under scale, turnover, or incident pressure. That is why the bill arrives late.
Here is the picture first.
┌──────────────────────────┐
│ Demo day looks polished │
├──────────────────────────┤
│ Prompt patches │
│ Data confusion │
│ Weak eval coverage │
│ Workflow sprawl │
│ Missing rollback paths │
│ Tribal knowledge │
└──────────────────────────┘
▼
Scale, turnover, or failure
▼
Debt bill arrives
The top box is what leaders usually see. The lower boxes are what operators eventually feel. Yes? That mismatch is the whole problem.
2) The six debt types worth naming¶
So what to do? Name the debt clearly. Once named, it becomes discussable and manageable. That is why classification helps.
Prompt debt¶
Prompt debt appears when prompts grow by patching, not design. A huge prompt may still work today, then one tiny edit causes a strange regression. Nobody knows which sentence mattered. The course starts drifting because intent is buried.
Data debt¶
Data debt appears when labels, sources, or freshness are unclear. Maybe retrieval mixes outdated and new documents, or edge cases never got labeled. During calm weeks, nobody notices. During failures, everyone notices.
Eval debt¶
Eval debt appears when the benchmark is small, stale, or biased. Teams keep shipping because scores still look nice, even after the task changed. Then confidence becomes fake. The weather check fails because the measuring tool is weak.
Workflow debt¶
Workflow debt appears when agent steps keep multiplying without rules. Retry loops, fallback branches, and wider permissions appear quietly. No one owns the stop conditions. Costs rise and failures become harder to trace.
Ops debt¶
Ops debt appears when rollback, alerting, and kill switches stay half-done. The happy path works, but the bad path becomes chaos. During incidents, even strong engineers waste time finding basics.
Knowledge debt¶
Knowledge debt appears when only one person knows why something exists. The crew depends on memory instead of written reasoning. Without the ship's log, turnover becomes a technical problem too. That slows every fix.
3) The bill arrives on Friday¶
See the common scene. Friday evening, a regression hits production. Answers are slower, costlier, and slightly less faithful. Nobody knows whether the cause is prompt, retrieval, model, or routing. That is the debt bill arriving.
Now the important principle. Ordered investigation is itself engineering discipline. Panic, guessing, and changing five things at once are not processes. The compass matters here because order protects clarity.
Look at one sane investigation path.
┌──────────────┐
│ 1. Stabilize │ freeze risky changes, protect users
└──────┬───────┘
▼
┌──────────────┐
│ 2. Compare │ what changed in prompt, model, data, routing?
└──────┬───────┘
▼
┌──────────────┐
│ 3. Trace │ latency, sources, failures, cost spikes
└──────┬───────┘
▼
┌──────────────┐
│ 4. Re-run │ unit tests, evals, and rollback checks
└──────┬───────┘
▼
┌──────────────┐
│ 5. Record │ write the fix and debt note
└──────────────┘
First, stabilize the surface. Protect users before chasing elegance. Then compare versions and recent changes. Then inspect traces and live signals. Then re-run unit tests and evals in the changed area. Only then choose rollback or repair. Simple, no?
Why this order? Because technical debt hides causality. If you investigate randomly, you create new confusion. If you investigate in order, you recover cause and effect. That helps the weather check stay honest even under stress. The ship's log should capture the learning, not only the fix.
4) Pay debt before it becomes identity¶
Debt becomes culture when nobody names it. Soon people say, "This system is just complicated." Often that means the system is under-documented and over-patched. Complexity is sometimes real. Sometimes it is unpaid debt wearing a clever costume.
So what to do? Keep debt visible in sprint work. Track prompt cleanup, eval refresh, data hygiene, rollback work, and ownership gaps. If you never schedule cleanup, you are choosing interest payments.
No team should need heroics every Friday. The system should stay understandable after one person leaves. Risk review should surface danger before launch, not after harm. Written notes should explain why a shortcut was taken and when to revisit it. Yes? That is what healthy engineering looks like.
A good team does not promise zero debt. That is fantasy. A good team keeps debt legible, payable, and revisitable. The compass should make tradeoffs explicit. Then shortcuts stay temporary instead of becoming architecture. That difference matters a lot.
5) A quick debt smell test¶
Ask five short questions. Can a new engineer explain the prompt stack in ten minutes? Can the team rerun meaningful evals after a change? Can someone trace the source of a bad answer quickly? Can production be rolled back without drama? Can the next person understand why the current design exists?
If several answers are no, debt is already present. Look. The absence of pain today does not prove health. It may only prove that stress has not arrived yet. That is why demos are weak evidence on their own. See. The bill prefers to arrive when the team is tired.
Where this lives in the wild¶
- Insurance claims copilot — AI lead fights prompt debt, stale evals, and fragile escalation workflows.
- Enterprise search assistant — platform engineer manages source freshness debt and retrieval-debugging gaps.
- Internal coding agent — staff engineer pays down tool-permission debt and rollback weakness.
- Voice support bot — product engineer addresses latency-workflow debt and missing incident notes.
- Legal drafting assistant — applied scientist reduces knowledge debt around citation rules and edge cases.
Pause and recall¶
- Why can a polished demo still hide serious AI debt?
- Which debt type grows when evals stay tiny and stale?
- Why is ordered investigation a principle, not just a tactic?
- How does the ship's log reduce knowledge debt during turnover?
Interview Q&A¶
Q1. What makes AI technical debt different from ordinary software debt? A. More behavior is probabilistic, so weak evidence can look convincing longer. Debt often hides until scale, turnover, or failure applies pressure. Common wrong answer to avoid: "AI debt is just messy prompt text."
Q2. What is eval debt? A. It is the gap between current task reality and what your benchmark still measures. Scores can stay high even while true quality drifts downward. Common wrong answer to avoid: "Eval debt only matters after model fine-tuning."
Q3. Why is Friday regression handling part of engineering principles? A. Because ordered investigation protects causality, reduces panic, and speeds recovery. The investigation method is part of system quality. Common wrong answer to avoid: "Just roll back immediately and inspect later."
Q4. How would you reduce knowledge debt in an AI team? A. Write decision notes, update runbooks, and connect changes to eval evidence. Make reasoning visible beyond one expert. Common wrong answer to avoid: "Hire stronger people so they can infer the missing context."
Apply now (5 min)¶
Exercise: Choose one AI system you know. Write one example each of prompt debt, data debt, eval debt, workflow debt, ops debt, and knowledge debt. Then mark which debt would hurt most during Friday regression triage.
Sketch from memory: Draw the demo box on top and the hidden debt layers below. Add arrows showing how scale, turnover, and failure make the bill arrive. Write one note for what should enter the incident notes after the fix.
Bridge. Debt grows fastest when changes are hard to examine clearly. Next, see why strong review culture is a defense against silent decay. → 08-code-review-ai.md