13. Honest Admission — What We Do Not Fully Understand About Building AI Systems¶
~11 min read. The most dangerous engineer is the one who does not know what they do not know; this file closes that gap.
Built on the ELI5 in 00-eli5.md. The blueprint was drawn, the foundation was poured, the plumbing was run, the inspection was done, the move-in day happened. Now we sit in the house and admit: some walls are load-bearing in ways we cannot see.
Why "best practices" keep shifting¶
See. Every six months, the best practice changes. In 2022: "fine-tune BERT for every task." In 2023: "prompt GPT-4, fine-tuning is not necessary." In 2024: "fine-tune a small model for efficiency; prompting alone has limits."
This is not the field maturing smoothly. This is the field not yet having stable ground truth. When best practices reverse in 18 months, they were not best practices. They were working hypotheses.
The honest position: treat every "best practice" as provisional. Use it if it works for your measured use case. Do not treat it as permanent truth.
The instinct to reach for the latest tool — RAG when RAG is trending, agents when agents are trending — is cargo cult engineering. The cargo cult builds the airstrip. The planes do not come.
The blueprint must drive choices, not the current trend.
Simple, no? Measure first. Adopt later.
The evaluation problem: no ground truth for open-ended tasks¶
This is the deepest unsolved problem in AI engineering today.
For classification: you have labels. Accuracy is unambiguous. For retrieval: you have relevance labels (with caveats). Precision@k is defensible. For open-ended generation: what is the ground truth for "a good answer"?
┌──────────────┐
│ User answer │
└──────┬───────┘
▼
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Human │ │ LLM │ │ Product │
│ judge │ │ judge │ │ metric │
└────┬─────┘ └────┬─────┘ └────┬─────┘
└─────────────┼─────────────┘
▼
approximate truth
Question: "Explain the trade-off between precision and recall."
Answer A: "Precision measures correctness of positive predictions.
Recall measures coverage of actual positives. Higher
precision means fewer false positives. Higher recall
means fewer false negatives. You trade one for the other."
Answer B: "These two metrics are in tension. Increasing one usually
lowers the other. The F1 score balances them."
Both answers are correct. Neither is complete. Neither is the other. Which should your eval prefer? How do you encode that preference?
The LLM-as-judge approach helps but inherits the judge's biases. Human annotation is the gold standard but does not scale. Pairwise comparison (A vs B) is more reliable than absolute scoring but is still subjective.
Look. There is no clean solution today. The best current practice: use multiple evaluation signals in combination. Treat any single eval metric as an approximation, not as truth.
Why AI systems degrade silently over time¶
We covered monitoring in file 08. But the deeper issue is structural.
AI systems are trained on a snapshot of the world. The world changes. The system does not.
Model trained: January 2024
Model deployed: March 2024
World changes:
- New regulations introduced: June 2024
- Market conditions shift: August 2024
- User vocabulary evolves: Ongoing
- KB articles drift in style: Ongoing
System accuracy on original test set: Stable (frozen queries)
System accuracy on live traffic: Degrading (moving target)
The eval suite catches regressions in the model. It does not catch regressions in the world-model gap. Your test set becomes less representative over time.
Silent degradation happens when: 1. The distribution of user queries shifts (new topics, new vocabulary). 2. The knowledge base evolves but the embedding model's representation does not. 3. The model provider updates the model silently (this happens — read the changelogs). 4. External APIs or tools called by agents change behaviour.
The inspection must be refreshed continuously. Add new real queries from production to the test set every month. Never rely on a frozen test set as the sole quality signal.
The gap between demo and production reliability¶
This is the gap that costs careers.
A demo is: - Best-case queries selected to work. - A single run with no load. - No concurrency, no edge cases, no adversarial users. - Fixed context that you chose.
Production is: - Every query users can imagine, many of which you did not. - Concurrent users generating unpredictable load patterns. - Adversarial users who try to break the system. - Context that changes over time (stale data, schema changes, API changes).
Reliability in demo: 9.5 / 10
Reliability in production: 6.5 / 10 (estimate for typical first deployment)
The gap of 3.0 points is not random. It comes from specific sources: - Edge cases (20% of production queries are near-edge-case). - Load-related failures (p95 latency deteriorates under concurrency). - Adversarial prompts (prompt injection, jailbreak attempts). - Data drift (KB staleness, topic distribution shift).
Honest engineering acknowledges this gap upfront and budgets time to close it. "Production-ready" means you have closed the gap to an acceptable margin, not eliminated it.
What rigorous AI engineering actually looks like¶
The field is converging on a set of practices. They are not magic. They reduce but do not eliminate uncertainty.
Practice 1: Eval before you claim.
No metric → no claim. No held-out test set → no deployment.
Practice 2: Measure everything, trust nothing.
LLM judge scores are approximations. Human labels are approximations.
Use multiple signals. Flag disagreement.
Practice 3: Assume degradation.
Build freshness monitoring into every production system from day one.
Refresh the test set monthly with real production queries.
Practice 4: Be honest about confidence.
"Our eval shows 0.78 precision on a set of 150 queries."
NOT "our system is highly accurate."
The first statement is defensible. The second is marketing.
Practice 5: Document what you do not know.
Every system should have a "known limitations" section.
This is not a weakness. It is evidence of honest engineering.
See. The honest engineer is more trusted, not less. Admitting limitation shows you know where the edges are. An engineer who does not know their system's limits is the dangerous one.
Where this lives in the wild¶
- Shreya Shankar's research — documented that ML models silently degrade in production; introduced continuous evaluation as a corrective practice.
- Anthropic's model cards — explicit "known limitations" section for every released model.
- Hamel Husain's evaluation critique — argued that most LLM evals are measuring the wrong things; no settled consensus on correct evaluation.
- Meta's Llama model documentation — explicit acknowledgment that the model may produce incorrect factual claims; scope of failures documented.
- OpenAI's GPT-4 system card — extensive limitations section; used as a template by others in the industry for honest capability documentation.
Pause and recall¶
- Why should you treat "best practices" as provisional rather than permanent?
- What is the core problem with evaluating open-ended AI generation tasks?
- Name four sources of silent degradation in production AI systems.
- What is the difference between "reliability in demo" and "reliability in production"?
Interview Q&A¶
Q: "What is the biggest unsolved problem in AI engineering today?"
A: Evaluation of open-ended tasks. We do not have ground truth for what a "good answer" means across diverse user needs. LLM-as-judge is a reasonable approximation but it is not unbiased. Human annotation scales poorly. This means every quality metric we use is an approximation with unknown error bounds.
Common wrong answer to avoid: "Hallucination is the biggest problem." Hallucination is a symptom. The inability to reliably detect it at scale is the problem.
Q: "How do you handle the fact that your eval suite becomes stale over time?"
A: I treat the eval suite as a living artefact. Every month I add 10–20 real production queries (anonymised) to the held-out test set. I mark old test cases with a date. If the query distribution shifts, older test cases become less representative and should be retired or reweighted. An eval suite that never changes is measuring a frozen world.
Common wrong answer to avoid: "The eval suite is stable once written." Stable eval, moving world = you are measuring the wrong thing.
Q: "A stakeholder asks why your AI system sometimes gives wrong answers. How do you explain it?"
A: Honestly. "This system's retrieval is correct on 78% of queries in our test set. The 22% it misses includes edge cases not in our training set, and queries where the knowledge base does not contain the answer. We are monitoring this in production and refreshing the test set monthly to catch new failure patterns early."
Common wrong answer to avoid: "We are still improving the model." This is vague and does not communicate the nature or frequency of failures. Stakeholders need numbers, not reassurance.
Q: "What does 'production-ready' mean for an AI system?"
A: It means the system meets its defined success criteria on the eval suite, has been through canary deployment with real-traffic quality monitoring, has a monitoring dashboard with alert thresholds, and has a rollback mechanism. It does not mean perfect. It means the failure modes are understood, measured, and within acceptable bounds.
Common wrong answer to avoid: "Production-ready means it works correctly." Nothing works correctly 100% of the time. The question is whether the failures are understood and acceptable.
Apply now (5 min)¶
Write a "known limitations" section for your capstone project. Be specific: what query types does your system handle poorly? What would cause quality to degrade over time for your specific system? Write one sentence on how you would detect that degradation in production.
Sketch from memory: Write the five rigorous AI engineering practices from this file. Write one sentence of justification for each. No looking.
Bridge. This module is complete. The next module asks: given these limitations, what principles should guide every engineering decision? → ../20_engineering_leadership_judgment/00-eli5.md