08. Honest admission¶

⏱️ Estimated time: 19 min | Level: advanced

ELI5 callback: In our chain, the kitchen trains, the prep station prepares, the recipe book stores, the serving counter serves, and the quality inspector checks. Same restaurant chain, different platform layer. See.

LLM evaluation is still fuzzy in too many important cases¶

Now let us say the uncomfortable thing directly. LLM evaluation is still messy for open-ended tasks. Reference answers help sometimes and mislead often. See. The quality inspector does not have one universal ruler here. The kitchen can train or fine-tune endlessly without solving that gap. The prep station may improve context quality yet leave judgment unresolved. The recipe book can store many versions without proving one is truly best. The serving counter can stream polished text that still misses the goal. So what to do? Be explicit about evaluation limits. Use task-specific rubrics, human review, and adversarial test sets together. Track failure categories, not only overall scores. Simple, no? Confidence should match measurement quality. Do not oversell a metric that barely captures user value.

prompt → model output
   │         │
   ├── rubric score
   ├── human review
   ├── edge-case tests
   └── still some ambiguity
             ↓

Open-ended quality often needs layered evaluation.
Human review remains expensive and necessary.
Rubrics reduce ambiguity but do not erase it.
Unknowns should be documented, not hidden.
Honest evaluation limits create better release decisions.
Now watch. Reproducibility has its own truth problem.

Training reproducibility is better than before and still imperfect¶

People say reproducible training as if it were a switch. Reality is rougher. Random seeds help, but hardware kernels, data order, and libraries still shift outcomes. Distributed training adds more nondeterminism through timing and communication. See. Exact reruns can be harder than teams admit. You may reproduce within a tolerance, not bit for bit. That is often enough operationally, but say it clearly. So what to do? Capture code, data, environment, and configuration rigorously. Then define what level of reproducibility you actually require. For some tasks, metric parity is enough. For regulated paths, tighter reproducibility expectations may matter. Also test recovery from interrupted runs. Now watch. The platform promise should match physical reality.

same code + same data
        │
        ├── same env?
        ├── same kernels?
        ├── same ordering?
        └── close result?
               ↓

Reproducibility is usually a band, not a point.
Capture enough context to explain meaningful variance.
Match reproducibility guarantees to domain risk.
Interrupted-run recovery deserves explicit testing.
Honest contracts beat magical claims.
Simple, no? Say what “reproduce” really means.

Continuous retraining has real cost and uncertain payoff¶

Retraining sounds modern. It is not always wise. Every retrain consumes compute, data engineering time, and review time. Sometimes the model barely improves. Sometimes it gets worse because labels are noisy or delayed. See. More frequent change can reduce trust if governance is weak. LLM fine-tuning can be especially expensive relative to actual gain. Labeling and evaluation pipelines may become the dominant cost. So what to do? Compare expected lift against full retraining cost. Use trigger policies tied to monitored evidence, not calendar vanity. Also ask whether thresholds, prompts, or features can solve the issue cheaper. Not every problem needs a fresh model. That sentence saves serious money. Now watch. Platform maturity means choosing not to retrain sometimes.

drift signal
    │
    ├── data fix?
    ├── threshold fix?
    ├── prompt fix?
    └── retrain candidate?
             ↓

Retraining should compete against simpler remedies.
Include review and labeling cost in the decision.
Measure actual lift after retraining, not only before.
Fast loops still need economic discipline.
Stable systems often retrain less than people boast.
See. Fresh is not automatically better.

Human feedback loops are powerful and messy¶

Many AI systems depend on human review, preference signals, or moderation queues. Those loops are valuable and biased at the same time. Review quality can vary by expert, shift, or incentive. Feedback may lag reality or reflect policy changes. See. Human labels are not pure ground truth falling from the sky. Yet they remain essential for many domains. So what to do? Measure reviewer agreement and disagreement patterns. Version your rubrics, policies, and annotation guidelines. Keep feedback provenance visible in the platform. Also protect reviewers from overload, because tired humans create noisy labels. This is operations, ethics, and system design together. Now watch. Responsible platforms budget for human judgment explicitly. That budget belongs in architecture conversations, not only staffing plans.

model output
    │
    ├── user feedback
    ├── reviewer label
    ├── rubric version
    └── training signal?
            ↓

Human feedback is signal with variance, not perfect truth.
Track reviewer agreement and rubric changes over time.
Protect the human pipeline from overload and ambiguity.
Provenance matters when labels drive retraining.
Good systems respect both humans and math.
Simple, no? Review queues are part of the platform.

Senior engineers earn trust by naming limits early¶

The final lesson is not pessimism. It is clarity. Good platform design includes honest boundaries in the proposal itself. Say which metrics are weak. Say which costs are uncertain. Say which automation still needs human approval. See. Credibility rises when the story includes constraints. Interviewers and stakeholders both trust that tone more. So what to do? End designs with explicit unknowns and next tests. List what you can guarantee today and what needs further validation. That makes iteration plans more intelligent. It also prevents false certainty from entering production plans. Now watch. Honest admission is not weakness; it is operating maturity. The strongest platforms are careful about what they do not promise. That discipline protects teams, users, and budgets together.

design proposal
    │
    ├── guarantees
    ├── assumptions
    ├── unknowns
    └── next validation
             ↓

Name limits before production names them for you.
Unknowns should lead to tests, not embarrassment.
Clear boundaries improve prioritization and trust.
Mature teams prefer truth over glossy certainty.
Honest admission is part of strong system design.
See. Clarity is a senior engineering skill.

Where this lives in the wild¶

An LLM product team keeps human rubric review because automated judges miss subtle task quality failures.
A platform group documents reproducibility as metric tolerance because exact reruns are unrealistic on some stacks.
A fraud team retrains only when monitored lift justifies labeling and review cost.
A moderation workflow tracks reviewer agreement because policy shifts can look like model drift.
A mature platform proposal ends with explicit unknowns and validation plans instead of fake certainty.

Pause and recall¶

Why is LLM evaluation still a hard open problem for many tasks?
What does honest reproducibility mean in distributed training?
Why can continuous retraining be wasteful or risky?
How does naming limits early actually strengthen design credibility?

Interview Q&A¶

Q: What is still unsolved in AI platform design? A: Open-ended LLM evaluation, perfect reproducibility, and economically justified continuous retraining all remain partially unresolved. Common wrong answer to avoid: The main problems are solved now; it is mostly implementation detail.

Q: Why is exact training reproducibility hard? A: Distributed execution, hardware kernels, data ordering, and environment drift create variance even with strong tracking practices. Common wrong answer to avoid: Because engineers forget to set the random seed.

Q: Should platforms retrain continuously by default? A: No. Retraining should compete against simpler fixes and must justify compute, labeling, review, and risk costs. Common wrong answer to avoid: Yes. More frequent retraining is always modern and better.

Q: Why should you mention unknowns in a system design answer? A: Because clear unknowns improve trust, focus next experiments, and prevent false guarantees from shaping bad decisions. Common wrong answer to avoid: Because it lowers expectations so nobody can judge the design.

Apply now (5 min)¶

Take one AI system you admire and list three things it probably still does not know well. For each unknown, write one measurement or experiment that would reduce uncertainty. Then decide which unknown is most dangerous if ignored. That is how honest platform thinking begins.

Bridge. AI platform built. Time to practice — live interview case studies. → ../13_interview_case_studies/00-eli5.md