05. Three Quality Layers — one ship, three instruments¶

~12 min read. Reliable AI needs different checks at different heights.

Built on the ELI5 in 00-eli5.md. The compass — reminder of the decision framework — helps pick the right quality layer.

1) See the stack before the metrics¶

Look. A captain does not steer with only one instrument. The compass points direction. The weather check judges danger. The ship's log preserves memory. The crew uses all three while following the course.

AI quality works the same way. Unit tests guard deterministic wiring. Evals guard task behavior. Monitoring guards live production health. Each layer answers a different question. That is the whole picture.

See. A parser bug is not an eval problem. A faithfulness drop is not a unit-test problem. A sudden latency spike is not solved by offline evals. When teams mix layers, they waste time. When they separate layers, they move faster.

So what to do? Keep the stack separate, but connected. Use the course to define quality goals. Use the compass to choose the right check. Use the crew to read results together. Use the weather check before risky launches. Use the ship's log to preserve what changed.

Picture first, then arguments.

┌──────────────────────────────────────────────┐
│ Layer 3: Monitoring                          │
│ Live traffic, latency, cost, errors, drift   │
├──────────────────────────────────────────────┤
│ Layer 2: Evals                               │
│ Accuracy, faithfulness, safety, helpfulness  │
├──────────────────────────────────────────────┤
│ Layer 1: Unit tests                          │
│ Prompt builder, filters, parser, router      │
└──────────────────────────────────────────────┘
                 ▲
                 │
         Confidence rises by layer

The bottom layer sits closest to code. The top layer sits closest to reality. Simple, no? You need all three.

2) Unit tests protect deterministic wiring¶

Unit tests belong where behavior should stay exact. If input is fixed, output should stay predictable. That means prompt builders, retrieval filters, parsers, routers, and fallbacks. This is the lower deck of the ship.

Look. If your prompt template forgets a required variable, unit test it. If your retrieval filter leaks another tenant, unit test it. If your parser accepts broken JSON, unit test it. If your router sends billing queries to search, unit test it. If your timeout fallback skips a safe branch, unit test it.

These checks are fast and cheap. They run on every pull request and make refactors less frightening. But unit tests do not grade user value. A perfect parser can still parse a useless answer. A correct router can still choose a weak prompt path. A clean prompt builder can still produce poor helpfulness. That gap is important. Yes? So what to do? Use exact assertions for deterministic surfaces only. Keep tests small, direct, and boring. Boring is good here. The compass should stop you from exact-matching creative output.

3) Evals protect task behavior¶

Evals ask a higher question. Did the system do the task well enough? This is where accuracy, faithfulness, safety, and helpfulness live. These are behavior questions, not glue questions. That is why evals sit above unit tests.

See. A retrieval filter may pass every unit test. Still, the answer may ignore retrieved evidence. A parser may be flawless. Still, the answer may sound confident and wrong. A router may pick the intended tool. Still, the result may be unhelpful. That gap is exactly the eval layer.

Good evals use representative tasks. They compare outputs with rubrics, references, or judge criteria. They show whether behavior matches the intended course. If eval confidence is weak, launch confidence should stay weak. Examples help. Ask a support copilot to summarize a refund policy. Check whether the answer is accurate and cites the source. Ask a medical triage assistant to refuse beyond scope. Check whether the refusal is correct and safe. Ask a coding assistant for SQL. Check correctness, helpfulness, and policy compliance. Simple, no? Evals can be offline or online. They can be automatic or human-reviewed, but they must map to task risk. The crew should read evals before celebrating a demo.

4) Monitoring protects live production health¶

Monitoring begins when real users arrive. Now the sea becomes messy. Traffic changes. Latency moves. Cost climbs. Retrieval sources disappear. Users complain in strange ways. This is production reality.

Look. A system can pass unit tests and evals on Tuesday. By Friday, traffic doubles and latency collapses. Or a model version shifts and helpfulness drops. Or a broken index hurts retrieval quality silently. Monitoring catches live health problems while the ship is sailing. That is why this layer sits on top.

Dashboards, alerts, traces, and feedback loops matter here. This is where the weather check becomes continuous. You are not only asking, "Can we ship?" You are asking, "Are we still safe right now?" That is a different question. But monitoring cannot replace the lower layers. A production alert cannot explain a broken parser contract. A cost spike cannot score answer faithfulness. A trace cannot tell you whether the answer was helpful enough. One instrument, one job. So what to do? Use monitoring for live health. Use evals for task behavior. Use unit tests for deterministic glue. Then connect them through the ship's log. Incident notes should link to traces, versions, eval runs, and code changes. That chain keeps evidence connected under pressure.

5) Put the stack together without confusion¶

The failure pattern is common. A team writes only unit tests. Then they claim quality is covered. Another team runs only evals. Then they miss wiring regressions. A third team watches dashboards only. Then they keep discovering failures after users do.

See the principle. Do not use one instrument for all layers. Do not ask one dashboard to answer every question. Do not expect one eval to explain every outage. Do not expect one unit test to prove product value. Different layers protect different failure shapes.

A healthy stack feels calm. Unit tests fail fast on contracts. Evals fail thoughtfully on task behavior. Monitoring fails loudly on live health. Yes? This is engineering discipline, not ceremony.

Where this lives in the wild¶

Customer support copilot — AI engineer tracks parser tests, citation evals, and live CSAT monitoring.
Claims review assistant — platform engineer watches tenant filters, adjudication evals, and production latency.
Internal coding agent — staff engineer reviews router tests, pass@k evals, and token-cost dashboards.
Search summarization feature — product engineer checks citation faithfulness evals and click feedback monitoring.
Voice collections bot — applied scientist monitors refusal evals, escalation tests, and live drop-off alerts.

Pause and recall¶

Which layer should catch a broken retrieval filter?
Why can perfect unit tests still hide poor helpfulness?
What live question belongs to monitoring, not evals?
How does the decision framework help pick the right layer?

Interview Q&A¶

Q1. Why not use evals instead of unit tests for everything? A. Evals are slower and weaker for deterministic glue. Unit tests isolate contract failures quickly. Evals should judge behavior, not replace basic wiring checks. Common wrong answer to avoid: "Evals are smarter, so they can cover all testing."

Q2. What does monitoring add after strong offline evals? A. Monitoring shows live health under real traffic, costs, versions, and failures. It catches issues that offline samples never revealed. Common wrong answer to avoid: "If eval scores are high, production monitoring is optional."

Q3. Give one example for each quality layer. A. Unit test a prompt builder variable requirement. Eval a citation answer for faithfulness. Monitor p95 latency and thumbs-down rate in production. Common wrong answer to avoid: "All three layers just check accuracy with different tools."

Q4. How would you explain the stack to a PM? A. Say unit tests protect plumbing, evals protect task outcomes, and monitoring protects live reliability. That keeps everyone aligned without mixing layers. Common wrong answer to avoid: "Quality is one dashboard with a few red and green lights."

Apply now (5 min)¶

Exercise: Pick one AI feature you know. List three deterministic parts for unit tests. List three behavior questions for evals. List three live signals for monitoring. Write one sentence for how launch-risk review would change your plan.

Sketch from memory: Draw the three-layer stack. Label the decision frame and shared docs beside it. Add one arrow showing how evidence flows upward.

Bridge. Monitoring matters, but watching dashboards alone is passive. Next, see how observability and error budgets create operating discipline around it. → 06-observability-error-budgets.md