02. The day-one audit¶

With the diagnosis framework in hand, the next move is to look at the system systematically. The audit is the first two weeks of work. Its output is a map you will plan against for the next quarter. This chapter is what you inspect, in what order, and what you write down.

A new lead at a Mumbai consumer SaaS company inherits a recommendation pipeline. Her first instinct is to read the code top to bottom. By Friday of week one she has read approximately a fifth of the system and feels overwhelmed. She switches strategies. The following Monday she walks the system not by code but by behaviour: what does it actually do in production right now? She pulls a sample of last week's recommendations from the logs. She finds the prompt template that produced them. She finds the model version. She finds the input distribution. She finds the most common failure shapes by sampling customer-complaint tickets. By Friday of week two, she has a map: inputs, prompts, models, outputs, failure modes, scale, cost. The code is still mostly unread, but she now knows what to read it for.

The day-one audit is this behaviour-first walk. The code comes later. The system's behaviour, scale, and shape come first.

What you produce by the end of two weeks¶

The audit output is a small artefact. Twelve sections, two to three paragraphs each. By the end of two weeks, you have:

1.  System purpose             — what business problem it solves; what it does not solve
2.  Input shape                — what comes in, from where, at what rate
3.  Output shape               — what goes out, to whom, with what consumer expectations
4.  Prompts and templates      — the artefacts that encode "what the system is trying to do"
5.  Models in use              — provider, version, region, parameters, fallback if any
6.  Code architecture          — major modules, entry points, where the model is called
7.  Observability inventory    — what is logged, what is traced, what is in the audit
8.  Eval coverage              — what tests exist, what evals exist (usually: none)
9.  Failure modes              — the top 5–10 failure shapes from customer complaints
10. Scale and cost             — calls per day, cost per day, latency distribution
11. Owner map                  — who knows what; who is on call; who left
12. Risk inventory             — known risks, undocumented dependencies, recent incidents

The artefact is not the destination — it is the substrate for the next quarter's plan. Keep it short enough to read in twenty minutes; otherwise it ages out before it is useful.

How to do the audit, in order¶

The order is deliberate. Earlier steps unblock later steps.

Days 1–2 — System purpose and inputs/outputs¶

Before reading code, talk. The PM, the on-call, the customer-success person who handles complaints. Three questions, asked openly:

What does this system do for the user?
When does it work well? When does it disappoint?
What were the last three incidents about?

The answers give you the system's intent before the implementation. The intent is the lens you read the implementation through.

Then: pull a sample of the system's I/O from production logs (if available) or from the application's storage. Twenty to fifty real input-output pairs. Skim them. Note the input shape; note the output shape; note any obvious patterns.

By end of day two you can write the first three sections of the audit artefact.

Days 3–4 — Prompts and templates¶

The prompts are the spec. Find them. They are usually:

In a prompts.py or prompts/ directory
In a YAML/JSON config
Inline as string literals in the code that calls the model
In a Notion or Confluence page (often outdated)

Often they are in two or three places, contradicting each other. Find every place a prompt is constructed; pick the one actually used in production by checking against the I/O sample.

Read each prompt carefully. Take notes on the intent it expresses. Note any references to things the prompt assumes (terms of service, internal policies, business rules). These are the dependencies you did not see in the code.

By end of day four you can write the prompt section, and you have a much sharper sense of what the system is trying to do.

Day 5 — Models in use¶

Find every model= string in the codebase. List them. For each, note:

Provider
Concrete version
Region
Parameters (temperature, max_output_tokens, tools, stop sequences)
Where in the code this model is invoked
Whether the version is pinned or "latest"

Cross-reference with each provider's deprecation calendar. Note any model that is approaching retirement.

This is the section that usually surprises. The team often thinks "we use Claude" and finds out they actually use four different models across the system, two of which are months from retirement.

Days 6–8 — Code architecture¶

Now read the code. Start at the entry points; trace one or two complete request paths through the system. Note the major modules and their responsibilities. Note where the model is called and how its output is parsed.

Resist the urge to refactor as you read. The reading is the work; the changes come after the eval backstop is in place.

By end of week one, you have read enough code to write section 6.

Days 9–10 — Observability inventory¶

What is logged today? What can you find about a specific historical request? Run the exercise: pick a customer complaint from last week, try to find the model call that produced the bad output, trace the path. Note where the trail goes cold.

The gaps in this exercise are your observability roadmap.

Days 11–12 — Eval coverage¶

Look at the test suite. Note which tests exercise the AI path versus mock it. Look for any existing eval artefacts — sometimes there is a small CSV or JSON file with sample inputs labelled by an engineer who tried to be careful but did not formalise it.

Usually there is no eval. Document that clearly.

Days 13–14 — Failure modes, scale, owners, risks¶

Pull the customer-complaint tickets for the last quarter. Cluster them by failure shape — wrong answer, refusal of a valid request, hallucinated detail, slow response, etc. List the top 5–10 with frequencies.

Pull operational metrics: calls per day, cost per day (the provider invoice, prorated to this system), latency distribution if available.

Identify the people: who has been on call, who handles complaints, who knows the data pipeline, who is leaving or has left.

Note any known risks: a deprecation deadline, an undocumented dependency, an upcoming compliance review.

By end of week two, the artefact is complete.

What you are looking for as you audit¶

While doing each step, watch for these signals. They are what you will use to plan the rest of the work.

Signal	What it tells you
Inputs vary widely in shape	Prompt templates may be brittle; eval set needs to span the variation
Outputs are mostly free text, parsed by keyword	Hidden behavioural debt — change the prompt and the parser may break
Multiple models in use, some near retirement	Migration work is on the critical path
No idempotency on retries	Duplicates are happening; audit shows it
Customer complaints cluster around a few patterns	High-leverage fix candidates
Cost per day surprises the PM	Visibility problem; budget surprises coming
One person has been on call for everything	Bus factor risk
The on-call playbook does not mention the AI module	Operations are blind to this part of the system
Logs say something the prompt does not predict	Drift has happened; prompt and behaviour are out of sync

The signals do not become the plan immediately. They become the inputs to the next chapter (the eval backstop) and chapter 04 (stop-bleeding-vs-do-right).

What to do when something is missing¶

In the chapter-opening example, several artefacts were "missing": prompts, eval, observability. The audit records the missing items as findings, not as failures.

No production traces. Section 7 says "no traces." This becomes a chapter 06 line item.
No eval set. Section 8 says "no eval coverage." This becomes the chapter 03 priority.
Model hardcoded in three places. Section 5 lists all three. This becomes a chapter 08 line item.

The audit is a fair record of the state; it is not a verdict on the team.

How to do the audit with limited access¶

Sometimes you cannot run the system, read production logs, or talk to former engineers. Adapt:

If you cannot read production logs, sample from whatever data is available — QA environment, batch outputs, customer-impact tickets.
If the team is dispersed, do a lightweight written interview by email: three to five questions, asked of three to five people.
If you cannot run the system, treat the code as your source: simulate inputs you can reasonably construct.

The audit can be lower-resolution and still be useful. A blurry map is better than no map.

How long does this really take?¶

For a system of moderate complexity, two weeks of focused work. For a small system, one week. For a large multi-component system, three to four weeks but you should still produce a first draft in two weeks and iterate.

Resist the temptation to extend the audit to a month. After two weeks you have diminishing returns. The remaining clarity comes from doing the work in chapters 03–07, not from more reading.

Common mistakes during the audit¶

Reading code first. You spend a week reading code without context. The behaviour-first walk is faster and produces better questions.
Assuming the documentation is accurate. Confluence pages older than six months are usually wrong about the AI parts. Verify against logs and code.
Skipping the customer-complaint review. Complaints are the real-world feedback that aggregate ground truth about failures. Skipping them means designing fixes for what you imagine is wrong.
Trying to fix things as you go. The audit is observation; the fixes start after chapter 03.
Not writing the artefact down. A mental model dies the next day; a written artefact survives.

Interview Q&A¶

Q1. Why audit behaviour-first instead of reading the code first? Because the code's complexity is mostly accidental — wiring, parsing, error handling — relative to the behaviour the system is trying to produce. Behaviour-first surfaces the system's intent (from PMs and prompts) and its actual production shape (from logs), which then becomes the lens through which the code makes sense. Reading 480 lines of code without this lens is slow and produces overwhelm. Reading the same code after knowing the intent is much faster. Wrong-answer notes: "the code is the source of truth" is true for non-AI systems; for AI, the prompt and the output distribution carry more behavioural weight than any single function.

Q2. The team tells you "we have great test coverage." You look and find the tests mock the model entirely. What do you say in the audit? You note that the test coverage is on the code paths around the model, not on the model's behaviour itself. The tests verify that the code calls the model correctly and parses its output safely, but they do not verify that the model returns useful outputs for representative inputs. This is the gap the eval backstop in chapter 03 fills. The audit records the existing coverage faithfully and identifies the missing coverage type separately. Wrong-answer notes: dismissing the existing tests as worthless is wrong — they are useful, just not for the right thing.

Q3. You have one week, not two. What do you cut? Cut sections 11 (owner map) and 12 (risk inventory) to "best effort" and finish the rest. The behaviour-first sections (1–5) and the failure-mode/scale sections (9, 10) are the load-bearing ones for planning. The owner and risk maps are operationally important but you can fill them in over the next two weeks while you work on the eval backstop. Wrong-answer notes: cutting the prompt or model audit (sections 4–5) is the wrong cut; those are the highest-leverage discoveries.

Q4. The audit finds the model is hardcoded as claude-3-opus-20240229, which is being retired in 90 days. Does that change your sequencing? Yes. The retirement adds a hard deadline. The eval backstop (chapter 03) and the model migration (chapter 08) move up in priority, because the migration cannot be safely done without the backstop, and the retirement is closer than the natural pace of the work. You front-load chapter 03 in the 30-day plan and start chapter 08 work in week three rather than week six. The audit's finding becomes a scheduling input, not just an observation. Wrong-answer notes: treating the audit as static (just a record) misses that its findings should change the plan.

What to do differently after reading this¶

When you inherit a system, do the behaviour-first walk before reading code.
Write the twelve-section audit artefact; keep it short enough to re-read every week.
Cross-reference the prompts and the model strings against the production I/O sample. Verify what is actually used.
Cluster customer complaints into failure shapes; this is your highest-leverage input for sequencing.
Treat the audit as the input to the plan, not the plan itself.

Bridge. The audit produces the map. Before you make any change against that map, you need a safety net — an eval backstop that catches behaviour regressions you cannot otherwise see. The next chapter is the minimum eval coverage and how to bootstrap it from the audit you just produced. → 03-the-eval-backstop.md