Skip to content

01. The inherited system

Before we plan an audit, we need to understand the patient. Legacy AI is not legacy code with a model attached. It has its own diagnosis, its own progression of failure modes, and its own constraints on what you can safely do first.


The new tech lead from chapter 00 spends day five trying to read the system. She opens the claims-triage repo. The top-level structure looks normal — app/, models/ (this turns out to be domain models, not ML models), pipelines/, tests/, a Dockerfile. She finds the entry point. It calls into a class called ClaimsAdvisor, which in turn calls a function decide_triage_path. The function is 480 lines. The first 80 lines build a prompt by string-concatenating dozens of fields from the claim. Lines 80–110 call anthropic.Client().messages.create(model="claude-3-opus-20240229", ...). Lines 110–280 parse the response with a long chain of if "approved" in text.lower(): ... elif "needs review" in text:. Lines 280–480 fall back to keyword-based heuristics when the parse fails. There are no unit tests for decide_triage_path itself; the unit tests around it mock the function entirely.

She has, in five days, read approximately two of the system's pieces. There are eleven similar pieces. None of the team that wrote them is still at the company.

This is what an inherited AI system looks like. The pathology is recognisable, but it is not the same as inheriting a Django monolith.


Why legacy AI is not legacy code

Four properties of legacy AI systems differ from legacy code in a way that matters for the playbook.

1. Non-determinism is the substrate

A legacy code path is deterministic. You run it twice, you get the same answer. You refactor it; you run the same test twice; if it passes both times, you have not broken it.

A legacy AI path is non-deterministic. You run it twice, you can get two different answers. You refactor it; you cannot test the new behaviour by comparing one new run to one old run. You need a distribution of runs, with aggregated judgement, to claim "the new version behaves like the old."

This is the eval backstop, and it is the load-bearing addition that makes legacy AI modernisation possible. Without it, you cannot ship changes; you can only operate the system.

2. The behaviour is in the prompts and the model

In a legacy codebase, the behaviour is in the code. Reading the code tells you what the system does. Refactoring the code preserves behaviour because the code is the behaviour.

In a legacy AI system, the behaviour is mostly in the prompts and partly in the model. The code orchestrates the inputs and parses the outputs, but the judgement — the part the AI does — lives in a multi-paragraph prompt and a model version string. Reading 480 lines of code tells you almost nothing about what the system is trying to do; reading the prompt tells you. The prompt is the spec. The code is the wiring.

Implication: when you encounter an inherited AI system, find the prompts first. They are the documentation. They are also the part most likely to drift, because they were tuned by trial-and-error against a model that may no longer exist.

3. The vendor moves under you

A legacy Django app sits still until you touch it. A legacy AI system drifts even when you do not touch it. The model is updated by the vendor. The model is retired by the vendor. The model behaves slightly differently after a vendor deploy. A "do nothing" stance for an AI system is unstable in a way that legacy code is not.

This is why the chapter-00 manager's instruction "do not break what works" requires more than restraint. It requires active stabilisation — pinning model versions, monitoring drift, planning for the next deprecation. Doing nothing is a slow-motion break.

4. The cost of wrong behaviour is asymmetric

A legacy code bug produces a wrong number, a wrong list, a stack trace. The user often notices immediately and reports it.

A legacy AI bug produces plausible-but-wrong text, a confident-but-misclassified outcome, a polite refusal of a legitimate request. The user often does not notice immediately. By the time the bug is detected — through customer complaints, regulatory inquiry, or a chance audit — many wrong decisions have been made. The lead-time on AI defects is longer; the blast is correspondingly larger.

Implication: the threshold for "safe to ship a change" is higher. The eval backstop, the parallel-run discipline, the rollback rehearsal — these are not paranoia. They are commensurate with the cost of being wrong.


The four states

Chapter 00 introduced the four states. Here they are with the diagnostic questions you ask to identify your system's state.

Frozen

You inherited a system. There is no eval. There is no audit log of model calls. The prompts may be in code, in a Notion page, or scattered in YAML files nobody touches. The model is hardcoded as a string. Nobody on the current team is sure exactly what the system does for any specific class of input.

You can operate it — restart it, scale it, route around it — but you cannot improve it, because you cannot tell whether a change made it better or worse.

Diagnostic: ask the current team "if I changed the system prompt to do X instead of Y, how would you know?" If the answer is "we'd see the customer complaints change" or "we'd watch the dashboard," you are frozen. The signal is too slow and too coarse to drive iteration.

Observable

There is some logging. Maybe traces, maybe not. The team can find what the system did for a specific case after the fact. There may be a customer-impact metric that moves with the system's behaviour.

You can operate it, you can investigate specific incidents, but you still cannot safely ship changes because no test catches behaviour regressions before users see them.

Diagnostic: ask "if I changed the system prompt and something subtle broke, would we know within a day?" If the answer is "maybe, depending on which subtle thing," you are observable but not eval-backed.

Eval-backed

There is at least a small, curated eval set with known-good outcomes. A code change or a prompt change can be run against the eval and produce a quality score. Changes that regress the score are caught before deployment.

You can ship changes — carefully, slowly, with rollback at the ready.

Diagnostic: ask "if I made a change today, what's the minimum check before merge?" If the answer is "run the eval suite; check the scores; if they hold or improve, ship; if they regress, investigate," you are eval-backed.

Modular

The system is composed of pieces with explicit interfaces. Components can be replaced one at a time. The prompt lives in a registry. The model is selected by alias. Observability is comprehensive. Evals run automatically.

This is the destination. Most inherited systems are not in this state when you inherit them.


The pathologies that produce frozen systems

Three patterns explain how systems end up frozen:

The "we'll add tests later" pattern. The original team shipped fast under deadline pressure. The evals were always going to be next sprint's work. They never were.

The "this part is the model's job" pattern. The original team thought the model would handle edge cases. They did not write tests around the model's behaviour because "you can't really test a model." The model has since changed twice; the original assumptions no longer hold.

The "single author" pattern. One engineer built the system. They held the mental model in their head. They left. The mental model left with them. The system works because nothing has changed, not because anyone understands it.

You will see one or more of these in every inherited AI system. They are not character flaws of the previous team; they are structural patterns the field has been working through. Your job is not to assign blame; it is to recognise the pattern so you can plan accordingly.


What this means for the playbook

The next chapters operationalise the consequences:

  • Chapter 02 (day-one audit) is the systematic mapping that produces the diagnosis.
  • Chapter 03 (the eval backstop) is the move from frozen to eval-backed — the most important change you make.
  • Chapters 04–06 (sequencing, prompts-out-of-code, observability) are the stabilisation phase.
  • Chapter 07 (strangler migration) is the move from eval-backed to modular.

The order matters. You do not refactor before instrumenting. You do not instrument before establishing the eval backstop. You do not establish the eval backstop before auditing to understand what to evaluate against.


Interview Q&A

Q1. What is the single biggest difference between inheriting a legacy AI system and inheriting a legacy non-AI service? Non-determinism. The system's correct output cannot be verified by re-running the same input and comparing. You need a distribution of inputs and an aggregated judgement (an eval) to know whether a change preserved behaviour. Every other difference (prompts as spec, vendor drift, asymmetric defect cost) flows from this, but the load-bearing one is non-determinism. Wrong-answer notes: "it uses AI" is the surface answer; the value of the question is in the methodological consequence.

Q2. Walk through the four states an inherited AI system can be in. Frozen — operate only; no signal for changes. Observable — operate plus investigate incidents; still no safe change. Eval-backed — operate plus investigate plus ship changes against a backstop. Modular — components have interfaces; replacement without touching others is possible. The diagnostic question for each is about your ability to detect a behaviour regression at a given turnaround: customer complaints (frozen), per-incident traces (observable), pre-merge eval check (eval-backed), per-component contract test (modular). Wrong-answer notes: missing the move-the-line discipline — frozen → observable → eval-backed → modular — is the most common gap.

Q3. The manager says "do not break what works." How do you reconcile that with the fact that the vendor will change the model under you? By making active stabilisation part of "not breaking." Doing nothing is unstable because the vendor moves; the system you inherited is not really frozen, it is drifting. The right interpretation of the instruction is: pin model versions concretely, monitor for drift, plan migrations on a calendar, and absorb vendor changes through the gateway. "Not breaking what works" requires the work; standing still does not preserve the state. Wrong-answer notes: taking "do not break" as literal restraint produces a slow-motion failure.

Q4. A previous engineer wrote a 480-line function that builds a prompt, calls the model, and parses the output with keyword heuristics. The function has no unit tests. What is the first thing you do? Find the prompt — the part the function builds. That is the spec for what the system is trying to do. Read it carefully; cross-reference it with sample outputs from production logs (if available). The 480 lines of code are mostly wiring; the prompt is the behavioural contract. After reading the prompt, you are ready to start the day-one audit on the rest of the system. Wrong-answer notes: "refactor the function" before reading the prompt is busywork; the function's complexity is mostly accidental relative to the prompt's intent.


What to do differently after reading this

  • When you inherit an AI system, find and read the prompts first. They are the spec.
  • Diagnose the system's state using the four-state framework before committing to any plan.
  • Reject "do nothing" as a strategy. Active stabilisation is required because the vendor moves.
  • Plan changes in proportion to the asymmetric defect cost. The threshold for "safe to ship" is higher than in non-AI code.

Bridge. With the diagnosis framework in hand, the next chapter is the systematic first inspection: in your first two weeks, what do you look at, in what order, to produce the map you will work from. → 02-day-one-audit.md