Skip to content

04. Stop-the-bleeding versus do-it-right

The eval backstop is in place. You can change things now. The next question is which things to change first. Modernisation has two pulls — quick wins that ease the worst pain, and structural work that prevents the pain from coming back. Sequencing them is the lead's most important early call.


A tech lead at a Pune SaaS company has just finished her audit and built a first eval. The findings show one severe symptom (the agent occasionally surfaces internal SQL identifiers in user-facing messages — happens twice a week) and three structural problems (prompts in code, no model versioning, no observability). The PM is asking for a roadmap. The lead thinks for a day and produces it: in week three, a one-line prompt patch that prefixes every output with a sanitiser and a temporary regex strip on identifiers. Quick, ugly, effective. In weeks four through six, structural work: prompts into a registry, observability retrofitted, eval expanded. The bleeding stops in week three; the structural work prevents the bleeding from recurring. The PM gets a date for symptom relief and a date for prevention. The team gets a clear sequence rather than a tangle of work fronts.

This chapter is the discipline behind that sequencing. The wrong sequence either ships symptom relief that masks the underlying issue, or invests in structure while users keep being hurt. The right sequence does both, on different timeframes, with explicit ownership of when each finishes.


The two pulls

Every modernisation has two competing forces.

Stop the bleeding. The user-visible symptoms hurting customers right now. Quick patches that reduce the symptom rate. Often ugly. Often technical debt. Not the long-term answer.

Do it right. The structural work that fixes the underlying conditions. Slower. Less visible to users in the short term. The basis for sustained improvement.

A modernisation that only does the first is firefighting that never ends. A modernisation that only does the second is a long absence from delivering user value. The right balance is both, sequenced explicitly.


The sequencing principles

Three principles guide the order.

1. Stop the bleeding on the worst symptom, fast. If a user-visible failure is causing material harm — wrong outputs to customers, regulatory exposure, lost revenue — patch it now, even if the patch is temporary. The team's credibility with stakeholders depends on visible progress on real pain.

2. Structural work before the next major change. The eval backstop, the prompt registry, the observability — these are preconditions for every later improvement. Do them before you take on bigger refactors, not after.

3. Avoid mutually destructive concurrent changes. A structural change underway while a bleeding-stop patch is being made can collide. Sequence so one finishes before the other touches the same surface, or work on different parts of the system in parallel.


The decision table

For each candidate fix, classify it on two axes:

Low blast radius High blast radius
High user impact Stop the bleeding now (quick patch) Stop the bleeding and plan the structural fix; do not move fast carelessly
Low user impact Defer; pick up in structural work Defer; structural work first

Blast radius here means how many parts of the system the fix touches. A regex strip on an output string is low-blast. A change to the prompt registry that affects every prompt is high-blast.

User impact is the rate and severity of the user-visible harm.

The four quadrants give a quick sort:

  • High user impact + low blast radius — fix now, ugly is fine
  • High user impact + high blast radius — fix now carefully, with the structural work happening alongside
  • Low user impact + low blast radius — defer to structural work
  • Low user impact + high blast radius — definitely defer; do not introduce new risk for low return

A worked example

Inherited system: a customer-support agent. The audit produced three findings.

# Finding User impact Blast radius Sequencing
A Agent occasionally returns raw error strings from internal systems to users High (3% of conversations) Low (one parsing function) Stop the bleeding now — add a sanitiser in week 1
B Prompts are buried in a 2k-line file with several near-duplicates Low (no direct user impact) High (affects all behaviour) Structural — registry migration in weeks 3–5
C Model is claude-3-opus-20240229, retiring in 60 days Will be high in 60 days High (every call) Structural with a deadline — start in week 2, finish by week 6

Sequence:

  • Week 1: Fix A — sanitiser plus eval entry for the failure mode.
  • Week 2: Begin C — research successor model, plan canary.
  • Weeks 3–5: Do B — prompts into a registry; this clears the way for the C migration.
  • Weeks 5–7: Complete C — canary the new model with the registry-served prompt; ship.
  • Week 8: Revisit A — replace the sanitiser with a proper structured-error contract.

The bleeding stops in week 1. The structural work cascades from week 2 onward. The temporary patch in week 1 is not forgotten — it has an explicit replacement scheduled in week 8 after the structural work makes the proper fix possible.


How to patch without painting yourself into a corner

The risk of stop-the-bleeding patches is that they accumulate. Six months later, the system has dozens of regex strips, hard-coded rules, and special cases that nobody owns. The patches were necessary but the housekeeping was skipped.

Three disciplines avoid this:

1. Every patch has a structural follow-up. When you ship a patch, schedule the proper fix. Put it on the roadmap. The patch is the bridge; the structural fix is the destination.

2. Patches are commented as patches. The code annotation says it is a temporary fix, what the proper fix is, and when it will arrive. Reviewers later can spot whether the proper fix landed.

3. Patches have eval coverage. The eval set picks up the failure mode the patch addresses. When the proper fix replaces the patch, the eval re-verifies that the failure mode stays handled.

These disciplines convert technical debt from "unmanageable accumulation" into "tracked liability with a payoff date."


When to refuse a quick patch

Sometimes the right answer is "no quick patch; we need the structural fix first." Two cases.

The patch would make the structural fix harder. If a regex strip is layered over an output format, the structural fix that changes the output format has to navigate the regex. Sometimes the patch is so entangled it is easier to do the structural work directly.

The bleeding is not bad enough. A finding might be uncomfortable but not actively harmful. If the system "could be better" but no users are being hurt right now, you can defer to the structural work without a patch. Spending patch effort on non-bleeding-level findings is a poor use of attention.

The pushback to stakeholders in these cases: "we are doing the structural fix in three weeks, which will resolve this directly. Patching first would slow that down." This is a defensible position when supported by the audit data.


What to do about urgent compliance findings

Sometimes the audit produces a regulatory or compliance finding that cannot wait three weeks. A data-residency leak, an unredacted PII path, an unauthorised tool. These are stop-the-bleeding-now items regardless of blast radius.

The discipline: patch immediately with the narrowest fix that addresses the compliance issue, communicate to the compliance team that the patch is in place, and put the structural fix on the highest-priority track. Do not let the structural sequencing pull a compliance fix later than necessary.


Communicating the sequence

The sequence is also a story you tell stakeholders. The PM and the executive want to know:

  • When does the worst symptom stop hurting users?
  • When does the system become reliable?
  • What are you doing in the meantime?

The sequence answers all three:

  • The symptom stops hurting in week 1 (or the relevant patch week).
  • The system becomes reliable as the structural work lands (with dates).
  • In the meantime, you are doing the structural work that makes "reliable" possible.

The sequence is the lead's contract with stakeholders. It should be re-stated weekly during the modernisation, with progress against each item.


Common mistakes

Treating every finding as urgent. If everything is bleeding, nothing is. Sort by impact and blast radius; spend your patch budget on what is genuinely worst.

Skipping structural work because patches feel productive. Each patch is a one-off; structural work pays for itself across all future changes. Resist the pull toward perpetual patching.

Skipping patches because structure is more important. The opposite mistake. Users are hurting now; structural work alone does not help them this week.

Not closing patches when structure lands. A patch left in place after its structural replacement is debt. Schedule the removal.


Interview Q&A

Q1. The audit found three things bleeding badly and five structural problems. The team has bandwidth for two streams of work. How do you sequence? Stream one is the bleeding fixes, sorted by user impact. Tackle the top one or two in week one with quick patches. Stream two is structural — the eval backstop is already done (chapter 03), so the next structural items are prompts-out-of-code, observability, and the model migration if any deadline applies. The structural work proceeds in parallel to bleeding fixes, but they should not collide on the same code surface. The third bleeding item, if it is not catastrophic, can wait until after the first structural milestone, which often makes the proper fix easier. Wrong-answer notes: "do everything at once" produces collision and burnout; "do bleeding only" produces perpetual firefighting.

Q2. A PM is pushing for a feature launch in three weeks. You think modernisation needs another month before new features should ship. What do you say? "Here is the eval coverage today; here is the structural posture today; here is the risk profile of shipping the feature now versus in five weeks." The conversation is about risk and confidence, not about modernisation as an obstacle. Sometimes the feature can ship within the modernisation programme if the feature touches a part of the system you have already stabilised. Sometimes the feature legitimately needs to wait. The lead's job is to make the trade visible; the PM and lead jointly own the decision. Wrong-answer notes: "we can't ship anything until modernisation is done" is overreach; "ship it, we'll handle the risk" cedes the engineering judgement.

Q3. You shipped a regex strip three weeks ago to stop a bleeding symptom. The structural fix is now ready. What's the closing process? Replace the regex strip with the structural fix. Run the eval set, especially the cases tied to the original bleeding symptom; verify they all still pass without the patch. Delete the regex strip from the code. Note in the postmortem or sprint summary that the patch has been retired. Update the eval-set documentation to reflect that the case is now handled structurally. The closing matters — leaving the regex in place after the structural fix creates dead code and a parallel enforcement that may diverge. Wrong-answer notes: "leave the regex as belt-and-braces" silently produces conflicting enforcement paths.

Q4. A compliance finding lands during the modernisation: PII is appearing in logs. What is the right sequencing? Patch immediately — within hours, not days. The narrowest fix that prevents PII from being logged: a redaction filter in the logging path. Notify the compliance team. Then immediately add the proper fix to the highest-priority structural track: PII-handling discipline in the audit log (see module 03_ai_security_safety/03). The compliance issue cannot wait for the natural sequencing; it overrides. Wrong-answer notes: "schedule it in the sprint after next" is the wrong answer for compliance; speed of containment is the discipline.


What to do differently after reading this

  • For each audit finding, classify by user impact and blast radius. Sequence per the decision table.
  • For every quick patch, schedule the structural follow-up and annotate the patch as temporary.
  • Refuse patches when they would impede the structural fix or address non-bleeding-level findings.
  • For compliance findings, override the natural sequencing — patch fast, follow up structurally.
  • State the sequence to stakeholders weekly; update with progress.

Bridge. Sequencing is settled. The first structural work begins. The most common starting point is the prompts — they encode the system's intent and they are the artefact most likely to be buried in code. The next chapter is the migration: prompts out of source into a registry, without breaking the running system. → 05-prompts-out-of-code.md