09. Data pipeline and context debt¶
Models are stabilised; the system is on a calendar. The next part of the inherited mess is the data side — retrieval contexts, training-data references, hand-rolled context builders, embedding indexes nobody owns. Many inherited AI systems carry as much data debt as code debt. This chapter is the discipline that addresses it.
A platform engineer at a Bengaluru e-commerce company inherits a product-recommendation system. The audit found the obvious code debt (chapters 02–08). It also found something subtler: every recommendation prompt is built from a build_context function that fetches "relevant" product data from a chain of five internal services, deduplicates, summarises, and concatenates into a single string. The function is 900 lines. It has no tests for what counts as relevant. The data it fetches has changed shape three times in two years; the function has been patched each time but never redesigned. The retrieval pipeline behind it indexes a snapshot of the catalogue that is rebuilt nightly by a different team; nobody knows what changed in the indexer last quarter or whether it still matches what build_context expects.
This is data debt. The model gets a context; the context is constructed by a poorly-owned pipeline. The model's behaviour is downstream of every weak link in the pipeline. Modernisation that addresses only the code calling the model misses half the problem.
What "data debt" means in this context¶
Three categories.
Retrieval and indexing. The system that fetches relevant data into a prompt — vector indexes, search APIs, knowledge graphs, internal service calls. The pipeline's shape, freshness, and quality drive every model output.
Context construction. The code that turns raw fetched data into prompt-shaped strings — formatting, deduplication, summarisation, truncation. Often hand-rolled, poorly tested, brittle to upstream changes.
Training data and fine-tuning references. If the system uses fine-tuned models or in-context learning examples, the data that produced or seeded those is part of the system. Inherited systems often have stale or untracked references here.
For most inherited systems, the first two are the urgent problems. The third matters for systems that use fine-tuning; many do not.
Why the data side is often skipped¶
Three patterns that produce data debt and resist modernisation.
It does not look like code. Data pipelines are SQL, ETL DAGs, indexer configs, embedding-generation scripts. They are not the code people review for AI changes. They evolve under the radar.
Ownership is split. The team that owns the catalogue indexer is not the team that owns the recommendation agent. The boundary between them is informal. When something breaks, neither owns it cleanly.
The blame goes to the model. A degradation in recommendations gets attributed to "the model is getting worse." Sometimes that is true; often the model is fine and the data the model sees is worse. Without distinguishing, the wrong fix is applied.
The discipline of this chapter is to make these problems visible and addressable.
What to audit on the data side¶
Extend the day-one audit (chapter 02) with these data-specific questions.
| Question | What it reveals |
|---|---|
| Where does the context for each prompt come from? | The retrieval pipeline's shape |
| Who owns each upstream source? | The ownership boundary |
| How often is each source refreshed? | Freshness debt — stale data in production |
| What guarantees does each source make about its shape? | Drift exposure on the data side |
| When did the indexer last change? | Recent changes you may not have noticed |
| Are there tests on the context shape that the prompt receives? | Usually no — the gap to address |
| Does the audit log record the context that was assembled? | Without it, drift is invisible |
The output is a data-side companion to the code audit. It feeds the next sections.
Capturing the context in the audit¶
A high-leverage early step: extend the per-call audit (chapter 06) to capture the assembled context — what the model actually saw. Without this, every context-related investigation is blind.
{
"audit_id": "aud_...",
"context_used": {
"sources": [
{ "name": "product-index", "fetched_at": "2026-05-25T11:14:01Z", "count": 12, "version": "idx-v42" },
{ "name": "customer-history", "fetched_at": "2026-05-25T11:14:02Z", "count": 8 }
],
"assembled_size_chars": 4820,
"assembled_hash": "sha256:..."
}
}
The sample of fully-captured calls (chapter 06 layer 4) includes the assembled context itself, with redaction. Privacy applies: the context may contain user data; the same disciplines from module 19 chapter 11 govern its retention and access.
After this is in place, context-related questions become answerable: "for this complaint, what data did the model see?" The answer is a query.
Establishing data freshness SLAs¶
Each upstream source has an implicit freshness — how long after data changes does it appear in the prompt? Inherited systems usually have no explicit SLAs. Modernisation adds them.
data_sources:
product-index:
owner: catalogue-team
refresh_cadence: nightly at 02:00 UTC
expected_freshness: "<24h"
monitored_by: gateway-team
on_breach: alert + degrade to last-good
customer-history:
owner: customer-platform
refresh_cadence: real-time (event-driven)
expected_freshness: "<5min"
monitored_by: gateway-team
on_breach: alert; do not degrade (consumer of stale data is acceptable here)
The SLAs become monitors. A freshness breach is a signal that the data the model sees may not be current; the on-call investigates.
Building the SLAs requires conversation with each source's owner. Modernisation forces these conversations; before modernisation, the conversations rarely happen.
Wrapping the context builder behind an interface¶
The 900-line build_context function from the chapter's opening is a classic strangler target (chapter 07). Same procedure:
- Define the interface — inputs (the request, the user's identity) and outputs (the assembled context, with metadata about its sources).
- Wrap the legacy function as one implementation.
- Build the modern implementation with proper retrieval discipline, smaller pieces, and tests.
- Shadow the new against the legacy; compare contexts; iterate until matched.
- Cut over.
The comparison is harder than for deterministic code — two context builders may produce semantically equivalent but textually different contexts. The eval is the arbiter: do the model's outputs on the same input stay equivalent when the context shifts? If yes, the new context builder is acceptable.
Retrieval-pipeline modernisation specifically¶
If the inherited system uses RAG (retrieval-augmented generation), the retrieval pipeline often needs its own modernisation track. Module 01_ai_engineering/08_rag_system_design and 09_advanced_rag_patterns cover the discipline; the modernisation work usually involves:
- Replacing keyword search with hybrid or dense retrieval, where appropriate
- Adding reranking, where the inherited system did naive top-k
- Establishing chunking discipline if the inherited system chunked ad-hoc
- Introducing query rewriting or expansion if the retrieval quality is poor
These are not modernisation specifically; they are the standard RAG disciplines applied to a system that needs them. The modernisation aspect is applying them without breaking what works. The eval backstop is the safety; the strangler pattern is the migration mechanism.
When training data is in scope¶
If the system uses fine-tuned models, the training data is part of the system. The inherited state often includes:
- A frozen training set whose provenance is partly forgotten
- A fine-tuned model that runs in production but cannot be retrained because the training pipeline is broken
- In-context learning examples baked into prompts whose source is a now-stale CSV
The modernisation work for training data is its own subject (module 00_ai_foundation/06_adaptation_compression covers the technical side; the modernisation side is largely about making provenance explicit and the training pipeline operable). For most inherited production systems, this is a later track — the urgent work is the retrieval and context side.
Common mistakes on the data side¶
Treating data debt as "not our problem." The agent team often considers data debt to be the platform team's or the catalogue team's. The user does not care about ownership boundaries; they see the agent's output. The agent team has to advocate for the data fixes that affect the user's experience.
Fixing the model when the data is the problem. A degradation that looks like a model regression is sometimes a stale-data or shape-drift regression. The data-side audit catches this.
Skipping context capture in the audit. Without recording what context the model saw, every data investigation is guesswork.
Modernising retrieval before establishing the eval. A retrieval change is a behaviour change. The eval (chapter 03) gates it the same way it gates any other change.
Doing too much at once. Retrieval changes, context-builder refactor, training-data overhaul — three simultaneous tracks is a tangle. Pick one; finish; move to the next.
What this chapter does not solve¶
- Bad source data. If the underlying catalogue is incomplete or wrong, no retrieval discipline produces a good context. That is the catalogue team's problem.
- Mismatched ontologies. If two upstream sources use incompatible categorisations, the context builder has to reconcile — sometimes there is no clean reconciliation. The modernisation surfaces the problem; the cross-team fix is not the agent team's solo work.
- Latency in the data pipeline. If a source is genuinely slow, no agent-side change makes it faster. The fix is in the source's own engineering.
Interview Q&A¶
Q1. The agent's recommendation quality has been slowly degrading for months. The model has not changed; the prompts have not changed. Where do you look? The data side. Pull the audit's context-used field for recent calls and compare to calls from six months ago. Check: are sources fresh? Has any source's shape changed (new fields, removed fields, changed semantics)? Has the catalogue indexer's output diverged from what the context builder expects? Are there cases where a source is silently failing and the context proceeds without it? The model and prompt are fixed; the data the model sees is the variable. Investigation flows from the context-used audit to the upstream sources. Wrong-answer notes: "switch to a better model" is the most common wrong instinct; the data investigation usually finds the real cause.
Q2. What is the first thing you add to the audit for data-side observability? The context-used field — sources, fetch timestamps, counts, versions, assembled size, and a hash of the assembled context. With this, every call's data inputs are traceable. Without it, you can investigate a model output but not the data the model saw. The capture is per-call; the sample (1–5%) also stores the full assembled context with redaction for deeper review. Wrong-answer notes: "trace IDs" is necessary but does not answer "what data did the model see"; the context-used field is the specific addition.
Q3. The context builder is a 900-line function. Strangler it or rewrite it? Strangler. Define the interface from its current inputs/outputs. Wrap the legacy as one implementation. Build the modern. Shadow it; compare contexts; use the eval to verify model behaviour holds. Cut over. The 900-line function is too entangled to rewrite without losing some behaviour the team did not realise it depended on. The strangler pattern preserves it. Wrong-answer notes: "rewrite is cleaner" repeats the big-bang anti-pattern from chapter 07.
Q4. The retrieval pipeline is owned by another team that is hard to coordinate with. How do you make data-side modernisation possible? Two moves. First, establish the freshness and shape SLAs with the team — they may not know what you depend on. Conversation forces the conversation. Second, capture data-side observability on your own side (the context-used audit, the source-version tracking) so you can detect their changes even when they do not communicate. The cross-team work is now data-informed: "your source's shape changed yesterday; here is the impact on our system." The data shifts the conversation from "your team is unreliable" to "here is a specific change with a measurable effect." Wrong-answer notes: "escalate to leadership" is sometimes necessary but does not produce the engineering substrate that makes the next collaboration easier.
What to do differently after reading this¶
- Extend the audit to capture context-used. Per-call metadata; sample for full content.
- Inventory upstream data sources; establish freshness and shape SLAs with each owner.
- Strangler the context builder behind an interface; do not rewrite.
- When investigating quality degradation, look at data freshness and shape before re-tuning the model.
- Treat retrieval-pipeline modernisation as its own track, gated by the eval, applied via strangler.
Bridge. Models and data are stabilised. The technical work is on a clear track. The remaining modernisation challenge is non-technical — the stakeholders who care about the system. The PM wants a roadmap. The customer wants to know what is changing. The on-call wants to know what to expect. The executive wants progress they can defend. The next chapter is the stakeholder-management discipline that makes the technical work durable. → 10-stakeholder-management.md