Skip to content

01. Train-production gap - the model passed the wrong exam

~20 min read. The first production skill in classical ML is learning what your offline number actually measured.

Builds on 00-eli5.md. The module's core contract is data shape -> model shape -> metric -> threshold -> production decision. This chapter focuses on the first break in that contract: the training rows are not the production world.


What the overview gave us before the first model fails

The overview gave us the operating picture: classical ML turns historical rows into future decisions. The model sees features, learns a pattern, emits a score or prediction, and some product threshold turns that score into an action. That sounds safe until you ask one uncomfortable question: did the historical rows test the same situation the model will face after launch?

Three ideas from the overview come back immediately:

  1. Data shape decides what the model is allowed to learn.
  2. Metric alignment decides what success appears to mean.
  3. Deployment threshold decides who pays when the score is wrong.

This file teaches the first audit: before you ask whether the algorithm is powerful enough, ask whether the offline exam was legal, representative, and computed the same way as production.

What this file solves

A fraud model reports 99.8% validation accuracy, ships, and still misses the fraud cases the business cares about. This file shows how to inspect the training-production gap through three concrete failure modes: label leakage, distribution shift, and training-serving skew. The artifact to demand is not a model name; it is a row-level feature audit, split design, and production-vs-training comparison.

The first warning sign is a number that is too clean

Start with a launch review for a card-fraud classifier.

Stage Rows Fraud rate Reported accuracy Fraud recall
Training 8,000,000 0.12% 99.92% 96%
Validation 1,000,000 0.11% 99.89% 94%
First live week 900,000 0.18% 99.74% 41%

The dashboard is confusing. Accuracy still looks high, but fraud recall collapsed. Business users do not complain about accuracy; they complain that stolen cards are getting through and legitimate customers are being blocked.

A smart engineer might say, "The model overfit. Add regularization." That might be true later. It is not the first diagnosis. The root cause may be simpler and more dangerous: the offline rows did not match the live decision. Maybe the training table included a field only known after investigation. Maybe live fraud changed after an attacker adapted. Maybe the online feature pipeline computed merchant history differently from the offline pipeline.

So the question becomes: what changed between the row the model practiced on and the row it served?

When the answer was hidden inside the input

Here is the smallest version.

At training time:
card_id, merchant_id, amount, country, chargeback_status_7d -> label: fraud

At prediction time:
card_id, merchant_id, amount, country, chargeback_status_7d is not known yet

If chargeback_status_7d is present in training, the model can learn the answer after the fact. In production, that field is blank or stale when the decision must be made.

Rule: every feature must be knowable at the moment of prediction

Point-in-time callout. The primitive is a timestamped fact. The constraint is that production decisions happen before many useful facts exist. The training set must join features as of the prediction timestamp, not as of the day the data scientist built the table.


1) Build the row as production would have seen it

The train-production gap begins with time. A training table is often built after the world has already unfolded. A production model acts before the outcome is known.

That difference is easy to miss because historical data feels complete. You query a warehouse table and every row is rich: user profile, merchant risk, investigation outcome, refund status, final settlement, support tags, chargeback flags. The model does not know which columns arrived before the decision and which arrived afterward. It only sees correlation.

The engineer has to enforce the timeline.

t0: user swipes card
t1: model must approve, decline, or challenge
t2: merchant settles
t3: user disputes
t4: investigation confirms fraud
t5: warehouse row contains final chargeback outcome

At t1, the model may use amount, merchant, country, device, account age, and prior behavior computed before the swipe. It may not use dispute status, chargeback outcome, manual investigation tags, or any aggregate that accidentally includes events after t1.

The mechanism is a point-in-time training row:

-- Shape, not production SQL.
SELECT
  tx.transaction_id,
  tx.amount,
  tx.country,
  merchant_features.risk_score_as_of_tx_time,
  card_features.prior_30d_declines_as_of_tx_time,
  labels.confirmed_fraud_later
FROM transactions tx
JOIN merchant_features
  ON merchant_features.merchant_id = tx.merchant_id
 AND merchant_features.feature_time <= tx.transaction_time
JOIN card_features
  ON card_features.card_id = tx.card_id
 AND card_features.feature_time <= tx.transaction_time
JOIN labels
  ON labels.transaction_id = tx.transaction_id;

The label can be learned later. The features must be knowable then. That separation is the first guardrail.

2) The gap has three different shapes

"Training was great, production is bad" is not one bug. It is a symptom shared by several bugs.

                 offline training                  live production

label leakage    feature contains future truth  ->  truth unavailable at decision time

distribution     historical row mix             ->  new users, attackers, market, season,
shift            differs from served row mix        product, policy, or geography

serving skew     feature computed one way        ->  same named feature computed another way

These failures need different fixes.

Leakage is fixed by time-safe feature construction and feature review. Shift is handled by split design, monitoring, fresh labels, retraining, reweighting, or product guardrails. Serving skew is fixed by shared feature definitions, feature stores, parity tests, and shadow scoring.

If you collapse all three into "overfitting," you reach for the wrong tool. Reducing model capacity does not remove future information. Adding data does not make stale serving code match offline code. Retraining does not help if attackers changed the problem faster than labels arrive.

3) Priya debugs the fraud model launch

Priya owns the first-week incident review for the fraud model. The model was a gradient-boosted tree trained on millions of transactions. The launch looked safe.

Attempt A - tune the model harder

The team first tries the usual ML moves. They lower tree depth. They increase regularization. They tune the classification threshold. Offline validation still looks strong. Live fraud recall stays weak.

That failure is useful. It tells Priya the model family may not be the root cause. The same offline validation set keeps rewarding the same hidden assumption.

She asks for a production row and the matching offline row.

Feature Offline training value Live serving value Why it matters
merchant_chargeback_rate_30d 8.4% 1.1% Offline job included chargebacks posted after the transaction.
device_seen_before true false Online identity graph lagged by 20 minutes.
merchant_category_id 5812 unknown New merchant category missing from online encoder.
manual_review_flag true null Human-review field existed only after model decision.

Now the shape is visible. The offline row was not the live row.

Attempt B - audit the row boundary

Priya changes the review question from "Which algorithm is best?" to "Could this exact feature value exist before the decision?"

For every top feature, she records:

Feature Available at decision time? Same offline and online code? Monitored for drift? Keep?
amount yes yes yes yes
merchant_chargeback_rate_30d yes, if point-in-time no partial rebuild
manual_review_flag no no no remove
device_age_hours yes yes yes yes
country_risk_score yes yes no keep + monitor

This table does more than clean features. It changes ownership. Data engineering owns point-in-time aggregates. Platform owns offline-online parity. Risk owns delayed labels and threshold cost. ML owns the model after the row is honest.

4) Why a random validation split loses to a time-aware split

The tempting alternative is a random train/test split. It is easy, statistically familiar, and often fine for stable textbook datasets.

It fails when production is time-ordered.

If transactions from Monday through Sunday are randomly mixed, the validation set can contain the same attacker campaign, same merchant rollout, same holiday pattern, and same data bug as training. The model appears to generalize because the exam shares the same leak.

A time-aware split asks the production-shaped question:

train:       Jan -> Mar
validation: Apr
test:       May
launch:     Jun

This is not always enough. If the product launches in a new country in June, May still may not represent June. But time-aware splitting is the baseline because production always arrives after training, never randomly interleaved with it.

Use random splits when rows are genuinely exchangeable. Use grouped splits when the same user, merchant, listing, patient, or device can appear many times. Use time splits when future behavior is the deployment target. Use backtesting when the model will be repeatedly retrained over time.

5) Delayed labels change what "live accuracy" can mean

Fraud labels arrive late. So do churn labels, loan default labels, house sale prices, support escalation labels, and medical outcomes.

That delay creates a monitoring trap. In the first hour after launch, you may know latency, score distribution, feature null rates, and decision counts. You do not yet know the true fraud recall.

So the first production signals are proxies:

Signal Arrives fast? What it catches What it cannot prove
Feature null rate yes pipeline break, unknown categories model correctness
Feature distribution drift yes served rows differ from training true label quality
Score distribution yes model output shifted or collapsed business correctness
Human review overturn rate medium obvious bad decisions full recall
Confirmed fraud recall slow true model performance early alerting

The lead-level habit is pairing early proxy signals with delayed truth. You do not wait three months to notice a broken feature, but you also do not declare success from clean proxy dashboards.

6) The three failure modes under one row

Take one transaction:

tx_id: tx_8317
time: 2026-05-22 09:10:03
amount: 470.00
merchant: m_44
country: IN
device_id: d_918
decision needed by: 09:10:03.150

Now watch the three gaps.

Leakage. Offline training includes chargeback_filed_within_7d = true. That field is almost the label. It cannot exist at 09:10:03.150.

Distribution shift. Training saw mostly domestic grocery and fuel transactions. Production now includes a new travel-card partner with cross-border hotel transactions. The input mix changed.

Serving skew. Offline code computes merchant_risk_30d from settled transactions. Online code computes it from authorizations. Same feature name, different population.

The row did not just "perform badly." It asked the model a different question from the one the model practiced.

7) The tradeoff: stricter validation hurts pride before it saves production

A stricter split usually makes offline numbers worse.

Validation design Fraud recall False-positive rate What the number means
Random split 94% 0.35% Rows are mixed with near-neighbors from training.
Grouped by card 81% 0.47% Same card cannot appear in train and validation.
Time split 68% 0.62% Future transactions differ from past transactions.
Time + new-merchant slice 39% 1.4% Launch risk is concentrated in new merchants.

The lower number is not bad news. It is earlier news.

The cost moved from production surprise to offline discomfort. That is a good trade. You would rather argue about a weak backtest before launch than explain real customer harm after launch.

The tradeoff is that stricter validation can be pessimistic or noisy. A small future slice may underrepresent stable segments. A new-merchant slice may be intentionally hard. That is why you keep multiple views: broad validation for average behavior, stress slices for launch risk, and delayed labels for truth.

8) Signals that the row boundary is breaking

  • Healthy behavior: top feature distributions, null rates, category rates, and score distributions stay within expected bands for each launch slice.
  • First degrading metric: null or unknown-category rate jumps before business metrics can confirm harm.
  • Misleading beginner metric: aggregate accuracy, because rare failures and delayed labels can hide inside it.
  • Expert graph: fraud recall, false-positive rate, score distribution, and feature drift sliced by merchant age, geography, card age, device novelty, model version, and feature-pipeline version.

The expert graph matters because train-production gaps are rarely uniform. One region, tenant, merchant class, or new product path usually breaks first.

9) Where this lever works and where it becomes awkward

This chapter's lever is row-boundary discipline: make training rows look like production rows and make validation ask production-shaped questions.

It is a strong fit for tabular prediction systems: fraud, credit risk, pricing, ranking, churn, demand forecasting, lead scoring, recommendations, underwriting, quality inspection, and support routing.

It becomes awkward when labels are subjective, shifting, or produced by the model itself. In those cases you still need the row-boundary habit, but you also need eval design, human review, rubric stability, and policy ownership.

It can also become expensive. Point-in-time feature stores, backfills, parity tests, and delayed-label pipelines take engineering effort. That effort is justified when a wrong decision has real cost. It may be overkill for a low-risk internal toy model.

10) Wrong mental model: high validation means production-safe

The seductive mistake is treating validation as a blessing from statistics.

Validation is only useful if it resembles the deployment question. A random split with leaked features is not a production test. It is a faster way to fool yourself. A high score on stale rows is not evidence of future safety. It is evidence that the model solved that particular offline exam.

Replace the mental model:

Old: "The model has 99% validation accuracy."
New: "The model has 99% on this split, with these feature timestamps,
      under this row mix, with this label delay, for this threshold."

That longer sentence is not bureaucracy. It is the minimum context needed to decide whether the number deserves trust.

11) Other failure shapes you will recognize

  1. Entity leakage: the same user, merchant, house, patient, or document appears in both train and validation.
  2. Target encoding leakage: category statistics are computed using the full dataset before splitting.
  3. Time-window leakage: a rolling aggregate accidentally looks beyond the prediction timestamp.
  4. Policy shift: a business rule changes who gets reviewed, so labels after the change mean something different.
  5. Selection bias: training labels exist only for cases a previous system chose to inspect.
  6. Cold-start category failure: production contains new merchants, neighborhoods, SKUs, or devices missing from training.
  7. Encoder drift: offline and online category mappings assign different IDs to the same value.
  8. Silent fallback failure: serving replaces missing features with defaults, making the model confidently wrong.

12) Cross-topic reinforcement - the same pressure returns

  • Bias and variance separate "bad row boundary" from "model shape too weak or too flexible."
  • Regularization helps only after the training and validation exam is honest.
  • Feature engineering must preserve point-in-time availability, not just predictive power.
  • Evaluation and cross-validation formalize the split choices introduced here.
  • Metrics and calibration ask whether the score and threshold match the production cost.
  • Class imbalance shows why aggregate accuracy can hide the exact cases the product cares about.

13) Design-review questions that catch shallow validation

  1. Could every feature value be known at the exact moment of prediction?
  2. Can the same entity appear in both train and validation through a different row?
  3. Does the validation split reflect the future, group, geography, or product slice the model will serve?
  4. Can offline and online feature code produce different values for the same logical feature?
  5. Which production proxy signal arrives before true labels, and which delayed label later confirms it?

Where this shows up in production

  • Fraud detection: chargebacks arrive after authorization, attackers adapt, and review policy changes labels.
  • House pricing: market regimes shift, sale prices are delayed, and neighborhood aggregates can leak future sales.
  • Credit underwriting: rejected applicants have missing repayment labels, creating selection bias.
  • Search ranking: old click logs encode the previous ranker, not neutral user preference.
  • Churn prediction: "customer will churn" labels depend on retention campaigns that changed over time.
  • Demand forecasting: holidays, promotions, stockouts, and new regions break random validation.
  • Healthcare risk scoring: diagnosis codes arrive after care decisions and coding practices drift.
  • Ad targeting: privacy changes, auction dynamics, and creative fatigue shift the live population.
  • Support routing: label quality depends on how agents tag cases and which cases are escalated.
  • Recommendation systems: exposure bias means the model learns from items users were allowed to see.

Recall - rebuild the gap from memory

  1. What is the difference between a feature and a label in a point-in-time training row?
  2. Why can a model with high validation accuracy still be useless in production?
  3. Name the three train-production gaps and give one concrete example of each.
  4. Why does a time split often reveal risk that a random split hides?
  5. Which production signals arrive before delayed labels?
  6. Why does "add regularization" fail as the first response to a train-production gap?
  7. What table would you ask for in a launch review?
  8. State the chapter rule without using the word "leakage."

Interview Q&A

Q: Training accuracy is 99%, but production recall collapses. What is your first move?

A: I do not start by changing the algorithm. I compare production rows against training rows and audit the feature timeline. First I check whether any top feature used future information. Then I compare train and live feature distributions for shift. Then I run the same sample through offline and online feature code to detect serving skew.

Common wrong answer to avoid: "It is overfitting, so add regularization." Regularization can help variance. It cannot fix future-looking features or mismatched feature pipelines.

Q: What is label leakage?

A: Label leakage means the training features contain information that would not be available at prediction time and is directly or indirectly caused by the outcome. In fraud, a chargeback or manual-review result is leakage if the model must decide before those facts exist.

Common wrong answer to avoid: "Leakage only happens when the label column is accidentally included." Leakage is often indirect through aggregates, timestamps, target encodings, or post-outcome workflow fields.

Q: How is distribution shift different from training-serving skew?

A: Distribution shift means the live population differs from the training population: new merchants, new users, new geography, new season, new attacker behavior. Training-serving skew means the same named feature is computed differently offline and online. One is a world mismatch; the other is a pipeline mismatch.

Common wrong answer to avoid: "Both mean the data changed, so retraining fixes both." Retraining may help shift, but skew needs feature parity.

Q: Why are delayed labels a production monitoring problem?

A: Because true performance may arrive days, weeks, or months after the decision. You need early proxy signals such as feature drift, null rates, score distribution, and human-review overturns, then delayed labels to confirm whether those proxies reflected real quality.

Common wrong answer to avoid: "Just monitor live accuracy." In many systems live accuracy is unavailable when you most need the alarm.

Q: When is a random validation split acceptable?

A: It is acceptable when rows are close to exchangeable: no strong time ordering, no repeated entity leakage, no launch slice that differs from history, and no production sequence that matters. Many real product datasets violate at least one of those conditions, so grouped or time-aware splits are often safer.

Common wrong answer to avoid: "Random split is always statistically best." Randomness is not the goal; deployment resemblance is the goal.

Design/debug exercise (10 min)

You inherit a churn model with 92% validation accuracy and poor production saves.

Open a blank page and draw a four-column audit:

Feature Available before churn decision? Same offline/online computation? Drift or missing-rate monitor?

Fill five likely features: last login, support tickets, discount offered, cancellation reason, plan age. Mark which ones are illegal at prediction time. Then choose a validation split: random, grouped by customer, time-based, or time plus launch-region slice. Write one sentence explaining why your split resembles production.

If you cannot identify at least one suspicious feature and one split risk, you have not audited the train-production gap yet.

Operational memory

When a model fails after launch, slow down before touching hyperparameters. The first question is whether the offline exam matched the live decision. If not, every algorithm comparison was performed on the wrong problem.

Remember:

  • A feature is legal only if it exists at prediction time.
  • Validation is useful only when the split resembles deployment.
  • Leakage, shift, and serving skew are different failures with different fixes.
  • Early production monitoring uses proxy signals; delayed labels provide truth.
  • A lower but honest offline number is better than a beautiful number that production will expose.

Bridge. Once the row boundary is honest, the next failure is inside the model itself: it may be too simple to learn the signal or too flexible to ignore noise. That is the bias-variance tradeoff. -> 02-bias-variance.md