Skip to content

12. Evaluation and cross-validation — splits that mimic deployment

Three-part contract. Train teaches. Validation tunes. Test judges once. Get the split wrong, every metric lies.

Built on the ELI5 in 00-eli5.md. The prediction and confidence score are only honest if the split is honest. Otherwise overfitting looks like skill.


The sealed envelope

See. Picture three stacks of labeled transactions on the fraud analyst's desk.

  • Train. The big stack. Past transactions. The model reads these openly. Learns patterns. Adjusts weights.
  • Validation. A medium stack, sealed but openable. We check performance here while tuning. Open, peek, adjust, re-seal. Many times.
  • Test. A small sealed envelope in the drawer. Opened exactly once, on the day of the audit. After that, it is dead. You cannot use it again for any decision.

This is the picture. Everything in between — every hyperparameter sweep, every cross-validation, every threshold pick — happens on train + validation. Test is the audit.

Why so strict? Because the moment you peek at test and act on what you saw — switch a model, retune a threshold, drop a feature — the test set has quietly become validation. Your final number is now optimistic. The fraud model now looks safer on paper than it really is. This is the most common self-deception in applied ML.


K-fold cross-validation — the rotating validation seat

Data is scarce. One val split of 150 rows is noisy. So we rotate.

In 5-fold CV, slice the train+val portion into 5 equal chunks. Train on 4. Validate on 1. Rotate. Repeat 5 times. Average the 5 validation scores.

5-fold CV layout (each row is one fold iteration)

         chunk1   chunk2   chunk3   chunk4   chunk5
Fold 1:  [ VAL ]  [TRAIN]  [TRAIN]  [TRAIN]  [TRAIN]
Fold 2:  [TRAIN]  [ VAL ]  [TRAIN]  [TRAIN]  [TRAIN]
Fold 3:  [TRAIN]  [TRAIN]  [ VAL ]  [TRAIN]  [TRAIN]
Fold 4:  [TRAIN]  [TRAIN]  [TRAIN]  [ VAL ]  [TRAIN]
Fold 5:  [TRAIN]  [TRAIN]  [TRAIN]  [TRAIN]  [ VAL ]

Mean of the 5 scores → expected performance. Spread of the 5 → stability. If they bounce wildly across folds, the model is sensitive to which transactions it saw — that is a variance clue, overfitting leaving fingerprints.

Worked numerical example — 100 samples, 5-fold

100 rows. K = 5. Each fold is 100 / 5 = 20 samples.

Iteration Train indices Val indices Train size Val size
Fold 1 21–100 1–20 80 20
Fold 2 1–20, 41–100 21–40 80 20
Fold 3 1–40, 61–100 41–60 80 20
Fold 4 1–60, 81–100 61–80 80 20
Fold 5 1–80 81–100 80 20

Each row gets used as validation exactly once. Each row gets used in training exactly four times. Five trained models. Five scores. Average them.

Stratified K-fold — keep the class ratio honest

Imagine 100 rows, 10 fraud, 90 legit. A random 20-row fold could land 0 fraud and 20 legit. Now your validation has no positives — the score is meaningless.

Stratified K-fold forces each fold to keep the same class ratio: 2 fraud, 18 legit in every fold. The prediction is judged on representative transactions each time. sklearn.model_selection.StratifiedKFold is the default for classification for exactly this reason.


Time-series CV — the future does not leak backward

Now the hard case. Your data has time. Predict tomorrow's fraud rate from yesterday's features. Predict next week's demand from last week's pattern.

Random K-fold here is a lie. Why? Because folds 1–4 contain Tuesday's rows, fold 5 (validation) contains Monday's rows. The model trained on the future and was scored on the past. Production never works that way.

So what to do? Roll the window forward in time.

Time-series CV — rolling window (train past, val next chunk)

time →

Fold 1:  [TRAIN: weeks 1–4]   [VAL: week 5]
Fold 2:  [TRAIN: weeks 1–5]   [VAL: week 6]
Fold 3:  [TRAIN: weeks 1–6]   [VAL: week 7]
Fold 4:  [TRAIN: weeks 1–7]   [VAL: week 8]

Train always sits in the past. Validation always sits strictly after. This is what production looks like — the model trained on history is scored on the next chunk it has not seen.

"X breaks" — three attempts at time-series CV

Stripe-style fraud model. The model must predict tomorrow's chargebacks from today's swipes. Three honest attempts:

Attempt 1 — random K-fold. Shuffle all rows, split 5 ways. Train on 4 chunks, val on 1.

What happens? Validation score = 0.94 AUC. Looks great. Ship it. Production AUC = 0.71. The model memorized seasonal patterns from the future and "predicted" them in the past. Random K-fold leaks the future. Dies in production.

Attempt 2 — expanding window. Train on weeks 1–4, val on week 5. Train on weeks 1–5, val on week 6. And so on.

What happens? No future leakage. Honest split. But — fraud at hour 23 of week 4 is in train; fraud at hour 0 of week 5 is in val. The events are 1 hour apart. They share so much context (same fraudster, same merchant, same fingerprint) that val score is still optimistic. Better than random K-fold. Still hides risk.

Attempt 3 — rolling window with embargo. Train on weeks 1–4. Skip week 5 (the embargo). Validate on week 6.

What happens? The 1-hour adjacency is broken. The validation period is genuinely "the future the model has not seen and cannot peek at via shared session." This is the most realistic. Used by quant funds, fraud teams, demand forecasters at Uber. Score is lower than the other two — and it is the score that matches production.

The lesson. Random K-fold gives the prettiest number and the worst product. Rolling window with embargo gives the ugliest number and the most honest product.


Pause and recall. Without scrolling — what does the test envelope rule say? What is the size of one fold in 100-sample 5-fold CV? Why does random K-fold lie on time-series data? If any link is fuzzy, scroll up.


The deploy-mimic rule — your CV split must look like production

This is the one rule that fixes most evaluation bugs. Ask yourself: when this model runs in production, what does it see that it has never seen before? Then make your validation set look like that.

Three concrete cases.

Case 1 — customer-level CV (e-commerce, lending)

You predict whether a customer will churn. A random row-level split puts customer C-42's January order in train and her March order in val. The model has already seen her browsing pattern, her ZIP, her preferred brands. Validation is a memory test, not a generalization test.

Deploy-mimic fix. Use GroupKFold with group = customer_id. All of C-42's rows land in one fold. The model never sees her in train when scoring her in val. This mirrors production — production sees genuinely new customers.

You rank items inside a session. A random row split puts click 1 of session S-7 in train and click 5 of the same session in val. The two clicks share the same query, same user mood, same time-of-day. Information leaks across.

Deploy-mimic fix. Group by session_id. Each session lives entirely in one fold.

Case 3 — stratified-by-time CV (forecasting)

You predict next month's revenue. December has Black Friday spikes. If your val months are all July, you never test the model on December. The metric looks stable because it never sees the hard month.

Deploy-mimic fix. Make sure each fold spans a representative time slice — or run multiple time-series splits and report each separately. Do not hide December.

The principle. The validation distribution must equal the production input distribution. If your CV is easier than production, you are calibrating the fraud model on easy legit traffic only — and the prediction will collapse the day a fraud burst hits.


Nested cross-validation — tune inside, judge outside

When you tune hyperparameters inside ordinary CV and report that same score, the estimate is optimistic. Model selection already learned from those folds.

Nested CV fixes this with two loops. The inner loop tunes hyperparameters. The outer loop evaluates the whole tuning procedure on untouched folds.

So the outer score is the honest one — unbiased for "train + tune + deploy this pipeline." In sklearn, put GridSearchCV inside a pipeline, then pass that object to sklearn.model_selection.cross_val_score.


Where this lives in the wild

  • Netflix recommendation A/B-test setup. Offline CV uses temporal splits — train on history, validate on the most recent 7 days, then validate again on a held-out user cohort to mimic the new-subscriber distribution. The offline number is a sanity gate before the actual A/B test on live traffic.
  • Stripe Radar fraud. Time-aware splits with embargo. The model must predict tomorrow's chargebacks from today's swipes. Random K-fold would inflate AUC by 15+ points. They use rolling windows with a multi-day embargo to break session adjacency.
  • E-commerce churn (Shopify, Amazon). GroupKFold by customer_id. A single customer's behavior across months is correlated; splitting her across folds inflates accuracy. Group CV is the deploy-mimic rule.
  • Uber demand forecasting. Expanding-window time-series CV by city-hour. Each city is its own fold-set. The model is judged on its ability to predict the next hour given the past — exactly what dispatch needs.
  • sklearn StratifiedKFold defaults. shuffle=False by default in older versions caught many beginners — adjacent rows ended up in the same fold and inflated scores. Modern pipelines pass shuffle=True, random_state=42 explicitly.

Pause and recall. Without scrolling — what is the deploy-mimic rule in one sentence? Why is GroupKFold the right call for customer churn? What does the test envelope rule forbid? If any link is fuzzy, scroll up.


Common leakage modes — how the future sneaks backward

Leakage is when information from the validation/test set, or from the future, contaminates the training set. Beautiful metrics. Terrible products.

  • Scaler fit on full data. You fit StandardScaler on all 1000 rows, then split. The scaler's mean now contains validation rows. Fix — fit on train only, transform val/test with the trained scaler.
  • Target encoding leakage. You encode "city" by mean of target. You compute it on the full train+val. Now val's target leaks into its own feature. Fix — compute target encoding inside each fold using train rows only.
  • Duplicate rows across split. Same customer, different rows, same outcome. Splits randomly. Both train and val see the answer. Fix — group split.
  • Lookahead features. Feature 30_day_avg_revenue computed at time t actually uses revenue from t+1 to t+30. Train sees the future. Fix — audit every feature's timestamp logic.
  • Label leakage from production pipeline. Feature was_refunded is filled in only after the chargeback decision — which is the label. The model "predicts" with 99.9% accuracy in offline test. In production, the feature is null. Fix — only use features available at prediction time.

Interview Q&A

Q: Why does random K-fold lie on time-series data?
A: Because folds contain rows from both the past and the future relative to each other. The model trains on the future and is scored on the past. Production never works that way — production only ever sees the past. Use rolling/expanding window splits with an embargo gap.
Common wrong answer to avoid: "shuffle equals iid sampling." Time-series rows are not iid — they share temporal autocorrelation, seasonality, and shared session context. iid is the assumption being violated, not the cure.

Q: When is leave-one-out CV the right choice?
A: When data is genuinely tiny — say, under 50–100 samples — and each row is iid and cheap to retrain on. Otherwise the variance of the LOO estimate is high, the compute cost is N retrains, and 5- or 10-fold is just better. Almost never the right default in modern ML.
Common wrong answer to avoid: "LOO is more accurate because it uses all the data." It is high-variance, not high-accuracy. The single held-out point is one noisy draw — averaging N noisy draws does not converge fast.

Q: What is target leakage and how do you spot it?
A: Target leakage is when a feature carries information that, in production, is only available after the label is known. Spot it by checking: "could I compute this feature at prediction time, before knowing the answer?" If no, it leaks. Also — if your offline AUC jumps to 0.99+, something is leaking.
Common wrong answer to avoid: "leakage means duplicates between train and test." Duplicates are one mode. The deeper mode is temporal — features that quietly use future information.

Q: My 5-fold CV scores are 0.81, 0.84, 0.79, 0.83, 0.82. The 10-fold scores are 0.78, 0.91, 0.65, 0.88, 0.72, 0.90, 0.69, 0.85, 0.74, 0.80. Which is more trustworthy?
A: The 5-fold mean (0.818) is the same shape, but 10-fold's spread (0.65–0.91) reveals the model is unstable on small validation chunks. With 10-fold, each val set is half the size — noisier per fold. Look at both mean and spread; do not ship a model whose fold-to-fold spread exceeds the gap to your baseline.
Common wrong answer to avoid: "10-fold is always better because more folds means more accuracy." More folds also mean smaller validation chunks and noisier fold scores. Honest evaluation is mean plus variance, not mean alone.


Apply now (5 min)

Scenario. You are building a model at a ride-share company. Predict whether a driver will accept a surge-priced ride request. You have 6 months of historical request data — driver_id, rider_id, time, surge_multiplier, weather, accepted (label).

By hand, design the CV split. Answer:

  1. Is this time-sensitive? (Yes — driver behavior shifts with time-of-day, day-of-week, season.)
  2. Are rows grouped? (Yes — same driver appears many times.)
  3. What kind of split? (Time-based + group-aware. Train on months 1–4, embargo a week, validate on month 5, hold out month 6 as test. Within the train period, group by driver_id if you also need driver-level generalization.)
  4. What features need leakage audit? (Anything aggregated over a window — driver_30_day_acceptance_rate must use only data strictly before the request timestamp.)

Then — without looking — sketch from memory:

  1. The 5-fold CV layout (5 rows, train/val flipped).
  2. The rolling-window time-series CV diagram.
  3. One sentence: the deploy-mimic rule.

If you can reproduce all three in 90 seconds, you own this.


Bridge. Now you have splits that mimic production. But which number should you read off them? Accuracy lies on rare fraud. The confidence score from the ELI5 must match real probability. That is calibration. Read 13-metrics-and-calibration.md next.