Skip to content

06. Module 00 — Classical ML Interview Prep & Revision

Purpose: Exam-style cheatsheet for rapid review before interviews. Organized algorithm-first so you can scan one topic in 60 seconds.


Review loop

  1. Skim the TOC of 02_explainer.md. Re-read any section where you still feel vague.
  2. Re-answer the self-check questions in 01_weekly_plan.md without looking.
  3. Re-do the full set of prompts in 04_daily_recall.md from memory.
  4. Recreate the failure-fix chain from 02_explainer.md §6.1 on a blank page.
  5. Review the deliverables described in 05_hands_on_lab.md and mark one weak explanation you would tighten.
  6. Re-read the foundation-gap audit in 02_explainer.md §6.5 before moving to the next module.

Reflection

  • Which concept still feels least natural: regularization geometry, boosting, calibration, or split design?
  • Which failure mode from §6.1 have you actually seen before?
  • Where does classical ML thinking still show up in LLM work? Think evals, ranking, calibration, monitoring, and leakage.

Quick revision cheatsheet

Bias-variance tradeoff

  • Bias (underfit): model too rigid, train & val both bad → add capacity/features
  • Variance (overfit): model too flexible, train good & val bad → regularize/simplify/more data
  • Irreducible noise = floor no model can beat
  • Diagnostic: compare train vs val gap, not raw numbers

Regularization geometry

  • L1 (Lasso): diamond constraint → corners force weights to zero → sparsity
  • L2 (Ridge): circle constraint → shrinks all weights evenly → smooth, no zeros
  • Elastic Net: mix of both when features are correlated

Linear models

  • Linear regression: ŷ = w·x + b, loss = MSE, coefficients are interpretable
  • Gradient descent: w := w - η * ∂L/∂w — repeated downhill nudging
  • Logistic regression: sigmoid squashes linear score → probability; loss = log loss
  • Feature engineering matters most for linear models (interactions, log, one-hot)

Trees & ensembles

  • Decision tree: greedy axis-aligned splits → rectangles; overfits easily
  • Random forest: many trees on bootstrap samples → averages reduce variance
  • Gradient boosting: sequential trees fix residuals of prior round → reduces bias
  • XGBoost wins tabular: handles mixed types, threshold interactions, mature regularization

Evaluation & metrics

  • Train/val/test split must mirror deployment (time-based, group-aware, stratified)
  • Cross-validation: rotate folds, average scores, spread = stability signal
  • Confusion matrix → Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = harmonic mean
  • ROC-AUC for ranking; PR-AUC for rare positives
  • Accuracy is dangerous on imbalanced data (99% by predicting majority)

Calibration

  • "When model says 80%, is it right 80% of the time?"
  • Reliability diagram: predicted prob vs actual freq; ideal = diagonal
  • Fixes: Platt scaling (simple), isotonic regression (flexible)
  • Matters whenever probabilities drive triage, thresholds, or resource allocation

Class imbalance & threshold

  • Lower threshold → more recall, more FP; higher threshold → more precision, more FN
  • Threshold is a product decision based on downstream action cost
  • Tools: class weights, focal loss, stratified splits, PR curves, cost-based tuning

Leakage red flags

  • Target encoding on full dataset before split
  • Future features available at training time
  • Same patient/user in train and val
  • Val score better than train → investigate, don't celebrate

Production rules of thumb

  • Always start with 3 baselines: logistic regression, random forest, XGBoost
  • Monitor feature drift and label drift separately
  • Slice metrics by segment — global averages hide local disasters
  • Keep the baseline alive as operational anchor

Algorithm decision guide

Linear Regression

What Weighted sum of features → continuous prediction
When to use Continuous target, additive relationships, need interpretable coefficients, fast baseline
Why Simple, fast, explainable; coefficients tell you feature importance directly
When NOT to use Non-linear interactions dominate; classification tasks; high-cardinality categoricals without encoding
How Engineer features (interactions, log, polynomial) → fit → check residuals → add L2 if coefficients explode
Examples Predicting house price from sqft/bedrooms/location; estimating hospital stay length from lab values; salary prediction from years of experience; ad spend → revenue attribution

Logistic Regression

What Linear score → sigmoid → probability for binary classification
When to use Binary classification; need calibrated probabilities; interpretability required; strong baseline
Why Gives probabilities (not just labels), fast, stable, easy to debug; with good features beats complex models
When NOT to use Highly non-linear boundaries; image/text without feature extraction; when threshold interactions dominate
How One-hot encode categoricals → scale numerics → add interaction features → regularize (L1 for selection, L2 for stability) → tune threshold based on business cost
Examples Email spam detection; loan default yes/no; patient readmission risk; click-through prediction baseline; churn prediction with engineered tenure/usage features

Decision Tree

What Greedy axis-aligned threshold splits → rectangular regions
When to use Need full interpretability; exploring feature interactions; quick EDA tool
Why Handles mixed types, captures threshold interactions naturally, visually explainable
When NOT to use Production predictions (overfits easily); smooth boundaries needed; stability required
How Limit depth/min leaf size → use as exploration tool → graduate to ensemble for production
Examples Clinical triage rules (if temp>101 AND O2<94 → ICU); explaining model logic to non-technical stakeholders; quick feature interaction discovery before building a real model

Random Forest

What Many bootstrapped trees averaged; random feature subsets per split
When to use Need stable out-of-box performance; little tuning budget; moderate data; want feature importance
Why Variance reduction via averaging; handles non-linearities; robust with minimal hyperparameter work
When NOT to use Need last-mile accuracy (boosting wins); need calibrated probabilities (often poorly calibrated); very high-dimensional sparse data
How Set n_estimators high (500+) → tune max_depth and min_samples_leaf → use OOB score for quick validation
Examples First production model for credit scoring; feature importance ranking for a new dataset; anomaly detection in sensor data; insurance claim severity when you need a quick reliable baseline

Gradient Boosting (XGBoost / LightGBM)

What Sequential trees each correcting residual errors of the prior ensemble
When to use Tabular data where you want best accuracy; Kaggle-style competitions; mixed feature types; moderate-to-large data
Why Attacks bias sequentially; built-in regularization (shrinkage, subsampling, tree constraints); dominates tabular benchmarks
When NOT to use Tiny data (overfits); need real-time sub-ms inference (many trees = slow); unstructured data (images, raw text)
How Start with defaults → tune learning_rate + n_estimators first → then max_depth, subsample, colsample → early stopping on validation → audit for leakage
Examples Fraud detection at scale (Stripe, PayPal); ranking search results; CTR prediction in ads; customer lifetime value; Kaggle tabular competitions; demand forecasting for e-commerce

L1 Regularization (Lasso)

What Diamond penalty on weights → drives some to exact zero
When to use Many irrelevant features; want automatic feature selection; sparse model desired
Why Creates sparsity — simpler model, easier to interpret, less noise
When NOT to use Many correlated useful features (arbitrarily drops one); when all features likely matter
How Increase λ until only signal features survive → check which features drop → validate with cross-fold stability
Examples Genomics with 20k genes but only ~50 matter; selecting which of 200 marketing features actually predict conversion; reducing model size for mobile deployment

L2 Regularization (Ridge)

What Circular penalty → shrinks all weights evenly toward zero
When to use Correlated features that all contribute; coefficients exploding; want smooth stable solution
Why Stabilizes inversion; distributes weight across correlated features instead of picking one
When NOT to use When you explicitly need feature selection (use L1); when model is already underfitting
How Start with small λ → increase until val loss stabilizes → coefficients should shrink but not zero out
Examples NLP bag-of-words with many correlated synonym features; multicollinear economic indicators (GDP, unemployment, CPI); image pixel regression where neighboring pixels correlate

Metrics decision

Situation Primary metric Why not accuracy
Balanced classes, equal costs Accuracy or F1 Accuracy is fine here
Rare positives (fraud, disease) PR-AUC, Recall@K Accuracy rewards always-negative
Ranking matters (recommendations) ROC-AUC, NDCG Need ordering, not threshold
Probabilities drive actions (triage) Log loss + calibration Need trustworthy confidence
Regression with outliers MAE over RMSE RMSE punishes outliers too hard

Top interview questions — with model answers

1. "Why did a model with 99% training accuracy fail in production?"

Overfitting (memorized noise), data leakage (feature knew the answer), or distribution shift (prod data differs from train). Diagnose: check train-vs-val gap, audit for leakage, compare feature distributions across time.

2. "Explain bias-variance tradeoff."

Bias = model too simple, systematically wrong (both train & val bad). Variance = model too flexible, memorizes noise (train good, val bad). Total error = bias² + variance + irreducible noise. Fix bias with more capacity; fix variance with regularization/data/ensembles.

3. "Why does L1 produce sparsity but L2 doesn't?"

L1's diamond constraint has corners on axes — optimization hits a corner, forcing weights to exactly zero. L2's circle is smooth — optimization touches a round surface, shrinking all weights but never zeroing them.

4. "Linear vs logistic regression — what changes?"

Target (continuous → probability), output (raw number → sigmoid-squashed [0,1]), loss (MSE → log loss). Decision boundary is still linear. Sigmoid sits outside the linear computation, not inside.

5. "Random forest vs gradient boosting — when each?"

RF: many trees in parallel on resampled data → averaging reduces variance. Use when you need stability with minimal tuning. Boosting: sequential trees fixing residuals → reduces bias. Use when you want maximum accuracy and can afford tuning. Different mechanisms, different goals.

6. "Why does XGBoost beat deep learning on tabular data?"

Tabular lacks spatial/sequential structure that gives neural nets their edge. Trees handle mixed feature types, threshold interactions, missing values natively. Mature regularization prevents overfitting on modest data sizes. Always benchmark XGBoost before reaching for deep learning on spreadsheets.

7. "ROC-AUC vs PR-AUC — when does PR-AUC matter more?"

When the positive class is rare (fraud, disease). ROC-AUC can look great even when precision is terrible because TN dominates. PR-AUC focuses on the positive class — precision and recall only — so it exposes weakness that ROC-AUC hides.

8. "What is calibration and why should product teams care?"

Calibration = when the model says 80%, it's actually right ~80% of the time. Matters for: loan pricing (wrong probability → mispriced risk), triage queues (overconfidence → wrong resource allocation), any system where probabilities drive actions. A model can rank perfectly and still be dangerously miscalibrated.

9. "How do you handle class imbalance?"

Don't start with SMOTE. Order: (1) switch metric to PR-AUC/recall, (2) class_weight='balanced', (3) tune threshold on validation, (4) scale_pos_weight in XGBoost, (5) SMOTE as last resort. Always: stratified splits, slice metrics by segment.

10. "What is data leakage? Give examples."

Information from the future or the label sneaks into training features. Examples: fitting scaler on full data before split, target encoding without cross-validation, same patient in train and val, using a feature that's computed after the prediction point. Detection: if val score is suspiciously close to or better than train → investigate.


Mini-checkpoint

Conceptual

  1. Bias vs variance in one sentence each. (02_explainer.md §2.1)
  2. Why does L1 create sparsity? (02_explainer.md §2.3)
  3. Linear vs logistic regression — what changes in the output and loss? (02_explainer.md §3.1, §3.3)
  4. Random forest vs gradient boosting — what problem is each solving? (02_explainer.md §4.2, §4.3)
  5. Why does XGBoost often beat deep learning on tabular data? (02_explainer.md §4.4)
  6. Why is calibration different from accuracy? (02_explainer.md §5.3)
  7. Why is accuracy dangerous on imbalanced data? (02_explainer.md §5.4)

Applied

  1. A teammate reports 99% training accuracy on 100 patients. What do you ask next? (02_explainer.md §1.1, §5.1)
  2. You need to predict monthly churn. How do you split the data? (02_explainer.md §5.1)
  3. Your model ranks well but 0.9-confidence cases are only right 60% of the time. What is broken? (02_explainer.md §5.3)
  4. Fraud is 1% of transactions. Which metrics matter most and why? (02_explainer.md §5.2, §5.4)

Self-evaluation

Section Score /
Conceptual __ 14
Applied __ 8
Recall confidence __ 5
Total __ 27

Completion gate

  • [ ] Read all 6 chapters of 02_explainer.md
  • [ ] Completed the deliverables in 05_hands_on_lab.md
  • [ ] Can reproduce the failure-fix chain from §6.1 without looking
  • [ ] Can answer all prompts in 04_daily_recall.md from memory
  • [ ] Can explain the bridge to 01_neural_network_primitives from §6.6
  • [ ] Ready to move to the next module