06. Module 00 — Classical ML Interview Prep & Revision¶
Purpose: Exam-style cheatsheet for rapid review before interviews. Organized algorithm-first so you can scan one topic in 60 seconds.
Review loop¶
- Skim the TOC of
02_explainer.md. Re-read any section where you still feel vague. - Re-answer the self-check questions in
01_weekly_plan.mdwithout looking. - Re-do the full set of prompts in
04_daily_recall.mdfrom memory. - Recreate the failure-fix chain from
02_explainer.md§6.1 on a blank page. - Review the deliverables described in
05_hands_on_lab.mdand mark one weak explanation you would tighten. - Re-read the foundation-gap audit in
02_explainer.md§6.5 before moving to the next module.
Reflection¶
- Which concept still feels least natural: regularization geometry, boosting, calibration, or split design?
- Which failure mode from §6.1 have you actually seen before?
- Where does classical ML thinking still show up in LLM work? Think evals, ranking, calibration, monitoring, and leakage.
Quick revision cheatsheet¶
Bias-variance tradeoff¶
- Bias (underfit): model too rigid, train & val both bad → add capacity/features
- Variance (overfit): model too flexible, train good & val bad → regularize/simplify/more data
- Irreducible noise = floor no model can beat
- Diagnostic: compare train vs val gap, not raw numbers
Regularization geometry¶
- L1 (Lasso): diamond constraint → corners force weights to zero → sparsity
- L2 (Ridge): circle constraint → shrinks all weights evenly → smooth, no zeros
- Elastic Net: mix of both when features are correlated
Linear models¶
- Linear regression:
ŷ = w·x + b, loss = MSE, coefficients are interpretable - Gradient descent:
w := w - η * ∂L/∂w— repeated downhill nudging - Logistic regression: sigmoid squashes linear score → probability; loss = log loss
- Feature engineering matters most for linear models (interactions, log, one-hot)
Trees & ensembles¶
- Decision tree: greedy axis-aligned splits → rectangles; overfits easily
- Random forest: many trees on bootstrap samples → averages reduce variance
- Gradient boosting: sequential trees fix residuals of prior round → reduces bias
- XGBoost wins tabular: handles mixed types, threshold interactions, mature regularization
Evaluation & metrics¶
- Train/val/test split must mirror deployment (time-based, group-aware, stratified)
- Cross-validation: rotate folds, average scores, spread = stability signal
- Confusion matrix → Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = harmonic mean
- ROC-AUC for ranking; PR-AUC for rare positives
- Accuracy is dangerous on imbalanced data (99% by predicting majority)
Calibration¶
- "When model says 80%, is it right 80% of the time?"
- Reliability diagram: predicted prob vs actual freq; ideal = diagonal
- Fixes: Platt scaling (simple), isotonic regression (flexible)
- Matters whenever probabilities drive triage, thresholds, or resource allocation
Class imbalance & threshold¶
- Lower threshold → more recall, more FP; higher threshold → more precision, more FN
- Threshold is a product decision based on downstream action cost
- Tools: class weights, focal loss, stratified splits, PR curves, cost-based tuning
Leakage red flags¶
- Target encoding on full dataset before split
- Future features available at training time
- Same patient/user in train and val
- Val score better than train → investigate, don't celebrate
Production rules of thumb¶
- Always start with 3 baselines: logistic regression, random forest, XGBoost
- Monitor feature drift and label drift separately
- Slice metrics by segment — global averages hide local disasters
- Keep the baseline alive as operational anchor
Algorithm decision guide¶
Linear Regression¶
| What | Weighted sum of features → continuous prediction |
| When to use | Continuous target, additive relationships, need interpretable coefficients, fast baseline |
| Why | Simple, fast, explainable; coefficients tell you feature importance directly |
| When NOT to use | Non-linear interactions dominate; classification tasks; high-cardinality categoricals without encoding |
| How | Engineer features (interactions, log, polynomial) → fit → check residuals → add L2 if coefficients explode |
| Examples | Predicting house price from sqft/bedrooms/location; estimating hospital stay length from lab values; salary prediction from years of experience; ad spend → revenue attribution |
Logistic Regression¶
| What | Linear score → sigmoid → probability for binary classification |
| When to use | Binary classification; need calibrated probabilities; interpretability required; strong baseline |
| Why | Gives probabilities (not just labels), fast, stable, easy to debug; with good features beats complex models |
| When NOT to use | Highly non-linear boundaries; image/text without feature extraction; when threshold interactions dominate |
| How | One-hot encode categoricals → scale numerics → add interaction features → regularize (L1 for selection, L2 for stability) → tune threshold based on business cost |
| Examples | Email spam detection; loan default yes/no; patient readmission risk; click-through prediction baseline; churn prediction with engineered tenure/usage features |
Decision Tree¶
| What | Greedy axis-aligned threshold splits → rectangular regions |
| When to use | Need full interpretability; exploring feature interactions; quick EDA tool |
| Why | Handles mixed types, captures threshold interactions naturally, visually explainable |
| When NOT to use | Production predictions (overfits easily); smooth boundaries needed; stability required |
| How | Limit depth/min leaf size → use as exploration tool → graduate to ensemble for production |
| Examples | Clinical triage rules (if temp>101 AND O2<94 → ICU); explaining model logic to non-technical stakeholders; quick feature interaction discovery before building a real model |
Random Forest¶
| What | Many bootstrapped trees averaged; random feature subsets per split |
| When to use | Need stable out-of-box performance; little tuning budget; moderate data; want feature importance |
| Why | Variance reduction via averaging; handles non-linearities; robust with minimal hyperparameter work |
| When NOT to use | Need last-mile accuracy (boosting wins); need calibrated probabilities (often poorly calibrated); very high-dimensional sparse data |
| How | Set n_estimators high (500+) → tune max_depth and min_samples_leaf → use OOB score for quick validation |
| Examples | First production model for credit scoring; feature importance ranking for a new dataset; anomaly detection in sensor data; insurance claim severity when you need a quick reliable baseline |
Gradient Boosting (XGBoost / LightGBM)¶
| What | Sequential trees each correcting residual errors of the prior ensemble |
| When to use | Tabular data where you want best accuracy; Kaggle-style competitions; mixed feature types; moderate-to-large data |
| Why | Attacks bias sequentially; built-in regularization (shrinkage, subsampling, tree constraints); dominates tabular benchmarks |
| When NOT to use | Tiny data (overfits); need real-time sub-ms inference (many trees = slow); unstructured data (images, raw text) |
| How | Start with defaults → tune learning_rate + n_estimators first → then max_depth, subsample, colsample → early stopping on validation → audit for leakage |
| Examples | Fraud detection at scale (Stripe, PayPal); ranking search results; CTR prediction in ads; customer lifetime value; Kaggle tabular competitions; demand forecasting for e-commerce |
L1 Regularization (Lasso)¶
| What | Diamond penalty on weights → drives some to exact zero |
| When to use | Many irrelevant features; want automatic feature selection; sparse model desired |
| Why | Creates sparsity — simpler model, easier to interpret, less noise |
| When NOT to use | Many correlated useful features (arbitrarily drops one); when all features likely matter |
| How | Increase λ until only signal features survive → check which features drop → validate with cross-fold stability |
| Examples | Genomics with 20k genes but only ~50 matter; selecting which of 200 marketing features actually predict conversion; reducing model size for mobile deployment |
L2 Regularization (Ridge)¶
| What | Circular penalty → shrinks all weights evenly toward zero |
| When to use | Correlated features that all contribute; coefficients exploding; want smooth stable solution |
| Why | Stabilizes inversion; distributes weight across correlated features instead of picking one |
| When NOT to use | When you explicitly need feature selection (use L1); when model is already underfitting |
| How | Start with small λ → increase until val loss stabilizes → coefficients should shrink but not zero out |
| Examples | NLP bag-of-words with many correlated synonym features; multicollinear economic indicators (GDP, unemployment, CPI); image pixel regression where neighboring pixels correlate |
Metrics decision¶
| Situation | Primary metric | Why not accuracy |
|---|---|---|
| Balanced classes, equal costs | Accuracy or F1 | Accuracy is fine here |
| Rare positives (fraud, disease) | PR-AUC, Recall@K | Accuracy rewards always-negative |
| Ranking matters (recommendations) | ROC-AUC, NDCG | Need ordering, not threshold |
| Probabilities drive actions (triage) | Log loss + calibration | Need trustworthy confidence |
| Regression with outliers | MAE over RMSE | RMSE punishes outliers too hard |
Top interview questions — with model answers¶
1. "Why did a model with 99% training accuracy fail in production?"
Overfitting (memorized noise), data leakage (feature knew the answer), or distribution shift (prod data differs from train). Diagnose: check train-vs-val gap, audit for leakage, compare feature distributions across time.
2. "Explain bias-variance tradeoff."
Bias = model too simple, systematically wrong (both train & val bad). Variance = model too flexible, memorizes noise (train good, val bad). Total error = bias² + variance + irreducible noise. Fix bias with more capacity; fix variance with regularization/data/ensembles.
3. "Why does L1 produce sparsity but L2 doesn't?"
L1's diamond constraint has corners on axes — optimization hits a corner, forcing weights to exactly zero. L2's circle is smooth — optimization touches a round surface, shrinking all weights but never zeroing them.
4. "Linear vs logistic regression — what changes?"
Target (continuous → probability), output (raw number → sigmoid-squashed [0,1]), loss (MSE → log loss). Decision boundary is still linear. Sigmoid sits outside the linear computation, not inside.
5. "Random forest vs gradient boosting — when each?"
RF: many trees in parallel on resampled data → averaging reduces variance. Use when you need stability with minimal tuning. Boosting: sequential trees fixing residuals → reduces bias. Use when you want maximum accuracy and can afford tuning. Different mechanisms, different goals.
6. "Why does XGBoost beat deep learning on tabular data?"
Tabular lacks spatial/sequential structure that gives neural nets their edge. Trees handle mixed feature types, threshold interactions, missing values natively. Mature regularization prevents overfitting on modest data sizes. Always benchmark XGBoost before reaching for deep learning on spreadsheets.
7. "ROC-AUC vs PR-AUC — when does PR-AUC matter more?"
When the positive class is rare (fraud, disease). ROC-AUC can look great even when precision is terrible because TN dominates. PR-AUC focuses on the positive class — precision and recall only — so it exposes weakness that ROC-AUC hides.
8. "What is calibration and why should product teams care?"
Calibration = when the model says 80%, it's actually right ~80% of the time. Matters for: loan pricing (wrong probability → mispriced risk), triage queues (overconfidence → wrong resource allocation), any system where probabilities drive actions. A model can rank perfectly and still be dangerously miscalibrated.
9. "How do you handle class imbalance?"
Don't start with SMOTE. Order: (1) switch metric to PR-AUC/recall, (2) class_weight='balanced', (3) tune threshold on validation, (4) scale_pos_weight in XGBoost, (5) SMOTE as last resort. Always: stratified splits, slice metrics by segment.
10. "What is data leakage? Give examples."
Information from the future or the label sneaks into training features. Examples: fitting scaler on full data before split, target encoding without cross-validation, same patient in train and val, using a feature that's computed after the prediction point. Detection: if val score is suspiciously close to or better than train → investigate.
Mini-checkpoint¶
Conceptual¶
- Bias vs variance in one sentence each. (
02_explainer.md§2.1) - Why does L1 create sparsity? (
02_explainer.md§2.3) - Linear vs logistic regression — what changes in the output and loss? (
02_explainer.md§3.1, §3.3) - Random forest vs gradient boosting — what problem is each solving? (
02_explainer.md§4.2, §4.3) - Why does XGBoost often beat deep learning on tabular data? (
02_explainer.md§4.4) - Why is calibration different from accuracy? (
02_explainer.md§5.3) - Why is accuracy dangerous on imbalanced data? (
02_explainer.md§5.4)
Applied¶
- A teammate reports 99% training accuracy on 100 patients. What do you ask next? (
02_explainer.md§1.1, §5.1) - You need to predict monthly churn. How do you split the data? (
02_explainer.md§5.1) - Your model ranks well but 0.9-confidence cases are only right 60% of the time. What is broken? (
02_explainer.md§5.3) - Fraud is 1% of transactions. Which metrics matter most and why? (
02_explainer.md§5.2, §5.4)
Self-evaluation¶
| Section | Score | / |
|---|---|---|
| Conceptual | __ | 14 |
| Applied | __ | 8 |
| Recall confidence | __ | 5 |
| Total | __ | 27 |
Completion gate¶
- [ ] Read all 6 chapters of
02_explainer.md - [ ] Completed the deliverables in
05_hands_on_lab.md - [ ] Can reproduce the failure-fix chain from §6.1 without looking
- [ ] Can answer all prompts in
04_daily_recall.mdfrom memory - [ ] Can explain the bridge to
01_neural_network_primitivesfrom §6.6 - [ ] Ready to move to the next module