06. Module 00 — Classical ML Interview Prep & Revision¶

Purpose: Exam-style cheatsheet for rapid review before interviews. Organized algorithm-first so you can scan one topic in 60 seconds.

Review loop¶

Skim the TOC of 02_explainer.md. Re-read any section where you still feel vague.
Re-answer the self-check questions in 01_weekly_plan.md without looking.
Re-do the full set of prompts in 04_daily_recall.md from memory.
Recreate the failure-fix chain from 02_explainer.md §6.1 on a blank page.
Review the deliverables described in 05_hands_on_lab.md and mark one weak explanation you would tighten.
Re-read the foundation-gap audit in 02_explainer.md §6.5 before moving to the next module.

Reflection¶

Which concept still feels least natural: regularization geometry, boosting, calibration, or split design?
Which failure mode from §6.1 have you actually seen before?
Where does classical ML thinking still show up in LLM work? Think evals, ranking, calibration, monitoring, and leakage.

Quick revision cheatsheet¶

Bias-variance tradeoff¶

Bias (underfit): model too rigid, train & val both bad → add capacity/features
Variance (overfit): model too flexible, train good & val bad → regularize/simplify/more data
Irreducible noise = floor no model can beat
Diagnostic: compare train vs val gap, not raw numbers

Regularization geometry¶

L1 (Lasso): diamond constraint → corners force weights to zero → sparsity
L2 (Ridge): circle constraint → shrinks all weights evenly → smooth, no zeros
Elastic Net: mix of both when features are correlated

Linear models¶

Linear regression: ŷ = w·x + b, loss = MSE, coefficients are interpretable
Gradient descent: w := w - η * ∂L/∂w — repeated downhill nudging
Logistic regression: sigmoid squashes linear score → probability; loss = log loss
Feature engineering matters most for linear models (interactions, log, one-hot)

Trees & ensembles¶

Decision tree: greedy axis-aligned splits → rectangles; overfits easily
Random forest: many trees on bootstrap samples → averages reduce variance
Gradient boosting: sequential trees fix residuals of prior round → reduces bias
XGBoost wins tabular: handles mixed types, threshold interactions, mature regularization

Evaluation & metrics¶

Train/val/test split must mirror deployment (time-based, group-aware, stratified)
Cross-validation: rotate folds, average scores, spread = stability signal
Confusion matrix → Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = harmonic mean
ROC-AUC for ranking; PR-AUC for rare positives
Accuracy is dangerous on imbalanced data (99% by predicting majority)

Calibration¶

"When model says 80%, is it right 80% of the time?"
Reliability diagram: predicted prob vs actual freq; ideal = diagonal
Fixes: Platt scaling (simple), isotonic regression (flexible)
Matters whenever probabilities drive triage, thresholds, or resource allocation

Class imbalance & threshold¶

Lower threshold → more recall, more FP; higher threshold → more precision, more FN
Threshold is a product decision based on downstream action cost
Tools: class weights, focal loss, stratified splits, PR curves, cost-based tuning

Leakage red flags¶

Target encoding on full dataset before split
Future features available at training time
Same patient/user in train and val
Val score better than train → investigate, don't celebrate

Production rules of thumb¶

Always start with 3 baselines: logistic regression, random forest, XGBoost
Monitor feature drift and label drift separately
Slice metrics by segment — global averages hide local disasters
Keep the baseline alive as operational anchor

Algorithm decision guide¶

Linear Regression¶


What	Weighted sum of features → continuous prediction
When to use	Continuous target, additive relationships, need interpretable coefficients, fast baseline
Why	Simple, fast, explainable; coefficients tell you feature importance directly
When NOT to use	Non-linear interactions dominate; classification tasks; high-cardinality categoricals without encoding
How	Engineer features (interactions, log, polynomial) → fit → check residuals → add L2 if coefficients explode
Examples	Predicting house price from sqft/bedrooms/location; estimating hospital stay length from lab values; salary prediction from years of experience; ad spend → revenue attribution

Logistic Regression¶


What	Linear score → sigmoid → probability for binary classification
When to use	Binary classification; need calibrated probabilities; interpretability required; strong baseline
Why	Gives probabilities (not just labels), fast, stable, easy to debug; with good features beats complex models
When NOT to use	Highly non-linear boundaries; image/text without feature extraction; when threshold interactions dominate
How	One-hot encode categoricals → scale numerics → add interaction features → regularize (L1 for selection, L2 for stability) → tune threshold based on business cost
Examples	Email spam detection; loan default yes/no; patient readmission risk; click-through prediction baseline; churn prediction with engineered tenure/usage features

Decision Tree¶


What	Greedy axis-aligned threshold splits → rectangular regions
When to use	Need full interpretability; exploring feature interactions; quick EDA tool
Why	Handles mixed types, captures threshold interactions naturally, visually explainable
When NOT to use	Production predictions (overfits easily); smooth boundaries needed; stability required
How	Limit depth/min leaf size → use as exploration tool → graduate to ensemble for production
Examples	Clinical triage rules (if temp>101 AND O2<94 → ICU); explaining model logic to non-technical stakeholders; quick feature interaction discovery before building a real model

Random Forest¶


What	Many bootstrapped trees averaged; random feature subsets per split
When to use	Need stable out-of-box performance; little tuning budget; moderate data; want feature importance
Why	Variance reduction via averaging; handles non-linearities; robust with minimal hyperparameter work
When NOT to use	Need last-mile accuracy (boosting wins); need calibrated probabilities (often poorly calibrated); very high-dimensional sparse data
How	Set n_estimators high (500+) → tune max_depth and min_samples_leaf → use OOB score for quick validation
Examples	First production model for credit scoring; feature importance ranking for a new dataset; anomaly detection in sensor data; insurance claim severity when you need a quick reliable baseline

Gradient Boosting (XGBoost / LightGBM)¶


What	Sequential trees each correcting residual errors of the prior ensemble
When to use	Tabular data where you want best accuracy; Kaggle-style competitions; mixed feature types; moderate-to-large data
Why	Attacks bias sequentially; built-in regularization (shrinkage, subsampling, tree constraints); dominates tabular benchmarks
When NOT to use	Tiny data (overfits); need real-time sub-ms inference (many trees = slow); unstructured data (images, raw text)
How	Start with defaults → tune learning_rate + n_estimators first → then max_depth, subsample, colsample → early stopping on validation → audit for leakage
Examples	Fraud detection at scale (Stripe, PayPal); ranking search results; CTR prediction in ads; customer lifetime value; Kaggle tabular competitions; demand forecasting for e-commerce

L1 Regularization (Lasso)¶


What	Diamond penalty on weights → drives some to exact zero
When to use	Many irrelevant features; want automatic feature selection; sparse model desired
Why	Creates sparsity — simpler model, easier to interpret, less noise
When NOT to use	Many correlated useful features (arbitrarily drops one); when all features likely matter
How	Increase λ until only signal features survive → check which features drop → validate with cross-fold stability
Examples	Genomics with 20k genes but only ~50 matter; selecting which of 200 marketing features actually predict conversion; reducing model size for mobile deployment

L2 Regularization (Ridge)¶


What	Circular penalty → shrinks all weights evenly toward zero
When to use	Correlated features that all contribute; coefficients exploding; want smooth stable solution
Why	Stabilizes inversion; distributes weight across correlated features instead of picking one
When NOT to use	When you explicitly need feature selection (use L1); when model is already underfitting
How	Start with small λ → increase until val loss stabilizes → coefficients should shrink but not zero out
Examples	NLP bag-of-words with many correlated synonym features; multicollinear economic indicators (GDP, unemployment, CPI); image pixel regression where neighboring pixels correlate

Metrics decision¶

Situation	Primary metric	Why not accuracy
Balanced classes, equal costs	Accuracy or F1	Accuracy is fine here
Rare positives (fraud, disease)	PR-AUC, Recall@K	Accuracy rewards always-negative
Ranking matters (recommendations)	ROC-AUC, NDCG	Need ordering, not threshold
Probabilities drive actions (triage)	Log loss + calibration	Need trustworthy confidence
Regression with outliers	MAE over RMSE	RMSE punishes outliers too hard

Mini-checkpoint¶

Conceptual¶

Bias vs variance in one sentence each. (02_explainer.md §2.1)
Why does L1 create sparsity? (02_explainer.md §2.3)
Linear vs logistic regression — what changes in the output and loss? (02_explainer.md §3.1, §3.3)
Random forest vs gradient boosting — what problem is each solving? (02_explainer.md §4.2, §4.3)
Why does XGBoost often beat deep learning on tabular data? (02_explainer.md §4.4)
Why is calibration different from accuracy? (02_explainer.md §5.3)
Why is accuracy dangerous on imbalanced data? (02_explainer.md §5.4)

Applied¶

A teammate reports 99% training accuracy on 100 patients. What do you ask next? (02_explainer.md §1.1, §5.1)
You need to predict monthly churn. How do you split the data? (02_explainer.md §5.1)
Your model ranks well but 0.9-confidence cases are only right 60% of the time. What is broken? (02_explainer.md §5.3)
Fraud is 1% of transactions. Which metrics matter most and why? (02_explainer.md §5.2, §5.4)

Self-evaluation¶

Section	Score	/
Conceptual	__	14
Applied	__	8
Recall confidence	__	5
Total	__	27

Completion gate¶

[ ] Read all 6 chapters of 02_explainer.md
[ ] Completed the deliverables in 05_hands_on_lab.md
[ ] Can reproduce the failure-fix chain from §6.1 without looking
[ ] Can answer all prompts in 04_daily_recall.md from memory
[ ] Can explain the bridge to 01_neural_network_primitives from §6.6
[ ] Ready to move to the next module

06. Module 00 — Classical ML Interview Prep & Revision¶

Review loop¶

Reflection¶

Quick revision cheatsheet¶

Bias-variance tradeoff¶

Regularization geometry¶

Linear models¶

Trees & ensembles¶

Evaluation & metrics¶

Calibration¶

Class imbalance & threshold¶

Leakage red flags¶

Production rules of thumb¶

Algorithm decision guide¶

Linear Regression¶

Logistic Regression¶

Decision Tree¶

Random Forest¶

Gradient Boosting (XGBoost / LightGBM)¶

L1 Regularization (Lasso)¶

L2 Regularization (Ridge)¶

Metrics decision¶

Top interview questions — with model answers¶

Mini-checkpoint¶

Conceptual¶

Applied¶

Self-evaluation¶

Completion gate¶