03. Classical ML Refresher — Study Material¶

For deep understanding see 02_explainer.md. This file is the quick-reference companion: formulas, tables, model-selection notes, and interview-ready summaries.

Section 1 — Bias, variance, and generalization¶

Term	Plain meaning	Typical symptom	First fix
Bias	model too simple	train and validation both poor	add features or model flexibility
Variance	model too sensitive to sample noise	train great, validation poor	regularize, add data, ensemble
Generalization	performance on truly unseen data	stable validation and test scores	preserve split discipline

See explainer §2.1 and §6.1.

Section 2 — Linear regression¶

Model form: ŷ = w·x + b
Objective: mean squared error MSE = (1/n) * Σ (y - ŷ)^2
Best when the target is continuous and the relationship is reasonably additive after feature transforms
Strengths: interpretable, fast, good baseline
Risks: underfits curved interactions unless you engineer features

See explainer §3.1.

Section 3 — Logistic regression and gradient descent¶

z = w·x + b
p = 1 / (1 + e^(-z))
log loss = - [ y*log(p) + (1-y)*log(1-p) ]

Use for binary classification
Output is a probability, not just a label
Threshold choice is a business decision
Parameters are usually trained with gradient descent

w := w - η * ∂L/∂w

See explainer §3.2 and §3.3.

Section 4 — Regularization¶

Type	Penalty	Geometry	Typical effect
L1 / Lasso	`λ * Σ \|w\|`	diamond	sparsity, feature selection
L2 / Ridge	`λ * Σ w²`	circle	smooth shrinkage, stability
Elastic net	mix of L1 and L2	mixed	useful with correlated features

Use L1 when you expect many irrelevant features. Use L2 when many correlated features matter a little. See explainer §2.3.

Section 5 — Trees and ensembles¶

Model	Mechanism	Main win	Main risk
Decision tree	greedy threshold splits	interpretable interactions	overfits easily
Random forest	many bootstrapped trees averaged	variance reduction	less sharp than boosting
Gradient boosting	sequential residual correction	strong tabular accuracy	tuning burden, can overfit
XGBoost / LightGBM	optimized boosting libraries	top-tier tabular baseline	still needs leakage-safe eval

See explainer §4.1-§4.4.

Section 6 — Splits and cross-validation¶

Setup	When to use	Warning
Train / validation / test	default workflow	do not peek at test repeatedly
Stratified k-fold	imbalanced classification	preserve class ratios
Group k-fold	repeated users, patients, accounts	keep entities together
Time-series split	temporal prediction	future must never leak backward

Rule: evaluation must mirror deployment. See explainer §5.1.

Section 7 — Metrics cheat table¶

Metric	Formula idea	Best when	Trap
Accuracy	correct / total	balanced classes, equal costs	misleading on rare events
Precision	TP / (TP + FP)	false positives costly	ignores missed positives
Recall	TP / (TP + FN)	false negatives costly	can flood false alarms
F1	harmonic mean of P and R	both matter	hides calibration
ROC-AUC	ranking quality over thresholds	broad ranking view	optimistic on imbalance
PR-AUC	precision-recall area	rare positive class	less intuitive casually
Log loss	penalty on wrong probabilities	probability quality matters	hard to explain without calibration
RMSE / MAE	regression error size	continuous targets	outlier sensitivity differs

See explainer §5.2.

Section 8 — Calibration and class imbalance¶

Calibration: when the model says 0.8, reality should be close to 80%
A model can rank well and still be badly calibrated
Use calibration when probabilities drive queueing, pricing, triage, or intervention
Heavy imbalance usually means PR-AUC, recall, threshold tuning, and slice metrics matter more than raw accuracy

See explainer §5.3 and §5.4.

Section 9 — Feature engineering and leakage¶

Topic	Practical note
Scaling	needed for distance-based models and many gradient-based models
One-hot encoding	safe default for low-cardinality categoricals
Target encoding	powerful, but compute inside folds only
Interactions	often essential for linear models
Time features	extract weekday, hour, lag, recency
Leakage audit	ask whether a feature knows the future or the label indirectly

Leakage examples: - fitting a scaler on all data before split - target encoding on the full dataset - letting the same user appear in train and validation - random split on time-ordered events

See explainer §3.4 and §5.1.

Section 10 — Foundation gap audit and bridge¶

What the next module assumes: 1. gradient descent concept 2. loss minimization 3. train/test split logic 4. feature representation 5. overfitting concept

Where to revise them: - gradient descent → explainer §3.2 - loss minimization → explainer §3.1, §3.3 - split logic → explainer §1.1, §5.1 - feature representation → explainer §3.4 - overfitting → explainer §1.1, §2.1, §5.4

Bridge sentence from explainer §6.6:

Next module — 01_neural_network_primitives — takes these ideas to high-dimensional space. The gradient descent you learned here becomes backpropagation. The ensemble idea becomes layers. The regularization shapes become dropout and weight decay.

Self-check¶

For full interview-ready prompts, see 02_explainer.md §6.3.

Why is high training accuracy alone unconvincing? (§1.1)
Which geometry belongs to L1 and which to L2? (§2.3)
Why does logistic regression still have a linear boundary? (§3.3)
When does feature engineering matter more than algorithm choice? (§3.4)
Random forest vs boosting — what problem is each fixing? (§4.2, §4.3)
Why does XGBoost fit tabular structure so well? (§4.4)
Why is repeated test-set peeking dangerous? (§5.1)
ROC-AUC vs PR-AUC — when does PR-AUC matter more? (§5.2, §5.4)
What is calibration in one sentence? (§5.3)
Name three leakage paths from memory. (§5.1)