Skip to content

03. Classical ML Refresher — Study Material

For deep understanding see 02_explainer.md. This file is the quick-reference companion: formulas, tables, model-selection notes, and interview-ready summaries.

Section 1 — Bias, variance, and generalization

Term Plain meaning Typical symptom First fix
Bias model too simple train and validation both poor add features or model flexibility
Variance model too sensitive to sample noise train great, validation poor regularize, add data, ensemble
Generalization performance on truly unseen data stable validation and test scores preserve split discipline

See explainer §2.1 and §6.1.

Section 2 — Linear regression

  • Model form: ŷ = w·x + b
  • Objective: mean squared error MSE = (1/n) * Σ (y - ŷ)^2
  • Best when the target is continuous and the relationship is reasonably additive after feature transforms
  • Strengths: interpretable, fast, good baseline
  • Risks: underfits curved interactions unless you engineer features

See explainer §3.1.

Section 3 — Logistic regression and gradient descent

z = w·x + b
p = 1 / (1 + e^(-z))
log loss = - [ y*log(p) + (1-y)*log(1-p) ]
  • Use for binary classification
  • Output is a probability, not just a label
  • Threshold choice is a business decision
  • Parameters are usually trained with gradient descent
w := w - η * ∂L/∂w

See explainer §3.2 and §3.3.

Section 4 — Regularization

Type Penalty Geometry Typical effect
L1 / Lasso λ * Σ |w| diamond sparsity, feature selection
L2 / Ridge λ * Σ w² circle smooth shrinkage, stability
Elastic net mix of L1 and L2 mixed useful with correlated features

Use L1 when you expect many irrelevant features. Use L2 when many correlated features matter a little. See explainer §2.3.

Section 5 — Trees and ensembles

Model Mechanism Main win Main risk
Decision tree greedy threshold splits interpretable interactions overfits easily
Random forest many bootstrapped trees averaged variance reduction less sharp than boosting
Gradient boosting sequential residual correction strong tabular accuracy tuning burden, can overfit
XGBoost / LightGBM optimized boosting libraries top-tier tabular baseline still needs leakage-safe eval

See explainer §4.1-§4.4.

Section 6 — Splits and cross-validation

Setup When to use Warning
Train / validation / test default workflow do not peek at test repeatedly
Stratified k-fold imbalanced classification preserve class ratios
Group k-fold repeated users, patients, accounts keep entities together
Time-series split temporal prediction future must never leak backward

Rule: evaluation must mirror deployment. See explainer §5.1.

Section 7 — Metrics cheat table

Metric Formula idea Best when Trap
Accuracy correct / total balanced classes, equal costs misleading on rare events
Precision TP / (TP + FP) false positives costly ignores missed positives
Recall TP / (TP + FN) false negatives costly can flood false alarms
F1 harmonic mean of P and R both matter hides calibration
ROC-AUC ranking quality over thresholds broad ranking view optimistic on imbalance
PR-AUC precision-recall area rare positive class less intuitive casually
Log loss penalty on wrong probabilities probability quality matters hard to explain without calibration
RMSE / MAE regression error size continuous targets outlier sensitivity differs

See explainer §5.2.

Section 8 — Calibration and class imbalance

  • Calibration: when the model says 0.8, reality should be close to 80%
  • A model can rank well and still be badly calibrated
  • Use calibration when probabilities drive queueing, pricing, triage, or intervention
  • Heavy imbalance usually means PR-AUC, recall, threshold tuning, and slice metrics matter more than raw accuracy

See explainer §5.3 and §5.4.

Section 9 — Feature engineering and leakage

Topic Practical note
Scaling needed for distance-based models and many gradient-based models
One-hot encoding safe default for low-cardinality categoricals
Target encoding powerful, but compute inside folds only
Interactions often essential for linear models
Time features extract weekday, hour, lag, recency
Leakage audit ask whether a feature knows the future or the label indirectly

Leakage examples: - fitting a scaler on all data before split - target encoding on the full dataset - letting the same user appear in train and validation - random split on time-ordered events

See explainer §3.4 and §5.1.

Section 10 — Foundation gap audit and bridge

What the next module assumes: 1. gradient descent concept 2. loss minimization 3. train/test split logic 4. feature representation 5. overfitting concept

Where to revise them: - gradient descent → explainer §3.2 - loss minimization → explainer §3.1, §3.3 - split logic → explainer §1.1, §5.1 - feature representation → explainer §3.4 - overfitting → explainer §1.1, §2.1, §5.4

Bridge sentence from explainer §6.6:

Next module — 01_neural_network_primitives — takes these ideas to high-dimensional space. The gradient descent you learned here becomes backpropagation. The ensemble idea becomes layers. The regularization shapes become dropout and weight decay.

Self-check

For full interview-ready prompts, see 02_explainer.md §6.3.

  1. Why is high training accuracy alone unconvincing? (§1.1)
  2. Which geometry belongs to L1 and which to L2? (§2.3)
  3. Why does logistic regression still have a linear boundary? (§3.3)
  4. When does feature engineering matter more than algorithm choice? (§3.4)
  5. Random forest vs boosting — what problem is each fixing? (§4.2, §4.3)
  6. Why does XGBoost fit tabular structure so well? (§4.4)
  7. Why is repeated test-set peeking dangerous? (§5.1)
  8. ROC-AUC vs PR-AUC — when does PR-AUC matter more? (§5.2, §5.4)
  9. What is calibration in one sentence? (§5.3)
  10. Name three leakage paths from memory. (§5.1)