03. Classical ML Refresher — Study Material¶
For deep understanding see
02_explainer.md. This file is the quick-reference companion: formulas, tables, model-selection notes, and interview-ready summaries.
Section 1 — Bias, variance, and generalization¶
| Term | Plain meaning | Typical symptom | First fix |
|---|---|---|---|
| Bias | model too simple | train and validation both poor | add features or model flexibility |
| Variance | model too sensitive to sample noise | train great, validation poor | regularize, add data, ensemble |
| Generalization | performance on truly unseen data | stable validation and test scores | preserve split discipline |
See explainer §2.1 and §6.1.
Section 2 — Linear regression¶
- Model form:
ŷ = w·x + b - Objective: mean squared error
MSE = (1/n) * Σ (y - ŷ)^2 - Best when the target is continuous and the relationship is reasonably additive after feature transforms
- Strengths: interpretable, fast, good baseline
- Risks: underfits curved interactions unless you engineer features
See explainer §3.1.
Section 3 — Logistic regression and gradient descent¶
- Use for binary classification
- Output is a probability, not just a label
- Threshold choice is a business decision
- Parameters are usually trained with gradient descent
See explainer §3.2 and §3.3.
Section 4 — Regularization¶
| Type | Penalty | Geometry | Typical effect |
|---|---|---|---|
| L1 / Lasso | λ * Σ |w| |
diamond | sparsity, feature selection |
| L2 / Ridge | λ * Σ w² |
circle | smooth shrinkage, stability |
| Elastic net | mix of L1 and L2 | mixed | useful with correlated features |
Use L1 when you expect many irrelevant features. Use L2 when many correlated features matter a little. See explainer §2.3.
Section 5 — Trees and ensembles¶
| Model | Mechanism | Main win | Main risk |
|---|---|---|---|
| Decision tree | greedy threshold splits | interpretable interactions | overfits easily |
| Random forest | many bootstrapped trees averaged | variance reduction | less sharp than boosting |
| Gradient boosting | sequential residual correction | strong tabular accuracy | tuning burden, can overfit |
| XGBoost / LightGBM | optimized boosting libraries | top-tier tabular baseline | still needs leakage-safe eval |
See explainer §4.1-§4.4.
Section 6 — Splits and cross-validation¶
| Setup | When to use | Warning |
|---|---|---|
| Train / validation / test | default workflow | do not peek at test repeatedly |
| Stratified k-fold | imbalanced classification | preserve class ratios |
| Group k-fold | repeated users, patients, accounts | keep entities together |
| Time-series split | temporal prediction | future must never leak backward |
Rule: evaluation must mirror deployment. See explainer §5.1.
Section 7 — Metrics cheat table¶
| Metric | Formula idea | Best when | Trap |
|---|---|---|---|
| Accuracy | correct / total | balanced classes, equal costs | misleading on rare events |
| Precision | TP / (TP + FP) | false positives costly | ignores missed positives |
| Recall | TP / (TP + FN) | false negatives costly | can flood false alarms |
| F1 | harmonic mean of P and R | both matter | hides calibration |
| ROC-AUC | ranking quality over thresholds | broad ranking view | optimistic on imbalance |
| PR-AUC | precision-recall area | rare positive class | less intuitive casually |
| Log loss | penalty on wrong probabilities | probability quality matters | hard to explain without calibration |
| RMSE / MAE | regression error size | continuous targets | outlier sensitivity differs |
See explainer §5.2.
Section 8 — Calibration and class imbalance¶
- Calibration: when the model says 0.8, reality should be close to 80%
- A model can rank well and still be badly calibrated
- Use calibration when probabilities drive queueing, pricing, triage, or intervention
- Heavy imbalance usually means PR-AUC, recall, threshold tuning, and slice metrics matter more than raw accuracy
See explainer §5.3 and §5.4.
Section 9 — Feature engineering and leakage¶
| Topic | Practical note |
|---|---|
| Scaling | needed for distance-based models and many gradient-based models |
| One-hot encoding | safe default for low-cardinality categoricals |
| Target encoding | powerful, but compute inside folds only |
| Interactions | often essential for linear models |
| Time features | extract weekday, hour, lag, recency |
| Leakage audit | ask whether a feature knows the future or the label indirectly |
Leakage examples: - fitting a scaler on all data before split - target encoding on the full dataset - letting the same user appear in train and validation - random split on time-ordered events
See explainer §3.4 and §5.1.
Section 10 — Foundation gap audit and bridge¶
What the next module assumes: 1. gradient descent concept 2. loss minimization 3. train/test split logic 4. feature representation 5. overfitting concept
Where to revise them: - gradient descent → explainer §3.2 - loss minimization → explainer §3.1, §3.3 - split logic → explainer §1.1, §5.1 - feature representation → explainer §3.4 - overfitting → explainer §1.1, §2.1, §5.4
Bridge sentence from explainer §6.6:
Next module —
01_neural_network_primitives— takes these ideas to high-dimensional space. The gradient descent you learned here becomes backpropagation. The ensemble idea becomes layers. The regularization shapes become dropout and weight decay.
Self-check¶
For full interview-ready prompts, see 02_explainer.md §6.3.
- Why is high training accuracy alone unconvincing? (§1.1)
- Which geometry belongs to L1 and which to L2? (§2.3)
- Why does logistic regression still have a linear boundary? (§3.3)
- When does feature engineering matter more than algorithm choice? (§3.4)
- Random forest vs boosting — what problem is each fixing? (§4.2, §4.3)
- Why does XGBoost fit tabular structure so well? (§4.4)
- Why is repeated test-set peeking dangerous? (§5.1)
- ROC-AUC vs PR-AUC — when does PR-AUC matter more? (§5.2, §5.4)
- What is calibration in one sentence? (§5.3)
- Name three leakage paths from memory. (§5.1)