00. Classical ML Refresher - First-Principles Overview¶
Classical ML is what you use when the world arrives as rows, columns, labels, scores, and business thresholds.
A fraud team ships a model that looks excellent in the notebook. Validation accuracy is 99.8%. The demo feels safe. Leadership relaxes.
Then production traffic arrives. Almost every transaction is legitimate, so a model can look accurate by saying "not fraud" all day. The few fraudulent transactions hide inside rare patterns. A new merchant category appears. Weekend behavior differs from weekday behavior. The model's score says 0.91, but only half of those cases are truly fraud. Support complains about blocked cards, risk complains about missed fraud, and the dashboard still says accuracy is high.
So the root cause is not "the model needs to be more advanced." The root cause is that classical ML is a contract between data shape, model shape, loss, metric, and deployment threshold. If any part of that contract is wrong, the model can be mathematically trained and operationally useless.
This module rebuilds that contract from first principles. We start with the train-production gap, because the only question that matters is whether a pattern learned from historical rows still works on future rows. Then we name the two failure directions: underfitting when the model cannot express the pattern, and overfitting when it memorizes accidents. From there, every mechanism has a job: choose a model shape, constrain it, optimize its parameters, engineer useful features, evaluate honestly, calibrate scores, handle imbalance, and admit where tabular learners stop being enough.
The recurring lesson is simple: classical ML is not a bag of algorithms. It is controlled generalization from examples. A senior engineer should be able to look at a dataset, ask what signal is stable, choose the smallest model shape that can express it, measure the right production cost, and set a threshold that matches the business tradeoff.
The recurring pressures and concepts¶
| Pressure / concept | Meaning |
|---|---|
| Train-production gap | The model learns from yesterday's rows but must survive tomorrow's distribution. |
| Bias and variance | The two ways generalization fails: too simple to learn signal, or too flexible to ignore noise. |
| Model shape | The boundary or function family an algorithm can express before data and training begin. |
| Regularization | A deliberate constraint that trades some training fit for better future behavior. |
| Feature engineering | Turning raw columns into signals the model shape can actually use. |
| Metric alignment | Choosing a measurement that matches the real cost of false positives, false negatives, ranking errors, or bad scores. |
| Calibration and thresholds | Separating "how likely is this?" from "what action should we take at this score?" |
| Unsupervised structure | Finding useful geometry when there is no label to optimize directly. |
The core picture¶
historical rows
|
v
features + labels
|
v
model shape
|
v
training loss chooses parameters
|
v
validation estimates future behavior
|
v
metric + threshold choose action
|
v
production feedback exposes drift, imbalance, and bad assumptions
The picture matters because each box can lie in a different way.
The rows can be stale. The labels can encode human bias or delayed truth. The feature set can leak information that will not exist at prediction time. The model shape can be too weak or too flexible. The training loss can optimize a proxy nobody cares about. The validation split can fail to mimic deployment. The metric can reward doing nothing on rare events. The threshold can be tuned for a cost ratio the business no longer accepts.
That is why the module starts with failure before algorithms. Linear regression, logistic regression, trees, forests, boosting, SVMs, KNN, Naive Bayes, PCA, and clustering are not names to memorize. They are different answers to the same engineering question: what structure do we believe exists in the data, and what mistake are we willing to pay for?
Top resources¶
- The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman - the classic map of model families, regularization, trees, boosting, SVMs, and unsupervised methods.
- An Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani, and Taylor - the gentler version, especially strong for bias-variance, validation, regression, classification, and tree methods.
- scikit-learn documentation - useful for practical API details, model selection, preprocessing, metrics, calibration, cross-validation, and baseline implementations.
- Your own production data slices - the best resource for class imbalance, drift, leakage, threshold cost, and whether a score is trusted by the humans who act on it.
What's coming¶
- 01-train-prod-gap.md - why notebook success can collapse when future rows differ.
- 02-bias-variance.md - the two failure directions behind underfitting and overfitting.
- 03-model-shapes.md - what boundaries different model families can draw.
- 04-regularization.md - why constraining a model can improve future behavior.
- 05-linear-regression.md - the simplest weighted-combination learner.
- 06-gradient-descent.md - how parameters move toward lower loss.
- 07-logistic-regression.md - how linear scores become class probabilities.
- 08-feature-engineering.md - how raw columns become usable signal.
- 09-decision-trees.md - splitting data with readable if/then structure.
- 10-random-forests.md - reducing variance by voting across many trees.
- 11-gradient-boosting.md - correcting residual mistakes stage by stage.
- 12-evaluation-cv.md - splitting data so validation resembles deployment.
- 13-metrics-and-calibration.md - measuring the right error and trusting scores.
- 14-class-imbalance.md - why rare events break accuracy.
- 15-svm.md - choosing a separating margin under geometric pressure.
- 16-knn.md - using nearby examples directly.
- 17-naive-bayes.md - updating odds from feature evidence.
- 18-dimensionality-reduction.md - compressing feature space while preserving useful variation.
- 19-clustering.md - discovering groups without labels.
- 20-honest-admission.md - where classical ML is still powerful, and where it stops being enough.
Memory map¶
| Concept | Prerequisite | Pressure family | Recurs later as | Layer touched |
|---|---|---|---|---|
| Train-production gap | labeled historical rows | data quality | drift, leakage, validation design | data -> deployment |
| Bias | model shape | accuracy + simplicity | underfitting, missed nonlinear signal | algorithm -> metric |
| Variance | finite samples | stability | overfitting, tree ensembles, cross-validation | data -> algorithm |
| Regularization | loss optimization | generalization | L1/L2, pruning, early stopping | training -> model |
| Feature engineering | domain columns | representation | separability, leakage, scaling | data -> algorithm |
| Calibration | probabilistic scores | decision quality | thresholding, risk queues, ranking | model -> product action |
| Class imbalance | base rates | business cost | precision/recall, PR curves, sampling | metric -> operations |
| Clustering/PCA | geometry | compression + discovery | embeddings, retrieval, anomaly slices | data -> representation |
The rule to carry¶
Classical ML turns historical examples into future decisions by controlling generalization.
If you remember one sentence, remember this:
Teacher voice. A model is only useful when the data split, model shape, metric, calibration, and threshold match the decision it will make in production.
Everything else in the module is a way to inspect or repair one part of that sentence.
Bridge. Start with the first failure: the model can win on training rows and still lose in production. The next file shows why the train-production gap is the pressure that makes evaluation, regularization, and model choice necessary. -> 01-train-prod-gap.md