01. Week 0 — Classical ML Refresher¶

💥 The $400M bug¶

August 2019. Zillow's ML team ships an upgraded Zestimate model. Training accuracy: 97.2%. Internal celebration. Champagne emoji in Slack.

Eighteen months later, Zillow writes down $400 million in losses. The model was overfitting to pandemic-era trends, had leaky temporal features, and nobody monitored calibration drift. 3,500 employees lost their jobs. One model. One missed validation check. Four hundred million dollars.

Every concept in this module exists because someone, somewhere, shipped a model without it and got burned.

This isn't theory. This is the difference between models that ship and models that destroy companies.

🧠 How to use this module¶

This is not a textbook you read front-to-back. It's a training sim.

Mode	What to do
🎯 First pass	Read Part 2 one section per day → do "pause and try" before reading answers
🔁 Active recall	Cover the answer, explain to an invisible interviewer in 60 seconds, then check
🎬 Visual mode	Every concept has ASCII diagrams. Redraw from memory. Can't → re-read
🏋️ Interview prep	Jump to Part 3 (Algorithms) → practice WHY/WHAT/HOW for each
🩺 Diagnosis mode	Given train/val numbers → name the disease in one breath using Part 2
🔧 System design	Part 4 → model selection, production failures, cost tradeoffs

Rule: never read passively for more than 5 minutes. After each section, close the file and explain it aloud. If you mumble → you don't know it yet.

🎯 What you'll be able to do after this week¶

Explain where ML sits in the AI/DL landscape and why classical ML dominates tabular/production
Diagnose overfitting from a learning curve and propose three fixes
Sketch bias-variance curve and L1/L2 shapes from memory
Trace one gradient descent step by hand
Write logistic regression as σ(w·x + b) and justify log loss over MSE
Pick the right metric for: fraud (1% positive), cancer screening, ad CTR, loan pricing
Distinguish bagging from boosting — what each reduces and when
Choose between linear/logistic/RF/XGBoost/KNN/SVM for a given problem with one-line justification
Spot feature leakage in a sklearn pipeline
Explain K-means and PCA at whiteboard level

📖 The story of this week — a disaster in 10 acts¶

🎬 You're a newly hired ML engineer. Day 1. Your teammate shows you a model with 99% training accuracy. "Ship it?" they ask.

🌍 Wait, what even IS machine learning? → you learn the big picture (AI/ML/DL, where you fit)
🔥 The model is lying → you learn overfitting diagnosis (how to spot the scam)
🤔 But lying HOW? → you learn bias-variance (is it too dumb or too paranoid?)
💊 Okay, treat it → you learn regularization (L1/L2 — the medication)
⚙️ How does the model actually learn? → you discover gradient descent
🧪 Better ingredients → you learn feature engineering (the real lever)
📏 Did it actually work? Prove it. → you master evaluation & calibration
🔀 Is your proof honest? → you learn cross-validation (testing your test)
⚖️ What if positives are 0.1%? → you handle class imbalance
🧰 Which tool for which job? → you enter the algorithm zoo (Part 3)

The chain: each rescue creates the next problem. That's ML engineering.

Part 2: Foundations¶

Ten concepts that apply to ALL algorithms. Each flows into the next. Master these once — they return in every ML system you'll ever build.

2.1 The big picture — where does all this fit?¶

🎬 Scene: Interview, round 1. "Can you explain the difference between AI, ML, and deep learning?" You fumble. The interviewer already has a mental minus. Let's never let that happen.

WHY this matters. You need a 30-second mental map of the entire field to position yourself, your work, and your tool choices. Without it, you're a coder with a library. With it, you're an engineer who knows why each tool exists.

WHAT — the nesting:

┌─────────────────────────────────────────────────────────────┐
│  ARTIFICIAL INTELLIGENCE                                     │
│  "Machines that exhibit intelligent behavior"                │
│  (rule-based expert systems, search, planning, ML...)       │
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  MACHINE LEARNING                                      │  │
│  │  "Systems that learn patterns from DATA, not rules"    │  │
│  │  (linear models, trees, SVMs, neural nets...)         │  │
│  │                                                        │  │
│  │  ┌─────────────────────────────────────────────────┐  │  │
│  │  │  DEEP LEARNING                                   │  │  │
│  │  │  "ML with many-layered neural networks"          │  │  │
│  │  │  (CNNs, RNNs, Transformers, LLMs...)            │  │  │
│  │  └─────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

HOW to think about it:

Layer	Input	Learns from	Example
AI (rule-based)	Rules written by humans	Expert knowledge	Chess engine (1990s), spam rules
ML (classical)	Features + labels	Data patterns	XGBoost fraud detection, credit scoring
DL (neural nets)	Raw data (pixels, text)	Hierarchical patterns	GPT, image classifiers, speech-to-text

The key insight for interviews: Classical ML ≠ outdated. It DOMINATES tabular/production:

Domain                  What actually runs in prod        Why not deep learning?
─────────────────────   ─────────────────────────────    ──────────────────────────
Credit scoring          Logistic regression + XGBoost    Regulatory interpretability
Fraud detection         XGBoost (Stripe, PayPal)         5ms latency constraint
Ad click prediction     Logistic regression (Google)     Billions of QPS, cost
Insurance pricing       GLMs + gradient boosting         Actuaries need coefficients
Recommendation ranking  XGBoost + embeddings             Tabular user features
Healthcare risk         Random Forest / XGBoost          Small datasets, audit trail

When to choose what: - Classical ML: Tabular data, need interpretability, moderate data, latency matters - Deep Learning: Images, audio, text, massive data, structure in the input - Both: Embeddings from DL fed into classical models (common in industry)

⏸️ Pause & Try. Your startup has 50k rows of user behavior data (columns: age, country, subscription_tier, days_active, purchases). Goal: predict churn. Someone proposes a transformer. What's your response?

✅ "This is tabular data with 50k rows — exactly where XGBoost dominates. Transformers need sequential/spatial structure and massive data. I'd start with XGBoost baseline, add logistic regression for interpretability comparison. We'd need 10-100x more data AND structural features before DL makes sense."

2.2 How models learn — the universal loop¶

🎬 Scene: Before diagnosing problems, we need to understand what "learning" even means. Every ML model — from linear regression to GPT-4 — follows the same 5-step loop.

WHY. If you don't understand the loop, you can't debug WHERE things go wrong. Bad data? Bad features? Wrong loss? Wrong optimizer? You need to see the machine to fix it.

WHAT — the loop:

┌──────────────────────────────────────────────────────────────────────┐
│                                                                       │
│   ① DATA        ② FEATURES       ③ MODEL         ④ LOSS             │
│   (raw input)   (transformed)    (prediction)    (how wrong?)        │
│                                                                       │
│   Patients  →   age, temp,    →  ŷ = f(x)    →  L = (y - ŷ)²       │
│   with labs     fever×lowO2       prediction      or log loss         │
│                                                                       │
│                          ⑤ UPDATE                                     │
│                     ←────────────────                                 │
│                  "Adjust weights to reduce loss"                       │
│                  (gradient descent, or exact solve)                    │
│                                                                       │
│   Repeat ①→⑤ until loss stops improving (convergence)                │
└──────────────────────────────────────────────────────────────────────┘

HOW each step can fail:

Step	What goes wrong	Result	Fix taught in...
① Data	Too little, biased, wrong split	Bad generalization	§2.3 (overfitting), §2.9 (CV)
② Features	Missing interactions, leakage	Weak or lying model	§2.7 (feature engineering)
③ Model	Too simple or too complex	Bias or variance	§2.4 (bias-variance)
④ Loss	Wrong metric, ignores costs	Optimizes wrong thing	§2.8 (evaluation)
⑤ Update	LR too high/low, wrong optimizer	Divergence or slow	§2.6 (gradient descent)

Two types of learning:

SUPERVISED (labels available)          UNSUPERVISED (no labels)
─────────────────────────────          ──────────────────────────
Input: (X, y) pairs                    Input: X only
Goal: predict y from X                 Goal: find structure in X
Examples:                              Examples:
  - Predict price (regression)           - Group customers (K-means)
  - Predict fraud (classification)       - Reduce dimensions (PCA)
  - Predict next word (language model)   - Find anomalies

⏸️ Pause & Try. A model has high training loss that won't decrease. Which step(s) of the loop are likely broken?

✅ Step ③ (model too simple — high bias) or Step ⑤ (learning rate too low). Could also be Step ② (features don't contain the signal).

2.3 When learning goes wrong — overfitting diagnosis¶

🎬 Scene: Your teammate: "We got 99.2% accuracy!" You: "On what?" "Training set." "Check validation." They come back pale. 61%.

WHY. A student who memorizes last year's exam bombs the real one. The model learned noise, not signal. If you can't spot this, everything after is shooting in the dark.

WHAT — three smoke signals (need ALL three):

Signal 1: THE GAP
┌─────────────────────────────────┐
│  Train acc:  99.2%              │
│  Val acc:    61.0%              │
│  Gap:        38.2% ← 🚨 ALARM │
└─────────────────────────────────┘

Signal 2: THE LEARNING CURVE
accuracy
  ▲
  │ ─────── train (keeps climbing)
  │         ╱
  │   ─────╱── validation (plateaus, then DROPS)
  │  ╱
  └──────────────────► training examples
     ↑ gap widens = overfitting

Signal 3: THE BASELINE CHECK
┌──────────────────────────────────────────┐
│  Fancy model:    61% val                 │
│  Logistic reg:   58% val                 │
│  Predict mean:   55% val                 │
│  Complex model barely beats simple →     │
│  it memorized, not learned.              │
└──────────────────────────────────────────┘

HOW to respond: 1. Check train vs val gap → big = overfitting 2. Check baseline → barely beats simple = memorized 3. Check learning curve → val degrades = stop earlier

In the wild — Stripe Radar. Retrain daily. Train-vs-val gap is a deploy gate. Gap too big → model doesn't ship.

Trap. Train vs val look close, you ship. Production: 45%. Why? Production has a third distribution. Always ask: "What changes between dev and prod?"

⏸️ Pause & Try. Colleague says 97% accuracy. Three questions before believing them?

✅ (1) Train or validation? (2) What's the baseline? (3) Is validation representative of production?

2.4 Naming the disease — bias-variance tradeoff¶

🎬 Scene: Dr. Simple: checks only temperature → misses 60% of cases. Dr. Complex: memorizes every patient → 100% on known, panics on new ones. Both bad. Different reasons. Different cures.

WHY. §2.3 told you: broken. But which way? Diagnose which one → name the cure.

WHAT — the archery picture (draw in interviews):

         BIAS (wrong aim)              VARIANCE (shaky hand)

           ┌─────────┐                  ┌─────────┐
           │    ⊕    │                  │         │
           │  ● ●●   │                  │ ●    ●  │
           │   ●●    │                  │       ● │
           │         │                  │  ●  ●   │
           └─────────┘                  └─────────┘
        Cluster together but            Scatter everywhere
        MISS the bullseye               Near bullseye on average

  Fix: AIM BETTER                    Fix: STEADY YOUR HAND
  → more capacity, features          → regularize, more data, ensemble

HOW — the 30-second diagnostic:

Train	Val	Disease	Prescription
55%	53%	High bias	More capacity, better features
99%	61%	High variance	Regularize, more data, ensemble
92%	91%	Balanced	Ship, monitor
85%	88%	Leakage	STOP. Audit pipeline

The classic curve:

   error
     ▲
     │ ╲                          ╱
     │   ╲   ← total error  ╱
     │     ╲              ╱
     │  bias²╲          ╱  variance
     │        ╲___ ___╱
     │             V         ← sweet spot
     └─────────────────────────► model complexity

Trap. Quoting "bias-variance" generically. Interviewer wants: point at ONE, explain WHY, name the fix. "Train 99%, val 61% → high variance → add L2 + more data."

⏸️ Pause & Try. Model A: train 72%, val 70%. Model B: train 98%, val 64%. Diagnose each, one fix.

✅ A: High bias → add features or complex model. B: High variance → regularize or more data.

2.5 The treatment — regularization (L1 vs L2)¶

🎬 Scene: 200 consultants predicting house prices. Without a budget: "Door color = $50,000!" They're memorizing. Rule: "Budget is limited. Make your case or get fired."

WHY penalize at all?

Without constraint:
  hours_studied × 8.2     ← sense
  shoe_size     × 47.3    ← 🚨 noise!
  birth_month   × -12.8   ← 🚨 nonsense!

Fix: Loss = errors + TAX ON WEIGHT SIZES
Model thinks: "Is shoe_size worth 47.3 of my budget?"

WHAT — two types of budget:

L2 = SALARY CAP                    L1 = HEADCOUNT CAP
"Keep all, pay less"               "Choose 15. Rest fired."
→ All shrink, none die             → Weak ones collapse to ZERO

L2 (Ridge): Tax = λ × Σwᵢ²        L1 (Lasso): Tax = λ × Σ|wᵢ|

HOW — numbers:

Feature          No tax    L2(λ=5)    L1(λ=1)    L1(λ=5)
─────────────────────────────────────────────────────────
sqft             150       60         140         80
beds             200       80         180         100
school           80        32         50          💀 0
door_color       45        18         💀 0        💀 0
neighbor_dog     12        5          💀 0        💀 0

WHY L1 kills (the geometry):

L2 = CIRCLE around origin        L1 = DIAMOND around origin
Loss oval touches circle:         Loss oval touches diamond:
→ smooth surface                  → hits a CORNER (on axis)
→ both weights ≠ 0               → one weight = 0 → SPARSITY

WHEN to use which:

Situation	Use	Example
Many useless features	L1	Genomics: 20k genes, 50 matter
Correlated features	L2	Economics: GDP, CPI correlated
Both selection + stability	Elastic Net	FICO credit scoring
Already underfitting	NONE	Makes it worse

Trap. L1 is unstable with correlated features. income and salary: L1 kills one randomly. L2 keeps both at half.

⏸️ Pause & Try. 500 genomics features (20 useful). Which? 50 correlated indicators. Which? Already underfitting. Add regularization?

✅ (1) L1. (2) L2. (3) No — makes underfitting worse.

2.6 The learning algorithm — gradient descent¶

🎬 Scene: Mountain in fog. Can't see. Feel the slope → step downhill → repeat until flat. Every model learns this way.

WHY. Millions of weight combinations. Can't try all. Gradient descent finds the minimum by always stepping downhill.

WHAT:

w_new = w_old - η × ∂L/∂w

"New weight = old weight - (learning rate × slope)"

HOW — one step:

w = 5.00, gradient = +1.20, η = 0.1
w_new = 5.00 - (0.1 × 1.20) = 4.88  ← moved left (downhill)

The learning rate drama:

η too BIG:  ●╱╲●╱╲● DIVERGE      η too SMALL: ●●●●●... 100k steps
η right:    ●╲___●●● converged

Full derivation (linear regression):

Cost: J = (1/2m) × Σ(h(x⁽ⁱ⁾) - y⁽ⁱ⁾)²

∂J/∂θ₁ = (1/m) × Σ(h(x⁽ⁱ⁾) - y⁽ⁱ⁾) × x⁽ⁱ⁾

Update: θ₁ := θ₁ - α × ∂J/∂θ₁

Variants:

Variant	Key idea	Used for
Batch GD	All data per step	Small datasets
SGD	1 sample per step	Online learning
Mini-batch	32-512 samples	Standard DL
Adam	Adaptive LR per weight	Default for neural nets
AdamW	Adam + proper L2	Default for transformers

Trap. "Local minima!" → Linear/logistic: convex, no local minima. Neural nets: yes, but most local minima generalize fine. Real problem = saddle points → momentum solves it.

⏸️ Pause & Try. w=3.0, gradient=-0.8, η=0.05. New weight? Direction?

✅ 3.0 - (0.05 × -0.8) = 3.04. Right, because gradient is negative (downhill right).

2.7 Better ingredients — feature engineering¶

🎬 Scene: Two chefs, same stove. Fresh wagyu vs frozen fish sticks. Who wins? The ingredients.

WHY.

Tune hyperparameters:     +0.5-1%
Fancier model:            +1-2%
Better features:          +5-15%  ← WHERE THE MONEY IS

WHAT — the cookbook:

Raw	Transform	Example
income: ₹20L	`log(income)`	14.5
city: "Mumbai"	One-hot	`[0,1,0,0]`
temp:102, O₂:91	Interaction	`fever_AND_low_o2 = 1`
timestamp	Extract	`hour=14, dow=Mon`
age: 67	Polynomial	`67, 4489`

HOW — prevent leakage:

# Pipeline prevents leakage by fitting only on train
pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale', StandardScaler()),
    ('model', LogisticRegression()),
])
pipe.fit(X_train, y_train)

Trap. Fitting scaler on full data before split. Target encoding on full dataset. Same patient in train and val.

⏸️ Pause & Try. Features: age, purchase_amount, signup_date, country. Create two new features for churn. Leakage risk?

✅ days_since_signup, purchase × age_bucket. Risk: using future data for country aggregates.

2.8 Did it work? — evaluation & calibration¶

🎬 Scene: "99% accuracy!" on fraud detection. Fraud = 0.5%. Model says "not fraud" to everything → 99.5% accuracy. WORSE than doing nothing.

WHY. Models look amazing on bad metrics. Need metrics matching actual cost of wrong decisions.

WHAT — confusion matrix:

                 Predicted FRAUD    Predicted NOT
Actually FRAUD     TP = 18            FN = 12
Actually NOT       FP = 6             TN = 64

Precision = 18/24 = 0.75    Recall = 18/30 = 0.60
F1 = 0.67                   Accuracy = 82/100 = 0.82 (misleading!)

HOW — metric selection:

Situation	Metric	Why
Cancer screening	Recall	Missing = death
Spam filter	Precision	Real email in spam = angry user
Rare fraud	PR-AUC	Both P and R matter
Ad ranking	ROC-AUC	Ordering matters
Loan pricing	Calibration	Probabilities must be honest

Calibration: Two models, both AUC=1.0. One says "60% default" when reality is 100%. Pricing loans based on 60% → bankruptcy. AUC measures ranking, NOT probability honesty. Need both.

Trap. "AUC 0.95, ship it!" — says nothing about calibration. Need ranking metric AND calibration metric.

⏸️ Pause & Try. TP=45, FP=15, FN=5, TN=435. Precision? Recall? Is accuracy misleading?

✅ P=0.75. R=0.90. Acc=0.96. Yes — "always negative" gets 90%. Accuracy hides 10% recall loss.

2.9 Honest testing — cross-validation¶

🎬 Scene: Single 80/20 split → 91% val. Ship. Production: 72%. Your one split was lucky.

WHY. CV tests across MULTIPLE splits. Gives average performance — honest.

WHAT — flavors:

Flavor	When	Why
Stratified K-fold	Imbalanced classification	Preserves class ratio
Group K-fold	Multiple rows per entity	Prevents identity leak
Time-series split	Temporal data	Never future-into-past
Nested CV	Hyperparameter tuning	Prevents tuning leak

HOW — failure case:

❌ Stock prediction + random KFold: trains on Nov to predict Oct → fake 85%
✅ TimeSeriesSplit: train Jan-Jun → predict Jul → honest 58%

StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
GroupKFold(n_splits=5)
TimeSeriesSplit(n_splits=5, test_size=30_000)

Trap. Default KFold on time-series → future leaks into past. Always TimeSeriesSplit for temporal data.

⏸️ Pause & Try. Predicting hospital readmission, same patient appears multiple times, using random KFold. Bug?

✅ Identity leakage. Patient in both train/val. Fix: GroupKFold on patient_id.

2.10 The rare-event problem — class imbalance¶

🎬 Scene: Fraud = 0.1%. Model says "not fraud" always → 99.9% accuracy. $2M stolen overnight.

WHY. When positives < 5%, accuracy is broken. Standard loss ignores the rare class.

WHAT — six tools (order of try-first):

Right metric → PR-AUC, recall@precision
class_weight='balanced' → one-line fix
scale_pos_weight → XGBoost equivalent
Threshold tuning → 0.5 is rarely right
SMOTE → synthetic minority. Use sparingly.
Focal loss → re-weights hard examples

HOW:

LogisticRegression(class_weight='balanced')
XGBClassifier(scale_pos_weight=neg/pos)

# Threshold tuning
precisions, recalls, thresholds = precision_recall_curve(y_val, y_prob)

# SMOTE inside pipeline (prevents leakage)
ImbPipeline([('smote', SMOTE()), ('model', LR())])

Trap. Resampling before split → leaked. Always resample inside CV folds.

⏸️ Pause & Try. 0.1% fraud. Model: 99.5% accuracy, 0% recall. Useful? Two fixes?

✅ Useless (predicts "not fraud" always). Fix: class_weight='balanced' + switch to PR-AUC.

Part 3: The Algorithm Zoo¶

Each algorithm follows the same template: WHY (what problem demanded it), WHAT (equation + diagram), HOW (code + hyperparams), WHEN/WHEN NOT, INTERVIEW angle, TRAP. This is your interview prep — one card per algorithm.

3.1 Linear Regression — the $2 trillion equation¶

WHY does this exist? You need to predict a NUMBER from features. House price from sqft. Revenue from ad spend. The simplest possible prediction: a weighted sum. $2T+ in financial decisions daily still rest on this.

WHAT — the math:

ŷ = w₁·x₁ + w₂·x₂ + ... + wₖ·xₖ + b

Example: price = 150·sqft + 20000·beds + 50000·school - 10000·highway + 85000

Worked example:

Patient	Inflammation (x)	Stay days (y)	Predicted (ŷ=2x+1)	Error²
A	1	3	3	0
B	2	5	5	0
C	3	8	7	1
			MSE	0.33

MSE = (1/n) × Σ(y - ŷ)²
Why squared? Big mistakes punished MORE. Error of 5 → penalty 25.

R² — how much pattern did I capture?

R² = 1 - (residual sum of squares / total sum of squares)
   = 1 - (how wrong you are / how wrong "just predict mean" is)

R² = 0.85 → explained 85% of variance
R² = 0.00 → no better than mean
R² < 0   → WORSE than mean (something very wrong)

Adjusted R² — the feature-addition trap:

R² ALWAYS increases when you add features (even useless ones).
Adjusted R² penalizes useless additions → DROPS if feature doesn't help.

Assumptions of linear regression (interview staple — "name 4"):

Assumption	What it means	What breaks if violated	How to check
Linearity	Relationship is linear in parameters	Curved patterns in residuals	Residual vs fitted plot
Independence	Observations don't influence each other	SE underestimated, CIs wrong	Durbin-Watson test
Homoscedasticity	Constant variance of errors	Unreliable significance tests	Residual spread plot
Normality of errors	Errors ~ N(0, σ²)	CIs and p-values invalid	Q-Q plot
No multicollinearity	Features not highly correlated	Unstable coefficients	VIF > 10 = problem

Quick check: plot residuals vs fitted values
  ✅ Random cloud around 0 → assumptions OK
  ❌ Funnel shape → heteroscedasticity
  ❌ Curved pattern → non-linearity (add polynomial/log features)
  ❌ Clusters → independence violated

HOW — code:

from sklearn.linear_model import LinearRegression, Ridge, Lasso
LinearRegression().fit(X, y)       # no regularization
Ridge(alpha=1.0).fit(X, y)         # L2 — prevents coefficient explosion
Lasso(alpha=0.1).fit(X, y)        # L1 — automatic feature selection

WHEN to use: - Continuous target with additive relationships - Need interpretable coefficients (audit, regulation) - Fast baseline for any regression task - A/B test effect estimation

WHEN NOT: - Classification (use logistic) - Highly non-linear without feature engineering - High-dimensional sparse without regularization

INTERVIEW Qs: - "Does linear regression assume linear relationships?" → Linear in PARAMETERS. Can add x², log(x) and it's still "linear regression." - "Coefficient = 1500 for sqft — causal?" → No. Correlation under model. Bigger homes have pools. Need RCTs/DAGs for causation. - "When R² is 0.99 — celebrate?" → Maybe leakage. Or overfitting. Check adjusted R² and validation R².

Trap. Interpreting coefficients without standardizing. w=1500 on sqft vs w=0.02 on income doesn't mean sqft matters more — different scales.

⏸️ Pause & Try. R²=0.92. Add "day_of_week_born" → R²=0.921, adjusted R² drops 0.91→0.905. Keep the feature?

✅ No. Adjusted R² dropped — feature adds noise. Regular R² always goes up (trap). Trust adjusted R².

3.2 Logistic Regression — the confidence meter¶

WHY? Linear regression gives -∞ to +∞. But "will this patient default?" needs a probability [0,1]. You can't tell a customer "your fraud score is -47.2." You need a squeeze.

WHAT — the sigmoid:

Input:  z = w·x + b    (linear score, any real number)
Output: p = 1/(1+e⁻ᶻ)  (always between 0 and 1)

     1.0 ┤            ╭───────   ← very positive → ~1.0
         │          ╱
     0.5 ┤──────●──────          ← z=0 → 50/50
         │    ╱
     0.0 ┤───╯                   ← very negative → ~0.0
         └────────────────────►
        -4   -2    0    2    4

Worked numbers: | Score z | σ(z) | Interpretation | |---:|---:|---| | -3.0 | 0.05 | "5% — almost certainly NOT fraud" | | 0.0 | 0.50 | "Coin flip — no idea" | | +3.0 | 0.95 | "95% — block it" |

Why log loss, not MSE?

Log loss: L = -[y·log(p) + (1-y)·log(1-p)]

True label = 1 (IS fraud):
  Predict p=0.95 → loss = 0.05   ← tiny, good!
  Predict p=0.50 → loss = 0.69   ← moderate
  Predict p=0.05 → loss = 3.00   ← HUGE 🔥

Being confidently WRONG is punished exponentially.
MSE gradients vanish near 0/1 → model stops learning when most wrong.

HOW — code:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, penalty='l2', class_weight='balanced')
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]  # ALWAYS use probabilities
# Tune threshold based on business cost, not 0.5 default

WHEN: Binary classification + calibrated probabilities needed, interpretability, strong baseline, sparse high-dim (text bag-of-words + L1), billions-QPS inference.

WHEN NOT: Non-linear boundaries (use trees), images/audio/text without features, threshold interactions.

INTERVIEW Qs: - "Is logistic regression non-linear because sigmoid?" → NO. Sigmoid is monotonic. Decision boundary (where p=0.5, z=0) is still a STRAIGHT LINE. Linear classifier. - "Why log loss over MSE?" → Log loss: gradient = (p-y), clean. MSE: gradient vanishes near 0/1. - "Over XGBoost, when?" → Interpretability, sub-ms at billions QPS, calibrated probabilities natively.

Trap. Using .predict() instead of .predict_proba(). Default 0.5 threshold is almost never optimal under imbalance.

⏸️ Pause & Try. Weights [0.5, -0.3], bias=0.1, input=[4, 2]. Compute z, then p.

✅ z = 0.5×4 + (-0.3)×2 + 0.1 = 2.0 - 0.6 + 0.1 = 1.5. p = 1/(1+e⁻¹·⁵) ≈ 0.82.

3.3 Decision Trees — the 20-questions game¶

WHY? Linear models draw ONE straight line. Reality has corners. "If income > 50k AND age > 30 AND no defaults → approve." A tree asks yes/no questions and splits data into boxes. Handles interactions naturally.

WHAT — the XOR proof (lines CANNOT work):

   x₁  x₂  label        x₂ ▲
    0   0    ○            1 │ ●───○
    0   1    ●              │
    1   0    ●            0 │ ○───●
    1   1    ○              └──────► x₁
                              0   1

No single line separates ● from ○. Trees solve it in 2 questions:

     x₁ = 0?
     ╱      ╲
   YES       NO
   ╱           ╲
 x₂=1?      x₂=0?
  ╱  ╲       ╱  ╲
 ●    ○     ●    ○

HOW it splits — information gain:

Parent node: 100 samples (60 positive, 40 negative)
  Entropy = -0.6×log₂(0.6) - 0.4×log₂(0.4) = 0.97

Split on "age > 30":
  Left (age≤30):  50 samples (45 pos, 5 neg) → entropy = 0.47
  Right (age>30): 50 samples (15 pos, 35 neg) → entropy = 0.61

Information gain = 0.97 - 0.5×0.47 - 0.5×0.61 = 0.43 ← good split!

Greedy: try every feature, every threshold. Pick max gain. Repeat.

HOW — code:

from sklearn.tree import DecisionTreeClassifier, export_text
tree = DecisionTreeClassifier(max_depth=5, min_samples_leaf=20)
tree.fit(X, y)
print(export_text(tree, feature_names=list(X.columns)))  # readable rules

WHEN: Need full interpretability (show stakeholders), exploring interactions during EDA, clinical/regulatory rules, quick prototype.

WHEN NOT: Production predictions alone (overfits hard), need stability (small data changes → different tree), smooth boundaries needed.

INTERVIEW Qs: - "Why do trees overfit?" → Greedy splits memorize noise. Deep-enough tree = zero training error on anything. Need pruning/depth limits. - "Trees vs linear?" → Trees: interactions + non-linearity natively, but unstable. Linear: stable, miss interactions without FE.

Trap. Trusting single tree's accuracy. High train accuracy = memorized. Always validate with held-out data.

⏸️ Pause & Try. A tree gets 100% train, 62% val. What happened? One fix that doesn't change the algorithm?

✅ Memorized training data (unlimited depth). Fix: set max_depth=5 or min_samples_leaf=20 — constrains the tree without changing algo.

3.4 Random Forest — the jury of 500¶

WHY? One tree is drunk — memorizes everything. What if you ask 500 trees, each trained on slightly different data, and AVERAGE their answers? Individual errors cancel out. That's a Random Forest.

WHAT — the mechanism:

Step 1: Draw 500 bootstrap samples (random subsets with replacement)
Step 2: Train one tree per sample (with random FEATURE subsets per split)
Step 3: Average predictions (regression) or majority vote (classification)

Why it works:
  Each tree is wrong sometimes. But wrong in DIFFERENT ways.
  Average → individual errors cancel → stable prediction.
  Like asking 500 jurors instead of 1.

The key — what it fixes:

Single tree:           Random Forest:
Train: 100%            Train: 95%
Val:   62%             Val:   89%

Variance: HIGH         Variance: LOW (averaging)
Bias: LOW              Bias: slightly higher (can't memorize)

HOW — code:

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
    n_estimators=500,        # more trees = more stable (diminishing returns past ~300)
    max_features='sqrt',     # forces trees to disagree
    min_samples_leaf=10,     # prevents individual tree overfitting
    n_jobs=-1,               # parallel training
    oob_score=True,          # free validation estimate
)
rf.fit(X_train, y_train)
print(rf.oob_score_)        # out-of-bag score ≈ validation score

WHEN: First reliable baseline, stable performance with minimal tuning, feature importance (with caveats), can't afford extensive HP search.

WHEN NOT: Need last-mile accuracy (boosting wins by 1-3%), need calibrated probabilities (forests are often poorly calibrated), very high-dim sparse, real-time sub-ms inference.

INTERVIEW Qs: - "RF vs boosting — what's each fixing?" → RF reduces VARIANCE (averaging). Boosting reduces BIAS (sequential correction). Different goals. - "Does RF never overfit?" → Can — small noisy data, very deep trees. Just less aggressively than single tree/boosting. - "How does feature randomness help?" → Forces trees to disagree. Averaging disagreeing trees cancels noise.

Trap. Trusting .feature_importances_ blindly. Inflates high-cardinality features, arbitrarily picks one of correlated pair. Use permutation importance or SHAP.

⏸️ Pause & Try. RF: val 88%. XGBoost: val 91%. Business needs STABLE predictions (same customer = same score tomorrow). Which ship?

✅ RF. More stable (averaging). XGBoost's 3% edge might not matter if predictions fluctuate. Stability > marginal accuracy for trust products.

3.5 Gradient Boosting (XGBoost) — the final boss of tabular¶

WHY? Random Forest fixes variance by averaging. But what if your problem is BIAS — the model isn't smart enough? Boosting builds trees SEQUENTIALLY: each new tree corrects the mistakes of the ensemble so far. Result: the most accurate model for tabular data, period.

WHAT — the mechanism:

Step 1: Train tree₁ on original data
Step 2: Compute RESIDUALS (errors) of tree₁
Step 3: Train tree₂ to predict the RESIDUALS
Step 4: Ensemble = tree₁ + tree₂ (tree₂ corrects tree₁'s mistakes)
Step 5: Compute new residuals, train tree₃... repeat 500 times.

Each tree fixes what the previous ensemble got wrong.
Like a student correcting the teacher's homework — each round smarter.

Boosting vs Bagging (interview staple):

BAGGING (Random Forest)              BOOSTING (XGBoost)
─────────────────────────            ─────────────────────────
Trees trained in PARALLEL            Trees trained SEQUENTIALLY
Each on random subset                Each on RESIDUALS of previous
Average → reduces VARIANCE           Correct → reduces BIAS
Hard to overfit                      CAN overfit (needs early stopping)

AdaBoost vs Gradient Boosting:

AdaBoost: re-weights SAMPLES (misclassified get higher weight)
Gradient Boosting: fits RESIDUALS directly (gradient of loss)

AdaBoost: simpler, older, more sensitive to outliers
GradBoost: more flexible, handles any differentiable loss, dominant today

Why XGBoost beats deep learning on spreadsheets:

What gives neural nets edge:         Tabular data:
✓ Spatial structure (pixels)         ✗ No structure
✓ Sequential (tokens in order)       ✗ Column order arbitrary
✓ Massive data (millions)            ✗ Often 10k-100k rows
✓ Homogeneous input (all ints)       ✗ Mixed: int, float, bool, string

Trees ask threshold questions: "age > 30?" Works for ANYTHING.
Missing values? XGBoost learns which direction to route them.
No normalization needed. No preprocessing needed.

HOW — code:

import xgboost as xgb
model = xgb.XGBClassifier(
    n_estimators=500,          # number of sequential trees
    max_depth=6,               # shallow trees (avoid overfitting)
    learning_rate=0.05,        # shrinkage: each tree contributes less
    subsample=0.8,             # row sampling (like bagging)
    colsample_bytree=0.8,     # feature sampling
    early_stopping_rounds=20,  # stop when val stops improving
    eval_metric='auc',
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

Tuning order: learning_rate + n_estimators (with early stop) → max_depth → subsample, colsample → min_child_weight.

WHEN: Tabular + best accuracy needed, mixed types, moderate-to-large data (10k+), Kaggle tabular (80%+ winners).

WHEN NOT: <1000 rows (use logistic), images/text/audio (use DL), sub-ms inference at massive QPS, need interpretability for regulators.

INTERVIEW Qs: - "Why XGBoost on tabular?" → Handles mixed types, missing values, threshold interactions natively. Mature regularization. No preprocessing. - "Boosting = bagging with more steps?" → NO. Parallel vs sequential. Variance vs bias. Different mechanisms entirely. - "How to tune?" → LR + n_estimators first → depth → sampling → always early-stop.

Trap. Tuning on validation 30+ times → validation has leaked. Use nested CV or held-out test. Also: reaching for transformers on tabular without benchmarking XGBoost first.

⏸️ Pause & Try. Someone proposes BERT for credit scoring: 50 features, 20k rows. Your response?

✅ "Benchmark XGBoost first. 20k rows, 50 tabular features = exactly where boosting dominates. No sequential structure, too little data for DL. If XGBoost gets 85%, we need BERT to meaningfully beat it to justify complexity."

3.6 Naive Bayes — the impossibly good simplifier¶

WHY? You have text (1000s of word features). Most classifiers choke on such high dimensions. Naive Bayes makes ONE bold assumption — features are independent — and suddenly the math becomes trivially fast. Wrong assumption, but often right classification.

WHAT — Bayes' theorem applied:

P(spam | words) = P(words | spam) × P(spam) / P(words)

"Naive" assumption: words are independent given the class
P(words | spam) = P("viagra" | spam) × P("free" | spam) × P("click" | spam) × ...

This is WRONG (words correlate!) but works because:
  → For classification, you only need argmax P(class | features)
  → Even if probabilities are wrong, the RANKING can still be correct

HOW — worked example:

Email: "Free viagra click now"

P(spam) = 0.4,  P(ham) = 0.6

P("free"|spam) = 0.8    P("free"|ham) = 0.1
P("viagra"|spam) = 0.9  P("viagra"|ham) = 0.01
P("click"|spam) = 0.7   P("click"|ham) = 0.1
P("now"|spam) = 0.5     P("now"|ham) = 0.3

P(spam|words) ∝ 0.4 × 0.8 × 0.9 × 0.7 × 0.5 = 0.1008
P(ham|words)  ∝ 0.6 × 0.1 × 0.01 × 0.1 × 0.3 = 0.0000018

spam wins by a factor of 56,000. Classified: SPAM ✓

HOW — code:

from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Text classification (most common use)
tfidf = TfidfVectorizer(max_features=10000)
X_tfidf = tfidf.fit_transform(texts)
MultinomialNB().fit(X_tfidf, y)

# Numeric features
GaussianNB().fit(X_numeric, y)

WHEN: Text classification (spam, sentiment), very small datasets, extremely fast training/inference needed, good probabilistic baseline.

WHEN NOT: Feature interactions matter, calibrated probabilities needed (overconfident), tabular with correlated features.

INTERVIEW Q: "Why does NB work despite wrong assumption?" → For classification, only need ranking of P(class|x) to be correct, not exact values. Even if joint probability is wrong, argmax often isn't.

Trap. Trusting probability VALUES. NB is notoriously poorly calibrated (overconfident). Use for ranking/classification, not probability-driven decisions.

⏸️ Pause & Try. Two features are perfectly correlated (feature₂ = 2×feature₁). Does this break Naive Bayes? Why?

✅ Yes — NB treats them as independent, effectively "double-counting" the same evidence. The magnitude of evidence gets inflated, making probabilities even more poorly calibrated.

3.7 K-Nearest Neighbors — ask your neighbors¶

WHY? Sometimes you don't want a formula at all. Just look at the K most similar examples and do what they did. No training phase — pure memory. The simplest possible non-parametric model.

WHAT — the mechanism:

New patient arrives. What's their risk?

Step 1: Compute distance from new patient to ALL training patients
Step 2: Find K nearest neighbors (e.g., K=5)
Step 3: Majority vote (classification) or average (regression)

     ○ ○
   ○ ★ ●    ← ★ is the new point. 3 nearest: ○○●. Vote: ○ wins (2 vs 1)
     ● ●

No "training" — just memorize all data and search at prediction time.

The curse of dimensionality (critical interview concept):

In 2D: points cluster naturally. Neighbors are meaningful.
In 100D: ALL points are roughly equidistant!

Why? In high dimensions, the difference between nearest and farthest
neighbor becomes negligible relative to total distance.

"Nearest neighbor" becomes meaningless when everyone is equally far.
Fix: reduce dimensions first (PCA) or use feature selection.

HOW — code:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# MUST scale — KNN is distance-based
pipe = Pipeline([
    ('scale', StandardScaler()),  # income 0-100k and age 0-100 → same scale
    ('knn', KNeighborsClassifier(n_neighbors=5, weights='distance'))
])

WHEN: Small datasets, prototyping, anomaly detection (distance to K-th neighbor), highly irregular boundaries.

WHEN NOT: Large data (inference = O(n)), high dimensions (curse), features not scaled, production at scale.

INTERVIEW Qs: - "What happens to KNN in high dimensions?" → Curse of dimensionality. All points equidistant. Fix: PCA or feature selection. - "KNN has no parameters?" → K is a parameter. Distance metric (Euclidean, Manhattan, cosine) matters enormously. - "Train time?" → Zero (just store data). Inference time = O(n) without indexing (KD-tree/Ball-tree help).

Trap. Forgetting to scale. Income 0-100k, age 0-100 → distance dominated by income. Always standardize.

⏸️ Pause & Try. 500-dimensional genomics data, 200 samples. Will KNN work well? Why/not?

✅ No — curse of dimensionality. 500D with 200 samples = all points equidistant. Need PCA to reduce to ~20 dimensions first, or use a tree-based method that's immune to this.

3.8 SVM — the maximum margin separator¶

WHY? Logistic regression finds A separating line. But there are infinitely many lines that separate two classes. SVM finds THE BEST one — the one with maximum margin (distance) from both classes. More margin = more robust.

WHAT — the geometry:

           ● ●           ○ ○
         ● ● ●         ○ ○ ○
        ●   ●     |      ○ ○
              ●   |   ○
              ↑   |   ↑
        support   |   support
        vectors   |   vectors
                  ↑
            decision boundary
          ←─margin─→

SVM maximizes this margin. Support vectors = the closest points to boundary.
Only THESE points matter. All others could disappear without changing the result.

The kernel trick (when data isn't linearly separable):

2D problem: can't separate with a line
         ● ● ● ●
       ○ ○ ○ ○ ○ ○
         ● ● ● ●

Kernel trick: map to higher dimension where a linear separator EXISTS
  φ(x) = [x₁, x₂, x₁², x₂², x₁x₂]  ← 5D now

In 5D, the classes ARE linearly separable!
SVM finds the hyperplane in 5D, projects back to curved boundary in 2D.

Key insight: SVM never actually computes the high-D features.
It only needs dot products in that space → kernel function K(x,x') does this cheaply.

HOW — code:

from sklearn.svm import LinearSVC, SVC

LinearSVC(C=1.0).fit(X, y)             # fast, linear kernel
SVC(kernel='rbf', C=1.0).fit(X, y)     # non-linear, slow on large data

# C = penalty for misclassification
#   High C: narrow margin, few violations (overfits)
#   Low C: wide margin, allows violations (regularizes)

WHEN: Small-medium datasets with clear margin, text classification (linear SVM on TF-IDF), maximum-margin guarantee, binary classification.

WHEN NOT: Large datasets (O(n² to n³) training), need probabilities (gives scores not probs), multi-class at scale, when XGBoost exists.

INTERVIEW Qs: - "What's the kernel trick?" → Maps to higher-D space where linear separator exists, but only computes dot products there (cheap). - "SVM vs logistic?" → SVM: max-margin, no probabilities, strong with small data. Logistic: calibrated probs, faster, scales better. - "What are support vectors?" → The closest points to the boundary. Only these define the decision boundary.

Trap. RBF kernel on 100k+ rows — training explodes. Linear SVM is fine. For most modern tasks, XGBoost wins both accuracy and speed.

⏸️ Pause & Try. You have 100 million ad impressions to classify. Someone suggests SVM with RBF kernel. Problem?

✅ Training time O(n²-n³) = death with 100M rows. Even linear SVM is too slow. Use logistic regression (O(n) with SGD) or XGBoost. SVM is for small-medium data only.

3.9 K-Means Clustering — finding groups without labels¶

WHY? All algorithms above need LABELS (supervised). But sometimes you have no labels — just data. "Find natural groups in my customers." "Segment users by behavior." K-means finds K clusters by putting similar points together.

WHAT — the algorithm (4 steps, dead simple):

Step 1: Randomly place K centroids
Step 2: Assign each point to its NEAREST centroid
Step 3: Move each centroid to the MEAN of its assigned points
Step 4: Repeat steps 2-3 until centroids stop moving

    Before (random init):        After convergence:
        ★                            ★
      ○ ○ ○                        ● ● ●
    ○   ○                          ● ●
           ★                              ★
        ○ ○ ○                          ▲ ▲ ▲
          ○ ○                            ▲ ▲

    ★ = centroid    Points take cluster of nearest centroid

HOW to choose K — the elbow method:

inertia (within-cluster distance)
    ▲
    │╲
    │  ╲
    │    ╲___
    │        ╲_________    ← "elbow" = diminishing returns
    └───────────────────► K
     1  2  3  4  5  6

Also: silhouette score (how well-separated clusters are)

HOW — code:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# MUST scale (distance-based, like KNN)
X_scaled = StandardScaler().fit_transform(X)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)

# Elbow plot
inertias = [KMeans(n_clusters=k).fit(X_scaled).inertia_ for k in range(1, 11)]

WHEN: Customer segmentation, image compression, feature engineering (cluster_id as feature), initialization for other algorithms, exploratory data analysis.

WHEN NOT: Non-spherical clusters (use DBSCAN), very different cluster sizes, high dimensions without reduction, when you need deterministic results (random init).

INTERVIEW Qs: - "How to choose K?" → Elbow method (inertia vs K plot), silhouette score, or business-driven (marketing wants 4 segments). - "K-means assumptions?" → Spherical clusters, similar sizes, similar density. Fails on elongated/ring-shaped clusters. - "What's the convergence guarantee?" → Always converges (inertia decreases monotonically) but may find local optimum. Fix: n_init=10 (run 10 times, keep best).

Trap. Forgetting to scale. If income=0-100k and age=0-100, clusters are based purely on income differences. Also: K-means finds K clusters even if the data has no natural grouping — always validate clusters make business sense.

⏸️ Pause & Try. You run K-means with K=3 on customer data and get 3 clusters. How do you know if these are REAL segments vs noise?

✅ Check silhouette score (>0.5 = decent). Visualize in 2D (PCA/t-SNE). Most importantly: do clusters have different BUSINESS outcomes (different churn rates, different LTV)? If clusters don't predict anything useful, they're noise.

3.10 PCA — compressing reality¶

WHY? 500 features. Many are correlated (income ≈ salary, height ≈ weight). You want to reduce to 20 features that capture 95% of the information. PCA finds the directions of maximum variance and projects data onto them.

WHAT — the intuition:

Original 2D data:                   After PCA:
   y ▲     ●  ●                    PC1 captures MOST variance
     │   ●  ● ●  ●                  (the direction data "spreads" most)
     │ ●  ●  ●                     
     │  ●  ●                        PC2 captures remaining variance
     └──────────► x                  (perpendicular to PC1)

If data is a "cigar shape" in 2D:
  PC1 = long axis of cigar (captures 95% of spread)
  PC2 = short axis (captures 5%)
  → Drop PC2, keep PC1. Reduced from 2D to 1D with 95% info retained!

WHAT — the math (simplified):

1. Center data (subtract mean)
2. Compute covariance matrix
3. Find eigenvectors (directions of max variance) = principal components
4. Eigenvalues tell you HOW MUCH variance each PC captures
5. Keep top-k PCs that explain ≥ 95% cumulative variance

Eigenvalue:  λ₁=45  λ₂=30  λ₃=15  λ₄=5  λ₅=3  λ₆=2
Cumulative:  45%    75%    90%    95%   98%   100%
                                  ↑ keep 4 PCs = 95% info in 4D instead of 6D

HOW — code:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# MUST scale first (PCA is variance-based)
X_scaled = StandardScaler().fit_transform(X)

# Find number of components for 95% variance
pca = PCA(n_components=0.95)  # auto-selects k for 95% variance
X_reduced = pca.fit_transform(X_scaled)
print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} features")
print(f"Explained variance: {pca.explained_variance_ratio_.cumsum()}")

WHEN: High-dimensional data with correlated features, before KNN (curse of dimensionality), visualization (project to 2D/3D), noise reduction, speeding up downstream models.

WHEN NOT: Features have meaning you need to interpret (PCA components are uninterpretable), non-linear relationships (use t-SNE/UMAP for viz), sparse data (use TruncatedSVD instead).

INTERVIEW Qs: - "What does PCA maximize?" → Variance captured in each component. PC1 = direction of maximum spread. - "PCA for feature selection?" → No. PCA creates NEW features (linear combinations of originals). Feature selection keeps ORIGINAL features. - "When does PCA fail?" → Non-linear relationships. Also fails if you don't scale (high-scale features dominate variance).

Trap. Not scaling before PCA. If income (0-100k) and age (0-100) → PC1 will just be "income" because it has 1000× more variance. Scale first.

⏸️ Pause & Try. 500 genomics features → PCA → 20 components (95% variance). Feed to KNN. Why is this better than raw 500D KNN?

✅ Curse of dimensionality: 500D makes all distances meaningless. PCA reduces to 20D where distances are meaningful again. Also removes noise from correlated features and speeds up KNN inference.

Part 4: System-Level Thinking¶

These sections separate IC-level from Lead-level interviews. Anyone can explain what XGBoost is. Fewer can explain when NOT to use it, what breaks in production, and what the latency cost is.

4.1 Model selection decision tree¶

                    Is the data tabular (rows × columns)?
                              │
                ┌─────────────┴──────────────┐
              YES                             NO → images/audio/text
                │                                  → deep learning module
                │
        Is the task regression?
                │
        ┌───────┴────────┐
       YES                NO (classification)
        │                  │
   ┌────┴──────┐      ┌───┴───────────────┐
   │           │      │                    │
 Need       Just    Need               Just need
 interpret-  best   interpret-          best accuracy
 ability?    acc?   ability?
   │           │      │                    │
 LINEAR     XGBoost  LOGISTIC           XGBoost
 regression          regression
                                           │
                                  Stable & low-tuning?
                                           │
                                     RANDOM FOREST

Special-case shortcuts: - Time series → boosted trees + lag/rolling features, or Prophet/ARIMA - Sparse text → logistic + L1, or linear SVM - Small data (< 1k rows) → logistic + heavy regularization - Imbalance > 100:1 → boosted trees + class weights + PR-AUC - Need probabilities for pricing → calibrated logistic, or XGBoost + isotonic - Need feature importance → XGBoost + SHAP - Quick baseline → Naive Bayes (text) or KNN (tiny tabular)

The 90% rule: logistic baseline → XGBoost → deep learning ONLY on huge structured-feature data. Skipping the baseline is the most common system-design mistake.

4.2 Feature importance & SHAP¶

Four ways to ask "what does this model use?":

Method	How	Strength	Weakness
Coefficients (linear)	Direct from model	Signed, interpretable	Only after standardization
`.feature_importances_` (trees)	Split-weighted impurity	Fast, built-in	Inflates high-cardinality, hides correlated
Permutation importance	Shuffle feature, measure drop	Model-agnostic, honest	Slow, randomness
SHAP values	Game-theory attribution	Per-row, per-feature, additive	Slow for large datasets

SHAP — the gold standard:

Every prediction = baseline + Σ(feature contributions)

For a loan denial:
  Baseline (avg approval rate):    +0.60
  Credit utilization contribution: -0.25   ← biggest factor
  Income contribution:             +0.10
  Payment history contribution:    -0.08
  Final prediction:                 0.37 (deny)

Required by US ECOA / EU GDPR for consumer credit decisions.

import shap
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)                          # global view
shap.force_plot(explainer.expected_value, shap_values[i])  # one-row explanation

Trap. "High feature importance = causes the outcome." No. SHAP explains MODEL behavior, not the world. Importance ≠ causality.

4.3 Cost & latency intuition¶

Model	Train (1M×100)	Inference	Memory
Logistic regression	seconds	< 1ms (μs at batch)	tiny
Random forest (200 trees)	minutes	1-5ms	hundreds MB
XGBoost (500 trees)	minutes	1-10ms	tens-hundreds MB
Small MLP	minutes (GPU)	~1ms	small
BERT-base	hours (GPU)	10-100ms (CPU)	hundreds MB
GPT-class LLM	weeks (cluster)	seconds	tens GB

Why it matters: - Google ad CTR: logistic at billions QPS (one dot product) - Stripe fraud: XGBoost at 5ms (within card-swipe round trip) - Transformer at same QPS: 1000× more expensive

Trap. Designing around accuracy ONLY, ignoring serving cost. "Downgrade model, gain 10× throughput, lose 1% accuracy" often wins.

4.4 Production failure modes¶

These happen to WORKING models after deployment. Lead-tier ML is mostly debugging these.

Failure	What happens	Detect	Fix
Concept drift	Relationship changes (pre-COVID → post-COVID)	Monitor metrics on fresh labeled slice	Scheduled retraining
Covariate drift	Input distribution shifts (new traffic source)	KS-test/PSI per feature	Retrain or re-segment
Label drift	Definition of positive class changes	Cross-team label audits	Re-label + retrain
Train-serve skew	Feature computed differently offline vs online	Shadow-mode logging	Feature store (Tecton, Feast)
Feedback loops	Predictions affect future labels	Reserve random "shadow" cohort	Unbiased data collection
Cold-start	New cohort has no training data	Hierarchical model, borrow features	Bootstrap smaller model

The discipline. Most "model is broken" tickets are concept drift, train-serve skew, or feedback loops. Build dashboards for these BEFORE shipping.

Part 5: Interview Prep (Consolidated)¶

5.1 Common misconceptions (quick-index)¶

Misconception	Truth	Relevant section
"More data always fixes overfitting"	Only fixes high-variance. High-bias won't improve.	§2.4
"L1 is just smaller L2"	Different shape → L1 zeros weights, L2 only shrinks	§2.5
"ROC-AUC = model quality"	Only ranking. Miscalibrated model can have AUC=1.0	§2.8
"Linear regression assumes linear relationships"	Linear in PARAMETERS. Features can be x², log(x)	§3.1
"Random Forest never overfits"	Can — small noisy data, deep trees	§3.4
"Boosting = bagging with more steps"	Parallel+average vs sequential+residuals	§3.5
"Feature engineering is dead (deep learning)"	Dead for images/text. Dominant on tabular	§2.7
"Class imbalance = always SMOTE"	Distorts distributions. Class weights + threshold usually better	§2.10
"Higher accuracy = better model"	Only for balanced classes with symmetric costs	§2.8
"Logistic is non-linear (sigmoid)"	Decision boundary is still a straight line	§3.2
"KNN has no parameters"	K and distance metric are critical parameters	§3.7
"Deep learning always wins"	Not on tabular. XGBoost dominates spreadsheets	§3.5
"PCA = feature selection"	PCA creates NEW features. Selection keeps originals	§3.10

5.2 Interview phrasing guide¶

Each question is testing ONE specific concept. The annotation names the hidden test.

Question	They're testing...	Strong answer structure
"99% train, 82% val — what do you suspect?"	Overfitting diagnosis	Name it + 3 actions (learning curve, regularize, simplify)
"Why does L1 produce sparsity?"	Geometric intuition	Diamond corners on axes → weight hits zero
"Logistic is linear — how probabilities?"	Sigmoid placement	σ sits OUTSIDE linear part. Boundary is still a line
"RF vs boosting?"	Variance vs bias	Parallel+average vs sequential+residuals
"Strong AUC, weak precision — what now?"	Calibration vs threshold	Separate problems: recalibrate OR adjust threshold
"Explain this model to a non-technical PM"	SHAP communication	"Feature X contributed -0.25 to this decision because..."
"Design a fraud detection system"	End-to-end thinking	Data → features → model → metric → threshold → monitoring
"How would you handle 99.9% class imbalance?"	Toolkit knowledge	class_weight → threshold tuning → PR-AUC → focal loss

5.3 Self-check questions¶

Cannot answer in 60 seconds without notes → re-read that section.

Where does classical ML sit vs deep learning? When does each dominate? → §2.1
Name three signals of overfitting → §2.3
Train 98%, val 63% — diagnose and prescribe → §2.4
Why does L1 create sparsity? Draw the shapes → §2.5
Compute one gradient descent step by hand → §2.6
Why is feature engineering the biggest lever on tabular? → §2.7
When is accuracy a misleading metric? → §2.8
When should you NOT use random KFold? → §2.9
Name the assumptions of linear regression → §3.1
Why log loss over MSE for classification? → §3.2
Random forest vs boosting — what does each reduce? → §3.4/§3.5
When does XGBoost lose to deep learning? → §3.5
Explain the kernel trick in 30 seconds → §3.8
What does PCA maximize? → §3.10
Name 3 production failure modes and how to detect them → §4.4

5.4 Health check¶

[ ] Can explain AI/ML/DL taxonomy and where classical ML dominates
[ ] Can sketch bias-variance curve and L1/L2 shapes from memory
[ ] Can compute precision, recall, F1 without notes
[ ] Can trace one gradient descent step by hand
[ ] Can explain RF vs boosting (variance vs bias)
[ ] Can name 4 assumptions of linear regression
[ ] Can explain XGBoost vs DL on tabular in 90 seconds
[ ] Can explain PCA and K-means at whiteboard level
[ ] Can list 3 production failure modes with detection strategies
[ ] Ready to move into 01_neural_network_primitives

🔗 Prerequisites & connections¶

What you should already know. Weighted sums. Derivatives at "slope of a curve" level. Train/val/test discipline. Basic probability.
Where this leads. Next week: neural networks. Same pieces, deeper stack. Loss functions (cross-entropy returns), regularizers (L1/L2 → weight decay), optimizers (gradient descent → Adam), bias-variance, calibration — ALL return. These aren't training wheels. They're load-bearing intuitions for everything after.

⏱️ Difficulty markers¶

🟢 Linear regression, logistic regression, KNN, Naive Bayes
🟡 Bias-variance (easy to define, hard to feel), evaluation under imbalance, decision trees
🔴 L1 vs L2 geometry, calibration, boosting vs bagging mechanics, PCA math, SVM kernel trick