07. Logistic regression — the simplest spam classifier + confidence score¶

One linear score. One squash. A real classifier. Three minutes.

Built on the ELI5 in 00-eli5.md. The feature list, the prediction, and the confidence score — those three pictures live throughout this file.

The setup — a regression that refused to stay numerical¶

Linear regression predicts a number. Price. Salary. House value.

But in email filtering, we want spam or ham. A class, not a number. And we want a confidence score — not just a yes/no, but how sure.

So what to do? Take the linear score. Squash it into a probability between 0 and 1. That probability is exactly the confidence score from the ELI5.

That squash is the sigmoid. That whole machine is logistic regression.

Mental model before math — the S-curve squasher¶

Picture a number line stretching from minus infinity to plus infinity. That is the linear score z = w·x + b. It can be anything — minus 7, plus 12, zero.

Now imagine a soft S-shaped slide that takes any number on that line and gently bends it into the strip between 0 and 1. Big positive z slides up near 1. Big negative z slides down near 0. Zero lands exactly at 0.5 — total uncertainty, the confidence score sitting at the middle.

        sigmoid(z) — the confidence score
   p ↑
 1.0 |·····················___________
     |                ____/
 0.88|·············__/   <- z=2
     |          __/
 0.5 |--------+----------- <- z=0, total uncertainty
     |     __/
 0.12|··__/    <- z=-2
     |_/
 0.0 |·····················
     +--+----+----+----+-----→ z
       -4   -2    0    2    4

That curve is the whole story. The flat ends say "I am sure". The steep middle says "I am genuinely unsure". A confidence score that knows how to be uncertain.

The math — z, then sigmoid¶

Two steps. One linear. One non-linear.

z = w·x + b           (linear score, any real number)
p = 1 / (1 + e^(-z))  (sigmoid, squashes to (0, 1))

Three reference points to memorize:

z	sigmoid(z)	Confidence score says
−2	0.12	Almost certainly ham
0	0.50	Coin flip — model has no idea
+2	0.88	Almost certainly spam

The threshold (usually 0.5) turns probability into a label. But the threshold is a mailbox policy decision, not a model decision. For auto-delete you may set it to 0.95. For a softer spam folder, 0.5 is fine.

Softmax — when classes are more than two¶

Logistic regression handles two classes. Spam vs ham.

For K classes, replace sigmoid with softmax:

p_k = exp(z_k) / Σ exp(z_j)

Each class gets a probability. All the probabilities sum to 1. This is multinomial logistic regression — same idea, just one score per class instead of one score total.

Worked example — three emails, three features¶

Three features: word_count, link_count, and sender_reputation. Target: spam (1) or ham (0).

Pretend training has given us weights:

w_words = 0.01        (longer blasts lean spam)
w_links = 1.0         (more links lean spam)
w_rep   = -0.6        (trusted senders pull toward ham)
b       = -2.0        (the intercept that centers things)

Three emails hit the filter.

Email	words	links	rep	z = 0.01·words + 1.0·links − 0.6·rep − 2.0	p = sigmoid(z)	Prediction at 0.5
A — short note from known sender	40	0	9	0.4 + 0 − 5.4 − 2.0 = −7.0	0.001	ham
B — newsletter with many links	180	3	5	1.8 + 3 − 3.0 − 2.0 = −0.2	0.450	ham
C — urgent offer from shady sender	220	5	1	2.2 + 5 − 0.6 − 2.0 = 4.6	0.990	spam

See the middle case. Email B gets a 45% spam score. That is the confidence score being honest — not 10%, not 95%, but "I am leaning ham, yet I am not relaxed." A good email system may keep it in the inbox, send it to Promotions, or ask for one more rule check. The label hides that nuance. The probability is the value.

This is exactly what ELI5 promised. Two outputs. Prediction (the label). Confidence score (the probability). Same machine.

X cannot do Y — why MSE on sigmoid outputs is bad¶

Now the failure. People sometimes ask: why not just use mean squared error on the predicted probability? (p − y)² looks fine. It is differentiable. Let us try.

Setup: true label y = 1. Model predicts three different wrong p values. We compute the gradient of the loss with respect to the score z.

A useful fact: dp/dz = p(1−p). The sigmoid's slope.

Attempt 1 — model is very wrong, p = 0.05¶

Truth says 1. Model is screaming 0. We need a big, loud "fix this" gradient.

Loss	Formula	dL/dz	Number
MSE	(p−1)²	2(p−1)·p(1−p)	2(−0.95)(0.05)(0.95) = −0.0903
Cross-entropy	−log(p)	p − 1	0.05 − 1 = −0.95

Cross-entropy yells −0.95. MSE whispers −0.09. Ten times softer. The model that desperately needs a big nudge gets a tiny one from MSE.

Attempt 2 — model is moderately wrong, p = 0.2¶

Loss	Formula	dL/dz	Number
MSE	(p−1)²	2(p−1)·p(1−p)	2(−0.8)(0.2)(0.8) = −0.256
Cross-entropy	−log(p)	p − 1	0.2 − 1 = −0.80

Still CE is roughly 3× louder.

Attempt 3 — model is mildly wrong, p = 0.4¶

Loss	Formula	dL/dz	Number
MSE	(p−1)²	2(p−1)·p(1−p)	2(−0.6)(0.4)(0.6) = −0.288
Cross-entropy	−log(p)	p − 1	0.4 − 1 = −0.60

CE still bigger, but now the gap is smaller. As we approach the right answer, both losses calm down. That is correct behaviour.

The pattern. MSE multiplies by p(1−p), which crushes the gradient near 0 and near 1 — exactly the regions where the model is most confidently wrong. Cross-entropy cancels that crushing factor out. Wrong-and-confident produces a loud gradient. Right-and-confident produces a quiet one. Perfect teaching signal.

        gradient magnitude when y=1, varying p

  |X| ↑
  1.0 |·····  CE
      |   \__
      |      \___
  0.5 |          \___
      |              \__
      |     MSE          \__
  0.1 |  ___---···---___      \__
      |_/    flat at edges    \_·
      +----+----+----+----+----+--→ p
       0.05 0.2  0.4  0.6  0.8  1.0

CE keeps a useful slope all the way to the edge. MSE flattens out in both corners. So the wrong-and-confident sample, the one that most needs learning, gets the least signal under MSE. That is the failure.

Picture — log-loss vs MSE for a true-positive case¶

Same setup. True label is 1. Vary the predicted probability from 0.01 to 0.99. Plot the loss.

   loss ↑
   4.6 |\
       | \
       |  \   log-loss = -log(p)
       |   \    "punishes wrong-confident sharply"
   2.3 |    \_
       |      \__
       |         \___
   1.0 |             \____
       |                  \________
       |     MSE = (p-1)²            \______
   0.0 |·························_____________·
       +----+----+----+----+----+----+----+--→ p
        0.01 0.1  0.3  0.5  0.7  0.9  0.99

Log-loss goes to infinity as p approaches 0 — confidently wrong is catastrophic. MSE caps at 1. This is why CE refuses to let the model be confidently wrong. A spam filter that confidently lets junk into the inbox or buries real mail in Spam is the one users stop trusting.

MLE connection — same math, opposite sign¶

Minimizing cross-entropy is the same as maximizing likelihood. Likelihood says: make the observed labels as probable as possible under the model. Negative log-likelihood is just cross-entropy with the sign flipped. Same math. Same optimum.

Where this lives in the wild¶

Logistic regression is everywhere you need a fast, auditable, calibrated first-pass prediction.

Gmail spam filter — first stage. A logistic regression over millions of token features scores each email in microseconds. Most obvious spam and obvious ham never reach the deeper classifier.
Affirm credit decisioning. First-pass approval uses logistic regression on income, debt, and history. Fast, auditable for ECOA compliance — the regulator can read the weights.
sklearn LogisticRegression — the default tabular baseline. Every Kaggle notebook starts here. If a fancy model cannot beat L2-regularized logistic regression by a wide margin, the fancy model is not earning its complexity.
B2B marketing lead scoring (HubSpot, Salesforce Einstein). Predict probability that a lead converts. Sales teams sort by score. The probability — not the binary label — is what drives the queue.
Card-fraud triage at banks. A lightweight logistic model scores each swipe before heavier ensembles run. Fast probability first, richer models later.

Pattern. Whenever you need probability + speed + auditability, logistic regression is the default. Voting panels come later.

Pause and recall. Without scrolling — what does sigmoid do, geometrically? Why is cross-entropy better than MSE for classification, in one phrase? What two things is logistic regression giving you that link back to the ELI5? If any link is fuzzy, scroll back.

Interview Q&A¶

Q: Why sigmoid + cross-entropy, not sigmoid + MSE?
A: MSE multiplies the gradient by p(1−p), which kills learning exactly when the model is confidently wrong. Cross-entropy's gradient is p − y, clean and proportional to error. The wrong-and-confident sample gets a loud nudge under CE and a near-zero nudge under MSE.
Common wrong answer to avoid: "MSE is for regression, CE is for classification — by convention." That is true but shallow. The real answer is the gradient shape. Conventions exist because the math made them.

Q: How does logistic regression handle more than two classes?
A: Softmax replaces the sigmoid. Each class gets its own weight vector, each class gets a probability, and the probabilities sum to 1. That native multiclass form is called multinomial logistic regression. One-vs-rest with separate models works, but it is not the core multiclass formulation.
Common wrong answer to avoid: "Use one-vs-rest with separate models." That can work as a workaround, but softmax is the native multiclass form.

Q: Is the decision boundary still linear?
A: Yes. The sigmoid only re-labels the y-axis. The set of points where p = 0.5 is exactly where z = 0, which is w·x + b = 0 — a hyperplane. So logistic regression cannot solve XOR-shaped problems any more than linear regression can. You need feature engineering or a non-linear model for that.
Common wrong answer to avoid: "Sigmoid is non-linear, so the boundary is curved." Sigmoid is non-linear in z, but z itself is linear in x. The boundary in feature space is a straight line.

Q: What does the magnitude of a weight w_j tell you?
A: How much one unit of feature j changes the log-odds. Big positive w_j means that feature pushes hard toward class 1. Big negative means it pushes hard toward class 0. Near zero means it barely matters. But only compare magnitudes after scaling features to similar ranges — otherwise a huge weight may just belong to a tiny-ranged feature.
Common wrong answer to avoid: "the biggest weight is always the most important feature." Without comparable feature scales, raw weight size can mislead badly.

Q: Why is logistic regression a strong production baseline despite being "simple"?
A: It is fast, gives calibrated probabilities out of the box when data is decent, is auditable because you can read every weight, retrains in seconds, and handles millions of sparse features through L1/L2 regularization. Product teams love it because they can ship, monitor, and explain it on day one.
Common wrong answer to avoid: "It's outdated, deep models are always better." On clean tabular data with limited rows, well-engineered logistic regression often matches or beats deeper models — and it ships in a fraction of the time.

Apply now (5 min)¶

Hand-compute the confidence score on three z values. No calculator-leaning. Use these rounded sigmoid values: σ(−3) ≈ 0.05, σ(−1) ≈ 0.27, σ(0) = 0.5, σ(1) ≈ 0.73, σ(3) ≈ 0.95.

Email X: word_count = 140, link_count = 2, sender_reputation = 4. Use weights from the worked example. Compute z. Read off p. Decide the prediction at threshold 0.5.

Then — without looking back — sketch from memory:

The S-curve, axes labelled, three points marked (z = −2, 0, +2).
One sentence: why MSE-on-sigmoid is bad in two words.
One sentence: what does logistic regression give back to the ELI5? Prediction + confidence score.

If you can do all three in under 2 minutes, you own this idea.

Bridge. Logistic regression draws a straight line. The feature list is rarely linearly separable as-given. Next file shows how to bend the data instead of bending the model — feature engineering. 08-feature-engineering.md.