05. Linear regression — drawing one line through the cloud¶

Three minutes. One picture. The simplest learner that still ships in production every day.

Built on the ELI5 in 00-eli5.md. Feature list, prediction, and confidence score are the names to keep in your head. Linear regression is the simplest model that turns a feature into a number.

The picture before the math¶

See. Imagine a scatter of dots on graph paper. Each dot is one house. The horizontal axis is one feature — say, square footage in hundreds. The vertical axis is the prediction target — price in lakhs.

The dots make a cloud. Not a line. A messy cloud. But the cloud has a slope — square footage up, price up.

So what to do? Draw one straight line through the cloud such that the line is as close to all the dots as possible.

price ↑
      |              ●        ╱
   80 |          ●      ●  ╱
      |              ●  ╱        ← best-fit line
   60 |          ●  ╱  ●
      |       ● ╱     ↕  ← vertical residual
   40 |   ● ╱  ●
      |  ╱ ●
   20 | ╱
      +─────────────────────→ sqft (hundreds)
        9    10   11   12

The vertical bars are residuals — how far each dot is above or below the line. Note: vertical distance, not perpendicular. We measure straight up and down because y is what we predict, x is given.

We hunt for the line that makes the sum of squared vertical bars as small as possible. That is linear regression.

The math, finally¶

The line has two knobs:

ŷ = w·x + b

w — slope. How much y rises per unit of x.
b — bias. Where the line crosses the y-axis when x = 0.

The thing we minimize is mean squared error:

MSE = (1/n) · Σ (yᵢ - ŷᵢ)²

Why squared and not absolute? Two reasons. One, squaring punishes big misses harder than small misses — a residual of 4 costs 16, four residuals of 1 cost 4 total. The line bends to avoid one big miss. Two, the squared loss is smooth and differentiable everywhere — calculus gives us a closed-form solution. Absolute error has a kink at zero. No clean formula.

Worked example — five houses, by hand¶

Five houses. Square footage x (in hundreds), price y (in lakhs).

x	y
1	2
2	3
3	5
4	4
5	6

The closed-form solution — the normal equation — gives:

w = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
b = ȳ - w·x̄

Step 1. Means. x̄ = (1+2+3+4+5)/5 = 3. ȳ = (2+3+5+4+6)/5 = 4.

Step 2. Deviations and products.

xᵢ	yᵢ	xᵢ-x̄	yᵢ-ȳ	(xᵢ-x̄)(yᵢ-ȳ)	(xᵢ-x̄)²
1	2	-2	-2	4	4
2	3	-1	-1	1	1
3	5	0	1	0	0
4	4	1	0	0	1
5	6	2	2	4	4
Σ				9	10

Step 3. Slope and bias.

w = 9 / 10 = 0.9
b = 4 - 0.9·3 = 4 - 2.7 = 1.3

So the best line is ŷ = 0.9·x + 1.3. That means: every extra 100 sqft adds about 0.9 lakh to the predicted price.

Step 4. Verify. Compute predictions and residuals.

x	y	ŷ = 0.9x + 1.3	residual y-ŷ	squared
1	2	2.2	-0.2	0.04
2	3	3.1	-0.1	0.01
3	5	4.0	+1.0	1.00
4	4	4.9	-0.9	0.81
5	6	5.8	+0.2	0.04
			Σ = 0	MSE = 0.38

Residuals sum to zero. That is always true at the optimum — the line passes through (x̄, ȳ). MSE is 1.90/5 = 0.38.

OLS assumptions — what sits under the clean math¶

Ordinary least squares assumes four main things. Linearity: the target is a linear function of the features you feed in. Independence: one residual should not predict the next residual. Homoscedasticity: error variance stays roughly constant across the range. Normal residuals: mainly for confidence intervals and hypothesis tests, not for getting the point estimate itself. In practice, do not chant these. Check residual plots.

R² — how much variance did the line explain?¶

The common score is:

R² = 1 - SS_res / SS_tot

Read it as: how much of the target's variance did the model explain compared with just predicting the mean? R² = 1 is perfect. R² = 0 means you are no better than predicting the average price for every house.

Closed-form vs iterative — two ways to find the line¶

The closed-form normal equation for many features is:

w = (XᵀX)⁻¹ Xᵀy

One matrix inversion. Done. No loops. No learning rate.

But XᵀX is a k × k matrix where k is the number of features. Inverting it costs roughly O(k³) time and O(k²) memory. With 100 features — trivial. With 100,000 features — infeasible.

So what to do? Use gradient descent instead. Start with random w, b. Compute the gradient of MSE. Step downhill. Repeat. That is the next file. Iterative is slower per iteration but scales to huge k and huge n. It also extends to losses that have no closed form — logistic regression, neural networks.

Rule of thumb. Few features and modest data — closed form. Big features or streaming data — gradient descent.

Where linear regression fails — fitting `y = x²`¶

Now the same failure in pricing form. A rigid model insists every target scales linearly with one feature. It tries to fit y = x² with a straight line. Watch it fail in three different ways.

Attempt 1 — fit on `x ∈ [0, 1]`¶

Five points. (0, 0), (0.25, 0.0625), (0.5, 0.25), (0.75, 0.5625), (1, 1).

The closed-form gives roughly ŷ = 1.0·x - 0.17. Looks okay-ish. MSE is small because the curve is mild on this range. The analyst says "see, linear works".

Attempt 2 — fit on `x ∈ [-1, 1]`¶

Now the parabola dips and rises. Symmetric points. Means x̄ = 0, ȳ ≈ 0.4.

The slope formula gives w = 0 because positive and negative x cancel. Best line is ŷ = 0.4 — a flat horizontal line through the average. Predictions are equally bad for every point. MSE is large. The failure is obvious.

Attempt 3 — fit on `x ∈ [10, 20]`¶

Now the parabola is steep and far from the origin. The closed-form gives roughly ŷ = 30·x - 200. The line catches the slope but misses the curvature. Residuals at the endpoints are positive, residuals in the middle are negative. The line systematically under-predicts the ends and over-predicts the middle. MSE looks numerically modest but the pattern of residuals is structured — a sure sign the model shape is wrong.

Three attempts. Three different failures. Mild range — fakes success. Symmetric range — collapses to a flat line. Far range — gets the slope but bends the wrong way. The structural reason is the same in all three. The truth is curved. The line cannot bend.

The fix is not "more data" or "smaller learning rate". The fix is feature engineering — add x² as a new column. Now ŷ = w₁·x + w₂·x² + b. The model is still linear in weights. But the shape it can draw is now a parabola. Files later in the module work this through.

Where this lives in the wild¶

Zillow Zestimate baseline. The first version of home price prediction was a regularized linear regression on tens of features — square footage, bedrooms, location codes. Gradient boosting beat it later, but the linear baseline was the sanity check that said "are we even close?".
Lyft surge pricing baseline. Linear regression on demand minus supply gives the first-order multiplier. Production uses richer models, but the linear version is the always-on fallback when the deep model misbehaves.
Airbnb price suggestion baseline. New listings get a linear-regression suggested price from comparable nearby listings. It anchors the host before the more complex model takes over.
sklearn.linear_model.LinearRegression. The default first-thing-to-try in any tabular ML notebook. If your fancy model cannot beat this baseline, your fancy model is broken or your features are bad.

The pattern. Linear regression is the floor. If a complex model cannot beat the linear baseline by a meaningful margin, something is wrong upstream — not in the model, but in the data or the framing.

Pause and recall. Without scrolling — what is w, what is b, why squared error and not absolute, and what does the closed-form solution cost you when features explode? If any of those are fuzzy, scroll back.

Interview Q&A¶

Q: Why squared error, not absolute error?
A: Two reasons. Squared error is smooth and differentiable everywhere, which gives a clean closed-form solution and clean gradients. And squared error punishes large misses much harder than small ones — useful when one big mistake is worse than many small ones. Absolute error has a kink at zero, no closed form, and treats one big miss the same as many small misses.
Common wrong answer to avoid: "squared error is always better". It is not. When outliers are noise, absolute error (Huber, quantile loss) is more robust. Squared error and outliers fight badly.

Q: When is the closed-form solution preferable to gradient descent?
A: When the number of features is modest (a few thousand at most) and the data fits in memory. Closed form gives the exact optimum in one matrix inversion — no learning rate, no convergence check. Gradient descent wins when features are huge, data streams, or the loss has no closed form (logistic regression, neural networks).
Common wrong answer to avoid: "closed form is always exact and faster". It is exact, but O(k³) inversion kills it past ~10,000 features. And XᵀX can be ill-conditioned, blowing up the inverse. That is why ridge regression adds λI — for stability, not just regularization.

Q: What does the bias term b actually do?
A: It lets the line cross the y-axis anywhere, not just the origin. Without b, the line is forced through (0, 0). With b, the line shifts up or down to match the average level of y. Geometrically, w is the angle of the line and b is the height where it crosses zero. Common wrong answer to avoid: "b is just a small correction term, not important." Wrong. Without b, the whole model family is different because every line is forced through the origin.

Q: Linear regression fits a curve when you add x² as a feature. So why is it still called "linear"?
A: Linear refers to the weights, not the input. ŷ = w₁·x + w₂·x² + b is linear in w₁, w₂, b. The shape it draws in x-space can be a parabola, but the optimization problem is still convex with a unique minimum. That is what makes feature engineering plus linear regression so powerful — curved shapes, simple math.
Common wrong answer to avoid: "linear regression can only draw lines". False. It can draw any shape your features can express. Hidden layers of neural networks do this feature engineering automatically — same trick, learned instead of hand-crafted.

Q: What are the assumptions of linear regression?
A: Linearity, independence of errors, homoscedasticity (constant error variance), and normality of residuals. In practice, check residual plots — they tell you much more than chanting the assumptions. And remember: the linearity assumption is about the relationship between features and target, which you can often achieve through feature engineering even when the raw data is nonlinear.
Common wrong answer to avoid: "Linear regression assumes the data itself is linear." No. It assumes the relationship is linear in the features you provide.

Apply now (5 min)¶

Take graph paper. Plot four dots: (1, 2), (2, 4), (3, 5), (4, 8).

By hand:

Compute x̄, ȳ.
Compute the slope w = Σ(xᵢ-x̄)(yᵢ-ȳ) / Σ(xᵢ-x̄)².
Compute the bias b = ȳ - w·x̄.
Draw the line. Draw the four vertical residuals.
Sum the squared residuals. That is n · MSE.

Then — without looking — sketch from memory:

The cloud-of-dots picture with the best-fit line and vertical residuals.
The normal equation w = (XᵀX)⁻¹Xᵀy and one sentence on when it breaks.

If you can reproduce both in 90 seconds, you own this idea.

Bridge. Closed form is elegant but does not scale. So we need an iterative way to find w, b — one that works for huge data, huge features, and any differentiable loss. The next file is the hill-walker: 06-gradient-descent.md.