Skip to content

08. Feature engineering — making the data linearly separable

The model only draws certain shapes. If the data needs a curve and your model draws lines, change the data — not the model.

Built on the ELI5 in 00-eli5.md. The feature list is the feature vector. Feature engineering is deciding what columns to add before the model trains.


The picture before the math

See. A linear model — logistic regression, linear regression, linear SVM — can only draw a straight line (or flat plane). That is its shape budget. Nothing more.

But many real problems are not straight. House prices bend with sqft × neighbourhood. Fraud risk spikes only when two things hold together. Seasonality goes up and down. A line cannot trace a spiral.

So what to do? Two paths.

  1. Change the model. Use trees, kernels, neural nets — things that draw curves.
  2. Change the feature list. Add new columns that turn the curve in your data into a line in the new feature space.

Path 2 is feature engineering. The model still draws a line. But now the line lands where you want it.

This is the engineer adding columns. Raw square footage is one column. The engineer adds sqft × premium_neighbourhood or amount × is_new_merchant as a new column. Same house. Same transaction. Richer feature list. The prediction gets easier from the columns alone.


XOR — the cleanest demonstration

Recall XOR. Output is 1 only when exactly one input is 1.

x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0

In raw (x1, x2) space, the four points are diagonally interleaved. No straight line separates them.

x2 ↑
 1 |  ●────────○
   |  (0,1)=1  (1,1)=0
   |
 0 |  ○────────●
   |  (0,0)=0  (1,0)=1
   +─────────────→ x1
      0         1

(○ = class 0, ● = class 1.) You can confirm — for any line, one ● and one ○ end up on each side. This is the rule pile collapsing. One line is not enough.

Now the trick. Add a third feature — x3 = x1 · x2, the interaction. Recompute the table.

x1 x2 x3 = x1·x2 y
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0

Lift the four points into 3D. Three of them sit on the floor x3 = 0. One — the (1,1) point — pops up to x3 = 1. Picture it.

        x3 ↑
         1 |        ●  (1,1,1)  y=0
           |
         0 |  ○────●────●        floor: x3=0
           | (0,0) (0,1) (1,0)
           +─────────────→ in (x1,x2) plane

On the floor, the y=1 points (0,1) and (1,0) sit. The y=0 point (0,0) also sits on the floor. The y=0 point (1,1) floats above.

Now we want one plane that separates y=1 from y=0. Try w1·x1 + w2·x2 + w3·x3 + b. Pick:

  • w1 = 1, w2 = 1, w3 = -2, b = -0.5.

Compute for each point:

Point w1·x1 + w2·x2 + w3·x3 + b Predicted Target Match?
(0,0,0) 0 + 0 + 0 − 0.5 = −0.5 0 0
(0,1,0) 0 + 1 + 0 − 0.5 = +0.5 1 1
(1,0,0) 1 + 0 + 0 − 0.5 = +0.5 1 1
(1,1,1) 1 + 1 − 2 − 0.5 = −0.5 0 0

Four out of four. The same linear model that failed in 2D now succeeds in 3D — because we enriched the feature list. The interaction x1·x2 makes "AND" visible to a linear scorer. Geometry that needed a curve in 2D needs only a flat plane in 3D.

The model did not get smarter. The feature list got richer.


Linear regression on time-series — three concrete attempts

A claim: plain linear regression on y = w·t + b cannot capture seasonality. Walk it through.

Suppose dummy daily sales over six weeks. True pattern: a slow upward trend plus a strong weekend bump (Sat, Sun ~30% higher) plus a summer-month bump (June +15%). Dummy data over days 1..42.

Attempt 1 — fit y = w·t + b on raw day index

The model gets one slope. It estimates the average upward drift and ignores the weekly cycle. R² stays around 0.30. Residuals show a clean 7-day ripple — high every Sat/Sun, low every Tue/Wed. Predictions on next Saturday: too low. Next Tuesday: too high. The line cannot go up-and-down within a week.

Attempt 2 — add day-of-week one-hot (Mon, Tue, …, Sun)

Now the feature list has 8 columns: t, is_Mon, is_Tue, …, is_Sun. The model can lift Saturday and Sunday by a constant. R² jumps to ~0.75. Weekend ripple disappears. But the residuals now show a slow hump centered on June.

Attempt 3 — add sin(2π · month / 12) and cos(2π · month / 12)

Two more columns. Cyclic features. Why both? Because December and January should be near each other on a circle — sin/cos pair encodes that. R² climbs to ~0.92. Residuals look like noise. Done.

attempt 1:     /             linear trend only — flat across week
attempt 2:    /\/\/\/\        + DOW one-hot — weekly ripple captured
attempt 3:    /\/\/\/\ ⌒      + sin/cos month — yearly hump captured

The model never changed. We changed the feature list. Each new column made one shape of curve representable as a flat line in the new space.


Catalog of common feature transforms

The engineer's toolkit for reshaping the feature list.

Log transform — for skewed numerics

Income, page views, transaction amount — heavy right tail. Most rows are small. A few are huge.

raw income:     |▓▓▓░░░░░░░░░░░░░░░░░░  ░  ░    →  median squashed near 0
log(income):    ░░▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░       →  spread out, near-Gaussian

Why it helps. Linear models give equal weight per unit. With raw income, going from ₹10k to ₹20k is the "same" as ₹100k to ₹110k — but in risk it is not. log makes equal ratios look equal.

One-hot encoding — for low-cardinality categoricals

City has 5 values. Cannot feed city = "Mumbai" to a linear model. Cannot feed city_id = 3 either — that says Mumbai is "between" Delhi (2) and Bangalore (4), which is nonsense.

city = "Mumbai"  →  [is_Mumbai=1, is_Delhi=0, is_Bangalore=0, is_Pune=0, is_Chennai=0]

Each city gets its own column. The model picks an independent weight per city. No false ordering.

Target encoding — for high-cardinality categoricals

Zip code has 30,000 values. One-hot blows up. So replace each zip with the mean target of training rows in that zip.

zip 110001 → mean default rate in 110001 = 0.04 → feature value 0.04

Powerful. But — leak warning below.

Sin/cos — for cyclic features

Hour of day, day of week, day of year, compass direction. December → January is one step on the calendar but a huge integer jump (12 → 1).

hour = 23  →  sin(2π·23/24) ≈ -0.26,  cos(2π·23/24) ≈ +0.97
hour = 1   →  sin(2π·1/24)  ≈ +0.26,  cos(2π·1/24)  ≈ +0.97

11pm and 1am are now neighbors on the circle, as they should be. Linear models can finally use "time of day" without breaking at midnight.

Binning — for non-linear thresholds

Age 0–18, 19–35, 36–60, 60+. Useful when risk jumps at a threshold rather than rising smoothly. Trades resolution for shape freedom.

Date features

timestampday_of_week, is_weekend, is_holiday, month, hour, days_since_signup. Raw timestamps hide every cycle. Decompose them.

Text — TF-IDF

Convert a document into a vector of term_frequency × inverse_document_frequency weights per word. Common words (the, and) get small weights. Rare-but-document-specific words get large weights. Linear classifier on top — fast, surprisingly hard to beat for many tasks.

Standardization — make scales comparable

Standardization means zero mean and unit variance per feature. It matters for linear models, logistic regression, SVM, KNN, and neural nets because distance and gradient size depend on scale. Trees do not care — a split on income > 8 works the same before or after scaling. So what to do? When in doubt, standardize for distance-based and gradient-based models. In sklearn, the default move is StandardScaler.


Pause and recall. Three transforms. Without scrolling — when do you reach for log? When for one-hot vs target encoding? Why do hour-of-day features need sin AND cos, not just one? And which model families actually care about standardization?


Where this lives in the wild

  • Stripe Radar — feature pipeline. Card fraud is XOR-shaped. New merchant alone — fine. Big amount alone — fine. Big amount × new merchant × foreign IP — risky. The feature pipeline manufactures hundreds of such interactions before the gradient-boosted scorer sees them.
  • Airbnb price suggestions. Date features, holiday flags, neighbourhood encodings, and listing-quality features all feed the pricing model. The feature factory is what makes the model accurate.
  • Booking.com search ranking. Query × destination interactions ("beach" × Goa, "ski" × Manali) plus user-history features × current-session features keep the ranker from collapsing to "popular hotels".
  • Netflix recsys. User × item encodings, watch-time-ratio features, and time-of-day × genre crosses still do heavy lifting before deeper models take over.
  • Zillow-style house pricing. Square footage, neighbourhood, renovation age, school rating, and interaction terms like sqft × premium_zipcode help a price model separate ₹45 lakh homes from ₹1.2 crore homes.

The pattern. Wherever a linear or shallow model still ships, look behind it — there is a fat feature pipeline doing the curve-bending.


Interview Q&A

Q: Why one-hot encode rather than integer-encode categoricals?
A: Integer codes impose a false ordering (Mumbai=3 is between Delhi=2 and Bangalore=4) and a false distance. Linear and distance-based models will use both. One-hot gives each category its own free weight — no ordering, no distance assumption.
Common wrong answer to avoid: "trees handle integer codes fine, so it does not matter." Trees may survive it because they split on thresholds, but logistic regression, KNN, and neural nets can be silently corrupted by those fake distances.

Q: When is target encoding dangerous?
A: When you compute it using the full dataset before splitting train/val. The target leaks into the feature — validation accuracy looks fantastic, production accuracy collapses. The fix is out-of-fold encoding: compute the mean using only the training fold, then apply it to validation. Smoothing helps stabilize rare categories.
Common wrong answer to avoid: "as long as I do a global mean it's fine." Globalness is exactly the leak. Out-of-fold discipline is the fix.

Q: When should you standardize features?
A: Always for distance-based models like KNN and SVM, and for gradient-based models like linear regression, logistic regression, and neural nets. It is not needed for tree-based models because threshold splits are invariant to monotonic transforms. When in doubt, standardize unless you know you are in tree-land.
Common wrong answer to avoid: "Always standardize everything." Trees do not need it, so scaling there mostly wastes effort and can hide interpretability.

Q: Is feature engineering still relevant in the deep learning era?
A: Yes — strongly so for tabular data. Trees and gradient boosting still dominate many structured-data problems, and they live on engineered features. For images, audio, and text, neural nets learn more of their own representation, but even there you still engineer data pipelines, augmentations, and tokenization choices.
Common wrong answer to avoid: "deep learning removed the need for feature engineering." It moved the work in some domains, but it absolutely did not erase it.

Q: When should you add an interaction term explicitly vs. let a tree learn it?
A: If the interaction is known from domain knowledge (sqft × premium_neighbourhood, amount × is_new_merchant), add it explicitly even for trees — it makes splits shallower and models more sample-efficient. If you have lots of data and no priors, trees or neural nets can discover it themselves. The small-data, strong-prior case is where explicit interactions pay the most.
Common wrong answer to avoid: "never add interactions because trees will find them anyway." Trees can learn them, yes, but explicit features often make the job easier and more data-efficient.


Apply now (5 min)

Take this tiny dataset of 8 days. Predict bike_rentals from (temp_celsius, day_of_week, is_holiday).

temp dow holiday rentals
30 Sat 0 800
30 Mon 0 400
15 Sat 0 350
15 Mon 0 200
28 Sun 1 850
10 Tue 0 120
22 Fri 0 350
22 Wed 1 600

By hand, write down the feature list you would feed a linear model. At minimum:

  1. temp_celsius (raw) — or temp − 20 to center it.
  2. is_weekend (one binary column).
  3. is_holiday (already there).
  4. Interaction: temp × is_weekend — captures that warm + weekend amplifies rentals.
  5. temp² — captures that very cold and very hot both reduce rentals.

Then sketch — without coding — what shape the model can now draw vs. what it could draw on raw (temp, dow, holiday) alone. Feel where each new column buys you a new shape. That feel is the whole game.


Bridge. Feature engineering enriches the feature list so a line works. But what if you just let the model itself draw step-shaped boundaries — no engineered crosses needed? That is the next file: 09-decision-trees.md.