01. The XOR problem - why one straight rule cannot learn feature interaction¶

~18 min read. XOR is the smallest dataset where "train longer" cannot fix the model, because the model class is wrong.

Builds on 00-eli5.md. The rule pile starts as weighted sums, the bend has not appeared yet, and this chapter shows why a straight rule cannot express a diagonal interaction.

What the cat robot knew before the first geometry failure¶

The overview gave us a simple cat-robot picture: inputs enter a rule pile, each rule combines numbers with weights, and the model emits a decision. That picture works for easy rules. If one feature alone is enough, a single weighted sum can separate the yes cases from the no cases.

What still breaks is interaction. A cat is not "high value in pixel 193" or "high value in pixel 194." Edges, corners, eyes, fur texture, and object parts are relationships between features. Classical tabular examples have the same shape: high amount alone may be safe, new merchant alone may be safe, but high amount with new merchant may be risky.

This chapter uses XOR because it is the smallest possible version of that pressure. Four points. Two inputs. One target. No noise. No optimizer mystery. If a perceptron fails here, the failure is structural.

What this file solves¶

A single perceptron can solve AND and OR, then fails on XOR even after more training. This file shows why the failure is not learning rate, initialization, or data size. The artifact to inspect is the four-point plane: the positive examples sit on opposite corners, so one straight decision boundary cannot isolate them.

The visible failure: training stalls on four perfect rows¶

Here is the whole dataset.

x1	x2	XOR target
0	0	0
0	1	1
1	0	1
1	1	0

The target is 1 when exactly one input is 1. It is 0 when both inputs match.

A single perceptron computes:

score = w1*x1 + w2*x2 + b
prediction = 1 if score >= 0 else 0

You train it. Loss improves a little and then stalls. You try a different seed. Same. You lower the learning rate. Same. You train longer. Same. The best hard classifier keeps getting at least one corner wrong, usually two.

The root cause is not that the optimizer failed to search long enough. The root cause is that the perceptron can only draw one straight cut through the input space. XOR needs two separated positive regions. A single line cannot make two islands.

The four points expose the impossibility¶

Plot the examples.

x2
^
|        target 1        target 0
|        (0,1) o---------x (1,1)
|
|
|        target 0        target 1
|        (0,0) x---------o (1,0)
+----------------------------------> x1
         0                           1

o means the model should output 1. x means it should output 0.

The positive examples are diagonal from each other. The negative examples are also diagonal from each other. One straight line cuts the plane into two half-planes. It can put a top half against a bottom half, or a left half against a right half, or one diagonal half against another. It cannot put two opposite corners on the same side while keeping the other two opposite corners on the other side.

Rule: a linear decision boundary can separate one half-space, not diagonal islands¶

Teacher voice. XOR is the first proof that capacity is not just "more weights." The model needs a non-linear feature or a non-linear layer, because the target depends on a relationship between inputs, not a weighted sum of inputs.

1) A perceptron is a line before it is an algorithm¶

The formula looks like a model:

w1*x1 + w2*x2 + b

Geometrically, the decision boundary is the set of points where the score equals zero:

w1*x1 + w2*x2 + b = 0

In two dimensions, that is a line. On one side of the line, the score is positive. On the other side, the score is negative.

That is the whole power and the whole limit.

score >= 0 side       decision boundary        score < 0 side
        yes      -------------------------->        no

The perceptron is useful because many tasks have a rough one-line separation. OR works: only (0,0) is negative, and a diagonal line can isolate it. AND works: only (1,1) is positive, and a diagonal line can isolate it. XOR fails because the positives are not one cluster.

2) Three plausible lines fail for three different reasons¶

The failure is easier to trust when you try the obvious lines.

Attempt A - split by x2¶

Use a horizontal boundary:

score = x2 - 0.5

Point	Score	Prediction	Target	Result
(0,0)	-0.5	0	0	correct
(1,0)	-0.5	0	1	wrong
(0,1)	0.5	1	1	correct
(1,1)	0.5	1	0	wrong

This line says "top row is positive." XOR says one top-row point is positive and the other is negative.

Attempt B - split by x1¶

Use a vertical boundary:

score = x1 - 0.5

Point	Score	Prediction	Target	Result
(0,0)	-0.5	0	0	correct
(0,1)	-0.5	0	1	wrong
(1,0)	0.5	1	1	correct
(1,1)	0.5	1	0	wrong

This line says "right column is positive." XOR says one right-column point is positive and the other is negative.

Attempt C - split by total count¶

Use a diagonal boundary:

score = x1 + x2 - 0.5

Point	Score	Prediction	Target	Result
(0,0)	-0.5	0	0	correct
(1,0)	0.5	1	1	correct
(0,1)	0.5	1	1	correct
(1,1)	1.5	1	0	wrong

This almost works. It detects "at least one input is on." That is OR, not XOR. The missing idea is "but not both." A single weighted sum can easily reward x1 and x2 individually. It cannot also subtract the special case where both are present unless that interaction exists as a feature or hidden unit.

3) The impossibility proof in four inequalities¶

Assume a perceptron solves XOR. Then the four rows require:

(0,0) -> 0: b < 0
(1,0) -> 1: w1 + b > 0
(0,1) -> 1: w2 + b > 0
(1,1) -> 0: w1 + w2 + b < 0

From the two positive rows:

w1 + b > 0
w2 + b > 0

Add them:

w1 + w2 + 2b > 0

Because (0,0) must be negative, b < 0. Subtracting a negative number makes the left side larger:

w1 + w2 + b > 0

But (1,1) requires:

w1 + w2 + b < 0

Contradiction. No weights exist.

Mini-FAQ. "Does the exact threshold convention matter?" No. Whether you use > 0, >= 0, or a sigmoid with threshold 0.5, the boundary is still linear. Boundary-edge ties can change one corner, not the geometry.

4) Why a product feature fixes what the line could not see¶

The plausible alternative to a hidden layer is feature engineering. Add one new feature:

x3 = x1 * x2

Now the model can represent:

score = x1 + x2 - 2*x3 - 0.5
prediction = 1 if score >= 0 else 0

Check it:

x1	x2	x1*x2	Score	Prediction	XOR
0	0	0	-0.5	0	0
1	0	0	0.5	1	1
0	1	0	0.5	1	1
1	1	1	-0.5	0	0

The product feature gives the model a name for "both inputs are on." Once that interaction is visible, a linear model can subtract it.

This is why the statement "linear models cannot solve XOR" needs precision. A linear model cannot solve XOR in the raw feature space (x1, x2). It can solve XOR if you hand it the interaction feature (x1*x2).

The hidden-layer promise is that the model can learn useful interaction features instead of waiting for a human to enumerate all of them.

5) The hidden layer turns one hard shape into two easy detectors¶

A small network can solve XOR by building two intermediate detectors:

h1 = OR(x1, x2)       detects at least one input
h2 = AND(x1, x2)      detects both inputs
out = h1 - h2         keeps one-on, removes both-on

Using threshold units, the idea is:

x1, x2
  |\
  | \
  |  +--> h1: x1 OR x2
  |
  +----> h2: x1 AND x2

output: h1 AND NOT h2

With ReLU-style units, one concrete construction is:

h1 = ReLU(x1 + x2)
h2 = ReLU(x1 + x2 - 1)
y  = h1 - 2*h2

x1	x2	h1	h2	y	thresholded output
0	0	0	0	0	0
1	0	1	0	1	1
0	1	1	0	1	1
1	1	2	1	0	0

The exact numbers are less important than the move. The hidden units create intermediate features. The output layer combines those features linearly. The bend keeps the two layers from collapsing into one line.

6) Why stacking linear layers without bends still fails¶

A common wrong repair is "add more layers." More layers help only if something non-linear happens between them.

Two linear layers are still one linear layer:

layer1 = W1*x + b1
layer2 = W2*layer1 + b2

layer2 = W2*(W1*x + b1) + b2
       = (W2*W1)*x + (W2*b1 + b2)

The stacked expression is still:

some_matrix*x + some_bias

That is the rule pile collapsing. You can stack ten linear layers and still get one effective linear transformation. Depth without non-linearity adds parameters, not shape.

Teacher voice. The bend is not decoration. It is the mechanism that prevents algebra from flattening the whole network back into one straight rule.

7) The tradeoff: feature engineering versus learned interactions¶

For XOR, hand-engineering x1*x2 is cheaper than training a neural network. In production, the choice depends on how many interactions exist and whether humans can name them.

Workload	Hand-engineered interaction	Hidden layer / non-linear model	Better default
Four-row XOR	Add `x1*x2`	Overkill	feature engineering
Fraud tabular data	Add known crosses like amount x merchant_age	Learns many interactions but needs monitoring	boosted trees or engineered features
Image edges	Hand-code filters possible but brittle	Convolutions learn local interactions	neural network
Text semantics	Manual crosses explode quickly	Embeddings and attention learn combinations	neural network
Small regulated scorecard	Human-readable crosses may be required	Harder to explain	linear + features

The pressure moves. Hand features move cost to human design and maintenance. Hidden layers move cost to data, training, monitoring, and interpretability. The right choice is not "deep is better." The right choice is whether the interaction space is small enough to name.

8) Signals that the model is missing interactions¶

Healthy behavior: residual errors look random after slicing by important feature pairs.
First degrading metric: pairwise slices fail even when one-feature slices look fine.
Misleading beginner metric: overall accuracy, because easy linear cases can hide interaction misses.
Expert graph: error heatmaps over two-feature grids, such as amount versus merchant age or user history versus item category.

The expert graph is the XOR picture at production scale. If the corners of a heatmap have opposite behavior and the model draws one smooth plane through them, you are seeing the same geometry.

9) Where XOR is the right mental model and where it is too small¶

XOR is a strong fit when the output depends on a combination that neither feature explains alone. It is useful for teaching interactions, feature crosses, hidden units, kernels, trees, and why non-linear activations matter.

It becomes pathological when treated as a full explanation of deep learning. Real networks do more than solve four-point logic. They learn distributed representations, reuse features across tasks, manage scale, optimize under noisy gradients, and run on hardware with memory and bandwidth constraints.

The scale limit is also obvious: XOR has no noise, no imbalance, no drift, no label ambiguity, and no train-test split problem. It isolates one constraint on purpose. Use it to understand representational limits, not to reason about every production failure.

10) Wrong mental model: a bigger linear model can eventually bend¶

The seductive intuition is that enough linear pieces should somehow add up to a curve.

They do not if no non-linearity separates them. Linear after linear is still linear. More weights rotate, scale, and shift the boundary. They do not create a corner, island, or hole.

Replace the mental model:

Old: "The model needs more parameters."
New: "The model needs a feature space where the target is separable."

Sometimes a human supplies that feature space with feature crosses. Sometimes a tree supplies it with splits. Sometimes a kernel supplies it implicitly. Sometimes a neural network learns it through hidden layers and activations.

11) Other failure shapes you will recognize¶

Feature cross failure: each feature is weak alone, but the pair is predictive.
Segment inversion: a rule helps one customer segment and hurts another.
Corner-case failure: the model handles common regions but fails on rare combinations.
Bag-of-words failure: words are harmless separately but risky together.
Pixel-composition failure: individual pixels mean little until arranged as an edge or texture.
Policy interaction failure: two safe permissions combine into an unsafe action.
Ranking interaction failure: user preference and item property matter jointly, not independently.
Modality fusion failure: text and image signals are useful only when compared.

12) Cross-topic reinforcement - the same pressure returns¶

Classical ML feature engineering uses product features and crosses to make linear models see interactions.
Activation functions are the neural-network answer: insert a bend so stacked rules can create new shapes.
Forward pass will make the hidden-feature construction concrete by computing intermediate activations.
Backpropagation later answers the next question: once a hidden layer can solve XOR, how do the weights find those detectors automatically?
Regularization will revisit the opposite risk: once the model can bend, it may bend too much.

13) Design-review questions that catch shallow explanations¶

What are the positive and negative regions in the raw feature space?
Can one line, plane, or hyperplane separate them?
If not, can a small engineered feature make them separable?
If hand features are not practical, which model family learns the interaction?
What slice or heatmap would reveal the interaction miss in production?

Where this shows up in production¶

Fraud detection: high amount and new merchant together can be risky even when each alone is ordinary.
Spam filtering: "free" and "unsubscribe" may be normal separately but suspicious in a specific template.
Search ranking: query intent and document property interact; neither score alone decides relevance.
Recommendation systems: user taste and item attribute become meaningful only as a pair.
Ad click prediction: creative, audience, placement, and time-of-day effects are interaction-heavy.
Credit risk: income, debt, employment stability, and recent inquiries can invert each other's meaning.
Medical triage: symptom combinations matter more than individual symptoms.
Security policy: two permissions that are safe separately can enable privilege escalation together.
Code analysis: one API call is safe unless paired with unchecked input or missing authorization.
Image recognition: edges and corners are local pixel interactions, not single-pixel facts.
Speech recognition: phoneme evidence depends on neighboring sounds and timing.
Vision-language models: the answer may depend on whether text agrees with image evidence.
A/B test analysis: treatment effect can flip by segment, hiding behind aggregate averages.
Pricing models: discount sensitivity can depend on customer tenure and current inventory.
Routing systems: latency risk may depend on region and provider together.
Agent tools: a tool call may be safe alone but unsafe after a prior state-changing action.
Anomaly detection: low-frequency signals may become important only when several occur together.
Data quality monitoring: a null feature may be harmless unless paired with a new category.

Recall - rebuild XOR from memory¶

What does a perceptron compute geometrically in two dimensions?
Where are the two positive XOR points on the plane?
Why do horizontal and vertical lines each get two rows wrong?
What contradiction appears in the four perceptron inequalities?
How does adding x1*x2 change the feature space?
Why do stacked linear layers collapse into one linear layer?
What production symptom suggests an XOR-shaped interaction miss?
Why is XOR useful but incomplete as a model of deep learning?

Interview Q&A¶

Q: Why can a single perceptron solve AND and OR but not XOR?

A: AND and OR are linearly separable in the raw (x1, x2) plane. A single line can isolate the one positive corner for AND or the one negative corner for OR. XOR has positives on opposite corners and negatives on opposite corners, so no one line can separate them.

Common wrong answer to avoid: "XOR is harder because it has more rows." It has the same four rows as AND and OR. The geometry is different.

Q: Why does adding more training time not solve XOR?

A: No valid weights exist for a single perceptron in the raw feature space. Training longer only searches the same impossible model class for longer. The fix is to change the representation or model family.

Common wrong answer to avoid: "The optimizer got stuck in a bad local minimum." For a single linear boundary, the representational limit is the main issue.

Q: Can a linear model ever solve XOR?

A: Yes, if the feature space includes the right interaction, such as x1*x2. Linear separability is a property of data relative to a feature representation, not a permanent property of the labels alone.

Common wrong answer to avoid: "Linear models can never solve XOR." They cannot solve raw XOR, but engineered features can make it linear.

Q: Why do neural networks need activation functions?

A: Without activation functions, stacked linear layers collapse algebraically into one linear layer. Activations insert non-linearity, so hidden layers can create intermediate features and decision regions that one line cannot express.

Common wrong answer to avoid: "Activations just keep values in a nice range." Some do, but their load-bearing role is breaking linear collapse.

Q: What is the smallest hidden-layer idea that solves XOR?

A: Create one hidden detector for "at least one input is on" and another for "both inputs are on." The output keeps the first and subtracts the second, producing exactly-one behavior.

Common wrong answer to avoid: "The network memorizes the four rows." A tiny network can solve XOR by representing the interaction, not merely storing rows.

Q: How would you detect XOR-shaped failure in a production model?

A: Slice errors by pairs or small groups of features. If each feature looks harmless alone but a combination has high error, a linear or additive model may be missing an interaction.

Common wrong answer to avoid: "Just look at global accuracy." Aggregate accuracy can hide the exact corner where the interaction matters.

Q: Does XOR prove deep networks are always better than feature engineering?

A: No. XOR proves a raw linear representation can be insufficient. If the interaction is known and small, feature engineering may be simpler and more interpretable. Deep models help when interactions are too many, subtle, or reusable for humans to enumerate.

Common wrong answer to avoid: "Deep learning exists because feature engineering is obsolete." Feature engineering is still a valid way to change representation.

Design/debug exercise (10 min)¶

First, reproduce the modeled example. Draw the four XOR points, label (0,1) and (1,0) as positive, and try a horizontal, vertical, and diagonal line. Write which row each line gets wrong.

Then do the production version. Pick a product you know and invent two binary features where each feature alone is not decisive but the combination flips the answer. Examples: new merchant and high amount, logged-out user and password reset, short video and thriller preference. Sketch the four corners and mark the risky corner or corners.

Finally, reproduce from memory:

the XOR truth table
one sentence explaining why no line works
one sentence explaining how x1*x2 or a hidden layer fixes it

Operational memory¶

XOR matters because it separates tuning failure from representation failure. When a model cannot express the target shape, better optimization cannot rescue it. The first diagnosis is not "train longer"; it is "what boundary can this model family draw?"

The move you learned is to inspect the feature space. If the positive and negative regions are interleaved, the model needs a new representation: engineered interaction features, trees, kernels, or hidden layers with non-linear activations. The bend exists because otherwise the rule pile collapses into one rule.

Remember:

A perceptron draws one linear boundary.
XOR places positive examples on opposite corners.
No line can put diagonal islands on one side.
A product feature like x1*x2 gives the interaction a name.
Hidden layers learn interaction features instead of requiring humans to list them.
Stacked linear layers without activations are still linear.
When pairwise slices fail, suspect an XOR-shaped interaction miss.

Bridge. We solved the first failure: one straight rule cannot represent XOR because the useful signal lives in an interaction. But stacking more straight rules still collapses into one straight rule. The next file introduces the bend - activation functions - and shows how ReLU, sigmoid, tanh, and GELU stop that collapse. -> 02-activation-functions.md