Skip to content

15. Support Vector Machines — the widest street between two document classes

Seven minutes. One street. One rope. Why the boundary with the most breathing room generalizes best.

Built on the ELI5 in 00-eli5.md. The feature list becomes a point in space, the prediction is which side of the street the document lands on, and the confidence score is usually distance from the street.


The picture before math

See. Two document classes sit on a 2D chart — one left-bottom, one right-top. Many straight lines can separate them.

SVM asks a stricter question: not "which line separates?" but "which line leaves the widest empty street between the two groups?"

class A documents          class B documents
x  x  x                    o  o  o
x  x                       o  o
-------------?-------------   many lines can separate
======== widest street ========
----------- center line -------
That center line is the decision boundary. The street edges are the margins. The points touching the curb are the support vectors.

If the street is wide, small TF-IDF noise will not flip the class label. If the street is narrow, one tiny word-count shift can cross the road.

The feature list becomes a point in space. The class label is which side of the street the point lands on. The confidence score is usually distance from the street, though raw SVM scores are not calibrated probabilities.

Why the widest street helps

Imagine two models. Model A draws the boundary hugging one class. Model B draws it in the middle with room on both sides.

A noisy word count arrives tomorrow. Which model breaks first? Model A.

That is overfitting. It memorized the exact training documents. SVM fights this by preferring margin. Wide margin means robustness. Wide margin means better generalization.

Support vectors — the few documents that matter

Most points do not matter directly. Only the closest points matter. Those are the support vectors.

far class A    far class A        far class B
x                    x                    o
        x   | margin |  boundary | margin |   o
            |        |           |        |
          support                         support
          vector                          vector
Move a far-away point slightly. The street hardly changes. Move a support vector. The whole street shifts.

So in the document picture, SVM is the model saying: "Ignore the easy documents. Show me the borderline cases. Those decide the rule."

The formula, lightly

For a linear SVM, the boundary is:

w · x + b = 0
x is the feature list. w is the direction perpendicular to the street. b shifts the street. Prediction is the sign:
if w · x + b > 0  → class B
if w · x + b < 0  → class A
The margin width is:
margin width = 2 / ||w||
So maximizing margin means minimizing ||w|| while still classifying correctly. Big picture first. Small ||w|| means a gentler boundary. A gentler boundary means a wider street.


Worked example — a 2D linear SVM by hand

Take six points. Negative class: - x1 = (1, 1) - x2 = (1, 2) - x3 = (2, 2) Positive class: - o1 = (4, 2) - o2 = (3, 3) - o3 = (4, 3) Now try this boundary:

x1 + x2 - 5 = 0
So: - w = (1, 1) - b = -5 - ||w|| = √(1² + 1²) = √2

Step 1 — check the two margin lines

For canonical SVM scaling, support vectors satisfy:

w · x + b = +1   or   -1
So the two curb lines are:
x1 + x2 - 5 = +1   →   x1 + x2 = 6
x1 + x2 - 5 = -1   →   x1 + x2 = 4
Now check points. Negative points: - (1,1) gives 2 - 5 = -3 - (1,2) gives 3 - 5 = -2 - (2,2) gives 4 - 5 = -1 ← support vector Positive points: - (4,2) gives 6 - 5 = +1 ← support vector - (3,3) gives 6 - 5 = +1 ← support vector - (4,3) gives 7 - 5 = +2 Good. The borderline documents are (2,2), (4,2), and (3,3). They touch the curbs.

Step 2 — compute the distance from one curb to the street

Distance from a point x to the boundary is:

distance = |w · x + b| / ||w||
Take support vector (4,2).
distance = |1| / √2 = 1/√2 ≈ 0.707
Same for (2,2). So full street width is:
2 × 0.707 = 1.414 = √2
That is the margin width. So what happened? SVM found the line halfway between the two closest class fronts. Not the average of all points. Not the prettiest line. The widest safe street.


Soft margin — because real data is messy

Clean separation is rare. Real corpora have overlap. A few documents from different classes sit mixed together.

So what to do? Allow some margin violations. That is the soft-margin SVM. The main knob is C. - Large C → punish mistakes hard. Narrower street. Lower training error. More overfitting risk. - Small C → allow some mistakes. Wider street. More regularization. Better against noise.

Think of C as how strict the classifier is. High C says every training document must be respected. Low C says some weird documents may be noise.

Hinge loss — why only support vectors matter

SVM minimizes hinge loss:

max(0, 1 - y·f(x))
Points inside the margin or on the wrong side pay a penalty. Points comfortably outside the margin pay nothing.

That is why only support vectors matter. The rest of the documents have zero loss and stop pulling on the boundary.


Linear SVM vs kernel SVM

Linear SVM draws a straight street. What if the data needs a curved road? Then we use a kernel.

Linear SVM

Best when the space is already good. This often happens in high-dimensional sparse text. A TF-IDF feature list is huge but often linearly separable enough.

Fast. Simple. Strong baseline.

Kernel SVM

Kernel SVM says: "Do not draw the curve directly. Measure similarity as if the points were mapped into a richer space." That is the kernel trick. We behave as if features were expanded, without explicitly building all of them.

original space           richer hidden space
ring around center   →   data becomes linearly separable there
XOR pattern          →   curved separation becomes straight
Common kernels: - RBF kernel. Local bubbles around points. Flexible. Most common nonlinear choice. - Polynomial kernel. Adds curved interactions like squared or cubic terms. So the mental picture is this. Linear SVM builds one straight street. Kernel SVM bends the city map first, then builds a straight street there.


The two knobs interviewers love — C and gamma

For RBF SVM, gamma matters. gamma controls how local each point's influence is. - Small gamma → each point influences a wide area. Smooth boundary. - Large gamma → each point influences only a tiny neighborhood. Wiggly boundary.

small gamma   →  smooth hill   → underfitting risk
large gamma   →  spiky hills   → overfitting risk
So usually: C controls how hard you fit training mistakes. gamma controls how wiggly the kernel boundary becomes. If both are large, danger. You can memorize.

Multi-class SVM — stitch binary streets together

SVM is natively binary. For K classes, use one-vs-rest (K models) or one-vs-one (K(K-1)/2 models).

In sklearn, LinearSVC uses OVR by default, while kernel SVC trains OVO internally. For large K, OVO can be faster because each sub-model trains on less data.

So if you classify Reuters topics or product categories, remember: one street per binary contrast, then combine the votes.


When SVM wins

SVM is excellent in some regimes. - Small datasets. Hundreds to low thousands of rows. - High-dimensional features. Text, bag-of-words, gene expression. - Binary classification. The native SVM setup. - Clear margin between classes. The street picture fits well.


When SVM loses

  • Large datasets. Training often scales around O(n²) to O(n³) for kernel SVMs.
  • Many classes. Native SVM is binary, so multi-class needs stitched-together binary models.
  • Need calibrated probabilities. Raw SVM scores are distances, not true probabilities.
  • Unscaled features. A feature with huge units dominates the street.
  • Messy nonlinear problems at scale. Gradient boosting or deep learning often wins. So if you have 10 million rows, do not reach for kernel SVM first.

Where this lives in the wild

  • ABBYY OCR pipelines. Engineered character-shape features plus SVM-style margin classifiers were long-time workhorses for character recognition before end-to-end deep learning became dominant.
  • Bloomberg document tagging. Sparse word features plus linear SVM remain strong for fast binary routing tasks where labels are expensive and explanations can be tied back to top-weighted terms.
  • Qualcomm sensor classification. Small on-device accelerometer and gyroscope datasets often favor linear or RBF SVMs because the models are compact and strong on few-shot problems.
  • Siemens industrial quality checks. Hand-engineered vibration or acoustic features with SVMs work well for defect-vs-no-defect screening when labeled failures are rare.
  • Reuters text categorization benchmark. Sparse bag-of-words features plus linear SVM became the classic strong baseline for topic classification because the margin works beautifully in high-dimensional text.

Pause and recall. Without scrolling: (a) what are support vectors and why do only they matter? (b) what does C control? (c) sketch the kernel trick — same data in 2D, then lifted to 3D where a plane separates. If any link is fuzzy, scroll back.


Interview Q&A

Q: Why do only support vectors matter?
A: Because the optimal boundary is pinned by the closest points to the margin. Far-away points already satisfy the constraints with room to spare, so moving them slightly does not change the widest street. The geometry is decided at the curb, not in the middle of the parking lot.
Common wrong answer to avoid: "All points matter equally because SVM uses the full dataset." It trains on the full dataset, yes, but the final boundary is determined mainly by the support vectors. Q: When would you choose linear SVM over logistic regression?
A: When the task is binary, the features are high-dimensional and sparse, the dataset is not huge, and margin matters more than calibrated probability. Linear SVM often wins on text because it focuses on separation, not probability fitting. Logistic regression is better when you need well-behaved probabilities and easier interpretation.
Common wrong answer to avoid: "SVM is always more accurate." Not true. On many tabular tasks, logistic regression or boosting matches or beats it. Q: What does the kernel trick actually do?
A: It computes inner products as if the data were mapped into a richer feature space, without explicitly building that huge space. That lets a linear separator in the hidden space become a nonlinear separator in the original space. So the city map bends, but the street is still straight in the lifted world.
Common wrong answer to avoid: "Kernel SVM just draws curves directly." That misses the key idea of implicit feature mapping. Q: Why is SVM weak for very large datasets?
A: Because kernel SVM training depends on pairwise relationships between examples, which becomes expensive in memory and time as n grows. Roughly, you pay quadratic to cubic cost. On millions of rows, linear models or tree ensembles are usually much more practical.
Common wrong answer to avoid: "because SVM is old." Age is not the issue. Pairwise scaling and memory cost are.


Apply now (5 min)

Take a page. Draw the six points from the worked example. Then do three things. 1. Draw the boundary x1 + x2 - 5 = 0. 2. Draw the two margin lines x1 + x2 = 4 and x1 + x2 = 6. 3. Circle the support vectors. Now compute from memory: - ||w|| for w = (1,1) - distance from a support vector to the street - full margin width Then — without looking — sketch from memory: 1. The widest-street picture. 2. One sentence on why support vectors matter. 3. One sentence on C vs gamma. If you can redraw all three in 90 seconds, you own SVM.


Bridge. SVM finds the widest street. But sometimes you do not need a street at all — just ask the neighbors. Read 16-knn.md next.