Support Vector Machines: Margins, KKT, and Slack Variables

Why Maximum Margin?

The margin is the distance from the decision boundary to the nearest training points (the support vectors). Maximising it creates the most separation between the boundary and the closest examples.

A boundary close to a training point is fragile. Small variation in a test point can flip it across the boundary and cause misclassification. A large margin is more robust.

The Decision Boundary

w^T x + b = 0

Support vectors lie exactly on the margin boundaries:

$w^T x_n + b = +1 \quad \text{(positive support vectors)}$ $w^T x_n + b = -1 \quad \text{(negative support vectors)}$

The constraint for all training points:

y_n(w^T x_n + b) \geq 1 \quad \forall n

Margin Width

The distance from a support vector to the hyperplane:

d = \frac{|w^T x_n + b|}{\|w\|} = \frac{1}{\|w\|}

The total margin (gap between positive and negative support vectors):

\text{margin} = \frac{2}{\|w\|}

Maximising the margin = minimising $\|w\|^2$ subject to the constraints.

KKT Conditions

The constrained optimisation is solved via the Lagrangian dual. The KKT conditions give:

w = \sum_n \alpha_n y_n x_n

\sum_n \alpha_n y_n = 0

The dual objective to maximise:

\sum_n \alpha_n - \frac{1}{2}\sum_n\sum_m \alpha_n \alpha_m y_n y_m x_n^T x_m

subject to $\alpha_n \geq 0$ .

KKT complementarity: $\alpha_n [y_n(w^T x_n + b) - 1] = 0$

This means $\alpha_n > 0$ only for support vectors, which are points exactly on the margin. All other training points have $\alpha_n = 0$ and contribute nothing to the solution. The decision boundary is determined entirely by a small subset of the data.

Soft Margin: Slack Variables

When classes overlap, the hard margin SVM has no feasible solution. Introduce a slack variable $\xi_n \geq 0$ for each training point:

Zero slack $\xi_n = 0$ : point on the correct side of the margin
Inside margin $0 < \xi_n \leq 1$ : point inside the margin, correctly classified
Misclassified $\xi_n > 1$ : point misclassified

Modified constraint:

y_n(w^T x_n + b) \geq 1 - \xi_n

The new KKT condition bounds $\alpha_n$ :

0 \leq \alpha_n \leq C

The parameter $C$ controls the trade-off:

Large $C$ → penalise misclassification heavily → narrow margin, fewer errors on training data
Small $C$ → allow more misclassification → wider margin, better generalisation

Note: SVMs remain sensitive to outliers even with slack variables. A single misplaced outlier can significantly shift the margin.

What You Can Do Now

The code below trains a linear SVM with sklearn, then extracts the support vectors, margin width, and decision boundary weights to verify the geometry directly.

import numpy as np
from sklearn.svm import SVC

np.random.seed(2)

# Two linearly separable clusters
X_pos = np.random.randn(15, 2) + np.array([2, 2])
X_neg = np.random.randn(15, 2) + np.array([-2, -2])
X = np.vstack([X_pos, X_neg])
y = np.array([1]*15 + [-1]*15)

# Train hard-margin SVM (C very large = hard margin)
svm = SVC(kernel='linear', C=1e6)
svm.fit(X, y)

w = svm.coef_[0]           # decision boundary normal vector
b = svm.intercept_[0]
margin = 2 / np.linalg.norm(w)

print(f"Weight vector w:    {w}")
print(f"Bias b:             {b:.4f}")
print(f"Margin width:       {margin:.4f}")
print(f"Support vectors:    {len(svm.support_vectors_)} points")
print(f"Support vector indices: {svm.support_}")

# Verify: support vectors should satisfy |w^T x + b| ≈ 1
for sv in svm.support_vectors_:
    score = abs(w @ sv + b)
    print(f"  |w^T x + b| = {score:.4f}  (should be ~1.0)")

# Soft-margin comparison
svm_soft = SVC(kernel='linear', C=0.1)
svm_soft.fit(X, y)
print(f"\nSoft margin (C=0.1) width: {2/np.linalg.norm(svm_soft.coef_[0]):.4f}")
print(f"Soft margin support vectors: {len(svm_soft.support_vectors_)}")

Reduce C from 1e6 to 0.1 to see the soft-margin SVM widen its margin at the cost of more support vectors. Add a misplaced outlier to X_pos near the negative cluster to observe how much the hard-margin solution shifts.