Contents
  1. The Kernel Trick
  2. Valid Kernels
  3. Common Kernels
  4. The Dual Representation
  5. The Optimal Hyperplane
  6. Dual Space Intuition
← All posts

SVMs: The Kernel Trick and Dual Representation

Kernels allow SVMs to operate in high-dimensional feature spaces without explicitly computing the transformation. The dual representation reformulates the optimisation in terms of pairwise similarities between training points.

The Kernel Trick

A kernel function k(x,x)k(x, x') computes the inner product between two points in a (potentially infinite-dimensional) feature space ϕ(x)\phi(x) without explicitly computing the feature map:

k(x,x)=ϕ(x),ϕ(x)k(x, x') = \langle \phi(x), \phi(x') \rangle

This allows SVMs to find non-linear decision boundaries by implicitly working in a high-dimensional space while keeping the computation tractable.

Valid Kernels

A function k(x,x)k(x, x') is a valid kernel if it corresponds to an inner product in some feature space. Equivalently, the Gram matrix (kernel matrix) must be positive semidefinite.

Given valid kernels k1k_1 and k2k_2, the following are also valid kernels (Shawe-Taylor and Christianini, 2004):

  • k(x,x)=ck1(x,x)k(x, x') = c \cdot k_1(x, x') for any constant c>0c > 0
  • k(x,x)=f(x)k1(x,x)f(x)k(x, x') = f(x) \cdot k_1(x, x') \cdot f(x') for any function ff
  • k(x,x)=q(k1(x,x))k(x, x') = q(k_1(x, x')) where qq is a polynomial with non-negative coefficients
  • k(x,x)=exp(k1(x,x))k(x, x') = \exp(k_1(x, x'))
  • k(x,x)=k1(x,x)+k2(x,x)k(x, x') = k_1(x, x') + k_2(x, x')
  • k(x,x)=k1(x,x)k2(x,x)k(x, x') = k_1(x, x') \cdot k_2(x, x')
  • k(x,x)=ka(xa,xa)kb(xb,xb)k(x, x') = k_a(x_a, x_a') \cdot k_b(x_b, x_b') (on different input subspaces)

The linear kernel k(x,x)=xTxk(x, x') = x^T x' is the simplest valid kernel.

Common Kernels

Gaussian (RBF) kernel:

k(x,x)=exp ⁣(xx22σ2)k(x, x') = \exp\!\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right)

This corresponds to an infinite-dimensional feature space. Points that are close in input space receive high similarity. The bandwidth σ\sigma controls the width of the kernel.

Polynomial kernel:

k(x,x)=(xTx+c)dk(x, x') = (x^T x' + c)^d

Degree dd controls the order of the polynomial feature interactions.

The Dual Representation

The primal SVM optimisation is over the weight vector ww. The dual reformulation expresses the solution in terms of the training data directly:

The optimal decision function becomes:

f(x)=i=1Nαiyik(xi,x)+bf(x) = \sum_{i=1}^{N} \alpha_i y_i k(x_i, x) + b

where αi0\alpha_i \geq 0 are the dual variables. Only support vectors have αi>0\alpha_i > 0. All other training points do not contribute to the decision boundary.

This is the representer theorem: the optimal solution lies in the span of the kernel functions evaluated at training points.

The Optimal Hyperplane

The optimal hyperplane maximises the margin between the two classes. The margin is 2/w2 / \|w\|.

The distance from the hyperplane to the support vector in the direction of the weight vector is 1/w1 / \|w\|. For a point xx:

distance=wTx+bw\text{distance} = \frac{w^T x + b}{\|w\|}

As the kernel bandwidth σ20\sigma^2 \to 0 (in the Gaussian kernel), the hyperplane becomes increasingly determined by the nearest data point, and more distant points contribute negligibly. As σ2\sigma^2 \to \infty, all points contribute equally and the kernel approximates a linear kernel.

Dual Space Intuition

The dual vector space VV^* mirrors the original vector space VV. The dual representation is complementary: properties that are extreme in the primal (such as large weights) correspond to properties that are constrained in the dual (such as bounded α\alpha values). The dual is often isomorphic to the original vector space in terms of structural properties.

Working in the dual is what makes kernel methods computationally feasible: instead of explicitly computing ϕ(x)\phi(x) (which may be infinite-dimensional), we only compute pairwise kernel evaluations k(xi,xj)k(x_i, x_j).

← All posts