SVMs: The Kernel Trick and Dual Representation

The Kernel Trick

A kernel function $k(x, x')$ computes the inner product between two points in a (potentially infinite-dimensional) feature space $\phi(x)$ without explicitly computing the feature map:

k(x, x') = \langle \phi(x), \phi(x') \rangle

This allows SVMs to find non-linear decision boundaries by implicitly working in a high-dimensional space while keeping the computation tractable.

Valid Kernels

A function $k(x, x')$ is a valid kernel if it corresponds to an inner product in some feature space. Equivalently, the Gram matrix (kernel matrix) must be positive semidefinite.

Given valid kernels $k_1$ and $k_2$ , the following are also valid kernels (Shawe-Taylor and Christianini, 2004):

$k(x, x') = c \cdot k_1(x, x')$ for any constant $c > 0$
$k(x, x') = f(x) \cdot k_1(x, x') \cdot f(x')$ for any function $f$
$k(x, x') = q(k_1(x, x'))$ where $q$ is a polynomial with non-negative coefficients
$k(x, x') = \exp(k_1(x, x'))$
$k(x, x') = k_1(x, x') + k_2(x, x')$
$k(x, x') = k_1(x, x') \cdot k_2(x, x')$
$k(x, x') = k_a(x_a, x_a') \cdot k_b(x_b, x_b')$ (on different input subspaces)

The linear kernel $k(x, x') = x^T x'$ is the simplest valid kernel.

Common Kernels

Gaussian (RBF) kernel:

k(x, x') = \exp\!\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right)

This corresponds to an infinite-dimensional feature space. Points that are close in input space receive high similarity. The bandwidth $\sigma$ controls the width of the kernel.

Polynomial kernel:

k(x, x') = (x^T x' + c)^d

Degree $d$ controls the order of the polynomial feature interactions.

The Dual Representation

The primal SVM optimisation is over the weight vector $w$ . The dual reformulation expresses the solution in terms of the training data directly:

The optimal decision function becomes:

f(x) = \sum_{i=1}^{N} \alpha_i y_i k(x_i, x) + b

where $\alpha_i \geq 0$ are the dual variables. Only support vectors have $\alpha_i > 0$ . All other training points do not contribute to the decision boundary.

This is the representer theorem: the optimal solution lies in the span of the kernel functions evaluated at training points.

The Optimal Hyperplane

The optimal hyperplane maximises the margin between the two classes. The margin is $2 / \|w\|$ .

The distance from the hyperplane to the support vector in the direction of the weight vector is $1 / \|w\|$ . For a point $x$ :

\text{distance} = \frac{w^T x + b}{\|w\|}

As the kernel bandwidth $\sigma^2 \to 0$ (in the Gaussian kernel), the hyperplane becomes increasingly determined by the nearest data point, and more distant points contribute negligibly. As $\sigma^2 \to \infty$ , all points contribute equally and the kernel approximates a linear kernel.

Dual Space Intuition

The dual vector space $V^*$ mirrors the original vector space $V$ . The dual representation is complementary: properties that are extreme in the primal (such as large weights) correspond to properties that are constrained in the dual (such as bounded $\alpha$ values). The dual is often isomorphic to the original vector space in terms of structural properties.

Working in the dual is what makes kernel methods computationally feasible: instead of explicitly computing $\phi(x)$ (which may be infinite-dimensional), we only compute pairwise kernel evaluations $k(x_i, x_j)$ .