Kernels: The Representer Theorem and the Curse of Dimensionality

Kernels as Similarity Functions

A kernel function $k(x, x')$ is a similarity function between two inputs. A kernelised classifier computes a weighted sum of similarities between a new input and the training points:

f(x) = \sum_{i=1}^{N} \alpha_i k(x_i, x)

Instead of learning a weight vector in the input space, the classifier learns weights $\alpha_i$ over training examples. The kernel $k$ handles the nonlinear transformation implicitly.

This is the basis of kernel methods: learning algorithms that compose non-trivial representations through pairwise similarity, rather than explicit feature engineering.

The Representer Theorem

For a regularised loss minimisation problem:

\min_\theta \sum_{n=1}^{N} L(y_n, f(x_n)) + \frac{\lambda}{2}\|\theta\|^2

where $f(x) = \theta^T \phi(x)$ , the representer theorem states that the optimal solution has the form:

\theta^* = \sum_{n=1}^{N} \alpha_n \phi(x_n)

The solution lies entirely in the span of the training data’s feature representations. This means we never need to compute $\phi(x)$ explicitly. Substituting back:

f(x) = \sum_{n=1}^{N} \alpha_n \langle \phi(x_n), \phi(x) \rangle = \sum_{n=1}^{N} \alpha_n k(x_n, x)

The optimisation over $\theta$ (potentially infinite-dimensional) reduces to an optimisation over $\alpha \in \mathbb{R}^N$ (finite, one per training point).

The Gaussian Distribution and Maximum Entropy

The $d$ -dimensional Gaussian is:

\mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu)\right)

The Gaussian is the distribution that maximises entropy subject to fixed mean and covariance. This makes it the natural choice for Parzen window density estimation and Gaussian process kernels.

For Parzen window estimation (kernel density estimation), the density at a new point is estimated as:

p(x) = \frac{1}{N} \sum_{n=1}^{N} k(x, x_n)

where $k$ is a kernel function (often Gaussian). The bandwidth controls the smoothness of the estimate.

The Curse of Dimensionality

The curse of dimensionality refers to the phenomenon where the number of data points required to cover a space grows exponentially with dimensionality.

In $d$ dimensions:

Volume grows as $r^d$ .
To maintain constant data density, the number of points must grow as $N \propto r^d$ .
For $d = 10$ , covering the space requires $10^{10}$ times more points than in $d = 1$ .

Consequences for machine learning:

Distance metrics become less meaningful: all points become approximately equidistant in high dimensions.
Kernel methods that rely on local similarity (Gaussian kernel) break down as the notion of “nearby” loses meaning.
Models need exponentially more data to generalise.

This is not always a problem: data may lie on a low-dimensional manifold embedded in high-dimensional space. Methods that exploit this (dimensionality reduction, manifold learning) can recover useful structure.

Polynomial Kernel and Feature Spaces

The polynomial kernel $k(x, x') = (x^T x' + c)^d$ corresponds to a feature space containing all monomials of degree up to $d$ . For two 2-dimensional inputs with $d = 2$ :

k(x, x') = (x_1 x_1' + x_2 x_2' + c)^2

expands to include terms like $x_1^2, x_1 x_2, x_2^2$ , and cross-terms. The feature space grows polynomially with degree, but the kernel evaluation remains $O(d_\text{input})$ .

The Gaussian kernel corresponds to an infinite-degree polynomial kernel, which is why it can represent arbitrarily complex boundaries.