SVMs: The Kernel Trick and Dual Representation
Kernels allow SVMs to operate in high-dimensional feature spaces without explicitly computing the transformation. The dual representation reformulates the optimisation in terms of pairwise similarities between training points.
The Kernel Trick
A kernel function computes the inner product between two points in a (potentially infinite-dimensional) feature space without explicitly computing the feature map:
This allows SVMs to find non-linear decision boundaries by implicitly working in a high-dimensional space while keeping the computation tractable.
Valid Kernels
A function is a valid kernel if it corresponds to an inner product in some feature space. Equivalently, the Gram matrix (kernel matrix) must be positive semidefinite.
Given valid kernels and , the following are also valid kernels (Shawe-Taylor and Christianini, 2004):
- for any constant
- for any function
- where is a polynomial with non-negative coefficients
- (on different input subspaces)
The linear kernel is the simplest valid kernel.
Common Kernels
Gaussian (RBF) kernel:
This corresponds to an infinite-dimensional feature space. Points that are close in input space receive high similarity. The bandwidth controls the width of the kernel.
Polynomial kernel:
Degree controls the order of the polynomial feature interactions.
The Dual Representation
The primal SVM optimisation is over the weight vector . The dual reformulation expresses the solution in terms of the training data directly:
The optimal decision function becomes:
where are the dual variables. Only support vectors have . All other training points do not contribute to the decision boundary.
This is the representer theorem: the optimal solution lies in the span of the kernel functions evaluated at training points.
The Optimal Hyperplane
The optimal hyperplane maximises the margin between the two classes. The margin is .
The distance from the hyperplane to the support vector in the direction of the weight vector is . For a point :
As the kernel bandwidth (in the Gaussian kernel), the hyperplane becomes increasingly determined by the nearest data point, and more distant points contribute negligibly. As , all points contribute equally and the kernel approximates a linear kernel.
Dual Space Intuition
The dual vector space mirrors the original vector space . The dual representation is complementary: properties that are extreme in the primal (such as large weights) correspond to properties that are constrained in the dual (such as bounded values). The dual is often isomorphic to the original vector space in terms of structural properties.
Working in the dual is what makes kernel methods computationally feasible: instead of explicitly computing (which may be infinite-dimensional), we only compute pairwise kernel evaluations .