KANs: Kolmogorov-Arnold Networks

The Theorem

The Kolmogorov-Arnold representation theorem states that any continuous multivariate function can be expressed as a finite composition of continuous univariate functions and addition:

f(x_1, \ldots, x_n) = \sum_{q=0}^{2n} \Phi_q\!\left(\sum_{p=1}^{n} \phi_{q,p}(x_p)\right)

This is the theoretical foundation of KANs. Every multivariate function has an exact representation using only univariate functions, regardless of dimensionality.

How KANs Differ from MLPs

In a standard MLP, activation functions are fixed (ReLU, tanh, etc.) and sit on nodes. The learned parameters are the linear weights on edges.

In a KAN, this is reversed: the activation functions are learnable and sit on edges. There are no fixed linear weights. Each edge learns a univariate function, implemented as a spline.

The KAN activation function on each edge is:

\phi(x) = w_b \cdot b(x) + w_s \cdot \text{spline}(x)

where $b(x) = \text{silu}(x) = x / (1 + e^{-x})$ is a basis function and $\text{spline}(x)$ is a learnable B-spline.

Parameter Count

The number of parameters in a KAN layer with $n$ inputs and $m$ outputs is determined by the spline coefficients per edge. For a network with layers of sizes $[n_0, n_1, \ldots, n_L]$ :

N(e) = \sum_{i=1}^{L} n_{i-1} \cdot n_i \cdot (k + G)

where $k$ is the spline order and $G$ is the number of grid points. Each edge holds a full spline, not a scalar weight.

Because each edge holds a univariate function, the learned functions can be visualised directly. A trained KAN shows what transformation each connection applies, and symbolic regression can extract closed-form expressions from those functions.

This is the main interpretability advantage over MLPs: in an MLP the distributed representation across weights makes interpretation difficult. In a KAN, each edge function is independently inspectable.

Accuracy and Efficiency Claims

For problems with compositional structure (functions that decompose into nested univariate functions), KANs are reported to be:

More parameter-efficient: the same accuracy as an MLP with far fewer parameters.
More accurate on scientific tasks such as PDE solving.
Better suited to symbolic regression tasks.

The curse of dimensionality (COD) applies differently. For a function $f(x) = x^T A x$ (a 4-dimensional function), KANs can exploit the compositional structure to reduce the effective dimensionality. MLPs treat all input dimensions uniformly and cannot optimise the use of compositional structure.

KAN and Symbolic KAN

A Symbolic KAN replaces the learned spline functions with known symbolic expressions (sin, exp, polynomial) once training reveals the approximate shape. This converts a neural network into an interpretable symbolic formula, which is useful for scientific discovery.

Training workflow:

Train a standard KAN to convergence.
Inspect each edge function visually.
Assign symbolic functions to edges where the shape matches a known expression.
Fine-tune the symbolic assignments.

Limitations

Grid extension is required to increase resolution, which adds complexity.
Training is slower than MLPs because spline operations are more expensive than linear transforms.
For general high-dimensional tasks without compositional structure, KANs may offer no advantage.
Xavier initialisation, batch normalisation, and standard optimisers (gradient descent) still apply during training.