KL Divergence and Autoencoders
KL divergence measures the difference between two probability distributions. Autoencoders use this idea implicitly but are not generative models. This post draws the line between point estimates and distributions.
KL Divergence
The Kullback-Leibler divergence between two distributions and is defined as:
Three properties matter:
- Non-negative: , with equality only when .
- Asymmetric: . This is not a distance metric.
- Information gain: It measures how much information is lost when is used to approximate .
The relationship to entropy:
where is the cross-entropy of and , and is the entropy of . Minimising KL divergence is equivalent to minimising cross-entropy when is fixed, which is why cross-entropy is the standard training loss in classification.
Autoencoders
An autoencoder learns a compressed representation of input data by training two networks together: an encoder and a decoder.
Given a dataset , the training objective is:
This is an L2 reconstruction loss. The encoder compresses input to a latent code; the decoder reconstructs the input from that code.
Input x → [Encoder E_φ] → code h → [Decoder D_θ] → x'
What Autoencoders Are Not
The critical distinction: autoencoders produce point estimates, not probability distributions.
The goal is to find parameters that minimise reconstruction error between the input and its representation. The focus is not to learn the underlying distribution that describes the data. MLE gives point estimates, and that is exactly what a plain autoencoder does.
This means:
- You cannot sample new data from a plain autoencoder. There is no defined distribution over the latent space.
- The latent code is a single vector per input, not a distribution.
- Interpolating in latent space may produce incoherent outputs because there is no continuity constraint.
| Model | Latent representation | Can generate new samples |
|---|---|---|
| Autoencoder | Point estimate | No |
| VAE | Distribution | Yes |
Why This Matters
The limitation of the plain autoencoder is exactly the motivation for the Variational Autoencoder. By replacing the point estimate with a distribution (typically Gaussian) and adding a regularisation term (the KL divergence from the prior), the latent space becomes structured and continuous. Sampling becomes possible.
The plain autoencoder is still useful for dimensionality reduction, denoising, and anomaly detection. It is not useful as a generative model.
Understanding where KL divergence fits: in a VAE, it appears in the ELBO as the regularisation term that pushes the approximate posterior toward the prior . In a plain autoencoder, there is no such term and no probabilistic interpretation of the latent code.