KL Divergence and Autoencoders

KL Divergence

The Kullback-Leibler divergence between two distributions $P$ and $Q$ is defined as:

D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \int x \cdot P(x) \cdot \log \frac{P(x)}{Q(x)}

Three properties matter:

Non-negative: $D_{KL}(P \| Q) \geq 0$ , with equality only when $P = Q$ .
Asymmetric: $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$ . This is not a distance metric.
Information gain: It measures how much information is lost when $Q$ is used to approximate $P$ .

The relationship to entropy:

D_{KL}(P \| Q) = H(P, Q) - H(P)

where $H(P, Q)$ is the cross-entropy of $P$ and $Q$ , and $H(P)$ is the entropy of $P$ . Minimising KL divergence is equivalent to minimising cross-entropy when $P$ is fixed, which is why cross-entropy is the standard training loss in classification.

Autoencoders

An autoencoder learns a compressed representation of input data by training two networks together: an encoder and a decoder.

x' = D_\theta(E_\phi(x)) \approx x

Given a dataset $\{x_i, y_i = x_i\}_{i=1}^n$ , the training objective is:

\min_{\theta, \phi} \sum_{i=1}^{n} \| D_\theta(E_\phi(x_i)) - x_i \|^2

This is an L2 reconstruction loss. The encoder compresses input to a latent code; the decoder reconstructs the input from that code.

Input x → [Encoder E_φ] → code h → [Decoder D_θ] → x'

What Autoencoders Are Not

The critical distinction: autoencoders produce point estimates, not probability distributions.

The goal is to find parameters that minimise reconstruction error between the input and its representation. The focus is not to learn the underlying distribution that describes the data. MLE gives point estimates, and that is exactly what a plain autoencoder does.

This means:

You cannot sample new data from a plain autoencoder. There is no defined distribution over the latent space.
The latent code $h$ is a single vector per input, not a distribution.
Interpolating in latent space may produce incoherent outputs because there is no continuity constraint.

Model	Latent representation	Can generate new samples
Autoencoder	Point estimate	No
VAE	Distribution $q_\phi(z\\|x)$	Yes

Why This Matters

The limitation of the plain autoencoder is exactly the motivation for the Variational Autoencoder. By replacing the point estimate with a distribution (typically Gaussian) and adding a regularisation term (the KL divergence from the prior), the latent space becomes structured and continuous. Sampling becomes possible.

The plain autoencoder is still useful for dimensionality reduction, denoising, and anomaly detection. It is not useful as a generative model.

Understanding where KL divergence fits: in a VAE, it appears in the ELBO as the regularisation term that pushes the approximate posterior $q_\phi(z|x)$ toward the prior $p(z)$ . In a plain autoencoder, there is no such term and no probabilistic interpretation of the latent code.