Contents
  1. KL Divergence
  2. Autoencoders
  3. What Autoencoders Are Not
  4. Why This Matters
← All posts

KL Divergence and Autoencoders

KL divergence measures the difference between two probability distributions. Autoencoders use this idea implicitly but are not generative models. This post draws the line between point estimates and distributions.

KL Divergence

The Kullback-Leibler divergence between two distributions PP and QQ is defined as:

DKL(PQ)=xP(x)logP(x)Q(x)=xP(x)logP(x)Q(x)D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \int x \cdot P(x) \cdot \log \frac{P(x)}{Q(x)}

Three properties matter:

  • Non-negative: DKL(PQ)0D_{KL}(P \| Q) \geq 0, with equality only when P=QP = Q.
  • Asymmetric: DKL(PQ)DKL(QP)D_{KL}(P \| Q) \neq D_{KL}(Q \| P). This is not a distance metric.
  • Information gain: It measures how much information is lost when QQ is used to approximate PP.

The relationship to entropy:

DKL(PQ)=H(P,Q)H(P)D_{KL}(P \| Q) = H(P, Q) - H(P)

where H(P,Q)H(P, Q) is the cross-entropy of PP and QQ, and H(P)H(P) is the entropy of PP. Minimising KL divergence is equivalent to minimising cross-entropy when PP is fixed, which is why cross-entropy is the standard training loss in classification.

Autoencoders

An autoencoder learns a compressed representation of input data by training two networks together: an encoder and a decoder.

x=Dθ(Eϕ(x))xx' = D_\theta(E_\phi(x)) \approx x

Given a dataset {xi,yi=xi}i=1n\{x_i, y_i = x_i\}_{i=1}^n, the training objective is:

minθ,ϕi=1nDθ(Eϕ(xi))xi2\min_{\theta, \phi} \sum_{i=1}^{n} \| D_\theta(E_\phi(x_i)) - x_i \|^2

This is an L2 reconstruction loss. The encoder compresses input to a latent code; the decoder reconstructs the input from that code.

Input x → [Encoder E_φ] → code h → [Decoder D_θ] → x'

What Autoencoders Are Not

The critical distinction: autoencoders produce point estimates, not probability distributions.

The goal is to find parameters that minimise reconstruction error between the input and its representation. The focus is not to learn the underlying distribution that describes the data. MLE gives point estimates, and that is exactly what a plain autoencoder does.

This means:

  • You cannot sample new data from a plain autoencoder. There is no defined distribution over the latent space.
  • The latent code hh is a single vector per input, not a distribution.
  • Interpolating in latent space may produce incoherent outputs because there is no continuity constraint.
ModelLatent representationCan generate new samples
AutoencoderPoint estimateNo
VAEDistribution qϕ(zx)q_\phi(z\|x)Yes

Why This Matters

The limitation of the plain autoencoder is exactly the motivation for the Variational Autoencoder. By replacing the point estimate with a distribution (typically Gaussian) and adding a regularisation term (the KL divergence from the prior), the latent space becomes structured and continuous. Sampling becomes possible.

The plain autoencoder is still useful for dimensionality reduction, denoising, and anomaly detection. It is not useful as a generative model.

Understanding where KL divergence fits: in a VAE, it appears in the ELBO as the regularisation term that pushes the approximate posterior qϕ(zx)q_\phi(z|x) toward the prior p(z)p(z). In a plain autoencoder, there is no such term and no probabilistic interpretation of the latent code.

← All posts