VAEs: ELBO, Reparameterisation, and the Latent Space
The VAE replaces the autoencoder's point estimate with a distribution over the latent space. Training is done by maximising the ELBO, which balances reconstruction quality against regularisation of the latent space.
The Core Idea
A plain autoencoder maps input to a fixed code . A Variational Autoencoder (VAE) maps to a distribution over the latent space, from which a code is sampled. This makes the latent space continuous and structured, which enables generation.
The encoder produces parameters of a distribution: . The decoder then reconstructs from a sample .
The ELBO
The true posterior is intractable. Instead, we approximate it with and train by maximising the Evidence Lower BOund (ELBO):
The two terms have clear roles:
- Reconstruction loss: measures how well the decoder recovers the input from a sampled .
- Regularisation: pushes the approximate posterior toward the prior .
Maximising the ELBO brings closer to the true posterior . The KL term is what enforces structure on the latent space: smoothness, continuity, and a specific distribution (in practice, the standard normal).
The derivation connects to KL divergence:
Rearranging: . Since KL , the ELBO is a lower bound on the log evidence.
The Reparameterisation Trick
Sampling is not differentiable, so gradients cannot flow through the sampler back to . The reparameterisation trick resolves this by rewriting the sample as a deterministic function of the parameters plus independent noise:
The stochasticity is now in , which does not depend on . Gradients flow through and as normal. By Leibniz’s rule:
Without reparameterisation, the gradient variance during stochastic optimisation is too high and convergence is impractical.
Latent Space Structure
The regularisation term controls the geometry of the latent space. With as the prior, the encoder is pushed to produce a posterior that is close to a standard Gaussian. In practice, the encoder outputs a vector of means and a vector of (log) standard deviations. The latent space is therefore a vector of mean and standard deviation values per dimension.
This has two consequences:
- Continuity: Nearby points in latent space decode to similar outputs.
- Completeness: Every point sampled from decodes to a plausible output.
Neither property holds for a plain autoencoder.
Two-Loss View
| Loss term | Role | Direction |
|---|---|---|
| Reconstruction | Force decoder to recover from | Minimise |
| KL regularisation | Keep close to | Minimise |
The tension between these two terms is what makes the VAE work. Too much weight on reconstruction: the latent space becomes unstructured. Too much weight on KL: the model ignores the input and learns nothing useful (posterior collapse).