VAEs: ELBO, Reparameterisation, and the Latent Space

The Core Idea

A plain autoencoder maps input $x$ to a fixed code $h$ . A Variational Autoencoder (VAE) maps $x$ to a distribution over the latent space, from which a code $z$ is sampled. This makes the latent space continuous and structured, which enables generation.

The encoder produces parameters of a distribution: $q_\phi(z|x) \approx \mathcal{N}(\mu, \sigma^2)$ . The decoder then reconstructs from a sample $z \sim q_\phi(z|x)$ .

The ELBO

The true posterior $p_\theta(z|x)$ is intractable. Instead, we approximate it with $q_\phi(z|x)$ and train by maximising the Evidence Lower BOund (ELBO):

\mathcal{L} = \mathbb{E}_{q_\phi(z|x)}\left[\log p_\theta(x|z)\right] - D_{KL}\left[q_\phi(z|x) \| p(z)\right]

The two terms have clear roles:

Reconstruction loss: $\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]$ measures how well the decoder recovers the input from a sampled $z$ .
Regularisation: $D_{KL}[q_\phi(z|x) \| p(z)]$ pushes the approximate posterior toward the prior $p(z) = \mathcal{N}(0, I)$ .

Maximising the ELBO brings $q_\phi(z|x)$ closer to the true posterior $p_\theta(z|x)$ . The KL term is what enforces structure on the latent space: smoothness, continuity, and a specific distribution (in practice, the standard normal).

The derivation connects to KL divergence:

D_{KL}[q_\phi(z|x) \| p(z|x)] = -\sum_z q_\phi(z|x) \log p_\theta(z, x) + \sum_z q_\phi(z|x) \log q_\phi(z|x) + \log p_\theta(x)

Rearranging: $\log p_\theta(x) = \mathcal{L} + D_{KL}[q_\phi \| p_\theta]$ . Since KL $\geq 0$ , the ELBO is a lower bound on the log evidence.

The Reparameterisation Trick

Sampling $z \sim q_\phi(z|x)$ is not differentiable, so gradients cannot flow through the sampler back to $\phi$ . The reparameterisation trick resolves this by rewriting the sample as a deterministic function of the parameters plus independent noise:

z = \mu_\phi(x) + \sigma_\phi(x) \cdot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I)

The stochasticity is now in $\varepsilon$ , which does not depend on $\phi$ . Gradients flow through $\mu_\phi$ and $\sigma_\phi$ as normal. By Leibniz’s rule:

\nabla_\phi \, \mathbb{E}_{q_\phi(z|x)}[f(z)] = \nabla_\phi \, \mathbb{E}_{p(\varepsilon)}[f(g(\phi, \varepsilon))]

Without reparameterisation, the gradient variance during stochastic optimisation is too high and convergence is impractical.

Latent Space Structure

The regularisation term controls the geometry of the latent space. With $p(z) = \mathcal{N}(0, I)$ as the prior, the encoder is pushed to produce a posterior $q_\phi(z|x)$ that is close to a standard Gaussian. In practice, the encoder outputs a vector of means and a vector of (log) standard deviations. The latent space is therefore a vector of mean and standard deviation values per dimension.

This has two consequences:

Continuity: Nearby points in latent space decode to similar outputs.
Completeness: Every point sampled from $p(z)$ decodes to a plausible output.

Neither property holds for a plain autoencoder.

Two-Loss View

Loss term	Role	Direction
Reconstruction	Force decoder to recover $x$ from $z$	Minimise
KL regularisation	Keep $q_\phi(z\\|x)$ close to $\mathcal{N}(0,I)$	Minimise

The tension between these two terms is what makes the VAE work. Too much weight on reconstruction: the latent space becomes unstructured. Too much weight on KL: the model ignores the input and learns nothing useful (posterior collapse).