Contents
  1. The Core Idea
  2. The ELBO
  3. The Reparameterisation Trick
  4. Latent Space Structure
  5. Two-Loss View
← All posts

VAEs: ELBO, Reparameterisation, and the Latent Space

The VAE replaces the autoencoder's point estimate with a distribution over the latent space. Training is done by maximising the ELBO, which balances reconstruction quality against regularisation of the latent space.

The Core Idea

A plain autoencoder maps input xx to a fixed code hh. A Variational Autoencoder (VAE) maps xx to a distribution over the latent space, from which a code zz is sampled. This makes the latent space continuous and structured, which enables generation.

The encoder produces parameters of a distribution: qϕ(zx)N(μ,σ2)q_\phi(z|x) \approx \mathcal{N}(\mu, \sigma^2). The decoder then reconstructs from a sample zqϕ(zx)z \sim q_\phi(z|x).

The ELBO

The true posterior pθ(zx)p_\theta(z|x) is intractable. Instead, we approximate it with qϕ(zx)q_\phi(z|x) and train by maximising the Evidence Lower BOund (ELBO):

L=Eqϕ(zx)[logpθ(xz)]DKL[qϕ(zx)p(z)]\mathcal{L} = \mathbb{E}_{q_\phi(z|x)}\left[\log p_\theta(x|z)\right] - D_{KL}\left[q_\phi(z|x) \| p(z)\right]

The two terms have clear roles:

  • Reconstruction loss: Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] measures how well the decoder recovers the input from a sampled zz.
  • Regularisation: DKL[qϕ(zx)p(z)]D_{KL}[q_\phi(z|x) \| p(z)] pushes the approximate posterior toward the prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I).

Maximising the ELBO brings qϕ(zx)q_\phi(z|x) closer to the true posterior pθ(zx)p_\theta(z|x). The KL term is what enforces structure on the latent space: smoothness, continuity, and a specific distribution (in practice, the standard normal).

The derivation connects to KL divergence:

DKL[qϕ(zx)p(zx)]=zqϕ(zx)logpθ(z,x)+zqϕ(zx)logqϕ(zx)+logpθ(x)D_{KL}[q_\phi(z|x) \| p(z|x)] = -\sum_z q_\phi(z|x) \log p_\theta(z, x) + \sum_z q_\phi(z|x) \log q_\phi(z|x) + \log p_\theta(x)

Rearranging: logpθ(x)=L+DKL[qϕpθ]\log p_\theta(x) = \mathcal{L} + D_{KL}[q_\phi \| p_\theta]. Since KL 0\geq 0, the ELBO is a lower bound on the log evidence.

The Reparameterisation Trick

Sampling zqϕ(zx)z \sim q_\phi(z|x) is not differentiable, so gradients cannot flow through the sampler back to ϕ\phi. The reparameterisation trick resolves this by rewriting the sample as a deterministic function of the parameters plus independent noise:

z=μϕ(x)+σϕ(x)ε,εN(0,I)z = \mu_\phi(x) + \sigma_\phi(x) \cdot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I)

The stochasticity is now in ε\varepsilon, which does not depend on ϕ\phi. Gradients flow through μϕ\mu_\phi and σϕ\sigma_\phi as normal. By Leibniz’s rule:

ϕEqϕ(zx)[f(z)]=ϕEp(ε)[f(g(ϕ,ε))]\nabla_\phi \, \mathbb{E}_{q_\phi(z|x)}[f(z)] = \nabla_\phi \, \mathbb{E}_{p(\varepsilon)}[f(g(\phi, \varepsilon))]

Without reparameterisation, the gradient variance during stochastic optimisation is too high and convergence is impractical.

Latent Space Structure

The regularisation term controls the geometry of the latent space. With p(z)=N(0,I)p(z) = \mathcal{N}(0, I) as the prior, the encoder is pushed to produce a posterior qϕ(zx)q_\phi(z|x) that is close to a standard Gaussian. In practice, the encoder outputs a vector of means and a vector of (log) standard deviations. The latent space is therefore a vector of mean and standard deviation values per dimension.

This has two consequences:

  1. Continuity: Nearby points in latent space decode to similar outputs.
  2. Completeness: Every point sampled from p(z)p(z) decodes to a plausible output.

Neither property holds for a plain autoencoder.

Two-Loss View

Loss termRoleDirection
ReconstructionForce decoder to recover xx from zzMinimise
KL regularisationKeep qϕ(zx)q_\phi(z\|x) close to N(0,I)\mathcal{N}(0,I)Minimise

The tension between these two terms is what makes the VAE work. Too much weight on reconstruction: the latent space becomes unstructured. Too much weight on KL: the model ignores the input and learns nothing useful (posterior collapse).

← All posts