Variational Autoencoder (VAE)

A latent-variable generative model trained by maximizing a variational lower bound on the data likelihood, with an amortized encoder and the reparameterization trick.

A variational autoencoder (VAE) is a latent-variable generative model. A latent code $z$ is drawn from a fixed prior $p(z)=\mathcal{N}(0,I)$ and decoded into data through a conditional decoder $p_\theta(x\mid z)$ . Exact maximum likelihood is intractable because the marginal $p_\theta(x)=\int p_\theta(x\mid z)\,p(z)\,dz$ has no closed form. The VAE introduces an encoder $q_\phi(z\mid x)$ that approximates the true posterior $p_\theta(z\mid x)$ and trains both networks jointly by variational inference. The encoder and decoder form a single forward path: an input is compressed into a low-dimensional code and then expanded back into a reconstruction, with the whole path trained end to end under one ELBO objective. The construction was introduced by Kingma and Welling in 2013 and remains a standard baseline for deep latent-variable modeling.

Intuition

Think of a VAE as a compressor with a bottleneck, except the bottleneck holds not a string of fixed bits but a cloud of probability. The encoder glances at a datum $x$ and answers “roughly which region of latent space does it fall in” — a mean and a width. The decoder draws one point from that cloud and tries to rebuild $x$ . Training forces two things at once: the drawn point must reconstruct (the code has to carry information), yet all the codes stacked together must cover the whole prior (no holes in latent space). The first makes the code useful; the second makes the space sampleable — after training, draw $z$ straight from $p(z)$ , decode it, and you get a new datum.

From a 2D latent space to generated samples — drag the cursor to decode:

2D latent space z (drag the cursor)

Decoded output p(x|z)

z = (0.00, 0.00)

z₁ — how spread outz₂ — rotation & core density

Drag through the latent space and the output morphs smoothly — the essence of a VAE: a continuous low-dimensional code that the decoder maps to data, so nearby z decode to similar samples. (The decoder here is illustrative, not trained.)

The objective is the evidence lower bound (ELBO):

\log p_\theta(x)\;\ge\; \mathbb{E}_{q_\phi(z\mid x)}\!\big[\log p_\theta(x\mid z)\big] -\mathrm{KL}\!\big(q_\phi(z\mid x)\,\Vert\,p(z)\big).

The first term is a reconstruction term: codes sampled from the encoder should decode back to $x$ . Here $\mathbb{E}_{q_\phi(z\mid x)}$ means “average under the code distribution the encoder produces for this $x$ ,” and $\log p_\theta(x\mid z)$ is the log-likelihood the decoder assigns to the true datum $x$ — for a Gaussian decoder with fixed variance this reduces to a negative squared reconstruction error. The second is a Kullback–Leibler penalty: $\mathrm{KL}(q_\phi(z\mid x)\Vert p(z))$ measures how far the per-sample posterior $q_\phi(z\mid x)$ strays from the prior $p(z)$ , pulling each code cloud toward the unit Gaussian centered at the origin. The gap between $\log p_\theta(x)$ and the ELBO is exactly $\mathrm{KL}(q_\phi(z\mid x)\Vert p_\theta(z\mid x))$ , so tightening the bound also sharpens the posterior approximation.

A concrete number: take a one-dimensional latent and suppose the encoder outputs $\mu=2,\ \sigma=0.5$ for some $x$ . The KL between a diagonal Gaussian and the unit normal has the closed form $\tfrac12(\mu^2+\sigma^2-1-\log\sigma^2)$ , which evaluates to $\tfrac12(4+0.25-1-\log 0.25)\approx 2.32$ nats — the price paid for moving this code away from the prior. If the encoder retreats to $\mu=0,\ \sigma=1$ , the KL drops to zero, but the code then carries no information about $x$ . Reconstruction and KL tug against each other along this axis.

From evidence to the bound

The bound follows from a single identity. Multiplying and dividing inside the log marginal by $q_\phi(z\mid x)$ and applying Jensen’s inequality to the resulting expectation gives the ELBO directly; equivalently, the log evidence decomposes exactly as

\log p_\theta(x) = \underbrace{\mathbb{E}_{q_\phi(z\mid x)}\!\big[\log p_\theta(x\mid z)\big] -\mathrm{KL}\!\big(q_\phi(z\mid x)\,\Vert\,p(z)\big)}_{\text{ELBO}} \;+\;\mathrm{KL}\!\big(q_\phi(z\mid x)\,\Vert\,p_\theta(z\mid x)\big).

The final term — the KL gap between the approximate posterior $q_\phi(z\mid x)$ and the true posterior $p_\theta(z\mid x)$ — is non-negative and unobservable (the true posterior is itself intractable), which is what makes the ELBO a lower bound rather than the evidence itself. Maximizing the ELBO over $\phi$ shrinks this gap; maximizing it over $\theta$ raises the evidence. Two distinct losses thus arise from one quantity: the encoder is pushed toward the true posterior, and the decoder toward higher likelihood. The identity also explains why maximizing the bound is safe — the bound differs from the truth by a KL we are always shrinking and can never drive negative.

Intuition

The two terms pull in opposite directions. Reconstruction rewards an encoder that packs distinct inputs into well-separated codes; the KL term rewards an encoder whose outputs blur into the prior. Balancing them yields a latent space that is both informative and smoothly sampleable.

Optimizing the ELBO requires gradients of an expectation over $q_\phi$ , whose parameters are themselves being trained — and there is no direct way to differentiate through “a sample whose distribution depends on those parameters.” The reparameterization trick removes the sampling from the gradient path: with a Gaussian encoder $q_\phi(z\mid x)=\mathcal{N}(\mu_\phi(x),\,\sigma_\phi(x)^2)$ , a sample is written

z=\mu_\phi(x)+\sigma_\phi(x)\odot\epsilon, \qquad \epsilon\sim\mathcal{N}(0,I),

so the randomness lives in $\epsilon$ alone and $\nabla_\phi$ passes through $\mu_\phi,\sigma_\phi$ by the chain rule. Here $\mu_\phi(x)$ is the center of the code cloud, $\sigma_\phi(x)$ its width, $\odot$ an elementwise product, and $\epsilon$ a noise source independent of the parameters. Once the noise is externalized as $\epsilon$ , gradients flow through the sampling as through any deterministic operation, training the whole path by ordinary backpropagation.

Deep dive

The encoder is amortized: a single network maps any $x$ to its variational parameters, rather than optimizing a separate posterior per data point. This efficiency carries a cost, splittable into two gaps. The amortization gap is the difference between the ELBO achieved by the shared encoder and the best ELBO attainable by optimizing a free posterior for each $x$ independently; a network that must generalize across all inputs cannot match per-point optimization everywhere. The approximation gap is the error incurred because the chosen variational family — a diagonal Gaussian, say — may simply be unable to represent the true posterior; even with per-point optimization, a diagonal Gaussian cannot fit a correlated or multimodal posterior. The two compound to set how far the ELBO sits below the true evidence: a more expressive family (e.g. a normalizing-flow posterior) shrinks the approximation gap, while a stronger encoder or test-time posterior refinement shrinks the amortization gap.

Posterior collapse and the $\beta$ -VAE

A failure mode peculiar to the VAE is posterior collapse: for some latent coordinates the encoder reverts to the prior, $q_\phi(z\mid x)\approx p(z)$ , so the KL term for those coordinates vanishes and they carry no information about $x$ . This occurs when a sufficiently expressive decoder can reconstruct the data without consulting the latent code — an autoregressive decoder is the classic case, filling in from what it has already generated and treating the latent as an optional side channel — leaving the KL penalty as the only active gradient for those dimensions, which then pushes them back to the prior. The reconstruction stays acceptable while the latent representation degrades into noise.

Reweighting the two terms exposes the trade-off directly. The $\beta$ -VAE scales the KL term by a coefficient $\beta$ ,

\mathcal{L}_\beta = \mathbb{E}_{q_\phi(z\mid x)}\!\big[\log p_\theta(x\mid z)\big] -\beta\,\mathrm{KL}\!\big(q_\phi(z\mid x)\,\Vert\,p(z)\big),

recovering the standard ELBO at $\beta=1$ . The coefficient $\beta$ is the weight on the KL term: $\beta>1$ strengthens the pull toward the prior, $\beta<1$ relaxes it. Larger $\beta$ tightens the latent toward the factorized prior, which tends to encourage disentanglement — individual coordinates aligning with independent factors of variation — at the expense of reconstruction fidelity. The KL term measures the rate, the average number of nats the code carries about $x$ , and the reconstruction term measures the distortion; $\beta$ traces out a rate–distortion curve along which posterior collapse is the extreme of vanishing rate (large $\beta$ presses every coordinate back to the prior), while $\beta\to 0$ degenerates to an ordinary autoencoder with no latent constraint.

Where it sits in Cryo-ET reconstruction

The VAE is the entry point for recasting missing-wedge restoration as a generative problem. It supplies three parts that later methods reuse repeatedly: amortized inference of a latent through the encoder, the ELBO that swaps an intractable likelihood for an optimizable lower bound, and reparameterization that makes a stochastic path differentiable. But the VAE’s per-sample KL inflates each posterior toward the prior and tends to blur reconstructions — a real problem when the goal is high-resolution structure. The later routes branch from exactly here: the Wasserstein autoencoder replaces the per-sample KL with a divergence on the aggregated posterior, often producing sharper samples and connecting to the optimal-transport route taken by CryoGEN-II (WAE/OT for a stable single answer); and the need to upgrade a single point reconstruction into a family of posterior reconstructions points to CryoWGEN. Reading the ELBO and its two gaps is what lets you see which constraint each of these methods is loosening.

← Generative & Distribution Matching

Variational Autoencoder (VAE)

From evidence to the bound

Posterior collapse and the β\betaβ-VAE

Where it sits in Cryo-ET reconstruction

Posterior collapse and the $\beta$ -VAE