Variational inference & the ELBO

Approximating an intractable posterior by optimizing a tractable family, bounded below by the evidence lower bound.

Big picture first

The exact posterior $p(z\mid x)$ is usually impossible to compute — the integral in its denominator is intractable. Variational inference takes a different route: instead of solving for the posterior exactly, pick the closest match out of a family of simple, easy-to-handle distributions. “Computing the posterior” then becomes “tuning a few parameters until two distributions are as close as possible” — an optimization problem. The rest of this page makes that idea precise.

Variational inference turns posterior computation into optimization. Given a latent-variable model $p(x,z)=p(x\mid z)\,p(z)$ , the posterior

p(z\mid x) = \frac{p(x,z)}{p(x)},\qquad p(x)=\int p(x,z)\,dz,

is usually intractable because the evidence $p(x)$ requires marginalizing over all latents. Here $x$ is the observation (in Cryo-ET, a set of noisy, missing-wedge projections), $z$ is the latent we want to infer (the 3D structure to reconstruct, the pose, or a latent code), $p(x\mid z)$ is the likelihood encoding the imaging physics, and $p(z)$ is the prior. The trouble is that integral: $z$ is often a voxel grid with hundreds of thousands of dimensions, so marginalizing over it is numerically hopeless, and neither $p(x)$ nor $p(z\mid x)$ has a closed form.

Variational inference replaces the true posterior with a tractable approximation $q(z\mid x)$ drawn from a chosen family, and fits $q$ by maximizing a lower bound on $\log p(x)$ . “Tractable” means we can sample from $q$ and evaluate its density — the canonical choice is a mean-field Gaussian $q(z\mid x)=\prod_i \mathcal{N}(z_i;\mu_i,\sigma_i^2)$ , one mean and variance per latent. Fitting $q$ is just optimizing those $(\mu_i,\sigma_i)$ .

The expressiveness of the chosen family caps the quality of the approximation: when the true posterior is multimodal but $q$ is restricted to a single Gaussian, the two cannot coincide, and the fit settles on one region of the posterior. Concretely, minimizing $\mathrm{KL}(q\,\|\,p)$ — the direction variational inference uses by default — is mode-seeking: it would rather shrink $q$ into a single peak than straddle the valley between two, because placing $q$ -mass where $p$ is near zero blows up $\log(q/p)$ . So a single-Gaussian $q$ locks onto one mode rather than averaging the two.

The interactive panel below fixes a bimodal target distribution and exposes the mean and standard deviation of a single Gaussian $q$ , making the trade-off between the expected log target, the entropy, and the resulting ELBO concrete. Slide the mean onto either peak and shrink the standard deviation, and the ELBO rises; try to cover both peaks at once with a wide, high-variance Gaussian and the ELBO instead drops, because $q$ wastes probability mass in the low-density valley between the peaks.

Target p (bimodal)Approximation q (Gaussian)

Expected log target E_q[log p̃]-3.286

Entropy H[q]2.374

ELBO-0.912

KL proxy (logZ − ELBO)0.912

Mean of q, μ: -0.40Std. dev. of q, σ: 2.60

A single Gaussian q cannot cover both peaks at once: spanning the low-density valley drags down the expected log target. Maximizing the ELBO drives q to lock onto one mode — the mode-seeking signature of reverse-KL variational inference.

That bound is the evidence lower bound (ELBO). Starting from Jensen’s inequality,

\log p(x) \;\ge\; \mathbb{E}_{q(z\mid x)}\!\big[\log p(x,z)\big] - \mathbb{E}_{q(z\mid x)}\!\big[\log q(z\mid x)\big] \;=\; \mathcal{L}(q).

Term by term: the first, $\mathbb{E}_{q}[\log p(x,z)]$ , averages the joint log-density under $q$ — estimate it by drawing a few $z$ from $q$ , plugging them into $\log p(x,z)$ , and averaging, with no intractable integral in sight. The second, $-\mathbb{E}_{q}[\log q(z\mid x)]$ , is the entropy of $q$ , rewarding $q$ for staying spread out instead of collapsing to a point. Their difference is the ELBO, written $\mathcal{L}(q)$ — and the whole thing is estimable by sampling, which is exactly why it can serve as an optimization objective.

The gap between $\log p(x)$ and the ELBO is exactly $\mathrm{KL}\!\big(q(z\mid x)\,\|\,p(z\mid x)\big)\ge 0$ . Equivalently,

\log p(x) = \mathcal{L}(q) + \mathrm{KL}\!\big(q(z\mid x)\,\|\,p(z\mid x)\big).

Since the left-hand side $\log p(x)$ does not depend on $q$ , raising the ELBO and lowering the KL are two sides of one move: maximizing the bound drives $q$ toward the true posterior while simultaneously tightening an estimate of the evidence. When $q$ equals the true posterior the KL is zero, the bound is tight, and the ELBO reaches $\log p(x)$ itself.

Intuition

Rewriting the ELBO exposes its two competing terms:

\mathcal{L}(q)=\underbrace{\mathbb{E}_{q}[\log p(x\mid z)]}_{\text{reconstruction}} -\underbrace{\mathrm{KL}\!\big(q(z\mid x)\,\|\,p(z)\big)}_{\text{regularization}}.

The first term rewards latents that explain the observation: it asks “if I draw $z$ from $q$ and pass it through the imaging model $p(x\mid z)$ , does it look like the $x$ I actually saw?” The second penalizes departures of $q$ from the prior, pulling $q$ back toward “what it should look like before seeing any data.” Turn the regularizer up and $q$ stays conservative, close to the prior; turn it down and $q$ fits this one observation more aggressively and overfits noise more easily. The objective balances fitting the data against staying close to the prior — which is exactly why it helps in Cryo-ET: a single tilt series carries limited information, and the prior term covers for what is missing.

Depth

Expectation–maximization (EM) is the special case in which $q$ is set to the exact posterior at each E-step, making the bound tight, followed by an M-step that maximizes $\mathbb{E}_{q}[\log p(x,z)]$ over model parameters. Variational inference generalizes EM to settings where the exact posterior is unavailable: the E-step is itself an approximate optimization over a restricted family $q$ , and the same ELBO is ascended in both arguments.

When $q$ is parameterized by a neural network $q_\phi(z\mid x)$ — mapping the observation $x$ straight to $\mu,\sigma$ , known as amortized inference — differentiating the ELBO with respect to $\phi$ hits a snag: the expectation itself depends on the parameters being optimized, so the gradient cannot simply move inside it. The reparameterization trick sidesteps this by writing $z=\mu_\phi+\sigma_\phi\odot\epsilon$ with $\epsilon\sim\mathcal{N}(0,I)$ , pushing the randomness into a $\phi$ -independent $\epsilon$ , so that

\nabla_\phi\,\mathbb{E}_{q_\phi}[f(z)] =\mathbb{E}_{\epsilon}\!\big[\nabla_\phi f(\mu_\phi+\sigma_\phi\odot\epsilon)\big],

an unbiased gradient from a single sample that backpropagates through the whole network. This is the technical core that lets the variational autoencoder train end to end. One cost to keep in mind: the mean-field $q$ assumes the $z_i$ are posterior-independent, which systematically underestimates variance — it discards correlations between latents that the true posterior has, so the uncertainty it reports is typically too narrow.

The ELBO is the training objective behind the variational autoencoder, and the intractability of the posterior KL term motivates the distribution-matching view used in optimal transport and in CryoGEN.

Placing this machinery back in the reconstruction pipeline: solving for $p(z\mid x)$ means “given noisy projections, infer the distribution over 3D structures.” Variational inference returns a single best approximation $q$ to that posterior — one settled, stable answer, matching the MAP point estimate of CryoGEN-I and the WAE/OT stable single answer of CryoGEN-II. But to characterize the whole family of posterior solutions — several plausible structures consistent with the same data — a single $q$ is not enough, and you switch to tools that genuinely sample the posterior. Energy-based models give the unnormalized posterior shape $p(z\mid x)\propto e^{-E(z)}$ , and Langevin dynamics and SGLD sample it repeatedly using only its score $\nabla_z\log p$ , returning a family of structures rather than one — the inner loop of CryoWGEN (EVIA Monte-Carlo for CryoWGEN-I, EVIA Langevin for the CryoWGEN-II posterior family). The ELBO on this page and that sampling chain share one starting point: both are stuck with the same intractable posterior and both work from a tractable proxy for it; the only difference is whether the proxy is “one distribution $q$ ” or “a stream of samples.”

← Inference & Sampling