Variational inference & the ELBO

Approximating an intractable posterior by optimizing a tractable family, bounded below by the evidence lower bound.

Big picture first

The exact posterior p(zx)p(z\mid x) is usually impossible to compute — the integral in its denominator is intractable. Variational inference takes a different route: instead of solving for the posterior exactly, pick the closest match out of a family of simple, easy-to-handle distributions. “Computing the posterior” then becomes “tuning a few parameters until two distributions are as close as possible” — an optimization problem. The rest of this page makes that idea precise.

Variational inference turns posterior computation into optimization. Given a latent-variable model p(x,z)=p(xz)p(z)p(x,z)=p(x\mid z)\,p(z), the posterior

p(zx)=p(x,z)p(x),p(x)=p(x,z)dz,p(z\mid x) = \frac{p(x,z)}{p(x)},\qquad p(x)=\int p(x,z)\,dz,

is usually intractable because the evidence p(x)p(x) requires marginalizing over all latents. Here xx is the observation (in Cryo-ET, a set of noisy, missing-wedge projections), zz is the latent we want to infer (the 3D structure to reconstruct, the pose, or a latent code), p(xz)p(x\mid z) is the likelihood encoding the imaging physics, and p(z)p(z) is the prior. The trouble is that integral: zz is often a voxel grid with hundreds of thousands of dimensions, so marginalizing over it is numerically hopeless, and neither p(x)p(x) nor p(zx)p(z\mid x) has a closed form.

Variational inference replaces the true posterior with a tractable approximation q(zx)q(z\mid x) drawn from a chosen family, and fits qq by maximizing a lower bound on logp(x)\log p(x). “Tractable” means we can sample from qq and evaluate its density — the canonical choice is a mean-field Gaussian q(zx)=iN(zi;μi,σi2)q(z\mid x)=\prod_i \mathcal{N}(z_i;\mu_i,\sigma_i^2), one mean and variance per latent. Fitting qq is just optimizing those (μi,σi)(\mu_i,\sigma_i).

The expressiveness of the chosen family caps the quality of the approximation: when the true posterior is multimodal but qq is restricted to a single Gaussian, the two cannot coincide, and the fit settles on one region of the posterior. Concretely, minimizing KL(qp)\mathrm{KL}(q\,\|\,p) — the direction variational inference uses by default — is mode-seeking: it would rather shrink qq into a single peak than straddle the valley between two, because placing qq-mass where pp is near zero blows up log(q/p)\log(q/p). So a single-Gaussian qq locks onto one mode rather than averaging the two.

The interactive panel below fixes a bimodal target distribution and exposes the mean and standard deviation of a single Gaussian qq, making the trade-off between the expected log target, the entropy, and the resulting ELBO concrete. Slide the mean onto either peak and shrink the standard deviation, and the ELBO rises; try to cover both peaks at once with a wide, high-variance Gaussian and the ELBO instead drops, because qq wastes probability mass in the low-density valley between the peaks.

μLatent z-707
Target p (bimodal)Approximation q (Gaussian)
Expected log target E_q[log p̃]-3.286
Entropy H[q]2.374
ELBO-0.912
KL proxy (logZ − ELBO)0.912

A single Gaussian q cannot cover both peaks at once: spanning the low-density valley drags down the expected log target. Maximizing the ELBO drives q to lock onto one mode — the mode-seeking signature of reverse-KL variational inference.

That bound is the evidence lower bound (ELBO). Starting from Jensen’s inequality,

logp(x)    Eq(zx) ⁣[logp(x,z)]Eq(zx) ⁣[logq(zx)]  =  L(q).\log p(x) \;\ge\; \mathbb{E}_{q(z\mid x)}\!\big[\log p(x,z)\big] - \mathbb{E}_{q(z\mid x)}\!\big[\log q(z\mid x)\big] \;=\; \mathcal{L}(q).

Term by term: the first, Eq[logp(x,z)]\mathbb{E}_{q}[\log p(x,z)], averages the joint log-density under qq — estimate it by drawing a few zz from qq, plugging them into logp(x,z)\log p(x,z), and averaging, with no intractable integral in sight. The second, Eq[logq(zx)]-\mathbb{E}_{q}[\log q(z\mid x)], is the entropy of qq, rewarding qq for staying spread out instead of collapsing to a point. Their difference is the ELBO, written L(q)\mathcal{L}(q) — and the whole thing is estimable by sampling, which is exactly why it can serve as an optimization objective.

The gap between logp(x)\log p(x) and the ELBO is exactly KL ⁣(q(zx)p(zx))0\mathrm{KL}\!\big(q(z\mid x)\,\|\,p(z\mid x)\big)\ge 0. Equivalently,

logp(x)=L(q)+KL ⁣(q(zx)p(zx)).\log p(x) = \mathcal{L}(q) + \mathrm{KL}\!\big(q(z\mid x)\,\|\,p(z\mid x)\big).

Since the left-hand side logp(x)\log p(x) does not depend on qq, raising the ELBO and lowering the KL are two sides of one move: maximizing the bound drives qq toward the true posterior while simultaneously tightening an estimate of the evidence. When qq equals the true posterior the KL is zero, the bound is tight, and the ELBO reaches logp(x)\log p(x) itself.

Intuition

Rewriting the ELBO exposes its two competing terms:

L(q)=Eq[logp(xz)]reconstructionKL ⁣(q(zx)p(z))regularization.\mathcal{L}(q)=\underbrace{\mathbb{E}_{q}[\log p(x\mid z)]}_{\text{reconstruction}} -\underbrace{\mathrm{KL}\!\big(q(z\mid x)\,\|\,p(z)\big)}_{\text{regularization}}.

The first term rewards latents that explain the observation: it asks “if I draw zz from qq and pass it through the imaging model p(xz)p(x\mid z), does it look like the xx I actually saw?” The second penalizes departures of qq from the prior, pulling qq back toward “what it should look like before seeing any data.” Turn the regularizer up and qq stays conservative, close to the prior; turn it down and qq fits this one observation more aggressively and overfits noise more easily. The objective balances fitting the data against staying close to the prior — which is exactly why it helps in Cryo-ET: a single tilt series carries limited information, and the prior term covers for what is missing.

Depth

Expectation–maximization (EM) is the special case in which qq is set to the exact posterior at each E-step, making the bound tight, followed by an M-step that maximizes Eq[logp(x,z)]\mathbb{E}_{q}[\log p(x,z)] over model parameters. Variational inference generalizes EM to settings where the exact posterior is unavailable: the E-step is itself an approximate optimization over a restricted family qq, and the same ELBO is ascended in both arguments.

When qq is parameterized by a neural network qϕ(zx)q_\phi(z\mid x) — mapping the observation xx straight to μ,σ\mu,\sigma, known as amortized inference — differentiating the ELBO with respect to ϕ\phi hits a snag: the expectation itself depends on the parameters being optimized, so the gradient cannot simply move inside it. The reparameterization trick sidesteps this by writing z=μϕ+σϕϵz=\mu_\phi+\sigma_\phi\odot\epsilon with ϵN(0,I)\epsilon\sim\mathcal{N}(0,I), pushing the randomness into a ϕ\phi-independent ϵ\epsilon, so that

ϕEqϕ[f(z)]=Eϵ ⁣[ϕf(μϕ+σϕϵ)],\nabla_\phi\,\mathbb{E}_{q_\phi}[f(z)] =\mathbb{E}_{\epsilon}\!\big[\nabla_\phi f(\mu_\phi+\sigma_\phi\odot\epsilon)\big],

an unbiased gradient from a single sample that backpropagates through the whole network. This is the technical core that lets the variational autoencoder train end to end. One cost to keep in mind: the mean-field qq assumes the ziz_i are posterior-independent, which systematically underestimates variance — it discards correlations between latents that the true posterior has, so the uncertainty it reports is typically too narrow.

The ELBO is the training objective behind the variational autoencoder, and the intractability of the posterior KL term motivates the distribution-matching view used in optimal transport and in CryoGEN.

Placing this machinery back in the reconstruction pipeline: solving for p(zx)p(z\mid x) means “given noisy projections, infer the distribution over 3D structures.” Variational inference returns a single best approximation qq to that posterior — one settled, stable answer, matching the MAP point estimate of CryoGEN-I and the WAE/OT stable single answer of CryoGEN-II. But to characterize the whole family of posterior solutions — several plausible structures consistent with the same data — a single qq is not enough, and you switch to tools that genuinely sample the posterior. Energy-based models give the unnormalized posterior shape p(zx)eE(z)p(z\mid x)\propto e^{-E(z)}, and Langevin dynamics and SGLD sample it repeatedly using only its score zlogp\nabla_z\log p, returning a family of structures rather than one — the inner loop of CryoWGEN (EVIA Monte-Carlo for CryoWGEN-I, EVIA Langevin for the CryoWGEN-II posterior family). The ELBO on this page and that sampling chain share one starting point: both are stuck with the same intractable posterior and both work from a tractable proxy for it; the only difference is whether the proxy is “one distribution qq” or “a stream of samples.”

← Inference & Sampling