Langevin dynamics & SGLD

Sampling from an unnormalized density using only its score, by following a noisy gradient ascent.

Langevin dynamics is a Markov chain that samples from a target density p(x)p(x) using nothing but its score xlogp(x)\nabla_x \log p(x). Each step takes a gradient step uphill in log-density and injects Gaussian noise:

xt+1=xt+η2xlogp(xt)+η  ξt,ξtN(0,I).x_{t+1} = x_t + \frac{\eta}{2}\,\nabla_x \log p(x_t) + \sqrt{\eta}\;\xi_t, \qquad \xi_t \sim \mathcal{N}(0, I).

Reading it term by term: xtx_t is the sample at step tt; η\eta is the step size (learning rate); xlogp(xt)\nabla_x \log p(x_t) is the score at the current point, pointing in the direction of steepest increase of log-density; and ξt\xi_t is fresh standard Gaussian noise drawn independently each step and scaled by η\sqrt{\eta}. The drift carries a factor of η/2\eta/2 and the noise a factor of η\sqrt{\eta}, and these are not arbitrary — it is this specific ratio that makes the chain’s stationary distribution exactly pp, rather than some power or distortion of pp.

For a small enough step size η\eta, the chain converges in distribution to pp. The drift term pulls samples toward high-probability regions, while the noise prevents the iterate from collapsing onto a single mode and lets it explore the full distribution.

Intuition

Pure gradient ascent on logp\log p would settle at a mode and stop — useful for finding a maximum, useless for sampling. The added noise turns that ascent into exploration: the walker lingers where probability is high but still wanders, visiting each region in proportion to its mass.

Picture a ball rolling on a potential landscape, where deeper valleys mean higher probability, while a steady random kicking (temperature) keeps jostling it. With no kicking it gets stuck in the nearest valley; with too much kicking it ignores the wells and wanders almost uniformly. Langevin dynamics tunes the kicking so the ball spends time in each valley in proportion to that valley’s probability mass — most of its time in the deepest well, but occasionally crossing a barrier to visit a shallower one.

Hundreds of walkers start uniform and, under the update above, drift and diffuse onto a bimodal target:

Target p(x)Sample histogram

Hundreds of walkers start uniform and, under Langevin dynamics, drift along the gradient of the log-density with added noise until they settle on the two modes. A larger step is faster but, taken too far, overshoots detail. Only the gradient is needed — no normalizing constant.

Notice that the walkers do not all rush to the taller peak: both peaks retain a share of walkers set by their respective probability mass. This is exactly what separates sampling from optimization — optimization wants only the global best, sampling wants the whole distribution.

The key practical point is that the score depends on pp only through xlogp\nabla_x \log p, in which any normalizing constant disappears. For an energy-based model p(x)eEθ(x)p(x)\propto e^{-E_\theta(x)},

xlogp(x)=xEθ(x),\nabla_x \log p(x) = -\nabla_x E_\theta(x),

so Langevin dynamics can draw samples from the model without ever computing the intractable partition function ZθZ_\theta. The reason is that ZθZ_\theta is a constant independent of xx, so its gradient with respect to xx is zero; sampling sees only the shape of the energy — the differences in energy between points — never its absolute scale. This makes Langevin the standard sampler for EBMs and for score-based generative models: as long as you can differentiate an unnormalized energy, you can sample from the distribution it defines.

To make it concrete, take a one-dimensional Gaussian p(x)=N(0,1)p(x)=\mathcal{N}(0,1), whose score is xlogp(x)=x\nabla_x\log p(x)=-x. Substituting into the update gives xt+1=xtη2xt+ηξt=(1η2)xt+ηξtx_{t+1}=x_t-\tfrac{\eta}{2}x_t+\sqrt{\eta}\,\xi_t=(1-\tfrac{\eta}{2})x_t+\sqrt{\eta}\,\xi_t. That is a first-order autoregressive process: the drift shrinks the sample toward the origin by a fixed fraction, and the noise puts variance back; in the small-η\eta limit its stationary variance tends to exactly 11, recovering the target Gaussian. This is what the abstract update above looks like in the simplest possible case.

Depth

The discrete update above only samples pp exactly in the continuous limit η0\eta\to 0. It is the Euler–Maruyama discretization of the Langevin stochastic differential equation

dx=12xlogp(x)dt+dWt,dx = \tfrac{1}{2}\nabla_x \log p(x)\,dt + dW_t,

where WtW_t is Brownian motion. The stationary distribution of this SDE is exactly pp, as one can verify from its Fokker–Planck equation. A finite step size η\eta introduces a discretization bias of order O(η)O(\eta), so the chain converges to a slightly distorted distribution; the Metropolis-adjusted Langevin algorithm (MALA) removes this bias with an accept/reject test after each step, at the cost of needing an evaluable (unnormalized) density.

Stochastic Gradient Langevin Dynamics (SGLD) addresses a different scaling problem: when the score is itself a sum over NN data points (as in a Bayesian posterior logp(θ)+ilogp(dataiθ)\log p(\theta)+\sum_i \log p(\mathrm{data}_i\mid\theta)), it replaces the full-data gradient with a minibatch estimate, making the method scalable to large datasets. With a step size ηt0\eta_t \to 0 on a suitable schedule (satisfying tηt=\sum_t\eta_t=\infty and tηt2<\sum_t\eta_t^2<\infty), the minibatch noise becomes negligible relative to the injected Langevin noise, and the chain transitions smoothly from stochastic optimization to posterior sampling. SGLD thus produces samples from a Bayesian posterior at roughly the cost of stochastic gradient descent — the large early steps find a high-probability region as fast as SGD, and the small later steps turn the iterate into a genuine posterior sampler.

Langevin sampling supplies the inner loop that draws candidate reconstructions from an energy prior. In Cryo-ET reconstruction the density p(x)p(x) is a posterior: an energy prior (which 3D structures are plausible) plus a data term (the reconstruction must agree with the measured tilt projections) sum to logp-\log p, and Langevin dynamics samples from it. CryoWGEN-II does exactly this — sampling the Boltzmann posterior directly and iteratively with Langevin/SGLD — returning not a single answer but a family of reconstructions, each a volume consistent with the data, their spread reading out the uncertainty left by the missing wedge. This connects the energy-based and variational views of generative modeling: variational inference approximates the posterior with a tractable distribution, while Langevin samples from it directly — two routes to the same target.

← Inference & Sampling