Langevin dynamics & SGLD

Sampling from an unnormalized density using only its score, by following a noisy gradient ascent.

Langevin dynamics is a Markov chain that samples from a target density $p(x)$ using nothing but its score $\nabla_x \log p(x)$ . Each step takes a gradient step uphill in log-density and injects Gaussian noise:

x_{t+1} = x_t + \frac{\eta}{2}\,\nabla_x \log p(x_t) + \sqrt{\eta}\;\xi_t, \qquad \xi_t \sim \mathcal{N}(0, I).

Reading it term by term: $x_t$ is the sample at step $t$ ; $\eta$ is the step size (learning rate); $\nabla_x \log p(x_t)$ is the score at the current point, pointing in the direction of steepest increase of log-density; and $\xi_t$ is fresh standard Gaussian noise drawn independently each step and scaled by $\sqrt{\eta}$ . The drift carries a factor of $\eta/2$ and the noise a factor of $\sqrt{\eta}$ , and these are not arbitrary — it is this specific ratio that makes the chain’s stationary distribution exactly $p$ , rather than some power or distortion of $p$ .

For a small enough step size $\eta$ , the chain converges in distribution to $p$ . The drift term pulls samples toward high-probability regions, while the noise prevents the iterate from collapsing onto a single mode and lets it explore the full distribution.

Intuition

Pure gradient ascent on $\log p$ would settle at a mode and stop — useful for finding a maximum, useless for sampling. The added noise turns that ascent into exploration: the walker lingers where probability is high but still wanders, visiting each region in proportion to its mass.

Picture a ball rolling on a potential landscape, where deeper valleys mean higher probability, while a steady random kicking (temperature) keeps jostling it. With no kicking it gets stuck in the nearest valley; with too much kicking it ignores the wells and wanders almost uniformly. Langevin dynamics tunes the kicking so the ball spends time in each valley in proportion to that valley’s probability mass — most of its time in the deepest well, but occasionally crossing a barrier to visit a shallower one.

Hundreds of walkers start uniform and, under the update above, drift and diffuse onto a bimodal target:

Target p(x)Sample histogram

Step size η: 0.040

Hundreds of walkers start uniform and, under Langevin dynamics, drift along the gradient of the log-density with added noise until they settle on the two modes. A larger step is faster but, taken too far, overshoots detail. Only the gradient is needed — no normalizing constant.

Notice that the walkers do not all rush to the taller peak: both peaks retain a share of walkers set by their respective probability mass. This is exactly what separates sampling from optimization — optimization wants only the global best, sampling wants the whole distribution.

The key practical point is that the score depends on $p$ only through $\nabla_x \log p$ , in which any normalizing constant disappears. For an energy-based model $p(x)\propto e^{-E_\theta(x)}$ ,

\nabla_x \log p(x) = -\nabla_x E_\theta(x),

so Langevin dynamics can draw samples from the model without ever computing the intractable partition function $Z_\theta$ . The reason is that $Z_\theta$ is a constant independent of $x$ , so its gradient with respect to $x$ is zero; sampling sees only the shape of the energy — the differences in energy between points — never its absolute scale. This makes Langevin the standard sampler for EBMs and for score-based generative models: as long as you can differentiate an unnormalized energy, you can sample from the distribution it defines.

To make it concrete, take a one-dimensional Gaussian $p(x)=\mathcal{N}(0,1)$ , whose score is $\nabla_x\log p(x)=-x$ . Substituting into the update gives $x_{t+1}=x_t-\tfrac{\eta}{2}x_t+\sqrt{\eta}\,\xi_t=(1-\tfrac{\eta}{2})x_t+\sqrt{\eta}\,\xi_t$ . That is a first-order autoregressive process: the drift shrinks the sample toward the origin by a fixed fraction, and the noise puts variance back; in the small- $\eta$ limit its stationary variance tends to exactly $1$ , recovering the target Gaussian. This is what the abstract update above looks like in the simplest possible case.

Depth

The discrete update above only samples $p$ exactly in the continuous limit $\eta\to 0$ . It is the Euler–Maruyama discretization of the Langevin stochastic differential equation

dx = \tfrac{1}{2}\nabla_x \log p(x)\,dt + dW_t,

where $W_t$ is Brownian motion. The stationary distribution of this SDE is exactly $p$ , as one can verify from its Fokker–Planck equation. A finite step size $\eta$ introduces a discretization bias of order $O(\eta)$ , so the chain converges to a slightly distorted distribution; the Metropolis-adjusted Langevin algorithm (MALA) removes this bias with an accept/reject test after each step, at the cost of needing an evaluable (unnormalized) density.

Stochastic Gradient Langevin Dynamics (SGLD) addresses a different scaling problem: when the score is itself a sum over $N$ data points (as in a Bayesian posterior $\log p(\theta)+\sum_i \log p(\mathrm{data}_i\mid\theta)$ ), it replaces the full-data gradient with a minibatch estimate, making the method scalable to large datasets. With a step size $\eta_t \to 0$ on a suitable schedule (satisfying $\sum_t\eta_t=\infty$ and $\sum_t\eta_t^2<\infty$ ), the minibatch noise becomes negligible relative to the injected Langevin noise, and the chain transitions smoothly from stochastic optimization to posterior sampling. SGLD thus produces samples from a Bayesian posterior at roughly the cost of stochastic gradient descent — the large early steps find a high-probability region as fast as SGD, and the small later steps turn the iterate into a genuine posterior sampler.

Langevin sampling supplies the inner loop that draws candidate reconstructions from an energy prior. In Cryo-ET reconstruction the density $p(x)$ is a posterior: an energy prior (which 3D structures are plausible) plus a data term (the reconstruction must agree with the measured tilt projections) sum to $-\log p$ , and Langevin dynamics samples from it. CryoWGEN-II does exactly this — sampling the Boltzmann posterior directly and iteratively with Langevin/SGLD — returning not a single answer but a family of reconstructions, each a volume consistent with the data, their spread reading out the uncertainty left by the missing wedge. This connects the energy-based and variational views of generative modeling: variational inference approximates the posterior with a tractable distribution, while Langevin samples from it directly — two routes to the same target.

← Inference & Sampling