Energy-based models

Probabilistic models that assign an unnormalized energy to every configuration, defining a density through the Boltzmann form.

Big picture first

To say “which xx are more likely,” the direct route is to write down a density p(x)p(x). But a density must be nonnegative everywhere and integrate to one, and those two constraints make it awkward to design a density by hand. Energy-based models start elsewhere: first give every configuration xx a score — its energy Eθ(x)E_\theta(x), lower meaning more favorable — then mechanically turn scores into probabilities. The modeler only has to reason about relative preference; the formula handles normalization. The rest of this page shows how that conversion works, where its cost lies, and why a “compare, never normalize” structure fits Cryo-ET reconstruction so well.

An energy-based model (EBM) defines a probability density over configurations xx through a scalar energy function Eθ(x)E_\theta(x). Here xx is the object being modeled (in Cryo-ET, a 3D density volume) and θ\theta are the parameters of the energy function itself (for instance the weights of a neural network). Low energy corresponds to high probability:

Energyx−55
Energy E(x)Density p(x) ∝ exp(−E(x)/T)

Lower energy means higher probability: the two well bottoms become the two peaks of the density. Higher T flattens exp(−E/T), pushing the density toward uniform and erasing the contrast between the wells.

The local minima of the energy function are the modes of the density, and a deeper well concentrates more probability mass at that location. A temperature parameter TT rescales the exponential family eE/Te^{-E/T} as a whole: at low temperature the density sharpens and nearly all mass collects in the deepest well, while at high temperature it flattens toward uniform as configurations approach equal probability. The partition function ZθZ_\theta supplies only the overall normalization and leaves the shape of the density unchanged.

pθ(x)=1ZθeEθ(x),Zθ=eEθ(x)dx.p_\theta(x) = \frac{1}{Z_\theta}\,e^{-E_\theta(x)}, \qquad Z_\theta = \int e^{-E_\theta(x)}\,dx .

Symbol by symbol: pθ(x)p_\theta(x) is the probability density of configuration xx; Eθ(x)E_\theta(x) is the scalar energy assigned to it; eEθ(x)e^{-E_\theta(x)} maps “low energy” monotonically to “large weight”; and ZθZ_\theta is the normalizing constant obtained by summing (integrating) over all possible xx, which guarantees pθ(x)dx=1\int p_\theta(x)\,dx=1. Note the minus sign in the exponent — it is what makes the lowest-energy configuration the most probable, matching the physical intuition that particles settle into low-energy states.

This is the Boltzmann–Gibbs form. The model places no architectural constraint on EθE_\theta — any function mapping xx to a real number induces a valid density once normalized — which makes EBMs an extremely flexible family. A minimal example: take Eθ(x)=12(xμ)2/σ2E_\theta(x)=\tfrac12(x-\mu)^2/\sigma^2 and the formula recovers a Gaussian with mean μ\mu and variance σ2\sigma^2, with Zθ=2πσZ_\theta=\sqrt{2\pi}\,\sigma available in closed form. The Gaussian is “easy” precisely because its energy is quadratic; replace EθE_\theta with a deep network and the density can have arbitrarily many wells of arbitrary shape — with the entire cost hidden inside ZθZ_\theta.

The difficulty is the partition function ZθZ_\theta. For high-dimensional xx the integral is intractable — a 1283128^3 voxel volume is roughly two million dimensions, and there is no closed form or feasible numerical sweep over it — so pθ(x)p_\theta(x) cannot be evaluated in closed form, and neither can the likelihood. Practical use of EBMs is built around the observation that many quantities of interest do not require ZθZ_\theta.

Intuition

The energy only ever matters by comparison. The probability ratio between two states is

pθ(x1)pθ(x2)=e(Eθ(x1)Eθ(x2)),\frac{p_\theta(x_1)}{p_\theta(x_2)} = e^{-(E_\theta(x_1)-E_\theta(x_2))},

in which ZθZ_\theta cancels. For instance, if Eθ(x1)Eθ(x2)=2E_\theta(x_1)-E_\theta(x_2)=-2, then x1x_1 is e27.4e^2\approx 7.4 times as probable as x2x_2 — a conclusion that never touches the intractable ZθZ_\theta. Sampling, ranking, and gradient-based search all depend on energy differences, not absolute probabilities. Equivalently, shifting the whole energy by a constant, EθEθ+cE_\theta\mapsto E_\theta+c, changes no distribution at all, because cc is absorbed identically in both the ratio and in ZθZ_\theta.

The gradient of the log-density with respect to xx, the score, is also free of ZθZ_\theta:

xlogpθ(x)=xEθ(x).\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x).

The derivation is one line: logpθ(x)=Eθ(x)logZθ\log p_\theta(x) = -E_\theta(x) - \log Z_\theta, and since logZθ\log Z_\theta does not depend on xx its gradient is zero, leaving only xEθ(x)-\nabla_x E_\theta(x). The score is a vector field that at each xx points in the direction of steepest increase in probability — that is, steepest decrease in energy. This is what makes gradient-based samplers such as Langevin dynamics compatible with EBMs: they only ever need to ask the energy function “which way is energy lower from here,” exactly the quantity backpropagation returns.

Depth

Maximum-likelihood training requires θlogZθ=Epθ[θEθ(x)]\nabla_\theta \log Z_\theta = -\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x)], an expectation under the model that demands sampling from pθp_\theta. Substituting it into the gradient of the log-likelihood gives an intuitive tug-of-war,

θlogpθ(xdata)=θEθ(xdata)  +  Epθ ⁣[θEθ(x)],\nabla_\theta \log p_\theta(x_{\text{data}}) = -\,\nabla_\theta E_\theta(x_{\text{data}}) \;+\; \mathbb{E}_{p_\theta}\!\big[\nabla_\theta E_\theta(x)\big],

where the first term pushes the energy of data points down and the second pushes up the energy of the model’s own samples; the two balance when the model distribution matches the data. The hard part is the expectation in the second term — it requires sampling from pθp_\theta, which is the partition-function problem wearing a different mask. Contrastive divergence approximates this expectation with short Markov chains initialized at the data, trading bias for affordable compute. Alternatives sidestep ZθZ_\theta entirely: score matching never touches the likelihood and instead fits the model score xlogpθ\nabla_x \log p_\theta to the data score, an objective in which only xEθ\nabla_x E_\theta appears and so is independent of ZθZ_\theta; and noise-contrastive estimation turns density estimation into a binary classification problem against a known noise distribution, letting a classifier learn the unnormalized density as a byproduct of separating “real data” from “noise.”

EBMs supply the probabilistic backbone for several reconstruction methods on this site: an energy prior p(x)eEϕ(x)p(x)\propto e^{-E_\phi(x)} encodes which 3D structures are plausible without ever evaluating ZZ — smooth, connected volumes consistent with known biology get low energy, while fractured or artifact-laden volumes get high energy. Reconstruction multiplies this prior by the data likelihood, and CryoGEN takes exactly this route: CryoGEN-I finds a single lowest point on the energy landscape (a MAP point estimate), while CryoWGEN does not stop at one point but instead samples a family of solutions from that posterior — its inner loop being precisely the Langevin sampling above, which needs only the score xE-\nabla_x E. The same “compare, never normalize” idea appears in optimal transport and in the Gibbs couplings of entropic transport, where an unnormalized exponential weight is likewise the central modeling object.

← Inference & Sampling