Energy-based models

Probabilistic models that assign an unnormalized energy to every configuration, defining a density through the Boltzmann form.

Big picture first

To say “which $x$ are more likely,” the direct route is to write down a density $p(x)$ . But a density must be nonnegative everywhere and integrate to one, and those two constraints make it awkward to design a density by hand. Energy-based models start elsewhere: first give every configuration $x$ a score — its energy $E_\theta(x)$ , lower meaning more favorable — then mechanically turn scores into probabilities. The modeler only has to reason about relative preference; the formula handles normalization. The rest of this page shows how that conversion works, where its cost lies, and why a “compare, never normalize” structure fits Cryo-ET reconstruction so well.

An energy-based model (EBM) defines a probability density over configurations $x$ through a scalar energy function $E_\theta(x)$ . Here $x$ is the object being modeled (in Cryo-ET, a 3D density volume) and $\theta$ are the parameters of the energy function itself (for instance the weights of a neural network). Low energy corresponds to high probability:

Energy E(x)Density p(x) ∝ exp(−E(x)/T)

Well separation d: 4.0Temperature T: 1.00

Lower energy means higher probability: the two well bottoms become the two peaks of the density. Higher T flattens exp(−E/T), pushing the density toward uniform and erasing the contrast between the wells.

The local minima of the energy function are the modes of the density, and a deeper well concentrates more probability mass at that location. A temperature parameter $T$ rescales the exponential family $e^{-E/T}$ as a whole: at low temperature the density sharpens and nearly all mass collects in the deepest well, while at high temperature it flattens toward uniform as configurations approach equal probability. The partition function $Z_\theta$ supplies only the overall normalization and leaves the shape of the density unchanged.

p_\theta(x) = \frac{1}{Z_\theta}\,e^{-E_\theta(x)}, \qquad Z_\theta = \int e^{-E_\theta(x)}\,dx .

Symbol by symbol: $p_\theta(x)$ is the probability density of configuration $x$ ; $E_\theta(x)$ is the scalar energy assigned to it; $e^{-E_\theta(x)}$ maps “low energy” monotonically to “large weight”; and $Z_\theta$ is the normalizing constant obtained by summing (integrating) over all possible $x$ , which guarantees $\int p_\theta(x)\,dx=1$ . Note the minus sign in the exponent — it is what makes the lowest-energy configuration the most probable, matching the physical intuition that particles settle into low-energy states.

This is the Boltzmann–Gibbs form. The model places no architectural constraint on $E_\theta$ — any function mapping $x$ to a real number induces a valid density once normalized — which makes EBMs an extremely flexible family. A minimal example: take $E_\theta(x)=\tfrac12(x-\mu)^2/\sigma^2$ and the formula recovers a Gaussian with mean $\mu$ and variance $\sigma^2$ , with $Z_\theta=\sqrt{2\pi}\,\sigma$ available in closed form. The Gaussian is “easy” precisely because its energy is quadratic; replace $E_\theta$ with a deep network and the density can have arbitrarily many wells of arbitrary shape — with the entire cost hidden inside $Z_\theta$ .

The difficulty is the partition function $Z_\theta$ . For high-dimensional $x$ the integral is intractable — a $128^3$ voxel volume is roughly two million dimensions, and there is no closed form or feasible numerical sweep over it — so $p_\theta(x)$ cannot be evaluated in closed form, and neither can the likelihood. Practical use of EBMs is built around the observation that many quantities of interest do not require $Z_\theta$ .

Intuition

The energy only ever matters by comparison. The probability ratio between two states is

\frac{p_\theta(x_1)}{p_\theta(x_2)} = e^{-(E_\theta(x_1)-E_\theta(x_2))},

in which $Z_\theta$ cancels. For instance, if $E_\theta(x_1)-E_\theta(x_2)=-2$ , then $x_1$ is $e^2\approx 7.4$ times as probable as $x_2$ — a conclusion that never touches the intractable $Z_\theta$ . Sampling, ranking, and gradient-based search all depend on energy differences, not absolute probabilities. Equivalently, shifting the whole energy by a constant, $E_\theta\mapsto E_\theta+c$ , changes no distribution at all, because $c$ is absorbed identically in both the ratio and in $Z_\theta$ .

The gradient of the log-density with respect to $x$ , the score, is also free of $Z_\theta$ :

\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x).

The derivation is one line: $\log p_\theta(x) = -E_\theta(x) - \log Z_\theta$ , and since $\log Z_\theta$ does not depend on $x$ its gradient is zero, leaving only $-\nabla_x E_\theta(x)$ . The score is a vector field that at each $x$ points in the direction of steepest increase in probability — that is, steepest decrease in energy. This is what makes gradient-based samplers such as Langevin dynamics compatible with EBMs: they only ever need to ask the energy function “which way is energy lower from here,” exactly the quantity backpropagation returns.

Depth

Maximum-likelihood training requires $\nabla_\theta \log Z_\theta = -\mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x)]$ , an expectation under the model that demands sampling from $p_\theta$ . Substituting it into the gradient of the log-likelihood gives an intuitive tug-of-war,

\nabla_\theta \log p_\theta(x_{\text{data}}) = -\,\nabla_\theta E_\theta(x_{\text{data}}) \;+\; \mathbb{E}_{p_\theta}\!\big[\nabla_\theta E_\theta(x)\big],

where the first term pushes the energy of data points down and the second pushes up the energy of the model’s own samples; the two balance when the model distribution matches the data. The hard part is the expectation in the second term — it requires sampling from $p_\theta$ , which is the partition-function problem wearing a different mask. Contrastive divergence approximates this expectation with short Markov chains initialized at the data, trading bias for affordable compute. Alternatives sidestep $Z_\theta$ entirely: score matching never touches the likelihood and instead fits the model score $\nabla_x \log p_\theta$ to the data score, an objective in which only $\nabla_x E_\theta$ appears and so is independent of $Z_\theta$ ; and noise-contrastive estimation turns density estimation into a binary classification problem against a known noise distribution, letting a classifier learn the unnormalized density as a byproduct of separating “real data” from “noise.”

EBMs supply the probabilistic backbone for several reconstruction methods on this site: an energy prior $p(x)\propto e^{-E_\phi(x)}$ encodes which 3D structures are plausible without ever evaluating $Z$ — smooth, connected volumes consistent with known biology get low energy, while fractured or artifact-laden volumes get high energy. Reconstruction multiplies this prior by the data likelihood, and CryoGEN takes exactly this route: CryoGEN-I finds a single lowest point on the energy landscape (a MAP point estimate), while CryoWGEN does not stop at one point but instead samples a family of solutions from that posterior — its inner loop being precisely the Langevin sampling above, which needs only the score $-\nabla_x E$ . The same “compare, never normalize” idea appears in optimal transport and in the Gibbs couplings of entropic transport, where an unnormalized exponential weight is likewise the central modeling object.

← Inference & Sampling