Entropy & KL divergence

Measuring uncertainty and the discrepancy between distributions through Shannon entropy, cross-entropy, and KL divergence, with the link to maximum likelihood.

Think of a distribution as “how unsure you are about the outcome.” If a coin almost always lands heads, you are nearly certain about the next flip and your uncertainty is low; if it is fair, every outcome is maximally hard to guess and your uncertainty is highest. Shannon entropy turns this “hard to guess” feeling into a single number. This page starts from entropy, builds up to cross-entropy and KL divergence — the latter measuring how far apart two distributions are — and shows that this is the shared language behind nearly every probabilistic objective, including the reconstruction methods.

Shannon entropy measures the uncertainty carried by a distribution. For a discrete distribution pp,

H(p)=ip(i)logp(i),H(p)=-\sum_{i}p(i)\log p(i),

where p(i)p(i) is the probability of outcome ii, logp(i)-\log p(i) is the “surprise” of seeing that outcome (rarer outcomes are more surprising), and the sum is the expected surprise. Entropy attains its maximum at the uniform distribution and is zero when all mass sits on a single outcome. It lower-bounds the average code length needed to losslessly encode samples drawn from pp.

A concrete example, with logarithms base 2: a fair coin p=(12,12)p=(\tfrac12,\tfrac12) has entropy 12log21212log212=1-\tfrac12\log_2\tfrac12-\tfrac12\log_2\tfrac12=1 bit — on average one bit per flip is needed to encode it. A biased coin p=(0.9,0.1)p=(0.9,0.1) has entropy 0.9log20.90.1log20.10.47-0.9\log_2 0.9-0.1\log_2 0.1\approx 0.47 bits, less than half: the outcome is more predictable, so fewer bits are needed on average. A fully certain coin p=(1,0)p=(1,0) has entropy zero and needs no bits at all.

-6-3036
p = N(0, 1) (fixed)q = N(μ, σ²) (adjustable)
KL(p‖q)2.459
KL(q‖p)1.227
Asymmetry ratio2.00

Forward KL(p‖q) grows fast where q has little mass but p has mass; reverse KL(q‖p) does the opposite. The two generally differ, so KL is not a distance.

The figure illustrates the asymmetry of the KL divergence with two univariate Gaussians: the reference pp is fixed as a standard normal, while the mean and standard deviation of qq are adjustable. When qq collapses to near-zero density in a tail where pp still carries mass, the forward KL rises sharply; swapping the roles of the two distributions generally yields a different value, and both vanish together only when p=qp=q. This asymmetry, driven by the log-ratio term inside the integral, carries over directly to KL and cross-entropy on continuous distributions.

Cross-entropy gives the average code length when samples from pp are encoded with a code designed for a distribution qq:

H(p,q)=ip(i)logq(i).H(p,q)=-\sum_{i}p(i)\log q(i).

Here outcomes occur at the frequencies of pp (hence the weighting by p(i)p(i)), but the code lengths are designed for qq (hence logq(i)-\log q(i)). If the qq used for coding does not match the true pp, the average code length must grow — the excess is exactly the KL divergence below.

Their difference is the KL divergence:

KL(pq)=ip(i)logp(i)q(i)=H(p,q)H(p).\mathrm{KL}(p\,\Vert\,q)=\sum_{i}p(i)\log\frac{p(i)}{q(i)}=H(p,q)-H(p).

It measures the extra code length paid for using a code built for qq instead of the true pp. By Gibbs’ inequality, KL(pq)0\mathrm{KL}(p\Vert q)\ge 0, with equality if and only if p=qp=q. The KL divergence is asymmetric: in general KL(pq)KL(qp)\mathrm{KL}(p\Vert q)\neq\mathrm{KL}(q\Vert p), so it is not a distance metric.

A worked example of the asymmetry. Let p=(0.5,0.5)p=(0.5,0.5) and q=(0.9,0.1)q=(0.9,0.1). In bits, KL(pq)=0.5log20.50.9+0.5log20.50.10.74\mathrm{KL}(p\Vert q)=0.5\log_2\frac{0.5}{0.9}+0.5\log_2\frac{0.5}{0.1}\approx 0.74 bits, whereas KL(qp)=0.9log20.90.5+0.1log20.10.50.53\mathrm{KL}(q\Vert p)=0.9\log_2\frac{0.9}{0.5}+0.1\log_2\frac{0.1}{0.5}\approx 0.53 bits. Both directions are positive and vanish only when p=qp=q, but the values differ — swapping the two distributions changes the answer.

Intuition

Forward KL KL(pq)\mathrm{KL}(p\Vert q) heavily penalizes placing little qq-mass where pp has mass, forcing qq to cover the whole support of pp (“mass-covering”). Reverse KL KL(qp)\mathrm{KL}(q\Vert p) penalizes qq for spreading outside pp, favoring a single mode (“mode-seeking”).

Deep dive

The asymmetry comes from which distribution does the weighting. Forward KL takes the expectation under pp: wherever p(i)>0p(i)>0 but q(i)0q(i)\to 0, the term logp(i)q(i)+\log\frac{p(i)}{q(i)}\to+\infty is amplified by p(i)p(i), so qq dares not leave a gap anywhere pp has support — this is the source of “mass-covering.” Reverse KL takes the expectation under qq: it penalizes wherever q(i)>0q(i)>0 but p(i)0p(i)\to 0, so qq prefers to cover a single peak of pp rather than spill outside it — the source of “mode-seeking.” When pp is multimodal, the qq minimizing forward KL spreads to cover every peak (even placing mass in the low-density valleys between them), while the qq minimizing reverse KL retreats into a single peak. The two objectives give markedly different solutions in generative modeling, and the choice depends on whether you want to “cover all modes” or “produce one sharp sample.”

The KL divergence connects directly to maximum likelihood. Maximizing a model’s log-likelihood on data is equivalent to minimizing the forward KL between the empirical data distribution p^\hat p and the model qθq_\theta:

argmaxθEp^[logqθ]=argminθKL(p^qθ),\arg\max_{\theta}\,\mathbb{E}_{\hat p}[\log q_\theta] =\arg\min_{\theta}\,\mathrm{KL}(\hat p\,\Vert\,q_\theta),

where p^\hat p is the empirical distribution that puts equal weight on each observed sample and qθq_\theta is the model distribution with parameters θ\theta. The equivalence holds because KL(p^qθ)=H(p^,qθ)H(p^)\mathrm{KL}(\hat p\Vert q_\theta)=H(\hat p,q_\theta)-H(\hat p), and H(p^)H(\hat p) does not depend on θ\theta, so it is constant for the optimization. This unifies maximum likelihood with information-theoretic distribution matching: training a maximum-likelihood model is making the model distribution match the empirical distribution of the data.

The KL divergence is always defined, but diverges to ++\infty when qq has near-zero mass where pp has mass (it is ++\infty for disjoint supports, giving no useful gradient); this is common early in training before the model and data distributions align, which is exactly why CryoGEN-II matches distributions with optimal transport instead of KL. The Wasserstein distance instead measures discrepancy through an optimal-transport cost, staying finite even without overlapping support and varying smoothly with geometric displacement, which makes it a frequent alternative to KL in variational inference and generative modeling.

These quantities are everywhere in Cryo-ET reconstruction. Maximum-likelihood and maximum-a-posteriori objectives (MAP, MLE & EM) are at heart minimizing the forward KL between the data and the predictions of the forward imaging model; the E-step of EM approximates the latent posterior with an auxiliary distribution, and the bound is tight exactly when the KL between that auxiliary distribution and the true posterior is zero. Variational inference explicitly minimizes a reverse KL, bringing the “mode-seeking” behavior into the approximate posterior. And when KL degenerates because the distributions do not overlap and gives no gradient, optimal transport takes over — this KL-versus-Wasserstein trade-off maps directly onto the boundary between the KL-family point estimates and the OT-family (CryoGEN-II) among the reconstruction methods.

← Probability & Statistics