Entropy & KL divergence

Measuring uncertainty and the discrepancy between distributions through Shannon entropy, cross-entropy, and KL divergence, with the link to maximum likelihood.

Think of a distribution as “how unsure you are about the outcome.” If a coin almost always lands heads, you are nearly certain about the next flip and your uncertainty is low; if it is fair, every outcome is maximally hard to guess and your uncertainty is highest. Shannon entropy turns this “hard to guess” feeling into a single number. This page starts from entropy, builds up to cross-entropy and KL divergence — the latter measuring how far apart two distributions are — and shows that this is the shared language behind nearly every probabilistic objective, including the reconstruction methods.

Shannon entropy measures the uncertainty carried by a distribution. For a discrete distribution $p$ ,

H(p)=-\sum_{i}p(i)\log p(i),

where $p(i)$ is the probability of outcome $i$ , $-\log p(i)$ is the “surprise” of seeing that outcome (rarer outcomes are more surprising), and the sum is the expected surprise. Entropy attains its maximum at the uniform distribution and is zero when all mass sits on a single outcome. It lower-bounds the average code length needed to losslessly encode samples drawn from $p$ .

A concrete example, with logarithms base 2: a fair coin $p=(\tfrac12,\tfrac12)$ has entropy $-\tfrac12\log_2\tfrac12-\tfrac12\log_2\tfrac12=1$ bit — on average one bit per flip is needed to encode it. A biased coin $p=(0.9,0.1)$ has entropy $-0.9\log_2 0.9-0.1\log_2 0.1\approx 0.47$ bits, less than half: the outcome is more predictable, so fewer bits are needed on average. A fully certain coin $p=(1,0)$ has entropy zero and needs no bits at all.

p = N(0, 1) (fixed)q = N(μ, σ²) (adjustable)

KL(p‖q)2.459

KL(q‖p)1.227

Asymmetry ratio2.00

Mean μ of q: 1.50Std σ of q: 0.70

Forward KL(p‖q) grows fast where q has little mass but p has mass; reverse KL(q‖p) does the opposite. The two generally differ, so KL is not a distance.

The figure illustrates the asymmetry of the KL divergence with two univariate Gaussians: the reference $p$ is fixed as a standard normal, while the mean and standard deviation of $q$ are adjustable. When $q$ collapses to near-zero density in a tail where $p$ still carries mass, the forward KL rises sharply; swapping the roles of the two distributions generally yields a different value, and both vanish together only when $p=q$ . This asymmetry, driven by the log-ratio term inside the integral, carries over directly to KL and cross-entropy on continuous distributions.

Cross-entropy gives the average code length when samples from $p$ are encoded with a code designed for a distribution $q$ :

H(p,q)=-\sum_{i}p(i)\log q(i).

Here outcomes occur at the frequencies of $p$ (hence the weighting by $p(i)$ ), but the code lengths are designed for $q$ (hence $-\log q(i)$ ). If the $q$ used for coding does not match the true $p$ , the average code length must grow — the excess is exactly the KL divergence below.

Their difference is the KL divergence:

\mathrm{KL}(p\,\Vert\,q)=\sum_{i}p(i)\log\frac{p(i)}{q(i)}=H(p,q)-H(p).

It measures the extra code length paid for using a code built for $q$ instead of the true $p$ . By Gibbs’ inequality, $\mathrm{KL}(p\Vert q)\ge 0$ , with equality if and only if $p=q$ . The KL divergence is asymmetric: in general $\mathrm{KL}(p\Vert q)\neq\mathrm{KL}(q\Vert p)$ , so it is not a distance metric.

A worked example of the asymmetry. Let $p=(0.5,0.5)$ and $q=(0.9,0.1)$ . In bits, $\mathrm{KL}(p\Vert q)=0.5\log_2\frac{0.5}{0.9}+0.5\log_2\frac{0.5}{0.1}\approx 0.74$ bits, whereas $\mathrm{KL}(q\Vert p)=0.9\log_2\frac{0.9}{0.5}+0.1\log_2\frac{0.1}{0.5}\approx 0.53$ bits. Both directions are positive and vanish only when $p=q$ , but the values differ — swapping the two distributions changes the answer.

Intuition

Forward KL $\mathrm{KL}(p\Vert q)$ heavily penalizes placing little $q$ -mass where $p$ has mass, forcing $q$ to cover the whole support of $p$ (“mass-covering”). Reverse KL $\mathrm{KL}(q\Vert p)$ penalizes $q$ for spreading outside $p$ , favoring a single mode (“mode-seeking”).

Deep dive

The asymmetry comes from which distribution does the weighting. Forward KL takes the expectation under $p$ : wherever $p(i)>0$ but $q(i)\to 0$ , the term $\log\frac{p(i)}{q(i)}\to+\infty$ is amplified by $p(i)$ , so $q$ dares not leave a gap anywhere $p$ has support — this is the source of “mass-covering.” Reverse KL takes the expectation under $q$ : it penalizes wherever $q(i)>0$ but $p(i)\to 0$ , so $q$ prefers to cover a single peak of $p$ rather than spill outside it — the source of “mode-seeking.” When $p$ is multimodal, the $q$ minimizing forward KL spreads to cover every peak (even placing mass in the low-density valleys between them), while the $q$ minimizing reverse KL retreats into a single peak. The two objectives give markedly different solutions in generative modeling, and the choice depends on whether you want to “cover all modes” or “produce one sharp sample.”

The KL divergence connects directly to maximum likelihood. Maximizing a model’s log-likelihood on data is equivalent to minimizing the forward KL between the empirical data distribution $\hat p$ and the model $q_\theta$ :

\arg\max_{\theta}\,\mathbb{E}_{\hat p}[\log q_\theta] =\arg\min_{\theta}\,\mathrm{KL}(\hat p\,\Vert\,q_\theta),

where $\hat p$ is the empirical distribution that puts equal weight on each observed sample and $q_\theta$ is the model distribution with parameters $\theta$ . The equivalence holds because $\mathrm{KL}(\hat p\Vert q_\theta)=H(\hat p,q_\theta)-H(\hat p)$ , and $H(\hat p)$ does not depend on $\theta$ , so it is constant for the optimization. This unifies maximum likelihood with information-theoretic distribution matching: training a maximum-likelihood model is making the model distribution match the empirical distribution of the data.

The KL divergence is always defined, but diverges to $+\infty$ when $q$ has near-zero mass where $p$ has mass (it is $+\infty$ for disjoint supports, giving no useful gradient); this is common early in training before the model and data distributions align, which is exactly why CryoGEN-II matches distributions with optimal transport instead of KL. The Wasserstein distance instead measures discrepancy through an optimal-transport cost, staying finite even without overlapping support and varying smoothly with geometric displacement, which makes it a frequent alternative to KL in variational inference and generative modeling.

These quantities are everywhere in Cryo-ET reconstruction. Maximum-likelihood and maximum-a-posteriori objectives (MAP, MLE & EM) are at heart minimizing the forward KL between the data and the predictions of the forward imaging model; the E-step of EM approximates the latent posterior with an auxiliary distribution, and the bound is tight exactly when the KL between that auxiliary distribution and the true posterior is zero. Variational inference explicitly minimizes a reverse KL, bringing the “mode-seeking” behavior into the approximate posterior. And when KL degenerates because the distributions do not overlap and gives no gradient, optimal transport takes over — this KL-versus-Wasserstein trade-off maps directly onto the boundary between the KL-family point estimates and the OT-family (CryoGEN-II) among the reconstruction methods.

← Probability & Statistics