Bayesian inference

Describing unknowns with probability and updating a prior into a posterior from observed data via Bayes' rule.

Bayesian inference treats every unknown as a random variable and represents uncertainty about it with a probability distribution. Let $\theta$ denote the parameters or latent state to be estimated and $x$ the observed data. A model specifies two ingredients: a prior $p(\theta)$ that encodes belief about $\theta$ before observation, and a likelihood $p(x\mid\theta)$ that describes how the data are generated given $\theta$ .

Intuition

Picture inference as spreading belief across every possible truth. You start by distributing belief according to the prior — some values of $\theta$ look more reasonable and get more weight. Each observation then asks, “if the truth were this $\theta$ , how likely is the data I just saw?”, and reweights accordingly. Values of $\theta$ consistent with the data gain weight, those that contradict it lose weight. The reweighted, renormalized distribution is the posterior. The process never picks a single answer; it reshapes the whole belief from the prior’s shape into the posterior’s.

A directly Cryo-ET-relevant example — only one coordinate of a 2-D structure is measured, the other unobserved like the missing wedge. Watch how prior and likelihood combine into the posterior, and how the MAP point estimate differs from the full posterior:

prior p(x)likelihood (data)posterior ∝ prior × likelihood

wide posterior — uncertain, a family

missing-wedge severity↑ 宽 / wide

ample data (sharp likelihood)more missing (flat likelihood)

Bayesian inference writes recovering a clean structure from a noisy observation as one update: the prior p(x) (amber, the energy prior — what structures are plausible) times the likelihood (blue, what this observation says) gives the posterior (purple, the updated belief). The MAP (amber tick) is the posterior's peak — all CryoGEN-I reports; the whole purple curve, peak plus width, is what CryoWGEN reports. The missing wedge weakens the observation in this direction and flattens the likelihood, so the posterior widens: the same gap admits a family of plausible answers. Drag toward ample data and the posterior tightens onto the MAP.

The posterior: updating belief with Bayes’ rule

After observing $x$ , belief about $\theta$ is given by the posterior $p(\theta\mid x)$ , obtained from Bayes’ rule:

p(\theta\mid x)=\frac{p(x\mid\theta)\,p(\theta)}{p(x)}, \qquad p(x)=\int p(x\mid\theta)\,p(\theta)\,d\theta.

Reading it term by term: the numerator $p(x\mid\theta)\,p(\theta)$ is the prior belief $p(\theta)$ multiplied by the likelihood $p(x\mid\theta)$ that this $\theta$ explains the data. The denominator $p(x)$ is the evidence, or marginal likelihood; it marginalizes over all $\theta$ (integrates $\theta$ out) and normalizes the posterior in $\theta$ into a valid probability distribution. Because $p(x)$ does not depend on $\theta$ — it is just a constant scale factor — the relation is often written

p(\theta\mid x)\;\propto\;p(x\mid\theta)\,p(\theta),

that is, posterior $\propto$ likelihood $\times$ prior. The symbol $\propto$ (proportional to) is a reminder that to recover actual probabilities you still divide by $p(x)$ so the area integrates to one. That dropped constant is irrelevant for a point estimate, but when comparing two different models, $p(x)$ itself measures how well a model fits.

Intuition

The prior sets a starting point, the likelihood supplies the evidence carried by the data, and the posterior is a compromise between them. With ample data the likelihood dominates and the posterior concentrates; with scarce data the prior retains more influence.

A worked example: Beta-Binomial conjugacy

A canonical closed-form example is the Beta-Binomial conjugate pair: to estimate a success probability $p$ , a Beta $(a,b)$ prior combined with $k$ successes in $n$ Binomial trials, whose likelihood is $p^{k}(1-p)^{n-k}$ , yields a Beta posterior, Beta $(a+k,\,b+n-k)$ . The posterior mean $(a+k)/(a+b+n)$ lies between the prior mean and the data frequency $k/n$ , and shifts toward $k/n$ as the sample size grows.

Concrete numbers make this tangible. With a uniform prior Beta $(1,1)$ (no knowledge of $p$ ) and $k=7$ successes in $n=10$ trials, the posterior is Beta $(8,4)$ , with mean $8/12\approx0.67$ — pulled by the data from the prior mean $0.5$ toward the frequency $0.7$ , but not all the way, because the sample is small. Swap in a strong prior Beta $(20,20)$ (a firm belief the coin is nearly fair), and the same data gives only Beta $(27,23)$ , mean $27/50=0.54$ — the prior drags the estimate back near $0.5$ . Read $a,b$ as pseudo-counts: the prior acts as if you had already seen $a$ successes and $b$ failures, and the real data simply add to those tallies. That is exactly how prior strength trades off against the amount of data.

When the prior and likelihood belong to matched families such that the posterior shares the prior’s family, the prior is called conjugate, and the posterior is available in closed form — updating just rewrites a few parameters. Conjugacy is one of the rare cases that sidesteps the integral; for most real models $p(x)$ has no closed form and one resorts to MAP point estimates, variational inference, or Langevin sampling.

Prediction and uncertainty

Prediction for a new observation $\tilde{x}$ is given by the predictive distribution, which averages over the posterior:

p(\tilde{x}\mid x)=\int p(\tilde{x}\mid\theta)\,p(\theta\mid x)\,d\theta.

This step is where the Bayesian approach parts ways with point estimation: instead of fixing a single $\hat\theta$ and predicting from it, every possible $\theta$ votes, weighted by its posterior probability $p(\theta\mid x)$ . The integrand $p(\tilde{x}\mid\theta)$ is the likelihood of the new data under that $\theta$ . When the posterior is broad (the parameters are uncertain), the predictive distribution widens accordingly, automatically propagating parameter uncertainty into predictions — something a point estimate cannot do.

Depth

The posterior delivers a full uncertainty structure, not just a point. Two common summaries are worth distinguishing. A credible interval comes straight from the posterior: $[\ell,u]$ is a 95% credible interval exactly when $\int_\ell^u p(\theta\mid x)\,d\theta=0.95$ , and it means literally “the posterior probability that $\theta$ lies in this interval is 0.95” — the very reading frequentist confidence intervals are so often mistaken for, but which here is the definition.

The evidence $p(x)$ looks like a mere normalizing constant, yet it is the engine of model comparison. For two models $M_1,M_2$ , the ratio of evidences $p(x\mid M_1)/p(x\mid M_2)$ is the Bayes factor. It bakes in an Occam’s razor: an over-flexible model spreads its prior probability across many possible datasets, lowering $p(x)$ for any particular $x$ , so a simpler model that still fits wins on evidence. Goodness-of-fit and model complexity end up unified in one quantity, with no separate penalty term.

For decisions, the framework supplies a clean optimality criterion: given a loss function $L(\theta,\hat\theta)$ , the best estimate minimizes the posterior expected loss $\mathbb{E}_{p(\theta\mid x)}[L(\theta,\hat\theta)]$ . Squared loss yields the posterior mean, absolute loss the posterior median, and 0-1 loss the posterior mode (the MAP). Different point estimates correspond to different loss assumptions.

Where it sits in Cryo-ET reconstruction

In cryo-electron tomography, reconstruction can be framed as a Bayesian inverse problem: the unknown three-dimensional density plays the role of $\theta$ , the tilt-series projections are $x$ , the imaging model set by the CTF and noise gives the likelihood, and structural assumptions on the density act as the prior. The prior is not optional decoration here: projections span only a limited angular range (the missing wedge), so the likelihood carries almost no information along certain directions, and the posterior stays broad there — it is the prior that fills in where the data are silent. The demo above compresses this mechanism into one missing coordinate in 2-D.

Locating an optimum of the posterior and the role of regularization are treated in MAP, MLE & the EM algorithm. The four methods can be told apart by how they treat this posterior: CryoGEN-I takes the posterior mode (a MAP point estimate); CryoGEN-II returns a stable single answer via WAE/OT; and CryoWGEN-I and CryoWGEN-II refuse to collapse to one point, using EVIA (Monte-Carlo and Langevin respectively) to characterize a whole family of posterior samples — delivering how uncertain the density is along with the density itself.

← Probability & Statistics