MAP, MLE & the EM algorithm

Two routes to point estimates—maximum likelihood and maximum a posteriori—and expectation–maximization for latent-variable models.

The full posterior p(θx)p(\theta\mid x) carries all the uncertainty about a parameter θ\theta, but often we just want one concrete answer: a set of weights, a single 3D density, a set of pose parameters. Collapsing the whole distribution into one θ\theta is point estimation, and the two criteria below cover most of practice.

Intuition

Picture the posterior as a landscape over parameter space whose height is how credible a parameter is. A point estimate asks “where is the highest peak?” Maximum likelihood reads off the landscape shaped by the data alone; maximum a posteriori first reweights that landscape by the prior—raising some regions, lowering others—before locating the peak. Both return the peak (the mode), not the landscape’s center of mass (the posterior mean); when the landscape is skewed, those two diverge.

Maximum likelihood estimation (MLE) takes the parameters that make the observed data most probable:

θ^MLE=argmaxθ  p(xθ).\hat\theta_{\text{MLE}}=\arg\max_{\theta}\;p(x\mid\theta).

Here p(xθ)p(x\mid\theta) is the likelihood: read as a function of θ\theta (with xx fixed at its observed value), it scores how probable this data is under that parameter. In practice one almost always maximizes the log-likelihood logp(xθ)\log p(x\mid\theta), because for independent samples the joint likelihood is a product, and the log turns it into a sum ilogp(xiθ)\sum_i \log p(x_i\mid\theta)—numerically stable against underflow and cleaner to differentiate. A familiar case: MLE for nn independent Gaussian samples gives μ^\hat\mu equal to the sample mean and σ^2\hat\sigma^2 equal to the sample variance (divided by nn). Many textbook “natural estimators” are in fact the MLE of some model.

A Gaussian mixture is the canonical setting for EM: which component generated each sample is the latent variable, and the soft assignment of responsibilities yields closed-form E- and M-steps. The demonstration below runs EM step by step on a fixed one-dimensional two-component dataset, with the log-likelihood non-decreasing after every iteration.

-6-30360.330
Data histogramComponent 1Component 2Mixture
Iteration: 0Log-likelihood: -327.258Means: -0.50, 0.80Weights: 0.50, 0.50

Each Step runs one E-step (responsibilities under the current two Gaussians) followed by one M-step (updated means, variances, weights). The log-likelihood is monotonically non-decreasing until convergence to a local maximum.

Maximum a posteriori (MAP) estimation incorporates the prior alongside the likelihood, taking the mode of the posterior:

θ^MAP=argmaxθ  p(xθ)p(θ).\hat\theta_{\text{MAP}}=\arg\max_{\theta}\;p(x\mid\theta)\,p(\theta).

Here p(θ)p(\theta) is the prior, the belief about θ\theta before seeing data. Note that the denominator—the evidence p(x)p(x) from Bayes’ theorem—does not depend on θ\theta, so it drops out of the argmax\arg\max. That is why MAP needs only p(xθ)p(θ)p(x\mid\theta)\,p(\theta) and never has to compute the intractable normalizing integral.

The two are closely related: in the log domain MAP equals MLE plus a term logp(θ)\log p(\theta). That term lets the prior act as a regularizer—a Gaussian prior corresponds to an L2L_2 penalty, a Laplace prior to an L1L_1 penalty. As the prior flattens, MAP reduces to MLE. This view connects the Bayesian framework of Bayesian inference to regularized frequentist optimization.

Depth

Why is a Gaussian prior exactly an L2L_2 penalty? Let p(θ)exp ⁣(12τ2θ2)p(\theta)\propto\exp\!\big(-\tfrac{1}{2\tau^2}\lVert\theta\rVert^2\big), so logp(θ)=12τ2θ2+const\log p(\theta)=-\tfrac{1}{2\tau^2}\lVert\theta\rVert^2+\text{const}. Substituting into argmax[logp(xθ)+logp(θ)]\arg\max[\log p(x\mid\theta)+\log p(\theta)], the constant does not move the optimum, and what remains is “log-likelihood minus a penalty on θ2\lVert\theta\rVert^2” with strength 1/τ21/\tau^2—exactly the λ\lambda of ridge regression. A tighter prior (smaller τ\tau) means a heavier penalty and an estimate pulled harder toward zero. A Laplace prior exp(θ1/b)\propto\exp(-\lVert\theta\rVert_1/b) likewise yields an L1L_1 penalty, and its non-differentiable kink at zero is precisely why it tends to produce sparse solutions. In short, “adding a regularizer” and “choosing a prior” are two descriptions of the same thing under MAP.

Carried into Cryo-ET: CryoGEN-I is precisely a MAP estimate whose logp(θ)\log p(\theta) term is a learned energy-based prior—an EBM rather than a hand-designed L2L_2/L1L_1 penalty. It returns a single best density (a point estimate, the posterior mode), which is exactly the MAP slot in the four-method taxonomy; later methods move toward distribution-level answers (links below).

Many models contain latent variables zz, so the likelihood requires marginalization:

p(xθ)=p(x,zθ)dz.p(x\mid\theta)=\int p(x,z\mid\theta)\,dz.

Here zz is a quantity that is not observed directly yet takes part in generating the data—in a mixture it is “which component a sample came from,” in imaging it can be the unknown pose behind each projection. The integral sums zz away, which usually renders logp(xθ)\log p(x\mid\theta) hard to optimize directly: the log sits outside the integral and cannot be pushed through term by term.

The expectation–maximization (EM) algorithm sidesteps the difficulty iteratively, alternating two steps:

Q(θθ(t))=Ep(zx,θ(t)) ⁣[logp(x,zθ)].Q(\theta\mid\theta^{(t)})=\mathbb{E}_{p(z\mid x,\theta^{(t)})}\!\big[\log p(x,z\mid\theta)\big].

Intuitively: the complete-data log-likelihood logp(x,zθ)\log p(x,z\mid\theta) is easy to optimize when zz is known (in a mixture it collapses to “fit each component to the samples assigned to it”). The trouble is only that zz is unknown. EM’s move is to first guess a soft assignment of zz from the current parameters (the E-step produces the “responsibilities,” each sample’s probability of belonging to each component), then treat that guess as truth to update parameters (the M-step), and repeat. In the demo above, the log-likelihood curve only rises after each iteration—a visible consequence of the guarantee below.

Each iteration never decreases the observed-data likelihood, so EM converges to a local maximum of the likelihood. Note local: EM is sensitive to initialization, different starts can land on different peaks, and in practice one often restarts from several random seeds and keeps the best.

Depth

Why does raising QQ raise the true likelihood? The key identity is

logp(xθ)=Q(θθ(t))Ep(zx,θ(t)) ⁣[logp(zx,θ)],\log p(x\mid\theta)=Q(\theta\mid\theta^{(t)})-\mathbb{E}_{p(z\mid x,\theta^{(t)})}\!\big[\log p(z\mid x,\theta)\big],

whose second term is a negative-entropy form. Moving θ\theta from θ(t)\theta^{(t)} to θ(t+1)\theta^{(t+1)}, the QQ term does not decrease by the definition of the M-step, while the change in the second term equals a KL divergence KL(p(zx,θ(t))p(zx,θ(t+1)))0\mathrm{KL}\big(p(z\mid x,\theta^{(t)})\,\Vert\,p(z\mid x,\theta^{(t+1)})\big)\ge 0 that points the same way. Together they give logp(xθ(t+1))logp(xθ(t))\log p(x\mid\theta^{(t+1)})\ge\log p(x\mid\theta^{(t)}). Equivalently, EM is coordinate ascent on an evidence lower bound: the E-step sets the auxiliary distribution to the exact posterior, making the bound tight against the true likelihood, and the M-step maximizes that bound over parameters. When the exact posterior is unavailable, the E-step becomes an approximate optimization over a restricted family—the content of variational inference; for the KL and entropy language see entropy and KL divergence.

In imaging models with latent poses or densities, this framework underlies many iterative reconstruction and alignment pipelines: the unknown 3D orientation behind each 2D projection is the latent zz, the 3D density is the parameter θ\theta, the E-step soft-aligns orientations and the M-step updates the density from them—the shape of EM. See subtomogram averaging. From here, Cryo-ET reconstruction methods split on what kind of answer they return: MAP gives a single mode (CryoGEN-I), a more stable single answer follows the WAE/OT route (CryoGEN-II), and CryoWGEN goes further to a family of posterior samples rather than one point.

← Probability & Statistics