MAP, MLE & the EM algorithm

Two routes to point estimates—maximum likelihood and maximum a posteriori—and expectation–maximization for latent-variable models.

The full posterior $p(\theta\mid x)$ carries all the uncertainty about a parameter $\theta$ , but often we just want one concrete answer: a set of weights, a single 3D density, a set of pose parameters. Collapsing the whole distribution into one $\theta$ is point estimation, and the two criteria below cover most of practice.

Intuition

Picture the posterior as a landscape over parameter space whose height is how credible a parameter is. A point estimate asks “where is the highest peak?” Maximum likelihood reads off the landscape shaped by the data alone; maximum a posteriori first reweights that landscape by the prior—raising some regions, lowering others—before locating the peak. Both return the peak (the mode), not the landscape’s center of mass (the posterior mean); when the landscape is skewed, those two diverge.

Maximum likelihood estimation (MLE) takes the parameters that make the observed data most probable:

\hat\theta_{\text{MLE}}=\arg\max_{\theta}\;p(x\mid\theta).

Here $p(x\mid\theta)$ is the likelihood: read as a function of $\theta$ (with $x$ fixed at its observed value), it scores how probable this data is under that parameter. In practice one almost always maximizes the log-likelihood $\log p(x\mid\theta)$ , because for independent samples the joint likelihood is a product, and the log turns it into a sum $\sum_i \log p(x_i\mid\theta)$ —numerically stable against underflow and cleaner to differentiate. A familiar case: MLE for $n$ independent Gaussian samples gives $\hat\mu$ equal to the sample mean and $\hat\sigma^2$ equal to the sample variance (divided by $n$ ). Many textbook “natural estimators” are in fact the MLE of some model.

A Gaussian mixture is the canonical setting for EM: which component generated each sample is the latent variable, and the soft assignment of responsibilities yields closed-form E- and M-steps. The demonstration below runs EM step by step on a fixed one-dimensional two-component dataset, with the log-likelihood non-decreasing after every iteration.

Data histogramComponent 1Component 2Mixture

Iteration: 0Log-likelihood: -327.258Means: -0.50, 0.80Weights: 0.50, 0.50

Each Step runs one E-step (responsibilities under the current two Gaussians) followed by one M-step (updated means, variances, weights). The log-likelihood is monotonically non-decreasing until convergence to a local maximum.

Maximum a posteriori (MAP) estimation incorporates the prior alongside the likelihood, taking the mode of the posterior:

\hat\theta_{\text{MAP}}=\arg\max_{\theta}\;p(x\mid\theta)\,p(\theta).

Here $p(\theta)$ is the prior, the belief about $\theta$ before seeing data. Note that the denominator—the evidence $p(x)$ from Bayes’ theorem—does not depend on $\theta$ , so it drops out of the $\arg\max$ . That is why MAP needs only $p(x\mid\theta)\,p(\theta)$ and never has to compute the intractable normalizing integral.

The two are closely related: in the log domain MAP equals MLE plus a term $\log p(\theta)$ . That term lets the prior act as a regularizer—a Gaussian prior corresponds to an $L_2$ penalty, a Laplace prior to an $L_1$ penalty. As the prior flattens, MAP reduces to MLE. This view connects the Bayesian framework of Bayesian inference to regularized frequentist optimization.

Depth

Why is a Gaussian prior exactly an $L_2$ penalty? Let $p(\theta)\propto\exp\!\big(-\tfrac{1}{2\tau^2}\lVert\theta\rVert^2\big)$ , so $\log p(\theta)=-\tfrac{1}{2\tau^2}\lVert\theta\rVert^2+\text{const}$ . Substituting into $\arg\max[\log p(x\mid\theta)+\log p(\theta)]$ , the constant does not move the optimum, and what remains is “log-likelihood minus a penalty on $\lVert\theta\rVert^2$ ” with strength $1/\tau^2$ —exactly the $\lambda$ of ridge regression. A tighter prior (smaller $\tau$ ) means a heavier penalty and an estimate pulled harder toward zero. A Laplace prior $\propto\exp(-\lVert\theta\rVert_1/b)$ likewise yields an $L_1$ penalty, and its non-differentiable kink at zero is precisely why it tends to produce sparse solutions. In short, “adding a regularizer” and “choosing a prior” are two descriptions of the same thing under MAP.

Carried into Cryo-ET: CryoGEN-I is precisely a MAP estimate whose $\log p(\theta)$ term is a learned energy-based prior—an EBM rather than a hand-designed $L_2$ / $L_1$ penalty. It returns a single best density (a point estimate, the posterior mode), which is exactly the MAP slot in the four-method taxonomy; later methods move toward distribution-level answers (links below).

Many models contain latent variables $z$ , so the likelihood requires marginalization:

p(x\mid\theta)=\int p(x,z\mid\theta)\,dz.

Here $z$ is a quantity that is not observed directly yet takes part in generating the data—in a mixture it is “which component a sample came from,” in imaging it can be the unknown pose behind each projection. The integral sums $z$ away, which usually renders $\log p(x\mid\theta)$ hard to optimize directly: the log sits outside the integral and cannot be pushed through term by term.

The expectation–maximization (EM) algorithm sidesteps the difficulty iteratively, alternating two steps:

E-step: using the current parameters $\theta^{(t)}$ , form the posterior over latents $p(z\mid x,\theta^{(t)})$ and the expected complete-data log-likelihood

Q(\theta\mid\theta^{(t)})=\mathbb{E}_{p(z\mid x,\theta^{(t)})}\!\big[\log p(x,z\mid\theta)\big].

M-step: maximize that expectation, $\theta^{(t+1)}=\arg\max_{\theta}Q(\theta\mid\theta^{(t)})$ .

Intuitively: the complete-data log-likelihood $\log p(x,z\mid\theta)$ is easy to optimize when $z$ is known (in a mixture it collapses to “fit each component to the samples assigned to it”). The trouble is only that $z$ is unknown. EM’s move is to first guess a soft assignment of $z$ from the current parameters (the E-step produces the “responsibilities,” each sample’s probability of belonging to each component), then treat that guess as truth to update parameters (the M-step), and repeat. In the demo above, the log-likelihood curve only rises after each iteration—a visible consequence of the guarantee below.

Each iteration never decreases the observed-data likelihood, so EM converges to a local maximum of the likelihood. Note local: EM is sensitive to initialization, different starts can land on different peaks, and in practice one often restarts from several random seeds and keeps the best.

Depth

Why does raising $Q$ raise the true likelihood? The key identity is

\log p(x\mid\theta)=Q(\theta\mid\theta^{(t)})-\mathbb{E}_{p(z\mid x,\theta^{(t)})}\!\big[\log p(z\mid x,\theta)\big],

whose second term is a negative-entropy form. Moving $\theta$ from $\theta^{(t)}$ to $\theta^{(t+1)}$ , the $Q$ term does not decrease by the definition of the M-step, while the change in the second term equals a KL divergence $\mathrm{KL}\big(p(z\mid x,\theta^{(t)})\,\Vert\,p(z\mid x,\theta^{(t+1)})\big)\ge 0$ that points the same way. Together they give $\log p(x\mid\theta^{(t+1)})\ge\log p(x\mid\theta^{(t)})$ . Equivalently, EM is coordinate ascent on an evidence lower bound: the E-step sets the auxiliary distribution to the exact posterior, making the bound tight against the true likelihood, and the M-step maximizes that bound over parameters. When the exact posterior is unavailable, the E-step becomes an approximate optimization over a restricted family—the content of variational inference; for the KL and entropy language see entropy and KL divergence.

In imaging models with latent poses or densities, this framework underlies many iterative reconstruction and alignment pipelines: the unknown 3D orientation behind each 2D projection is the latent $z$ , the 3D density is the parameter $\theta$ , the E-step soft-aligns orientations and the M-step updates the density from them—the shape of EM. See subtomogram averaging. From here, Cryo-ET reconstruction methods split on what kind of answer they return: MAP gives a single mode (CryoGEN-I), a more stable single answer follows the WAE/OT route (CryoGEN-II), and CryoWGEN goes further to a family of posterior samples rather than one point.

← Probability & Statistics