Entropic optimal transport & Sinkhorn

Adding an entropy / KL penalty to the transport cost yields a unique, everywhere-positive smooth coupling, solved efficiently by Sinkhorn matrix scaling.

Kantorovich optimal transport is a linear program: among all couplings with marginals $\mu,\nu$ , pick the one of least cost. Its solution is usually “too hard” — mass piles onto a handful of pairs $(x,y)$ , approaching a deterministic Monge map. Entropic regularization does one thing: it adds a relative-entropy penalty to the cost that punishes this collapse and forces the plan to spread mass out. In return you get three things — a unique solution, differentiability everywhere, and an algorithm that finishes in a few dozen matrix-scaling steps (Sinkhorn). This page works out where those three come from, and why this is exactly the posterior CryoWGEN samples during training.

Entropic optimal transport augments the Kantorovich transport cost with a relative-entropy penalty, turning the original linear program into a strongly convex problem. The penalty is taken against a reference coupling $\kappa$ and weighted by a regularization strength γ (read it as a “temperature”: larger γ smears the coupling flatter), trading transport cost against the diffuseness of the coupling and giving the objective

\min_{\pi\in\Pi(\mu,\nu)} \int c\,d\pi \;+\; \gamma\,\mathrm{KL}(\pi\Vert\kappa).

Term by term: $\pi$ is the coupling being solved for (the transport schedule), $\Pi(\mu,\nu)$ is the set of all couplings whose row/column marginals are $\mu,\nu$ , $c$ is the unit transport cost (often $\lVert x-y\rVert^2$ ), and $\int c\,d\pi$ is the plan’s total cost; $\mathrm{KL}(\pi\Vert\kappa)$ measures how far $\pi$ departs from the reference coupling $\kappa$ , with weight $\gamma>0$ . The larger $\gamma$ , the heavier the second term, and the more the solver is willing to sacrifice low cost for diffuseness.

Row marginal μColumn marginal νCoupling πᵢⱼ

Temperature γ: 0.050

Each cell is shaded by its Boltzmann weight exp(−cᵢⱼ/γ): small γ concentrates mass on the low-cost diagonal band, approaching a hard assignment; large γ smears the weights with entropy and the coupling spreads toward the independent product μ⊗ν.

Throughout, $\pi$ denotes the coupling and $\gamma$ the temperature (consistent with the site-wide convention in optimal transport). The reference coupling is most often the independent product measure $\kappa=\mu\otimes\nu$ , in which case $\mathrm{KL}(\pi\Vert\mu\otimes\nu)$ measures how far the coupling $\pi$ departs from independence (see entropy and KL divergence); this is simply the special case $\kappa=\mu\otimes\nu$ of the general form $\gamma\,\mathrm{KL}(\pi\Vert\kappa)$ . Strong convexity makes the optimum unique, with the explicit Gibbs–Boltzmann form

\pi^\star(x,y)\;\propto\;\kappa(x,y)\,e^{-c(x,y)/\gamma} \;\xrightarrow{\ \kappa=\mu\otimes\nu\ }\; \mu(x)\,\nu(y)\,e^{-c(x,y)/\gamma},

a Boltzmann reweighting of the reference coupling with the cost as potential energy at temperature $\gamma$ . The factor $e^{-c(x,y)/\gamma}$ translates “low cost” into “high weight”: every increase of the cost by $\gamma$ drops the weight by a factor of $e$ , so $\gamma$ sets directly how steep that decay is.

Unregularized optimal transport tends to concentrate mass on a sparse, near-deterministic support (a Monge map in the extreme case). The KL penalty punishes this collapse: every pair $(x,y)$ receives strictly positive transported mass, so the optimal plan is diffuse and smooth everywhere rather than a sharp assignment.

What the entropy term actually does

The entropy penalty is not just a computational trick — it changes the answer in three ways that are worth keeping separate.

It makes the problem well-posed and cheap. Unregularized OT is a linear program whose solution can sit on a brittle corner of the feasible polytope and costs $O(n^3)$ to find. Entropy makes the objective strictly convex: the optimum becomes unique and acquires the diagonal-scaling structure $\pi^\star=\mathrm{diag}(a)\,K\,\mathrm{diag}(b)$ that the Sinkhorn iteration below exploits, collapsing the linear program into a few matrix–vector products (Cuturi 2013).

It buys smoothness, and with it a usable gradient. Because no pair can be assigned exactly zero mass, the plan varies smoothly as the cost or the marginals move, and is differentiable in them. A hard assignment is piecewise-constant — its gradient is zero almost everywhere and undefined at the jumps — and so useless for training; the entropic plan is what lets optimal transport serve as a loss inside a network. Concretely in Cryo-ET: nudge the encoder’s parameters and the reconstruction $x$ moves a little, the degraded $\mathcal{T}_M(x)$ and the cost $c$ move with it; only when the transport plan responds to that nudge with a nonzero, continuous change can a gradient flow back to the encoder.

It has a physical meaning — a Schrödinger bridge. Entropic OT is exactly the most likely way a cloud of diffusing (Brownian) particles could have evolved from $\mu$ into $\nu$ : the cost is the kinetic effort, and the entropy term is the thermal noise of that diffusion, with $\gamma$ its temperature. This is the sense in which $\gamma$ is a temperature rather than a metaphor.

Intuition

Read $\gamma$ as how much randomness the transport is forced to keep. At $\gamma\to0$ the noise vanishes and you recover sharp, deterministic (Monge) transport — one answer. As $\gamma$ grows the coupling hedges, smearing mass across many pairs; in the limit it forgets the cost entirely and relaxes to the independent product $\mu\otimes\nu$ (maximum entropy, no transport structure left). The whole value of the entropy term is that it lets you dial between a single committed answer and a diffuse, hedged one — which, fixing one observation $y$ below, is exactly the dial between a single MAP reconstruction and a broad posterior.

The Sinkhorn algorithm

Solving the problem exploits the special structure of the optimal coupling. Any $\pi^\star$ can be written as a diagonal scaling $\pi^\star=\mathrm{diag}(a)\,K\,\mathrm{diag}(b)$ with kernel $K_{ij}=e^{-c_{ij}/\gamma}$ . The Sinkhorn algorithm alternately updates the two scaling vectors $a,b$ so that the row and column marginals match $\mu$ and $\nu$ . Given a cost matrix $C$ , marginals $\mu,\nu$ , and temperature $\gamma$ :

Form the kernel $K_{ij}=e^{-C_{ij}/\gamma}$ and initialize $b\leftarrow\mathbf{1}$ .
Update the row scaling: $a\leftarrow \mu \oslash (Kb)$ ( $\oslash$ is elementwise division).
Update the column scaling: $b\leftarrow \nu \oslash (K^{\top}a)$ .
Repeat 2–3 until the marginal error converges.
Return $\pi^\star=\mathrm{diag}(a)\,K\,\mathrm{diag}(b)$ .

Why does alternating these two steps work? Step 2 forces the current row sums onto $\mu$ , step 3 forces the column sums onto $\nu$ — each on its own is the cheapest KL projection that fixes one marginal in a single shot. Alternating them, each step only slightly disturbs the other’s marginal, the error contracts geometrically, and the iteration converges linearly to the unique coupling that satisfies both marginals at once. Each step costs only a single matrix–vector product. Because every operation is smooth and differentiable, the Sinkhorn output can serve as a differentiable loss embedded in end-to-end training. The scalings $a,b$ are the exponentiated dual potentials $f=\gamma\log a$ , $g=\gamma\log b$ — Sinkhorn is coordinate ascent on the smooth entropic dual.

Depth

Run the smallest possible numbers. Take two 2-point distributions $\mu=\nu=(\tfrac12,\tfrac12)$ , cost matrix $C=\begin{psmallmatrix}0&1\\1&0\end{psmallmatrix}$ (the diagonal is “stay in place” at zero cost, off-diagonal is “swap” at unit cost), and temperature $\gamma=1$ . Form the kernel $K_{ij}=e^{-C_{ij}}$ : $K=\begin{psmallmatrix}1&e^{-1}\\e^{-1}&1\end{psmallmatrix}\approx\begin{psmallmatrix}1&0.368\\0.368&1\end{psmallmatrix}$ . Symmetry already gives the answer away — symmetric rows/columns force $a=b=\text{const}$ , and the unique plan matching the marginals is $\pi^\star\approx\begin{psmallmatrix}0.365&0.135\\0.135&0.365\end{psmallmatrix}$ : most mass stays on the zero-cost diagonal, a little “leaks” into the swap, every row and column summing to exactly $0.5$ . Vary the temperature to see the trend: as $\gamma\to0$ the off-diagonal $e^{-1/\gamma}\to0$ and the plan collapses to the pure diagonal $\begin{psmallmatrix}0.5&0\\0&0.5\end{psmallmatrix}$ (hard transport, all mass on zero cost); as $\gamma\to\infty$ the off-diagonal $e^{-1/\gamma}\to1$ , $K$ tends to all-ones, and the plan smears to $\begin{psmallmatrix}0.25&0.25\\0.25&0.25\end{psmallmatrix}=\mu\otimes\nu$ (cost forgotten). $\gamma=1$ sits between the two — the entropy “dial” made visible.

Intuition

$\gamma$ plays the role of a temperature. Larger $\gamma$ smears the plan with entropy, pushing it toward the reference coupling $\kappa$ (the independent product $\mu\otimes\nu$ when $\kappa=\mu\otimes\nu$ ); as $\gamma\to0$ the Boltzmann weights sharpen into a hard selection of the lowest cost, recovering unregularized (hard) optimal transport, though the Sinkhorn iteration becomes ill-conditioned in that limit: the kernel’s $e^{-c/\gamma}$ either underflows or spans wildly different magnitudes at small $\gamma$ , which is why in practice it is run in the log domain (log-domain Sinkhorn), replacing direct products with log-sum-exp.

From entropic cost to Sinkhorn divergence

The entropic cost above is convenient but biased: $\mathrm{OT}_\gamma(\mu,\mu)\neq 0$ , so a generator that minimizes it under-disperses — as $\gamma$ grows the entropic cost shrinks distances toward a blurred metric. Put differently, even the cost of a distribution to itself is nonzero, so using it as a loss puts the optimum not at “generated = data” but pulled by the entropy term toward something more concentrated. The fix is the Sinkhorn divergence, which subtracts the self-terms:

\mathrm{S}_\gamma(\mu,\nu)=\mathrm{OT}_\gamma(\mu,\nu)-\tfrac12\,\mathrm{OT}_\gamma(\mu,\mu)-\tfrac12\,\mathrm{OT}_\gamma(\nu,\nu).

The two $-\tfrac12\,\mathrm{OT}_\gamma(\cdot,\cdot)$ self-terms subtract off exactly the residual “distribution-to-itself” cost, zeroing the total at $\mu=\nu$ . It satisfies $\mathrm{S}_\gamma(\mu,\mu)=0$ and $\mathrm{S}_\gamma\ge 0$ , and interpolates between optimal transport ( $\gamma\to0$ ) and maximum mean discrepancy ( $\gamma\to\infty$ ) (Genevay et al. 2018; Feydy et al. 2019). This debiased $\mathrm{S}_\gamma$ — not the raw entropic cost — is what is actually minimized when a Sinkhorn cost trains a generative model (the Sinkhorn-autoencoder setting, Patrini et al. 2020); and its large- $\gamma$ MMD limit is the same MMD the Wasserstein autoencoder can use to match its aggregated posterior to the prior.

This is exactly the EVIA / CryoWGEN posterior

The Boltzmann coupling $\pi^\star(y,x)\propto\kappa(y,x)\,e^{-c(y,\mathcal{T}_M(x))/\gamma}$ above is not merely a solver trick — it is the posterior sampled during training by EVIA and CryoWGEN. Fixing one observation $y$ and reading off the conditional slice $\pi^\star(\cdot\mid y)$ gives that observation’s per-image posterior

q^\star(x\mid y)\;\propto\;\kappa(x\mid y)\,e^{-c(y,\mathcal{T}_M(x))/\gamma},

a Boltzmann reweighting of the data-consistency reference $\kappa$ with the degradation mismatch $c(y,\mathcal{T}_M(x))=\lVert y-\mathcal{T}_M(x)\rVert^2$ as energy and $\gamma$ as temperature. Term by term: $x$ is a candidate clean reconstruction, $\mathcal{T}_M(x)$ is what it looks like after the missing-wedge degradation operator $\mathcal{T}_M$ pushes it back to observation space, and $\lVert y-\mathcal{T}_M(x)\rVert^2$ measures how far that degraded version sits from the measured $y$ ; the better $x$ explains $y$ , the lower the energy and the higher the weight the posterior gives it. This is exactly the distribution CryoWGEN samples in its E-step — CryoWGEN-I draws from it by Monte-Carlo reweighting, CryoWGEN-II by iterative Langevin / SGLD — while the encoder learns its conditional mean $\mathbb{E}[q(x\mid y)]$ , an Entropy-SGD-style smoothed point estimate.

The temperature $\gamma$ has a clear reading here. As $\gamma\to0$ the posterior collapses to the single point at the energy’s $\arg\min$ , recovering CryoGEN-II / MAP’s hard transport — one deterministic reconstruction per observation; for $\gamma>0$ a whole family of reconstructions survives, capturing exactly the uncertainty left by the missing wedge: the same corrupted $y$ should correspond to many plausible clean volumes $x$ . This puts the site’s four-method taxonomy on one dial — MAP point estimate (CryoGEN-I), WAE/OT stable single answer (CryoGEN-II), EVIA Monte-Carlo (CryoWGEN-I), EVIA Langevin posterior family (CryoWGEN-II) — and much of what separates them is this $\gamma$ (and how the Boltzmann posterior is sampled).

Depth

Why does the optimal conditional take the Boltzmann form? Fix an observation $y$ and write the conditional part of the entropy-regularized objective as a functional of $q(\cdot\mid y)$ :

\min_{q}\;\Big\{\mathbb{E}_{q}\big[c(y,\mathcal{T}_M(x))\big]\;+\;\gamma\,\mathrm{KL}\big(q\Vert\kappa(\cdot\mid y)\big)\Big\}.

Introduce a Lagrange multiplier $\lambda$ for the constraint $\int q\,dx=1$ , take the variation with respect to $q(x)$ , and set it to zero:

c(y,\mathcal{T}_M(x))+\gamma\big(\log\tfrac{q(x)}{\kappa(x\mid y)}+1\big)+\lambda=0 \;\Longrightarrow\; q^\star(x\mid y)\propto\kappa(x\mid y)\,e^{-c(y,\mathcal{T}_M(x))/\gamma}.

This is the Gibbs variational principle (Donsker–Varadhan): minimizing expected cost under a KL proximity to a reference measure forces the optimum to be a Boltzmann reweighting of that reference. The smaller $\gamma$ , the sharper the weights, and the closer the posterior is to a hard argmin of the cost — the MAP estimate. Read the other way, this is why learning the conditional mean is enough for the encoder: at small $\gamma$ the Boltzmann posterior is approximately a Gaussian bell whose mean and mode nearly coincide, so “learn the mean” and “find the MAP” converge to the same answer in that limit.

Entropic optimal transport underlies the transport objective in EVIA and the Wasserstein autoencoder, and supplies the differentiable distribution-matching approximation used in CryoWGEN. Plugging this Sinkhorn objective into an encoder–decoder gives the Sinkhorn autoencoder (Patrini et al., 2020) — a direct precursor of EVIA.

Entropic optimal transport & Sinkhorn

What the entropy term actually does

The Sinkhorn algorithm

From entropic cost to Sinkhorn divergence

This is exactly the EVIA / CryoWGEN posterior

Further reading