Kantorovich duality & the dual potentials

Every transport problem has a dual played on prices instead of plans; the c-transform collapses it to a single potential, the W₁ case is exactly the Wasserstein-GAN critic, and Brenier's theorem recovers the optimal map.

The Kantorovich problem chooses a transport plan — a coupling $\pi$ over pairs $(x,y)$ . Every such problem has a dual that chooses something completely different: not a plan, but a pair of prices. The dual view is the one that turns the Wasserstein distance from a number you compute into an objective a network can train, and it is the piece the Wasserstein autoencoder and Wasserstein GAN — and through them CryoGEN-II and CryoWGEN — quietly rely on.

One sentence to anchor the whole page: the primal optimizes over what to move where (a 2-D table $\pi$ ), the dual over what each location is worth (two curves $f,g$ ); both return the same number, but the dual’s unknown is a function, and a function is exactly what a neural network can fit. The four sections below show how that pair of prices collapses into one potential, why the distance cost forces it to be 1-Lipschitz, how it becomes the WGAN critic, and how, under squared cost, its gradient recovers a genuine map.

Two prices instead of a plan

Picture a shipping contractor. Instead of routing every shovel of earth yourself (choosing $\pi$ ), you hire a contractor who posts two price functions: a price $f(x)$ to load a unit at $x$ and a price $g(y)$ to unload it at $y$ . The contractor wants to maximize revenue $\int f\,d\mu + \int g\,d\nu$ , but to win your business no load–unload pair may cost more than doing it yourself, $f(x)+g(y)\le c(x,y)$ . The dual problem is exactly this:

\sup_{f,g}\ \int f\,d\mu + \int g\,d\nu \qquad\text{subject to}\qquad f(x)+g(y)\le c(x,y)\ \ \forall x,y.

Symbol by symbol: $\mu$ is the supply distribution (where the piles of earth are), $\nu$ is the demand distribution (where the holes are), $c(x,y)$ is the unit cost of moving a unit from $x$ to $y$ ; $f$ is the “load price schedule” and $g$ the “unload price schedule,” both functions defined over the whole space. The objective $\int f\,d\mu$ is the total loading fee the contractor collects across all piles, $\int g\,d\nu$ the total unloading fee across all holes. The constraint must hold for every pair $(x,y)$ , because you can always choose to “move that pair yourself,” so the contractor’s two posted prices must never sum to more than your own cost.

Strong duality holds: this supremum equals the primal infimum — the cheapest plan costs exactly what the smartest contractor can charge. (Kantorovich duality; Peyré & Cuturi 2019, §2.5.)

= W₁　optimal (the 1-Lipschitz bound is exactly attained)

potential slopes = -1.00

steep (−1.5)steep (+1.5)

The dual no longer chooses a transport plan — it chooses a potential f that maximizes 𝔼_μ[f] − 𝔼_ν[f]. The potential wants to be as steep as possible to separate μ from ν, but Kantorovich–Rubinstein requires f to be 1-Lipschitz (|slope| ≤ 1). That cap pins the dual value exactly at W₁, attained at slope −1. Push the slope past ±1 and the value exceeds W₁ — impossible for a true lower bound, which is exactly why the 1-Lipschitz constraint is indispensable.

Pin it to an example you can do in your head. Let $\mu$ put all its mass at a single point $x_0$ and $\nu$ at a single point $y_0$ , with cost $c=\lVert x_0-y_0\rVert$ . The primal has no choice but to ship the whole pile across, at cost $d=\lVert x_0-y_0\rVert$ . On the dual side the contractor maximizes $f(x_0)+g(y_0)$ subject to $f(x_0)+g(y_0)\le d$ , so he pushes the inequality to equality and earns exactly $d$ — both sides equal, strong duality visible at a glance. A notch harder: $\mu=\tfrac12\delta_{0}+\tfrac12\delta_{2}$ , $\nu=\delta_{1}$ , distance cost. Both piles are distance $1$ from $1$ , so the optimal plan ships each half-unit to $1$ at total cost $\tfrac12\cdot1+\tfrac12\cdot1=1$ ; the dual takes $f(0)=f(2)=\tfrac12$ and $g(1)=\tfrac12$ (check $f(x)+g(1)\le|x-1|$ , tight at $x=0,2$ ), earning $\tfrac12\cdot\tfrac12+\tfrac12\cdot\tfrac12+\tfrac12=1$ — matched again.

The c-transform: one potential, not two

For a fixed $f$ , the best the contractor can do at each $y$ is charge as much as the constraint allows, giving the $c$ -transform

f^{c}(y)=\inf_{x}\big(c(x,y)-f(x)\big).

This step just writes “raise $g(y)$ to the constraint ceiling at each $y$ ” as a formula: the constraint $g(y)\le c(x,y)-f(x)$ must hold for all $x$ , so the largest legal $g(y)$ is the smallest of those upper bounds, i.e. the infimum over $x$ . In other words, once the load price $f$ is fixed, the best unload price $g$ is no longer free — it is fully determined by $f$ . Substituting $g=f^{c}$ collapses the two prices into one. The dual becomes a maximization over a single potential:

\sup_{f}\ \int f\,d\mu + \int f^{c}\,d\nu .

This is the semi-dual, and that lone $f$ is the “single potential” the encoder–decoder methods refer to: optimizing over one function $f$ is what lets CryoGEN-II replace an adversarial min–max with a single-potential objective. (Peyré & Cuturi §3.1; Genevay et al., semi-dual stochastic OT.) This matters in practice: the primal unknown $\pi$ is an $n\times m$ table that grows with the dataset, while the semi-dual unknown $f$ is a function you can hand to a network and optimize with stochastic gradients on mini-batches — that is the hinge on which OT enters a training loop at all.

W₁ and the 1-Lipschitz potential

When the cost is the distance itself, $c(x,y)=\lVert x-y\rVert$ , the constraint forces $g=-f$ with $f$ 1-Lipschitz, and the semi-dual reduces to the clean Kantorovich–Rubinstein form

W_1(\mu,\nu)=\sup_{\lVert f\rVert_{\mathrm{Lip}}\le 1}\ \mathbb{E}_{\mu}[f]-\mathbb{E}_{\nu}[f].

Why does the distance cost force these two conclusions? Put $c(x,y)=\lVert x-y\rVert$ into $f(x)+g(y)\le\lVert x-y\rVert$ : setting $x=y$ gives $f(x)+g(x)\le0$ , and the triangle inequality makes $f^{c}=-f$ , so $g=-f$ — the two schedules collapse to one and its negative mirror. Back-substituting gives $f(x)-f(y)\le\lVert x-y\rVert$ for all $x,y$ , which is precisely the definition of 1-Lipschitz (slope at most $1$ in magnitude). $\lVert f\rVert_{\mathrm{Lip}}\le1$ is that constraint; $\mathbb{E}_{\mu}[f]-\mathbb{E}_{\nu}[f]$ is the difference of one and the same potential’s means under the two distributions. So $W_1$ no longer needs a plan solved and summed — it reduces to “find one function that climbs no faster than the geometry allows and maximizes the gap between its means on $\mu$ and $\nu$ .”

Intuition

The potential is a witness that separates the two distributions: it should be high where $\mu$ lives and low where $\nu$ lives, and the gap it opens, $\mathbb{E}_\mu[f]-\mathbb{E}_\nu[f]$ , scores how far apart they are. The 1-Lipschitz cap forbids it from rising faster than the geometry allows — without that cap the score could be inflated without bound (drag the slider past $\pm 1$ in the demo to see the “value” overshoot $W_1$ , which a true lower bound can never do). The tightest legal witness scores exactly $W_1$ .

Complementary slackness: how plan and prices interlock

The primal and the dual are not two unrelated optimizations — at the optimum they interlock through complementary slackness. A pair $(x,y)$ carries optimal mass only where its price constraint is tight, $f(x)+g(y)=c(x,y)$ ; wherever the constraint is strict, $f(x)+g(y)<c(x,y)$ , the transported mass on that pair is zero.

\pi^\star(x,y)>0\ \Longrightarrow\ f(x)+g(y)=c(x,y).

The intuition: the contractor only wins business on pairs where his quote exactly meets your own cost — anywhere else he is uncompetitive, and no mass flows there. This set of tight pairs (the $c$ -cyclically-monotone support) is exactly the set of pairs the optimal coupling is allowed to use. In the $\mu=\tfrac12\delta_0+\tfrac12\delta_2$ example above, the constraint is tight at $x=0,2$ and the plan does ship only on those two pairs — complementary slackness at work. The same equation explains why Brenier’s theorem in the next section can recover a map from the potential: tightness pins each $x$ to the $y$ that makes the equality hold.

This is the Wasserstein-GAN critic

The Kantorovich–Rubinstein form is the Wasserstein GAN objective. The WGAN critic is the potential $f$ ; weight clipping or a gradient penalty enforces the 1-Lipschitz constraint; and the generator is trained to lower $\mathbb{E}_{q_x}[f]$ , pushing the generated distribution $q_x$ toward the data. What looks like an adversarial game is, underneath, one side solving the $W_1$ dual and the other descending it — which is why the WGAN gradient stays informative even when the two distributions barely overlap (see the disjoint-support argument).

The contrast with a plain GAN makes the difference concrete. An ordinary discriminator outputs a probability in $[0,1]$ , which saturates (to $0$ or $1$ ) the moment the two distributions stop overlapping; its gradient then vanishes and the generator gets no direction. The WGAN critic outputs an unbounded real-valued potential, the 1-Lipschitz cap keeps it from saturating, and $\mathbb{E}_\mu[f]-\mathbb{E}_{q_x}[f]$ always gives a distance proportional to $W_1$ that is differentiable everywhere — the practical payoff of the dual view.

Brenier’s theorem: when the plan is a map

The primal page left a thread hanging — a deterministic Monge map “need not exist,” because mass may have to split. The dual closes it. For the squared cost $c(x,y)=\lVert x-y\rVert^2$ with an absolutely continuous source $\mu$ , the optimal potential is essentially a convex function, and its gradient is the optimal map:

T(x)=\nabla\varphi(x),\qquad \varphi\ \text{convex},

symbol by symbol: $\varphi$ is the convex potential built from the dual potentials (concretely a form like $\varphi(x)=\tfrac12\lVert x\rVert^2-f(x)$ ), $\nabla\varphi(x)$ is its gradient at $x$ , and $T$ is the map that sends source point $x$ to its target. Convexity is the load-bearing fact: it makes $\nabla\varphi$ monotone, so distinct $x$ are never sent to the same $y$ to fight over mass, hence no splitting — exactly the squared-cost consequence of the complementary-slackness “pin each $x$ to one $y$ ” above. So the Kantorovich optimum is induced by a genuine map after all — no splitting needed. (Brenier 1991; Peyré & Cuturi §2.4.) This is the precise sense in which squared-cost optimal transport “straightens” a distribution onto another.

Depth

The entropic dual is the Sinkhorn potentials. Adding the entropy penalty softens the hard constraint $f+g\le c$ into a smooth penalty, and the dual becomes unconstrained and differentiable. The $c$ -transform softens into a soft-min (log-sum-exp):

f^{c,\gamma}(y)=-\gamma\log\!\int e^{\,(f(x)-c(x,y))/\gamma}\,d\mu(x),

and the optimal potentials are exactly the log-domain Sinkhorn scalings, $f=\gamma\log a$ and $g=\gamma\log b$ . The row/column rescalings you iterate there are the dual prices here — Sinkhorn is coordinate ascent on this smooth dual. (Peyré & Cuturi §4.4; Cuturi 2013.) Note the $\inf$ has become $-\gamma\log\int e^{\cdot/\gamma}$ : as $\gamma\to0$ the soft-min converges to the hard $\inf$ and the smooth dual reverts to the non-smooth hard dual above; the larger $\gamma$ , the flatter the price curves and the looser the constraint — the dual-side face of entropic OT’s single temperature knob between “hard assignment” and “diffuse hedge.”

Why duality matters for learning. The primal optimizes over couplings — an $n\times m$ object that grows with the data. The dual optimizes over a single function $f$ , which a neural network can parameterize directly. That swap is the whole mechanism: it is what makes the Wasserstein distance a trainable loss rather than a quantity to recompute from scratch, and it is shared by WGAN (the hard 1-Lipschitz dual) and the Sinkhorn autoencoder (the smooth entropic dual) alike — the optimal-transport → autoencoder thread, seen from the dual side.

Where this lands in Cryo-ET reconstruction

This chain of dualities is not an abstract exercise; it carries the training objective of several reconstruction methods. When you want the distribution of reconstructed clean volumes to match a reference distribution, you need a distribution distance you can backpropagate through — and the primal’s plan $\pi$ is non-differentiable and explodes with voxel count, so it cannot serve as a loss directly. The dual rewrites that distance as an optimization over a potential: CryoGEN-II takes the WAE/OT route, using this page’s single-potential objective to get one stable answer; Wasserstein GAN uses the same 1-Lipschitz potential as a critic to drive distribution alignment. Step once toward the entropic side and the smooth dual’s Boltzmann solution becomes the per-image posterior that EVIA / CryoWGEN samples in its E-step (details on the entropic page) — the same dual, giving a MAP-like single answer as $\gamma\to0$ and a family of reconstructions for $\gamma>0$ that carries the uncertainty the missing wedge leaves behind.