Kantorovich duality & the dual potentials

Every transport problem has a dual played on prices instead of plans; the c-transform collapses it to a single potential, the W₁ case is exactly the Wasserstein-GAN critic, and Brenier's theorem recovers the optimal map.

The Kantorovich problem chooses a transport plan — a coupling π\pi over pairs (x,y)(x,y). Every such problem has a dual that chooses something completely different: not a plan, but a pair of prices. The dual view is the one that turns the Wasserstein distance from a number you compute into an objective a network can train, and it is the piece the Wasserstein autoencoder and Wasserstein GAN — and through them CryoGEN-II and CryoWGEN — quietly rely on.

One sentence to anchor the whole page: the primal optimizes over what to move where (a 2-D table π\pi), the dual over what each location is worth (two curves f,gf,g); both return the same number, but the dual’s unknown is a function, and a function is exactly what a neural network can fit. The four sections below show how that pair of prices collapses into one potential, why the distance cost forces it to be 1-Lipschitz, how it becomes the WGAN critic, and how, under squared cost, its gradient recovers a genuine map.

Two prices instead of a plan

Picture a shipping contractor. Instead of routing every shovel of earth yourself (choosing π\pi), you hire a contractor who posts two price functions: a price f(x)f(x) to load a unit at xx and a price g(y)g(y) to unload it at yy. The contractor wants to maximize revenue fdμ+gdν\int f\,d\mu + \int g\,d\nu, but to win your business no load–unload pair may cost more than doing it yourself, f(x)+g(y)c(x,y)f(x)+g(y)\le c(x,y). The dual problem is exactly this:

supf,g fdμ+gdνsubject tof(x)+g(y)c(x,y)  x,y.\sup_{f,g}\ \int f\,d\mu + \int g\,d\nu \qquad\text{subject to}\qquad f(x)+g(y)\le c(x,y)\ \ \forall x,y.

Symbol by symbol: μ\mu is the supply distribution (where the piles of earth are), ν\nu is the demand distribution (where the holes are), c(x,y)c(x,y) is the unit cost of moving a unit from xx to yy; ff is the “load price schedule” and gg the “unload price schedule,” both functions defined over the whole space. The objective fdμ\int f\,d\mu is the total loading fee the contractor collects across all piles, gdν\int g\,d\nu the total unloading fee across all holes. The constraint must hold for every pair (x,y)(x,y), because you can always choose to “move that pair yourself,” so the contractor’s two posted prices must never sum to more than your own cost.

Strong duality holds: this supremum equals the primal infimum — the cheapest plan costs exactly what the smartest contractor can charge. (Kantorovich duality; Peyré & Cuturi 2019, §2.5.)

potential f(x)source μtarget νW₁ = 4dual value 𝔼_μ[f] − 𝔼_ν[f] = 4.00

= W₁ optimal (the 1-Lipschitz bound is exactly attained)

The dual no longer chooses a transport plan — it chooses a potential f that maximizes 𝔼_μ[f] − 𝔼_ν[f]. The potential wants to be as steep as possible to separate μ from ν, but Kantorovich–Rubinstein requires f to be 1-Lipschitz (|slope| ≤ 1). That cap pins the dual value exactly at W₁, attained at slope −1. Push the slope past ±1 and the value exceeds W₁ — impossible for a true lower bound, which is exactly why the 1-Lipschitz constraint is indispensable.

Pin it to an example you can do in your head. Let μ\mu put all its mass at a single point x0x_0 and ν\nu at a single point y0y_0, with cost c=x0y0c=\lVert x_0-y_0\rVert. The primal has no choice but to ship the whole pile across, at cost d=x0y0d=\lVert x_0-y_0\rVert. On the dual side the contractor maximizes f(x0)+g(y0)f(x_0)+g(y_0) subject to f(x0)+g(y0)df(x_0)+g(y_0)\le d, so he pushes the inequality to equality and earns exactly dd — both sides equal, strong duality visible at a glance. A notch harder: μ=12δ0+12δ2\mu=\tfrac12\delta_{0}+\tfrac12\delta_{2}, ν=δ1\nu=\delta_{1}, distance cost. Both piles are distance 11 from 11, so the optimal plan ships each half-unit to 11 at total cost 121+121=1\tfrac12\cdot1+\tfrac12\cdot1=1; the dual takes f(0)=f(2)=12f(0)=f(2)=\tfrac12 and g(1)=12g(1)=\tfrac12 (check f(x)+g(1)x1f(x)+g(1)\le|x-1|, tight at x=0,2x=0,2), earning 1212+1212+12=1\tfrac12\cdot\tfrac12+\tfrac12\cdot\tfrac12+\tfrac12=1 — matched again.

The c-transform: one potential, not two

For a fixed ff, the best the contractor can do at each yy is charge as much as the constraint allows, giving the cc-transform

fc(y)=infx(c(x,y)f(x)).f^{c}(y)=\inf_{x}\big(c(x,y)-f(x)\big).

This step just writes “raise g(y)g(y) to the constraint ceiling at each yy” as a formula: the constraint g(y)c(x,y)f(x)g(y)\le c(x,y)-f(x) must hold for all xx, so the largest legal g(y)g(y) is the smallest of those upper bounds, i.e. the infimum over xx. In other words, once the load price ff is fixed, the best unload price gg is no longer free — it is fully determined by ff. Substituting g=fcg=f^{c} collapses the two prices into one. The dual becomes a maximization over a single potential:

supf fdμ+fcdν.\sup_{f}\ \int f\,d\mu + \int f^{c}\,d\nu .

This is the semi-dual, and that lone ff is the “single potential” the encoder–decoder methods refer to: optimizing over one function ff is what lets CryoGEN-II replace an adversarial min–max with a single-potential objective. (Peyré & Cuturi §3.1; Genevay et al., semi-dual stochastic OT.) This matters in practice: the primal unknown π\pi is an n×mn\times m table that grows with the dataset, while the semi-dual unknown ff is a function you can hand to a network and optimize with stochastic gradients on mini-batches — that is the hinge on which OT enters a training loop at all.

W₁ and the 1-Lipschitz potential

When the cost is the distance itself, c(x,y)=xyc(x,y)=\lVert x-y\rVert, the constraint forces g=fg=-f with ff 1-Lipschitz, and the semi-dual reduces to the clean Kantorovich–Rubinstein form

W1(μ,ν)=supfLip1 Eμ[f]Eν[f].W_1(\mu,\nu)=\sup_{\lVert f\rVert_{\mathrm{Lip}}\le 1}\ \mathbb{E}_{\mu}[f]-\mathbb{E}_{\nu}[f].

Why does the distance cost force these two conclusions? Put c(x,y)=xyc(x,y)=\lVert x-y\rVert into f(x)+g(y)xyf(x)+g(y)\le\lVert x-y\rVert: setting x=yx=y gives f(x)+g(x)0f(x)+g(x)\le0, and the triangle inequality makes fc=ff^{c}=-f, so g=fg=-f — the two schedules collapse to one and its negative mirror. Back-substituting gives f(x)f(y)xyf(x)-f(y)\le\lVert x-y\rVert for all x,yx,y, which is precisely the definition of 1-Lipschitz (slope at most 11 in magnitude). fLip1\lVert f\rVert_{\mathrm{Lip}}\le1 is that constraint; Eμ[f]Eν[f]\mathbb{E}_{\mu}[f]-\mathbb{E}_{\nu}[f] is the difference of one and the same potential’s means under the two distributions. So W1W_1 no longer needs a plan solved and summed — it reduces to “find one function that climbs no faster than the geometry allows and maximizes the gap between its means on μ\mu and ν\nu.”

Intuition

The potential is a witness that separates the two distributions: it should be high where μ\mu lives and low where ν\nu lives, and the gap it opens, Eμ[f]Eν[f]\mathbb{E}_\mu[f]-\mathbb{E}_\nu[f], scores how far apart they are. The 1-Lipschitz cap forbids it from rising faster than the geometry allows — without that cap the score could be inflated without bound (drag the slider past ±1\pm 1 in the demo to see the “value” overshoot W1W_1, which a true lower bound can never do). The tightest legal witness scores exactly W1W_1.

Complementary slackness: how plan and prices interlock

The primal and the dual are not two unrelated optimizations — at the optimum they interlock through complementary slackness. A pair (x,y)(x,y) carries optimal mass only where its price constraint is tight, f(x)+g(y)=c(x,y)f(x)+g(y)=c(x,y); wherever the constraint is strict, f(x)+g(y)<c(x,y)f(x)+g(y)<c(x,y), the transported mass on that pair is zero.

π(x,y)>0  f(x)+g(y)=c(x,y).\pi^\star(x,y)>0\ \Longrightarrow\ f(x)+g(y)=c(x,y).

The intuition: the contractor only wins business on pairs where his quote exactly meets your own cost — anywhere else he is uncompetitive, and no mass flows there. This set of tight pairs (the cc-cyclically-monotone support) is exactly the set of pairs the optimal coupling is allowed to use. In the μ=12δ0+12δ2\mu=\tfrac12\delta_0+\tfrac12\delta_2 example above, the constraint is tight at x=0,2x=0,2 and the plan does ship only on those two pairs — complementary slackness at work. The same equation explains why Brenier’s theorem in the next section can recover a map from the potential: tightness pins each xx to the yy that makes the equality hold.

This is the Wasserstein-GAN critic

The Kantorovich–Rubinstein form is the Wasserstein GAN objective. The WGAN critic is the potential ff; weight clipping or a gradient penalty enforces the 1-Lipschitz constraint; and the generator is trained to lower Eqx[f]\mathbb{E}_{q_x}[f], pushing the generated distribution qxq_x toward the data. What looks like an adversarial game is, underneath, one side solving the W1W_1 dual and the other descending it — which is why the WGAN gradient stays informative even when the two distributions barely overlap (see the disjoint-support argument).

The contrast with a plain GAN makes the difference concrete. An ordinary discriminator outputs a probability in [0,1][0,1], which saturates (to 00 or 11) the moment the two distributions stop overlapping; its gradient then vanishes and the generator gets no direction. The WGAN critic outputs an unbounded real-valued potential, the 1-Lipschitz cap keeps it from saturating, and Eμ[f]Eqx[f]\mathbb{E}_\mu[f]-\mathbb{E}_{q_x}[f] always gives a distance proportional to W1W_1 that is differentiable everywhere — the practical payoff of the dual view.

Brenier’s theorem: when the plan is a map

The primal page left a thread hanging — a deterministic Monge map “need not exist,” because mass may have to split. The dual closes it. For the squared cost c(x,y)=xy2c(x,y)=\lVert x-y\rVert^2 with an absolutely continuous source μ\mu, the optimal potential is essentially a convex function, and its gradient is the optimal map:

T(x)=φ(x),φ convex,T(x)=\nabla\varphi(x),\qquad \varphi\ \text{convex},

symbol by symbol: φ\varphi is the convex potential built from the dual potentials (concretely a form like φ(x)=12x2f(x)\varphi(x)=\tfrac12\lVert x\rVert^2-f(x)), φ(x)\nabla\varphi(x) is its gradient at xx, and TT is the map that sends source point xx to its target. Convexity is the load-bearing fact: it makes φ\nabla\varphi monotone, so distinct xx are never sent to the same yy to fight over mass, hence no splitting — exactly the squared-cost consequence of the complementary-slackness “pin each xx to one yy” above. So the Kantorovich optimum is induced by a genuine map after all — no splitting needed. (Brenier 1991; Peyré & Cuturi §2.4.) This is the precise sense in which squared-cost optimal transport “straightens” a distribution onto another.

Depth

The entropic dual is the Sinkhorn potentials. Adding the entropy penalty softens the hard constraint f+gcf+g\le c into a smooth penalty, and the dual becomes unconstrained and differentiable. The cc-transform softens into a soft-min (log-sum-exp):

fc,γ(y)=γlog ⁣e(f(x)c(x,y))/γdμ(x),f^{c,\gamma}(y)=-\gamma\log\!\int e^{\,(f(x)-c(x,y))/\gamma}\,d\mu(x),

and the optimal potentials are exactly the log-domain Sinkhorn scalings, f=γlogaf=\gamma\log a and g=γlogbg=\gamma\log b. The row/column rescalings you iterate there are the dual prices here — Sinkhorn is coordinate ascent on this smooth dual. (Peyré & Cuturi §4.4; Cuturi 2013.) Note the inf\inf has become γloge/γ-\gamma\log\int e^{\cdot/\gamma}: as γ0\gamma\to0 the soft-min converges to the hard inf\inf and the smooth dual reverts to the non-smooth hard dual above; the larger γ\gamma, the flatter the price curves and the looser the constraint — the dual-side face of entropic OT’s single temperature knob between “hard assignment” and “diffuse hedge.”

Why duality matters for learning. The primal optimizes over couplings — an n×mn\times m object that grows with the data. The dual optimizes over a single function ff, which a neural network can parameterize directly. That swap is the whole mechanism: it is what makes the Wasserstein distance a trainable loss rather than a quantity to recompute from scratch, and it is shared by WGAN (the hard 1-Lipschitz dual) and the Sinkhorn autoencoder (the smooth entropic dual) alike — the optimal-transport → autoencoder thread, seen from the dual side.

Where this lands in Cryo-ET reconstruction

This chain of dualities is not an abstract exercise; it carries the training objective of several reconstruction methods. When you want the distribution of reconstructed clean volumes to match a reference distribution, you need a distribution distance you can backpropagate through — and the primal’s plan π\pi is non-differentiable and explodes with voxel count, so it cannot serve as a loss directly. The dual rewrites that distance as an optimization over a potential: CryoGEN-II takes the WAE/OT route, using this page’s single-potential objective to get one stable answer; Wasserstein GAN uses the same 1-Lipschitz potential as a critic to drive distribution alignment. Step once toward the entropic side and the smooth dual’s Boltzmann solution becomes the per-image posterior that EVIA / CryoWGEN samples in its E-step (details on the entropic page) — the same dual, giving a MAP-like single answer as γ0\gamma\to0 and a family of reconstructions for γ>0\gamma>0 that carries the uncertainty the missing wedge leaves behind.

Further reading

← Optimal Transport