Kantorovich duality & the dual potentials
Every transport problem has a dual played on prices instead of plans; the c-transform collapses it to a single potential, the W₁ case is exactly the Wasserstein-GAN critic, and Brenier's theorem recovers the optimal map.
The Kantorovich problem chooses a transport plan — a coupling over pairs . Every such problem has a dual that chooses something completely different: not a plan, but a pair of prices. The dual view is the one that turns the Wasserstein distance from a number you compute into an objective a network can train, and it is the piece the Wasserstein autoencoder and Wasserstein GAN — and through them CryoGEN-II and CryoWGEN — quietly rely on.
One sentence to anchor the whole page: the primal optimizes over what to move where (a 2-D table ), the dual over what each location is worth (two curves ); both return the same number, but the dual’s unknown is a function, and a function is exactly what a neural network can fit. The four sections below show how that pair of prices collapses into one potential, why the distance cost forces it to be 1-Lipschitz, how it becomes the WGAN critic, and how, under squared cost, its gradient recovers a genuine map.
Two prices instead of a plan
Picture a shipping contractor. Instead of routing every shovel of earth yourself (choosing ), you hire a contractor who posts two price functions: a price to load a unit at and a price to unload it at . The contractor wants to maximize revenue , but to win your business no load–unload pair may cost more than doing it yourself, . The dual problem is exactly this:
Symbol by symbol: is the supply distribution (where the piles of earth are), is the demand distribution (where the holes are), is the unit cost of moving a unit from to ; is the “load price schedule” and the “unload price schedule,” both functions defined over the whole space. The objective is the total loading fee the contractor collects across all piles, the total unloading fee across all holes. The constraint must hold for every pair , because you can always choose to “move that pair yourself,” so the contractor’s two posted prices must never sum to more than your own cost.
Strong duality holds: this supremum equals the primal infimum — the cheapest plan costs exactly what the smartest contractor can charge. (Kantorovich duality; Peyré & Cuturi 2019, §2.5.)
= W₁ optimal (the 1-Lipschitz bound is exactly attained)
The dual no longer chooses a transport plan — it chooses a potential f that maximizes 𝔼_μ[f] − 𝔼_ν[f]. The potential wants to be as steep as possible to separate μ from ν, but Kantorovich–Rubinstein requires f to be 1-Lipschitz (|slope| ≤ 1). That cap pins the dual value exactly at W₁, attained at slope −1. Push the slope past ±1 and the value exceeds W₁ — impossible for a true lower bound, which is exactly why the 1-Lipschitz constraint is indispensable.
Pin it to an example you can do in your head. Let put all its mass at a single point and at a single point , with cost . The primal has no choice but to ship the whole pile across, at cost . On the dual side the contractor maximizes subject to , so he pushes the inequality to equality and earns exactly — both sides equal, strong duality visible at a glance. A notch harder: , , distance cost. Both piles are distance from , so the optimal plan ships each half-unit to at total cost ; the dual takes and (check , tight at ), earning — matched again.
The c-transform: one potential, not two
For a fixed , the best the contractor can do at each is charge as much as the constraint allows, giving the -transform
This step just writes “raise to the constraint ceiling at each ” as a formula: the constraint must hold for all , so the largest legal is the smallest of those upper bounds, i.e. the infimum over . In other words, once the load price is fixed, the best unload price is no longer free — it is fully determined by . Substituting collapses the two prices into one. The dual becomes a maximization over a single potential:
This is the semi-dual, and that lone is the “single potential” the encoder–decoder methods refer to: optimizing over one function is what lets CryoGEN-II replace an adversarial min–max with a single-potential objective. (Peyré & Cuturi §3.1; Genevay et al., semi-dual stochastic OT.) This matters in practice: the primal unknown is an table that grows with the dataset, while the semi-dual unknown is a function you can hand to a network and optimize with stochastic gradients on mini-batches — that is the hinge on which OT enters a training loop at all.
W₁ and the 1-Lipschitz potential
When the cost is the distance itself, , the constraint forces with 1-Lipschitz, and the semi-dual reduces to the clean Kantorovich–Rubinstein form
Why does the distance cost force these two conclusions? Put into : setting gives , and the triangle inequality makes , so — the two schedules collapse to one and its negative mirror. Back-substituting gives for all , which is precisely the definition of 1-Lipschitz (slope at most in magnitude). is that constraint; is the difference of one and the same potential’s means under the two distributions. So no longer needs a plan solved and summed — it reduces to “find one function that climbs no faster than the geometry allows and maximizes the gap between its means on and .”
The potential is a witness that separates the two distributions: it should be high where lives and low where lives, and the gap it opens, , scores how far apart they are. The 1-Lipschitz cap forbids it from rising faster than the geometry allows — without that cap the score could be inflated without bound (drag the slider past in the demo to see the “value” overshoot , which a true lower bound can never do). The tightest legal witness scores exactly .
Complementary slackness: how plan and prices interlock
The primal and the dual are not two unrelated optimizations — at the optimum they interlock through complementary slackness. A pair carries optimal mass only where its price constraint is tight, ; wherever the constraint is strict, , the transported mass on that pair is zero.
The intuition: the contractor only wins business on pairs where his quote exactly meets your own cost — anywhere else he is uncompetitive, and no mass flows there. This set of tight pairs (the -cyclically-monotone support) is exactly the set of pairs the optimal coupling is allowed to use. In the example above, the constraint is tight at and the plan does ship only on those two pairs — complementary slackness at work. The same equation explains why Brenier’s theorem in the next section can recover a map from the potential: tightness pins each to the that makes the equality hold.
This is the Wasserstein-GAN critic
The Kantorovich–Rubinstein form is the Wasserstein GAN objective. The WGAN critic is the potential ; weight clipping or a gradient penalty enforces the 1-Lipschitz constraint; and the generator is trained to lower , pushing the generated distribution toward the data. What looks like an adversarial game is, underneath, one side solving the dual and the other descending it — which is why the WGAN gradient stays informative even when the two distributions barely overlap (see the disjoint-support argument).
The contrast with a plain GAN makes the difference concrete. An ordinary discriminator outputs a probability in , which saturates (to or ) the moment the two distributions stop overlapping; its gradient then vanishes and the generator gets no direction. The WGAN critic outputs an unbounded real-valued potential, the 1-Lipschitz cap keeps it from saturating, and always gives a distance proportional to that is differentiable everywhere — the practical payoff of the dual view.
Brenier’s theorem: when the plan is a map
The primal page left a thread hanging — a deterministic Monge map “need not exist,” because mass may have to split. The dual closes it. For the squared cost with an absolutely continuous source , the optimal potential is essentially a convex function, and its gradient is the optimal map:
symbol by symbol: is the convex potential built from the dual potentials (concretely a form like ), is its gradient at , and is the map that sends source point to its target. Convexity is the load-bearing fact: it makes monotone, so distinct are never sent to the same to fight over mass, hence no splitting — exactly the squared-cost consequence of the complementary-slackness “pin each to one ” above. So the Kantorovich optimum is induced by a genuine map after all — no splitting needed. (Brenier 1991; Peyré & Cuturi §2.4.) This is the precise sense in which squared-cost optimal transport “straightens” a distribution onto another.
The entropic dual is the Sinkhorn potentials. Adding the entropy penalty softens the hard constraint into a smooth penalty, and the dual becomes unconstrained and differentiable. The -transform softens into a soft-min (log-sum-exp):
and the optimal potentials are exactly the log-domain Sinkhorn scalings, and . The row/column rescalings you iterate there are the dual prices here — Sinkhorn is coordinate ascent on this smooth dual. (Peyré & Cuturi §4.4; Cuturi 2013.) Note the has become : as the soft-min converges to the hard and the smooth dual reverts to the non-smooth hard dual above; the larger , the flatter the price curves and the looser the constraint — the dual-side face of entropic OT’s single temperature knob between “hard assignment” and “diffuse hedge.”
Why duality matters for learning. The primal optimizes over couplings — an object that grows with the data. The dual optimizes over a single function , which a neural network can parameterize directly. That swap is the whole mechanism: it is what makes the Wasserstein distance a trainable loss rather than a quantity to recompute from scratch, and it is shared by WGAN (the hard 1-Lipschitz dual) and the Sinkhorn autoencoder (the smooth entropic dual) alike — the optimal-transport → autoencoder thread, seen from the dual side.
Where this lands in Cryo-ET reconstruction
This chain of dualities is not an abstract exercise; it carries the training objective of several reconstruction methods. When you want the distribution of reconstructed clean volumes to match a reference distribution, you need a distribution distance you can backpropagate through — and the primal’s plan is non-differentiable and explodes with voxel count, so it cannot serve as a loss directly. The dual rewrites that distance as an optimization over a potential: CryoGEN-II takes the WAE/OT route, using this page’s single-potential objective to get one stable answer; Wasserstein GAN uses the same 1-Lipschitz potential as a critic to drive distribution alignment. Step once toward the entropic side and the smooth dual’s Boltzmann solution becomes the per-image posterior that EVIA / CryoWGEN samples in its E-step (details on the entropic page) — the same dual, giving a MAP-like single answer as and a family of reconstructions for that carries the uncertainty the missing wedge leaves behind.
Further reading
- Peyré & Cuturi. Computational Optimal Transport. 2019. (§2.4–2.5, §3.1, §4.4, §6.2)
- Genevay, Cuturi, Peyré, Bach. Stochastic Optimization for Large-scale Optimal Transport. NeurIPS 2016.
- Arjovsky, Chintala, Bottou. Wasserstein GAN. ICML 2017.
- Brenier. Polar factorization and monotone rearrangement of vector-valued functions. 1991.