GANs and the Wasserstein GAN

Training a generator against a critic, and replacing the classifier with a Wasserstein-distance critic for stable gradients.

A generative adversarial network (GAN) trains a generator $G$ that maps noise $z\sim p_z$ to samples $G(z)$ , in opposition to a discriminator $D$ that tries to tell generated samples from real ones. The two play a minimax game,

Mode collapse, illustrated — adjust how many modes the generator covers:

Real dataGenerated

Modes the generator covers: 2/8　(2 modes captured)

The real distribution has 8 modes (clusters on a ring). A healthy generator covers them all; a collapsed one piles mass on a few and ignores the rest — mode collapse. The Wasserstein distance (WGAN) penalises leaving whole modes unserved, which mitigates it.

\min_{G}\max_{D}\; \mathbb{E}_{x\sim p_{\text{data}}}[\log D(x)] +\mathbb{E}_{z\sim p_z}[\log(1-D(G(z)))].

Read the objective term by term. $D(x)\in(0,1)$ is the probability the discriminator assigns to $x$ being real; the first term $\mathbb{E}_{x\sim p_{\text{data}}}[\log D(x)]$ rewards $D$ for scoring real samples high; in the second term $G(z)$ is a fake sample decoded from prior noise $z$ , and $\log(1-D(G(z)))$ rewards $D$ for scoring fakes low. $D$ wants both terms large at once; $G$ wants the second term small — that is, it wants its fakes to fool $D$ .

The networks update in alternation: $D$ maximizes the objective with $G$ fixed, while $G$ minimizes it with $D$ fixed, so a single objective couples them into an adversarial loop rather than a loss either one minimizes outright. At the optimum the generator’s distribution matches the data and $D\equiv \tfrac12$ — the discriminator can no longer tell real from fake and outputs $0.5$ on every sample. In practice the original objective is fragile: when $D$ grows confident the generator’s gradient vanishes, and training is prone to mode collapse, where $G$ produces only a few outputs that fool $D$ rather than covering the full data distribution.

Intuition

Picture a forger ( $G$ ) and an inspector ( $D$ ) in a tug-of-war. Each time the inspector spots a tell, the forger fixes that flaw; each time the forger learns a trick, the inspector raises its bar. The ideal endpoint is forgeries indistinguishable from the real thing, leaving the inspector to guess by coin flip. The difficulty is that two networks with opposite goals are moving at once, with no fixed loss surface to descend — progress by one side reshapes the terrain for the other, so the game can circle, or one side can overwhelm the other, without converging.

Because $G$ and $D$ must stay roughly balanced in strength, such games are sensitive to the ratio of learning rates, network capacity, and update counts. A common recipe updates $D$ several times per $G$ step, keeping the judging side slightly ahead so the generator receives a better-aimed gradient — but too strong a $D$ pushes the gradient into a saturated region. Most later refinements amount to different ways of constraining this loop so that its gradients remain informative.

The minimax game as a Nash equilibrium

The adversarial objective is not minimized by either network alone; its solution is a Nash equilibrium of a two-player game, a joint configuration $(G^\*, D^\*)$ at which neither player can improve by changing its own parameters while the other is held fixed. Ordinary gradient descent on each network’s parameters need not converge to such a fixed point: the coupled dynamics can cycle around the equilibrium or diverge, since the descent direction for $G$ depends on a $D$ that is itself moving. The simplest illustration is the bilinear game $\min_x\max_y\, xy$ , whose only equilibrium is the origin, yet gradient dynamics orbit the origin at undiminished radius — the same mechanism behind “the objective is dropping but the samples keep getting worse” in adversarial training. Mode collapse is another symptom of this instability — the generator concentrates probability mass on a narrow set of outputs that the current $D$ happens to score highly, then chases $D$ ‘s response as it adapts, never settling on the full data distribution. When $G$ and $D$ become badly mismatched, a near-perfect discriminator drives the term $\log(1-D(G(z)))$ into a flat region, and the generator’s gradient vanishes. These pathologies motivate replacing the Jensen–Shannon objective with a distance whose gradient remains informative regardless of how the two distributions are positioned.

Intuition

The standard GAN loss measures a Jensen–Shannon divergence. When the generated and real distributions barely overlap — common early in training — that divergence saturates and provides almost no usable gradient. Here is the picture: if two clouds of points are completely disjoint, the quantity “how much do they overlap” is exactly zero whether they sit an inch or a mile apart — it tells you “no overlap” but not “which way to move to get closer.” A distance that degrades gracefully under disjoint support, and that points toward the direction of approach, is needed instead.

Depth

Make the contrast concrete. Take two parallel distributions — two delta distributions sitting on parallel lines a distance $\theta$ apart. As long as $\theta\neq 0$ the supports are disjoint, and the Jensen–Shannon divergence is the constant $\log 2$ , independent of $\theta$ : its gradient is $0$ everywhere, so the generator gets no signal pushing $\theta$ toward zero and the objective saturates as the supports separate. The Wasserstein-1 distance, instead, is $|\theta|$ , with gradient $\pm 1$ everywhere; it decreases smoothly as the distributions approach even while their supports stay disjoint, so the generator always has a usable direction — which is exactly what “graceful degradation” means. Here $\theta$ is the gap between the two distributions and $\log 2$ is the maximum value JS divergence takes under complete non-overlap (in natural log). Plug in any number to check: $\theta=2$ and $\theta=0.01$ both give a JS gradient of $0$ , while the Wasserstein gradient always points toward smaller $\theta$ .

The Wasserstein GAN (WGAN) replaces the classifier with a critic $f$ that estimates the Wasserstein-1 distance between generated and real data. By the Kantorovich–Rubinstein duality,

W_1(p_{\text{data}}, p_G) = \sup_{\|f\|_L \le 1}\; \mathbb{E}_{x\sim p_{\text{data}}}[f(x)] -\mathbb{E}_{z\sim p_z}[f(G(z))],

where the supremum is over all 1-Lipschitz functions $f$ . Term by term: $f$ is the critic, mapping a sample to a real-valued score; $\|f\|_L\le 1$ is the 1-Lipschitz constraint, requiring that $f$ ‘s values at any two points differ by no more than the distance between them — equivalently, its slope is bounded between $\pm 1$ , so $f$ can never be arbitrarily steep; the expression $\mathbb{E}_{p_{\text{data}}}[f]-\mathbb{E}_{p_G}[f]$ is the difference in the critic’s average score on real versus generated samples, and its supremum over all admissible $f$ equals the Wasserstein-1 distance between the two distributions. The critic outputs a real-valued score rather than a probability, and its objective gives the generator a meaningful gradient even when the two distributions do not overlap — because $f$ ‘s slope is pinned at $\pm 1$ , the score gap shrinks proportionally as the distributions approach, no matter how far apart real and fake start.

Depth

The duality is valid only while $f$ stays 1-Lipschitz, so the constraint must be enforced. Early WGANs clipped the critic’s weights to a small range, a crude device that distorts the function: clip too tightly and $f$ degenerates to nearly linear with too little capacity; clip too loosely and the constraint stops binding. The gradient penalty (WGAN-GP) instead adds a soft term pushing the gradient norm of $f$ toward $1$ at interpolated points $\hat x$ :

\lambda\,\mathbb{E}_{\hat x}\big[(\|\nabla_{\hat x} f(\hat x)\|_2 - 1)^2\big].

Here $\hat x$ is a point sampled by randomly interpolating between a real and a generated sample (the optimal critic has gradient norm exactly $1$ along the optimal transport path, so it suffices to enforce the constraint on those connecting lines); $\|\nabla_{\hat x} f\|_2$ is the magnitude of $f$ ‘s gradient at that point; the penalty pulls it toward $1$ , with $\lambda$ setting the penalty strength. This keeps $f$ close to 1-Lipschitz without clipping, yielding markedly more stable training.

Density-ratio estimation and the discriminator

The standard GAN discriminator and the Wasserstein critic represent two distinct strategies for comparing distributions. A classifier-style $D$ trained to optimality recovers a density ratio: at its optimum, $D^\*(x)=p_{\text{data}}(x)\big/\!\big(p_{\text{data}}(x)+p_G(x)\big)$ , from which $p_{\text{data}}/p_G$ can be read off, and substituting $D^\*$ back into the objective yields the Jensen–Shannon divergence. Intuitively, the density ratio answers “at $x$ , how many times more real samples than generated ones” — a pointwise quantity that only makes sense where both distributions place mass at the same point. The same density-ratio mechanism appears in latent space in the adversarial autoencoder, where a discriminator matches an aggregated posterior to a prior. The Wasserstein critic, by contrast, does not estimate a ratio at all: it estimates a transport cost — the minimal “work” needed to move the generated distribution’s mass into the real distribution — which stays finite and differentiable even where $p_{\text{data}}$ and $p_G$ share no support and the ratio $p_{\text{data}}/p_G$ is undefined. This is precisely why the Wasserstein formulation degrades gracefully under disjoint support while the density-ratio formulation saturates: transport cost asks “how far to move,” the density ratio asks “who has more at the same point,” and the latter has no answer once the two clouds of points are offset.

This same adversarial critic reappears as the learned energy prior of CryoGEN-I, matching degraded reconstructions to real observations in observation space; the Wasserstein distance it estimates is then what CryoGEN-II minimizes directly through optimal transport — with no min–max. That shift from “adversarial game” to “direct transport” mirrors this page’s theme: CryoGEN-I keeps the GAN minimax loop, accepting its instability in exchange for a flexible learned energy prior; CryoGEN-II skips the game and minimizes the Wasserstein distance the critic was approximating, yielding a stable single answer. (CryoWGEN goes further still, replacing the critic with an entropic-OT objective solved by sampling, not adversarial training.) In Cryo-ET reconstruction this is the entry point that turns missing-wedge restoration into a distribution-matching problem: real observations form $p_{\text{data}}$ , degraded reconstructions form $p_G$ , and a critic or transport cost measures and closes the gap between them — see the methods overview. It also connects to the adversarial variant of the autoencoders.

← Generative & Distribution Matching