GANs and the Wasserstein GAN

Training a generator against a critic, and replacing the classifier with a Wasserstein-distance critic for stable gradients.

A generative adversarial network (GAN) trains a generator GG that maps noise zpzz\sim p_z to samples G(z)G(z), in opposition to a discriminator DD that tries to tell generated samples from real ones. The two play a minimax game,

znoise z ~ p_zgenerator GG(z) fake samplexreal datadiscriminator / critic Dreal / fake or W scoreminimax gameG: make D accept G(z)D: separate real from fake

Mode collapse, illustrated — adjust how many modes the generator covers:

Real dataGenerated

The real distribution has 8 modes (clusters on a ring). A healthy generator covers them all; a collapsed one piles mass on a few and ignores the rest — mode collapse. The Wasserstein distance (WGAN) penalises leaving whole modes unserved, which mitigates it.

minGmaxD  Expdata[logD(x)]+Ezpz[log(1D(G(z)))].\min_{G}\max_{D}\; \mathbb{E}_{x\sim p_{\text{data}}}[\log D(x)] +\mathbb{E}_{z\sim p_z}[\log(1-D(G(z)))].

Read the objective term by term. D(x)(0,1)D(x)\in(0,1) is the probability the discriminator assigns to xx being real; the first term Expdata[logD(x)]\mathbb{E}_{x\sim p_{\text{data}}}[\log D(x)] rewards DD for scoring real samples high; in the second term G(z)G(z) is a fake sample decoded from prior noise zz, and log(1D(G(z)))\log(1-D(G(z))) rewards DD for scoring fakes low. DD wants both terms large at once; GG wants the second term small — that is, it wants its fakes to fool DD.

The networks update in alternation: DD maximizes the objective with GG fixed, while GG minimizes it with DD fixed, so a single objective couples them into an adversarial loop rather than a loss either one minimizes outright. At the optimum the generator’s distribution matches the data and D12D\equiv \tfrac12 — the discriminator can no longer tell real from fake and outputs 0.50.5 on every sample. In practice the original objective is fragile: when DD grows confident the generator’s gradient vanishes, and training is prone to mode collapse, where GG produces only a few outputs that fool DD rather than covering the full data distribution.

Intuition

Picture a forger (GG) and an inspector (DD) in a tug-of-war. Each time the inspector spots a tell, the forger fixes that flaw; each time the forger learns a trick, the inspector raises its bar. The ideal endpoint is forgeries indistinguishable from the real thing, leaving the inspector to guess by coin flip. The difficulty is that two networks with opposite goals are moving at once, with no fixed loss surface to descend — progress by one side reshapes the terrain for the other, so the game can circle, or one side can overwhelm the other, without converging.

Because GG and DD must stay roughly balanced in strength, such games are sensitive to the ratio of learning rates, network capacity, and update counts. A common recipe updates DD several times per GG step, keeping the judging side slightly ahead so the generator receives a better-aimed gradient — but too strong a DD pushes the gradient into a saturated region. Most later refinements amount to different ways of constraining this loop so that its gradients remain informative.

The minimax game as a Nash equilibrium

The adversarial objective is not minimized by either network alone; its solution is a Nash equilibrium of a two-player game, a joint configuration (G\*,D\*)(G^\*, D^\*) at which neither player can improve by changing its own parameters while the other is held fixed. Ordinary gradient descent on each network’s parameters need not converge to such a fixed point: the coupled dynamics can cycle around the equilibrium or diverge, since the descent direction for GG depends on a DD that is itself moving. The simplest illustration is the bilinear game minxmaxyxy\min_x\max_y\, xy, whose only equilibrium is the origin, yet gradient dynamics orbit the origin at undiminished radius — the same mechanism behind “the objective is dropping but the samples keep getting worse” in adversarial training. Mode collapse is another symptom of this instability — the generator concentrates probability mass on a narrow set of outputs that the current DD happens to score highly, then chases DD‘s response as it adapts, never settling on the full data distribution. When GG and DD become badly mismatched, a near-perfect discriminator drives the term log(1D(G(z)))\log(1-D(G(z))) into a flat region, and the generator’s gradient vanishes. These pathologies motivate replacing the Jensen–Shannon objective with a distance whose gradient remains informative regardless of how the two distributions are positioned.

Intuition

The standard GAN loss measures a Jensen–Shannon divergence. When the generated and real distributions barely overlap — common early in training — that divergence saturates and provides almost no usable gradient. Here is the picture: if two clouds of points are completely disjoint, the quantity “how much do they overlap” is exactly zero whether they sit an inch or a mile apart — it tells you “no overlap” but not “which way to move to get closer.” A distance that degrades gracefully under disjoint support, and that points toward the direction of approach, is needed instead.

Depth

Make the contrast concrete. Take two parallel distributions — two delta distributions sitting on parallel lines a distance θ\theta apart. As long as θ0\theta\neq 0 the supports are disjoint, and the Jensen–Shannon divergence is the constant log2\log 2, independent of θ\theta: its gradient is 00 everywhere, so the generator gets no signal pushing θ\theta toward zero and the objective saturates as the supports separate. The Wasserstein-1 distance, instead, is θ|\theta|, with gradient ±1\pm 1 everywhere; it decreases smoothly as the distributions approach even while their supports stay disjoint, so the generator always has a usable direction — which is exactly what “graceful degradation” means. Here θ\theta is the gap between the two distributions and log2\log 2 is the maximum value JS divergence takes under complete non-overlap (in natural log). Plug in any number to check: θ=2\theta=2 and θ=0.01\theta=0.01 both give a JS gradient of 00, while the Wasserstein gradient always points toward smaller θ\theta.

The Wasserstein GAN (WGAN) replaces the classifier with a critic ff that estimates the Wasserstein-1 distance between generated and real data. By the Kantorovich–Rubinstein duality,

W1(pdata,pG)=supfL1  Expdata[f(x)]Ezpz[f(G(z))],W_1(p_{\text{data}}, p_G) = \sup_{\|f\|_L \le 1}\; \mathbb{E}_{x\sim p_{\text{data}}}[f(x)] -\mathbb{E}_{z\sim p_z}[f(G(z))],

where the supremum is over all 1-Lipschitz functions ff. Term by term: ff is the critic, mapping a sample to a real-valued score; fL1\|f\|_L\le 1 is the 1-Lipschitz constraint, requiring that ff‘s values at any two points differ by no more than the distance between them — equivalently, its slope is bounded between ±1\pm 1, so ff can never be arbitrarily steep; the expression Epdata[f]EpG[f]\mathbb{E}_{p_{\text{data}}}[f]-\mathbb{E}_{p_G}[f] is the difference in the critic’s average score on real versus generated samples, and its supremum over all admissible ff equals the Wasserstein-1 distance between the two distributions. The critic outputs a real-valued score rather than a probability, and its objective gives the generator a meaningful gradient even when the two distributions do not overlap — because ff‘s slope is pinned at ±1\pm 1, the score gap shrinks proportionally as the distributions approach, no matter how far apart real and fake start.

Depth

The duality is valid only while ff stays 1-Lipschitz, so the constraint must be enforced. Early WGANs clipped the critic’s weights to a small range, a crude device that distorts the function: clip too tightly and ff degenerates to nearly linear with too little capacity; clip too loosely and the constraint stops binding. The gradient penalty (WGAN-GP) instead adds a soft term pushing the gradient norm of ff toward 11 at interpolated points x^\hat x:

λEx^[(x^f(x^)21)2].\lambda\,\mathbb{E}_{\hat x}\big[(\|\nabla_{\hat x} f(\hat x)\|_2 - 1)^2\big].

Here x^\hat x is a point sampled by randomly interpolating between a real and a generated sample (the optimal critic has gradient norm exactly 11 along the optimal transport path, so it suffices to enforce the constraint on those connecting lines); x^f2\|\nabla_{\hat x} f\|_2 is the magnitude of ff‘s gradient at that point; the penalty pulls it toward 11, with λ\lambda setting the penalty strength. This keeps ff close to 1-Lipschitz without clipping, yielding markedly more stable training.

Density-ratio estimation and the discriminator

The standard GAN discriminator and the Wasserstein critic represent two distinct strategies for comparing distributions. A classifier-style DD trained to optimality recovers a density ratio: at its optimum, D\*(x)=pdata(x)/ ⁣(pdata(x)+pG(x))D^\*(x)=p_{\text{data}}(x)\big/\!\big(p_{\text{data}}(x)+p_G(x)\big), from which pdata/pGp_{\text{data}}/p_G can be read off, and substituting D\*D^\* back into the objective yields the Jensen–Shannon divergence. Intuitively, the density ratio answers “at xx, how many times more real samples than generated ones” — a pointwise quantity that only makes sense where both distributions place mass at the same point. The same density-ratio mechanism appears in latent space in the adversarial autoencoder, where a discriminator matches an aggregated posterior to a prior. The Wasserstein critic, by contrast, does not estimate a ratio at all: it estimates a transport cost — the minimal “work” needed to move the generated distribution’s mass into the real distribution — which stays finite and differentiable even where pdatap_{\text{data}} and pGp_G share no support and the ratio pdata/pGp_{\text{data}}/p_G is undefined. This is precisely why the Wasserstein formulation degrades gracefully under disjoint support while the density-ratio formulation saturates: transport cost asks “how far to move,” the density ratio asks “who has more at the same point,” and the latter has no answer once the two clouds of points are offset.

This same adversarial critic reappears as the learned energy prior of CryoGEN-I, matching degraded reconstructions to real observations in observation space; the Wasserstein distance it estimates is then what CryoGEN-II minimizes directly through optimal transport — with no min–max. That shift from “adversarial game” to “direct transport” mirrors this page’s theme: CryoGEN-I keeps the GAN minimax loop, accepting its instability in exchange for a flexible learned energy prior; CryoGEN-II skips the game and minimizes the Wasserstein distance the critic was approximating, yielding a stable single answer. (CryoWGEN goes further still, replacing the critic with an entropic-OT objective solved by sampling, not adversarial training.) In Cryo-ET reconstruction this is the entry point that turns missing-wedge restoration into a distribution-matching problem: real observations form pdatap_{\text{data}}, degraded reconstructions form pGp_G, and a critic or transport cost measures and closes the gap between them — see the methods overview. It also connects to the adversarial variant of the autoencoders.

← Generative & Distribution Matching