CryoGEN-II: distribution matching via optimal transport

Trading per-image optimality for global distribution alignment — using optimal transport to stabilize training and match the distribution of real structures

CryoGEN-II picks up directly from CryoGEN-I. CryoGEN-I returns a point estimate xx^* per tomogram, but its adversarial min–max training is prone to instability and mode collapse. CryoGEN-II keeps the same foundation — no ground truth, supervision via the PYP_Y proxy — yet changes the goal from per-image optimality to making the aggregate distribution of generated reconstructions agree with the distribution of real structures, and reaches it through optimal transport rather than an adversarial game. It still returns one deterministic reconstruction per observation — turning the answer into a family of reconstructions that expresses the uncertainty the missing wedge leaves is the job of CryoWGEN.

Intuition

CryoGEN-I gives each image one “best answer,” but what it learns is a density ratio (a discriminator separating real from fake). Early in training, when the two distributions barely overlap, that ratio either saturates or produces exploding or vanishing gradients — this is the root of the min–max instability and mode collapse.

CryoGEN-II asks a different question: rather than fighting image by image, require that all generated reconstructions, taken as one aggregate distribution, agree with the distribution of real structures. Even when two distributions start out disjoint, “how far the mass has to move to turn one pile into the other” stays smooth and differentiable everywhere — and that is exactly what lets optimal transport stabilize training.

Source μTarget ν
Source μTarget νOptimal matching
Total transport cost: 13.79

Under 1D squared cost the optimal coupling pairs the sorted source points with the sorted target points (the monotone matching). Any crossing pair strictly raises the total cost, so the connecting lines never cross; shifting the offset translates the mass and the cost varies smoothly.

From an adversarial game to optimal transport

Instead of training a discriminator to play against a generator, CryoGEN-II directly minimizes the cost of transporting the generated distribution qxq_x onto the data distribution pyp_y:

Wc(py,qx)=infπΠ(py,qx)E(y,x)π[c(y,TM(x))].\mathcal{W}_c(p_y,q_x)=\inf_{\pi\in\Pi(p_y,q_x)}\mathbb{E}_{(y,x)\sim\pi}\big[c\big(y,\mathcal{T}_M(x)\big)\big].

Term by term: pyp_y is the distribution of real observations (the many tomograms actually recorded); qxq_x is the aggregate, in reconstruction space X\mathcal{X}, of the reconstructions the network outputs across observations; π\pi is a coupling (a transport plan) specifying how much mass moves from yy to xx, and Π(py,qx)\Pi(p_y,q_x) is the set of all couplings with marginals pyp_y and qxq_x; the inf\inf picks the cheapest such plan. The cost cc is the squared 2\ell_2 distance. Note that xx and yy live in different spaces and cannot be compared directly — the cost first pushes the clean reconstruction xx back into observation space with the same missing-wedge operator TM\mathcal{T}_M that CryoGEN uses, then compares it to yy via the squared distance yTM(x)2\big\lVert y-\mathcal{T}_M(x)\big\rVert^2.

Depth

Taking the inf\inf over couplings π\pi directly is a linear program whose size blows up with the number of samples — it cannot sit inside a training loop. The semi-dual form of optimal transport rewrites it as an optimization over a single potential ff:

Wc(py,qx)=supf  Eypy[f(y)]    Exqx[fc(TM(x))],\mathcal{W}_c(p_y,q_x)=\sup_{f}\;\mathbb{E}_{y\sim p_y}\big[f(y)\big]\;-\;\mathbb{E}_{x\sim q_x}\big[f^c\big(\mathcal{T}_M(x)\big)\big],

where fcf^c is the c-transform of ff with respect to the cost cc. The key point: there is only one function ff being optimized, with no CryoGEN-I-style min–max game between a generator and a discriminator, which makes training markedly more stable. This single-potential form is also naturally compatible with an encoder–decoder architecture — it is precisely the Wasserstein Auto-Encoder (WAE) objective: encode data into a latent space, decode to reconstruct, and require the transport cost between the aggregate posterior in latent space and the prior to be minimal. This “optimal transport → autoencoder” skeleton is what CryoGEN-II borrows.

Global distribution matching

The real dividing line between CryoGEN-I and CryoGEN-II is the level at which “alignment” happens. CryoGEN-I is per-image: for each yy it independently finds the xx^* minimizing TM(x)y2\big\lVert\mathcal{T}_M(x)-y\big\rVert^2 plus the energy prior, with every image on its own. CryoGEN-II adds a global constraint — it aggregates the reconstructions across all observations and requires

q(xy)p(y)dy    p(x).\int q(x\mid y)\,p(y)\,dy \;\approx\; p(x).

Term by term: q(xy)q(x\mid y) is the reconstruction the network outputs given observation yy, p(y)p(y) is the marginal distribution of observations, and integrating over yy collects the reconstructions across all observations into one aggregate distribution; p(x)p(x) is the prior over clean structures. The constraint says: the full set of reconstructions the network generates must, in a statistical sense, agree with the overall distribution of real structures — not merely each image matching its own posterior.

Intuition

Per-image fidelity guarantees that “every image matches its own observation.” Global distribution matching guarantees that “the whole batch of reconstructions, taken together, looks the way a real dataset should.” That extra constraint is exactly what suppresses artifacts that look plausible image by image yet, in aggregate, contradict the distribution of real structures — because such artifacts pull the aggregate distribution qxq_x away from the prior p(x)p(x) and thereby raise the transport cost.

Here p(x)p(x) itself is unavailable — there is no ground-truth reconstruction at all. Supervision still comes entirely from the real observations pyp_y: transporting the distribution of degraded generations TM(x)\mathcal{T}_M(x) to match pyp_y implicitly drives the aggregate reconstruction toward p(x)p(x). This is the PYP_Y proxy idea expressed at the level of distributions. The whole procedure is pure optimization — no Langevin or Monte-Carlo sampling.

What it achieves, and its limit

Lifting the goal from per-image optimality to global distribution matching buys CryoGEN-II two things. First, the single-potential optimal-transport objective replaces the adversarial min–max, so training no longer mode-collapses and is markedly more stable. Second, the reconstructions agree with the distribution of real structures in aggregate statistics, so they carry fewer artifacts and are more trustworthy. What it delivers is a stable, realistic point estimate — the most reliable single curve in the whole CryoGEN lineage.

Its limit, though, stems from being a point estimate at all: per observation, CryoGEN-II still returns one deterministic reconstruction. Yet the Fourier information the missing wedge erases corresponds, in principle, to more than one plausible xx — the data itself carries irreducible uncertainty. A single deterministic curve cannot say which regions are well-constrained and which were guessed. Turning that one answer into a family of reconstructions, and capturing the uncertainty with a Boltzmann posterior, calls for a different, entropic-variational framework — which is exactly where CryoWGEN begins.


Paper: CryoGEN-II is published at CVPR 2026 Workshops (VISION); its predecessor CryoGEN-I appeared at ICLR 2025.

Prerequisites: CryoGEN-I and optimal transport; to lift the single reconstruction into a posterior distribution, see CryoWGEN.

← Cryo-ET Reconstruction