CryoGEN-II: distribution matching via optimal transport

Trading per-image optimality for global distribution alignment — using optimal transport to stabilize training and match the distribution of real structures

CryoGEN-II picks up directly from CryoGEN-I. CryoGEN-I returns a point estimate $x^*$ per tomogram, but its adversarial min–max training is prone to instability and mode collapse. CryoGEN-II keeps the same foundation — no ground truth, supervision via the $P_Y$ proxy — yet changes the goal from per-image optimality to making the aggregate distribution of generated reconstructions agree with the distribution of real structures, and reaches it through optimal transport rather than an adversarial game. It still returns one deterministic reconstruction per observation — turning the answer into a family of reconstructions that expresses the uncertainty the missing wedge leaves is the job of CryoWGEN.

Intuition

CryoGEN-I gives each image one “best answer,” but what it learns is a density ratio (a discriminator separating real from fake). Early in training, when the two distributions barely overlap, that ratio either saturates or produces exploding or vanishing gradients — this is the root of the min–max instability and mode collapse.

CryoGEN-II asks a different question: rather than fighting image by image, require that all generated reconstructions, taken as one aggregate distribution, agree with the distribution of real structures. Even when two distributions start out disjoint, “how far the mass has to move to turn one pile into the other” stays smooth and differentiable everywhere — and that is exactly what lets optimal transport stabilize training.

Source μTarget νOptimal matching

Total transport cost: 13.79

Target offset: 1.0

Under 1D squared cost the optimal coupling pairs the sorted source points with the sorted target points (the monotone matching). Any crossing pair strictly raises the total cost, so the connecting lines never cross; shifting the offset translates the mass and the cost varies smoothly.

From an adversarial game to optimal transport

Instead of training a discriminator to play against a generator, CryoGEN-II directly minimizes the cost of transporting the generated distribution $q_x$ onto the data distribution $p_y$ :

\mathcal{W}_c(p_y,q_x)=\inf_{\pi\in\Pi(p_y,q_x)}\mathbb{E}_{(y,x)\sim\pi}\big[c\big(y,\mathcal{T}_M(x)\big)\big].

Term by term: $p_y$ is the distribution of real observations (the many tomograms actually recorded); $q_x$ is the aggregate, in reconstruction space $\mathcal{X}$ , of the reconstructions the network outputs across observations; $\pi$ is a coupling (a transport plan) specifying how much mass moves from $y$ to $x$ , and $\Pi(p_y,q_x)$ is the set of all couplings with marginals $p_y$ and $q_x$ ; the $\inf$ picks the cheapest such plan. The cost $c$ is the squared $\ell_2$ distance. Note that $x$ and $y$ live in different spaces and cannot be compared directly — the cost first pushes the clean reconstruction $x$ back into observation space with the same missing-wedge operator $\mathcal{T}_M$ that CryoGEN uses, then compares it to $y$ via the squared distance $\big\lVert y-\mathcal{T}_M(x)\big\rVert^2$ .

Depth

Taking the $\inf$ over couplings $\pi$ directly is a linear program whose size blows up with the number of samples — it cannot sit inside a training loop. The semi-dual form of optimal transport rewrites it as an optimization over a single potential $f$ :

\mathcal{W}_c(p_y,q_x)=\sup_{f}\;\mathbb{E}_{y\sim p_y}\big[f(y)\big]\;-\;\mathbb{E}_{x\sim q_x}\big[f^c\big(\mathcal{T}_M(x)\big)\big],

where $f^c$ is the c-transform of $f$ with respect to the cost $c$ . The key point: there is only one function $f$ being optimized, with no CryoGEN-I-style min–max game between a generator and a discriminator, which makes training markedly more stable. This single-potential form is also naturally compatible with an encoder–decoder architecture — it is precisely the Wasserstein Auto-Encoder (WAE) objective: encode data into a latent space, decode to reconstruct, and require the transport cost between the aggregate posterior in latent space and the prior to be minimal. This “optimal transport → autoencoder” skeleton is what CryoGEN-II borrows.

Global distribution matching

The real dividing line between CryoGEN-I and CryoGEN-II is the level at which “alignment” happens. CryoGEN-I is per-image: for each $y$ it independently finds the $x^*$ minimizing $\big\lVert\mathcal{T}_M(x)-y\big\rVert^2$ plus the energy prior, with every image on its own. CryoGEN-II adds a global constraint — it aggregates the reconstructions across all observations and requires

\int q(x\mid y)\,p(y)\,dy \;\approx\; p(x).

Term by term: $q(x\mid y)$ is the reconstruction the network outputs given observation $y$ , $p(y)$ is the marginal distribution of observations, and integrating over $y$ collects the reconstructions across all observations into one aggregate distribution; $p(x)$ is the prior over clean structures. The constraint says: the full set of reconstructions the network generates must, in a statistical sense, agree with the overall distribution of real structures — not merely each image matching its own posterior.

Intuition

Per-image fidelity guarantees that “every image matches its own observation.” Global distribution matching guarantees that “the whole batch of reconstructions, taken together, looks the way a real dataset should.” That extra constraint is exactly what suppresses artifacts that look plausible image by image yet, in aggregate, contradict the distribution of real structures — because such artifacts pull the aggregate distribution $q_x$ away from the prior $p(x)$ and thereby raise the transport cost.

Here $p(x)$ itself is unavailable — there is no ground-truth reconstruction at all. Supervision still comes entirely from the real observations $p_y$ : transporting the distribution of degraded generations $\mathcal{T}_M(x)$ to match $p_y$ implicitly drives the aggregate reconstruction toward $p(x)$ . This is the $P_Y$ proxy idea expressed at the level of distributions. The whole procedure is pure optimization — no Langevin or Monte-Carlo sampling.

What it achieves, and its limit

Lifting the goal from per-image optimality to global distribution matching buys CryoGEN-II two things. First, the single-potential optimal-transport objective replaces the adversarial min–max, so training no longer mode-collapses and is markedly more stable. Second, the reconstructions agree with the distribution of real structures in aggregate statistics, so they carry fewer artifacts and are more trustworthy. What it delivers is a stable, realistic point estimate — the most reliable single curve in the whole CryoGEN lineage.

Its limit, though, stems from being a point estimate at all: per observation, CryoGEN-II still returns one deterministic reconstruction. Yet the Fourier information the missing wedge erases corresponds, in principle, to more than one plausible $x$ — the data itself carries irreducible uncertainty. A single deterministic curve cannot say which regions are well-constrained and which were guessed. Turning that one answer into a family of reconstructions, and capturing the uncertainty with a Boltzmann posterior, calls for a different, entropic-variational framework — which is exactly where CryoWGEN begins.

Paper: CryoGEN-II is published at CVPR 2026 Workshops (VISION); its predecessor CryoGEN-I appeared at ICLR 2025.

Prerequisites: CryoGEN-I and optimal transport; to lift the single reconstruction into a posterior distribution, see CryoWGEN.

← Cryo-ET Reconstruction