Entropic Variational Inference Auto-encoding (EVIA)

Matching the aggregated posterior with entropic optimal transport to obtain a Boltzmann posterior and a soft barycentric encoder — a stochastic generalization of the WAE

EVIA (Entropic Variational Inference Auto-encoding) is a family of autoencoders that match the aggregated posterior to a prior using entropic optimal transport. It generalizes the WAE: where the WAE matches distributions with an optimal-transport cost, EVIA adds an entropy term to that cost, replacing a pointwise deterministic encoder with a Boltzmann posterior that returns a family of latents for each input.

γ→0: the posterior collapses to one point — the deterministic map of WAE / CryoGEN-II; γ>0: a family of codes capturing missing-wedge uncertainty.

The three sit on one line. The VAE pulls every per-sample $q(z\mid x)$ toward the prior; the WAE constrains only the aggregated posterior $q_Z=\int q(z\mid x)\,p_X(x)\,dx$ and lets the encoder collapse to a deterministic map; EVIA also constrains only the aggregated posterior but uses a temperature $\gamma$ to hold the encoder open, tunable continuously between a deterministic map ( $\gamma\to0$ ) and a diffuse distribution (large $\gamma$ ). That temperature knob is the source of everything below.

Intuition

When distributions are matched with optimal transport and no regularization, the optimum often degenerates to a deterministic map (a Monge map) — the encoder emits a single $z$ for each $x$ . But inference should be distributional: one $x$ can correspond to several plausible latent representations. EVIA adds an entropy term to the transport cost that acts as a barrier, forcing the solution to stay spread out, so the posterior becomes everywhere-positive and samplable again.

Made concrete on missing-wedge restoration: a tomogram reconstructed from a tilt series genuinely lacks information along the missing direction, because the high-angle projections were never collected. A deterministic encoder can only pick one most-likely fill-in and hide the inferential uncertainty; EVIA returns a family of fill-ins consistent with the observation, and the temperature $\gamma$ corresponds directly to “how uncertain we are about the directions we did not see.”

Entropic regularization and the Boltzmann posterior

For data $x$ , latent $z$ , decoder $\mathcal{A}:z\mapsto x$ , reference coupling $\kappa$ , and temperature $\gamma>0$ , the EVIA primal objective is an entropy-penalized optimal transport:

\min_{\pi\in\Pi(p_x,q_z)}\;\mathbb{E}_{(x,z)\sim\pi}\big[\|x-\mathcal{A}(z)\|^2\big]\;+\;\gamma\,\mathrm{KL}(\pi\,\|\,\kappa).

Reading it term by term: $\pi\in\Pi(p_x,q_z)$ ranges over all joint couplings whose marginals are the data law $p_x$ and the latent law $q_z$ ; the first term $\mathbb{E}\,\|x-\mathcal{A}(z)\|^2$ is the transport cost, demanding that paired $(x,z)$ decode back to $x$ ; the second term $\gamma\,\mathrm{KL}(\pi\,\|\,\kappa)$ is an entropy penalty against the reference coupling $\kappa$ , weighted by the temperature $\gamma$ . Larger $\gamma$ favors couplings close to $\kappa$ and more spread out; as $\gamma\to0$ the penalty vanishes, the objective reverts to pure transport, and the optimal coupling collapses to a deterministic map. Compared with the WAE objective above, the WAE is the $\gamma=0$ special case — EVIA simply adds a temperature to the same transport problem.

Depth

With a prior potential $w(z)$ — corresponding to the aggregated-posterior marginal constraint, and in Cryo-ET realized by the adversarially-learned energy critic $D_\psi=-E_\phi$ — and a data-fit weight $\lambda$ , define the utility $\mathcal{G}(z;x)=w(z)-\tfrac{\lambda}{2}\|x-\mathcal{A}(z)\|^2$ . By the Gibbs variational principle (Donsker–Varadhan), the optimal conditional for fixed $x$ takes Gibbs (Boltzmann) form:

q^\star(z\mid x)\;\propto\;\kappa(z\mid x)\,\exp\!\Big(\frac{\mathcal{G}(z;x)}{\gamma}\Big),

everywhere positive. This is the same Boltzmann coupling $\pi^\star\propto\kappa\,e^{-c/\gamma}$ as in entropic optimal transport, up to the adversarial term $w$ . As $\gamma\to0$ it reduces to the WAE’s deterministic hard transport.

What each piece does physically: the utility $\mathcal{G}$ writes two forces together — $w(z)$ rewards latents that look like the true distribution (high energy-critic score), and $-\tfrac{\lambda}{2}\|x-\mathcal{A}(z)\|^2$ penalizes latents that do not reconstruct $x$ ; $\lambda$ sets their relative say. The exponent $\mathcal{G}/\gamma$ is the familiar “negative energy over temperature” of statistical mechanics: small $\gamma$ peaks the distribution sharply on the highest-utility $z$ , approaching an $\arg\max$ ; large $\gamma$ flattens it toward the reference $\kappa$ . The $w$ term turns plain transport matching into energy-based-model density matching, and that is exactly what wires EVIA to energy-based models and adversarial training.

How the temperature $\gamma$ sets the posterior width and the reconstruction uncertainty — drag it:

energy E(x)posterior q(x|y) ∝ e^(−E/γ)

wide posterior — a family　→ missing-wedge uncertainty (CryoWGEN)

temperature γγ = 0.45

γ→0: one reconstructionlarge γ: a family

Temperature γ sets the posterior's width directly. Write data-consistency as an energy E(x) (the amber well); the posterior is the Boltzmann distribution in that well, q(x|y) ∝ e^(−E(x)/γ) (purple). As γ→0 it collapses to a spike at the bottom — one deterministic reconstruction, exactly WAE / CryoGEN-II; as γ grows it spreads into a family of reconstructions, and that width is the missing-wedge uncertainty CryoWGEN reports. The purple ticks along the bottom are sample reconstructions drawn from the posterior; they fan out as γ rises.

Turn the knob to either extreme: at $\gamma\to0$ the Boltzmann distribution collapses to the single highest-utility point, EVIA becomes the WAE’s deterministic encoder, and you get one best-guess reconstruction; at large $\gamma$ the posterior spreads back to the reference $\kappa$ and the encoder barely looks at the data. The useful operating regime in Cryo-ET is in between: pick $\gamma$ so the posterior just covers the family of solutions the missing wedge permits — neither pretending certainty nor degenerating into noise.

The soft barycentric encoder

The optimal encoder is not a single sample but the conditional expectation of the latent under the posterior — a soft barycentric projection:

T^\star(x)=\mathbb{E}_{q^\star(\cdot\mid x)}[z]\;\approx\;\sum_{m}\omega_m(x)\,z^{(m)},\qquad \omega_m(x)=\frac{\exp(\mathcal{G}(z^{(m)};x)/\gamma)}{\sum_j\exp(\mathcal{G}(z^{(j)};x)/\gamma)},

with $\{z^{(m)}\}$ drawn from the reference. The weights are a softmax, so the map is smooth and differentiable and converges to the classical hard optimal-transport map as $\gamma\to0$ .

How it is actually computed: draw a batch of candidate latents $z^{(1)},\dots,z^{(M)}$ from the reference, score each with its utility $\mathcal{G}(z^{(m)};x)$ , pass them through a temperature- $\gamma$ softmax to get weights $\omega_m$ , and take the weighted average of the candidates. That is the “soft” part — not hard-selecting the single highest-utility $z$ , but taking the barycenter under their Gibbs weights. The two extremes stay consistent again: as $\gamma\to0$ the softmax tends to one-hot, the barycenter collapses to a single $\arg\max$ , recovering the hard optimal-transport map; at large $\gamma$ the weights become uniform and $T^\star(x)$ falls back to a plain average of the candidates, nearly independent of $x$ . Because the whole chain (utility, softmax, weighted sum) is differentiable, the encoder trains end-to-end by backpropagation.

Two implementations: SGLD and amortized

Taking the conditional prior to be an isotropic Gaussian $\kappa(z\mid x)=\mathcal{N}(\bar z(x),\beta^{-1}I)$ , the effective potential to minimize is a tractable Log-Sum-Exp:

\Psi(x)=-\log\!\int\exp\!\Big\{w(z)-\tfrac{\lambda}{2}\|x-\mathcal{A}(z)\|^2-\tfrac{\beta}{2}\|z-\bar z\|^2\Big\}\,dz.

Reading this potential: the Gaussian prior folds $\kappa(z\mid x)\propto\exp(-\tfrac{\beta}{2}\|z-\bar z\|^2)$ straight into the exponent, so the integrand collects three terms — the energy critic $w(z)$ , the reconstruction penalty $-\tfrac{\lambda}{2}\|x-\mathcal{A}(z)\|^2$ , and a Gaussian pull (strength $\beta$ ) tying $z$ back to the prior mean $\bar z(x)$ . The leading negative log makes $\Psi(x)$ the “free energy” of this family of solutions; its gradient with respect to the parameters is the training signal. The only hard part is the integral over $z$ , and the two implementations are two ways to handle it.

EVIA-SGLD (MCMC): draw negative samples iteratively with Langevin dynamics, $z\leftarrow z+\eta\nabla_z[\cdots]+\sqrt{2\eta}\,\xi$ , needing only the energy gradient; more general. Here $\eta$ is the step size and $\xi\sim\mathcal{N}(0,I)$ is Gaussian noise injected each step — that noise is what makes the iteration sample the whole family of the posterior rather than slide into a single minimum, so SGLD yields actual posterior samples.
EVIA-amortized: train an encoder $q_\phi$ to predict $z$ in one shot and a decoder $p_\theta$ in place of $\mathcal{A}$ , with the end-to-end consistency objective $\mathcal{L}_{\phi,\theta}=\mathbb{E}_x\|p_\theta(q_\phi(x))-x\|^2+\mathcal{L}_\phi+\mathcal{L}_\theta$ ; faster. It amortizes the inner MCMC, otherwise run per image, into a single forward pass, at the cost of approximating only the conditional expectation (the soft barycenter) rather than the full family.

The trade-off is direct: SGLD runs a Langevin chain per sample — slow, but it keeps the full posterior uncertainty; amortized swaps the inner sampling for one forward pass — much faster, but it delivers a point estimate of the posterior (the barycenter). This SGLD-vs-amortized split is exactly the choice that, in Cryo-ET, separates CryoWGEN-II (Langevin, a family of posterior solutions) from the single-answer variant.

Place in the autoencoder family

Algorithm	Posterior matching	Form of the posterior
VAE	per-sample KL	Gaussian $q_\phi(z\mid x)$
WAE	optimal transport on the aggregated posterior	often a deterministic encoder
AAE	adversarial (density ratio)	implicit, set by the discriminator
EVIA	entropic optimal transport	Boltzmann posterior (everywhere positive)

One thread through the table is “how tightly the posterior is constrained”: the VAE pins every sample near the prior, the strictest; WAE/AAE constrain only the aggregated posterior and free the single sample, estimating the same divergence by different means (closed-form MMD vs. an adversarial discriminator); EVIA also constrains only the aggregated posterior, but uses temperature to soften “hard” transport continuously into an everywhere-positive Boltzmann posterior — which folds the other three into one family: $\gamma\to0$ is the WAE, and adding the adversarial energy term $w$ connects to the AAE’s adversarial matching.

EVIA casts missing-wedge restoration as a distributional inverse problem: the observation (a wedge-deficient tomogram) does not pin down a unique solution, so the correct output is a family of volumes consistent with the data, not one image. In Cryo-ET the energy critic $w=-E_\phi$ is learned adversarially, and the SGLD/amortized implementations correspond to “sample a posterior family” versus “take the posterior barycenter.” For the concrete application see CryoWGEN; its Langevin version, which returns a family of posterior solutions, is CryoWGEN-II.

← Generative & Distribution Matching