Entropic Variational Inference Auto-encoding (EVIA)

Matching the aggregated posterior with entropic optimal transport to obtain a Boltzmann posterior and a soft barycentric encoder — a stochastic generalization of the WAE

EVIA (Entropic Variational Inference Auto-encoding) is a family of autoencoders that match the aggregated posterior to a prior using entropic optimal transport. It generalizes the WAE: where the WAE matches distributions with an optimal-transport cost, EVIA adds an entropy term to that cost, replacing a pointwise deterministic encoder with a Boltzmann posterior that returns a family of latents for each input.

soft barycenter T*(x)z ~ q*(z|x)Observation xEncoderBoltzmann posteriorq*(z|x) ∝ κ·e^{G/γ}Decoder AReconstruction
γ→0: the posterior collapses to one point — the deterministic map of WAE / CryoGEN-II; γ>0: a family of codes capturing missing-wedge uncertainty.

The three sit on one line. The VAE pulls every per-sample q(zx)q(z\mid x) toward the prior; the WAE constrains only the aggregated posterior qZ=q(zx)pX(x)dxq_Z=\int q(z\mid x)\,p_X(x)\,dx and lets the encoder collapse to a deterministic map; EVIA also constrains only the aggregated posterior but uses a temperature γ\gamma to hold the encoder open, tunable continuously between a deterministic map (γ0\gamma\to0) and a diffuse distribution (large γ\gamma). That temperature knob is the source of everything below.

Intuition

When distributions are matched with optimal transport and no regularization, the optimum often degenerates to a deterministic map (a Monge map) — the encoder emits a single zz for each xx. But inference should be distributional: one xx can correspond to several plausible latent representations. EVIA adds an entropy term to the transport cost that acts as a barrier, forcing the solution to stay spread out, so the posterior becomes everywhere-positive and samplable again.

Made concrete on missing-wedge restoration: a tomogram reconstructed from a tilt series genuinely lacks information along the missing direction, because the high-angle projections were never collected. A deterministic encoder can only pick one most-likely fill-in and hide the inferential uncertainty; EVIA returns a family of fill-ins consistent with the observation, and the temperature γ\gamma corresponds directly to “how uncertain we are about the directions we did not see.”

Entropic regularization and the Boltzmann posterior

For data xx, latent zz, decoder A:zx\mathcal{A}:z\mapsto x, reference coupling κ\kappa, and temperature γ>0\gamma>0, the EVIA primal objective is an entropy-penalized optimal transport:

minπΠ(px,qz)  E(x,z)π[xA(z)2]  +  γKL(πκ).\min_{\pi\in\Pi(p_x,q_z)}\;\mathbb{E}_{(x,z)\sim\pi}\big[\|x-\mathcal{A}(z)\|^2\big]\;+\;\gamma\,\mathrm{KL}(\pi\,\|\,\kappa).

Reading it term by term: πΠ(px,qz)\pi\in\Pi(p_x,q_z) ranges over all joint couplings whose marginals are the data law pxp_x and the latent law qzq_z; the first term ExA(z)2\mathbb{E}\,\|x-\mathcal{A}(z)\|^2 is the transport cost, demanding that paired (x,z)(x,z) decode back to xx; the second term γKL(πκ)\gamma\,\mathrm{KL}(\pi\,\|\,\kappa) is an entropy penalty against the reference coupling κ\kappa, weighted by the temperature γ\gamma. Larger γ\gamma favors couplings close to κ\kappa and more spread out; as γ0\gamma\to0 the penalty vanishes, the objective reverts to pure transport, and the optimal coupling collapses to a deterministic map. Compared with the WAE objective above, the WAE is the γ=0\gamma=0 special case — EVIA simply adds a temperature to the same transport problem.

Depth

With a prior potential w(z)w(z) — corresponding to the aggregated-posterior marginal constraint, and in Cryo-ET realized by the adversarially-learned energy critic Dψ=EϕD_\psi=-E_\phi — and a data-fit weight λ\lambda, define the utility G(z;x)=w(z)λ2xA(z)2\mathcal{G}(z;x)=w(z)-\tfrac{\lambda}{2}\|x-\mathcal{A}(z)\|^2. By the Gibbs variational principle (Donsker–Varadhan), the optimal conditional for fixed xx takes Gibbs (Boltzmann) form:

q(zx)    κ(zx)exp ⁣(G(z;x)γ),q^\star(z\mid x)\;\propto\;\kappa(z\mid x)\,\exp\!\Big(\frac{\mathcal{G}(z;x)}{\gamma}\Big),

everywhere positive. This is the same Boltzmann coupling πκec/γ\pi^\star\propto\kappa\,e^{-c/\gamma} as in entropic optimal transport, up to the adversarial term ww. As γ0\gamma\to0 it reduces to the WAE’s deterministic hard transport.

What each piece does physically: the utility G\mathcal{G} writes two forces together — w(z)w(z) rewards latents that look like the true distribution (high energy-critic score), and λ2xA(z)2-\tfrac{\lambda}{2}\|x-\mathcal{A}(z)\|^2 penalizes latents that do not reconstruct xx; λ\lambda sets their relative say. The exponent G/γ\mathcal{G}/\gamma is the familiar “negative energy over temperature” of statistical mechanics: small γ\gamma peaks the distribution sharply on the highest-utility zz, approaching an argmax\arg\max; large γ\gamma flattens it toward the reference κ\kappa. The ww term turns plain transport matching into energy-based-model density matching, and that is exactly what wires EVIA to energy-based models and adversarial training.

How the temperature γ\gamma sets the posterior width and the reconstruction uncertainty — drag it:

energy min = MAPsample reconstructions
energy E(x)posterior q(x|y) ∝ e^(−E/γ)

wide posterior — a family → missing-wedge uncertainty (CryoWGEN)

Temperature γ sets the posterior's width directly. Write data-consistency as an energy E(x) (the amber well); the posterior is the Boltzmann distribution in that well, q(x|y) ∝ e^(−E(x)/γ) (purple). As γ→0 it collapses to a spike at the bottom — one deterministic reconstruction, exactly WAE / CryoGEN-II; as γ grows it spreads into a family of reconstructions, and that width is the missing-wedge uncertainty CryoWGEN reports. The purple ticks along the bottom are sample reconstructions drawn from the posterior; they fan out as γ rises.

Turn the knob to either extreme: at γ0\gamma\to0 the Boltzmann distribution collapses to the single highest-utility point, EVIA becomes the WAE’s deterministic encoder, and you get one best-guess reconstruction; at large γ\gamma the posterior spreads back to the reference κ\kappa and the encoder barely looks at the data. The useful operating regime in Cryo-ET is in between: pick γ\gamma so the posterior just covers the family of solutions the missing wedge permits — neither pretending certainty nor degenerating into noise.

The soft barycentric encoder

The optimal encoder is not a single sample but the conditional expectation of the latent under the posterior — a soft barycentric projection:

T(x)=Eq(x)[z]    mωm(x)z(m),ωm(x)=exp(G(z(m);x)/γ)jexp(G(z(j);x)/γ),T^\star(x)=\mathbb{E}_{q^\star(\cdot\mid x)}[z]\;\approx\;\sum_{m}\omega_m(x)\,z^{(m)},\qquad \omega_m(x)=\frac{\exp(\mathcal{G}(z^{(m)};x)/\gamma)}{\sum_j\exp(\mathcal{G}(z^{(j)};x)/\gamma)},

with {z(m)}\{z^{(m)}\} drawn from the reference. The weights are a softmax, so the map is smooth and differentiable and converges to the classical hard optimal-transport map as γ0\gamma\to0.

How it is actually computed: draw a batch of candidate latents z(1),,z(M)z^{(1)},\dots,z^{(M)} from the reference, score each with its utility G(z(m);x)\mathcal{G}(z^{(m)};x), pass them through a temperature-γ\gamma softmax to get weights ωm\omega_m, and take the weighted average of the candidates. That is the “soft” part — not hard-selecting the single highest-utility zz, but taking the barycenter under their Gibbs weights. The two extremes stay consistent again: as γ0\gamma\to0 the softmax tends to one-hot, the barycenter collapses to a single argmax\arg\max, recovering the hard optimal-transport map; at large γ\gamma the weights become uniform and T(x)T^\star(x) falls back to a plain average of the candidates, nearly independent of xx. Because the whole chain (utility, softmax, weighted sum) is differentiable, the encoder trains end-to-end by backpropagation.

Two implementations: SGLD and amortized

Taking the conditional prior to be an isotropic Gaussian κ(zx)=N(zˉ(x),β1I)\kappa(z\mid x)=\mathcal{N}(\bar z(x),\beta^{-1}I), the effective potential to minimize is a tractable Log-Sum-Exp:

Ψ(x)=log ⁣exp ⁣{w(z)λ2xA(z)2β2zzˉ2}dz.\Psi(x)=-\log\!\int\exp\!\Big\{w(z)-\tfrac{\lambda}{2}\|x-\mathcal{A}(z)\|^2-\tfrac{\beta}{2}\|z-\bar z\|^2\Big\}\,dz.

Reading this potential: the Gaussian prior folds κ(zx)exp(β2zzˉ2)\kappa(z\mid x)\propto\exp(-\tfrac{\beta}{2}\|z-\bar z\|^2) straight into the exponent, so the integrand collects three terms — the energy critic w(z)w(z), the reconstruction penalty λ2xA(z)2-\tfrac{\lambda}{2}\|x-\mathcal{A}(z)\|^2, and a Gaussian pull (strength β\beta) tying zz back to the prior mean zˉ(x)\bar z(x). The leading negative log makes Ψ(x)\Psi(x) the “free energy” of this family of solutions; its gradient with respect to the parameters is the training signal. The only hard part is the integral over zz, and the two implementations are two ways to handle it.

The trade-off is direct: SGLD runs a Langevin chain per sample — slow, but it keeps the full posterior uncertainty; amortized swaps the inner sampling for one forward pass — much faster, but it delivers a point estimate of the posterior (the barycenter). This SGLD-vs-amortized split is exactly the choice that, in Cryo-ET, separates CryoWGEN-II (Langevin, a family of posterior solutions) from the single-answer variant.

Place in the autoencoder family

AlgorithmPosterior matchingForm of the posterior
VAEper-sample KLGaussian qϕ(zx)q_\phi(z\mid x)
WAEoptimal transport on the aggregated posterioroften a deterministic encoder
AAEadversarial (density ratio)implicit, set by the discriminator
EVIAentropic optimal transportBoltzmann posterior (everywhere positive)

One thread through the table is “how tightly the posterior is constrained”: the VAE pins every sample near the prior, the strictest; WAE/AAE constrain only the aggregated posterior and free the single sample, estimating the same divergence by different means (closed-form MMD vs. an adversarial discriminator); EVIA also constrains only the aggregated posterior, but uses temperature to soften “hard” transport continuously into an everywhere-positive Boltzmann posterior — which folds the other three into one family: γ0\gamma\to0 is the WAE, and adding the adversarial energy term ww connects to the AAE’s adversarial matching.

EVIA casts missing-wedge restoration as a distributional inverse problem: the observation (a wedge-deficient tomogram) does not pin down a unique solution, so the correct output is a family of volumes consistent with the data, not one image. In Cryo-ET the energy critic w=Eϕw=-E_\phi is learned adversarially, and the SGLD/amortized implementations correspond to “sample a posterior family” versus “take the posterior barycenter.” For the concrete application see CryoWGEN; its Langevin version, which returns a family of posterior solutions, is CryoWGEN-II.

← Generative & Distribution Matching