Entropic Variational Inference Auto-encoding (EVIA)
Matching the aggregated posterior with entropic optimal transport to obtain a Boltzmann posterior and a soft barycentric encoder — a stochastic generalization of the WAE
EVIA (Entropic Variational Inference Auto-encoding) is a family of autoencoders that match the aggregated posterior to a prior using entropic optimal transport. It generalizes the WAE: where the WAE matches distributions with an optimal-transport cost, EVIA adds an entropy term to that cost, replacing a pointwise deterministic encoder with a Boltzmann posterior that returns a family of latents for each input.
The three sit on one line. The VAE pulls every per-sample toward the prior; the WAE constrains only the aggregated posterior and lets the encoder collapse to a deterministic map; EVIA also constrains only the aggregated posterior but uses a temperature to hold the encoder open, tunable continuously between a deterministic map () and a diffuse distribution (large ). That temperature knob is the source of everything below.
When distributions are matched with optimal transport and no regularization, the optimum often degenerates to a deterministic map (a Monge map) — the encoder emits a single for each . But inference should be distributional: one can correspond to several plausible latent representations. EVIA adds an entropy term to the transport cost that acts as a barrier, forcing the solution to stay spread out, so the posterior becomes everywhere-positive and samplable again.
Made concrete on missing-wedge restoration: a tomogram reconstructed from a tilt series genuinely lacks information along the missing direction, because the high-angle projections were never collected. A deterministic encoder can only pick one most-likely fill-in and hide the inferential uncertainty; EVIA returns a family of fill-ins consistent with the observation, and the temperature corresponds directly to “how uncertain we are about the directions we did not see.”
Entropic regularization and the Boltzmann posterior
For data , latent , decoder , reference coupling , and temperature , the EVIA primal objective is an entropy-penalized optimal transport:
Reading it term by term: ranges over all joint couplings whose marginals are the data law and the latent law ; the first term is the transport cost, demanding that paired decode back to ; the second term is an entropy penalty against the reference coupling , weighted by the temperature . Larger favors couplings close to and more spread out; as the penalty vanishes, the objective reverts to pure transport, and the optimal coupling collapses to a deterministic map. Compared with the WAE objective above, the WAE is the special case — EVIA simply adds a temperature to the same transport problem.
With a prior potential — corresponding to the aggregated-posterior marginal constraint, and in Cryo-ET realized by the adversarially-learned energy critic — and a data-fit weight , define the utility . By the Gibbs variational principle (Donsker–Varadhan), the optimal conditional for fixed takes Gibbs (Boltzmann) form:
everywhere positive. This is the same Boltzmann coupling as in entropic optimal transport, up to the adversarial term . As it reduces to the WAE’s deterministic hard transport.
What each piece does physically: the utility writes two forces together — rewards latents that look like the true distribution (high energy-critic score), and penalizes latents that do not reconstruct ; sets their relative say. The exponent is the familiar “negative energy over temperature” of statistical mechanics: small peaks the distribution sharply on the highest-utility , approaching an ; large flattens it toward the reference . The term turns plain transport matching into energy-based-model density matching, and that is exactly what wires EVIA to energy-based models and adversarial training.
How the temperature sets the posterior width and the reconstruction uncertainty — drag it:
wide posterior — a family → missing-wedge uncertainty (CryoWGEN)
Temperature γ sets the posterior's width directly. Write data-consistency as an energy E(x) (the amber well); the posterior is the Boltzmann distribution in that well, q(x|y) ∝ e^(−E(x)/γ) (purple). As γ→0 it collapses to a spike at the bottom — one deterministic reconstruction, exactly WAE / CryoGEN-II; as γ grows it spreads into a family of reconstructions, and that width is the missing-wedge uncertainty CryoWGEN reports. The purple ticks along the bottom are sample reconstructions drawn from the posterior; they fan out as γ rises.
Turn the knob to either extreme: at the Boltzmann distribution collapses to the single highest-utility point, EVIA becomes the WAE’s deterministic encoder, and you get one best-guess reconstruction; at large the posterior spreads back to the reference and the encoder barely looks at the data. The useful operating regime in Cryo-ET is in between: pick so the posterior just covers the family of solutions the missing wedge permits — neither pretending certainty nor degenerating into noise.
The soft barycentric encoder
The optimal encoder is not a single sample but the conditional expectation of the latent under the posterior — a soft barycentric projection:
with drawn from the reference. The weights are a softmax, so the map is smooth and differentiable and converges to the classical hard optimal-transport map as .
How it is actually computed: draw a batch of candidate latents from the reference, score each with its utility , pass them through a temperature- softmax to get weights , and take the weighted average of the candidates. That is the “soft” part — not hard-selecting the single highest-utility , but taking the barycenter under their Gibbs weights. The two extremes stay consistent again: as the softmax tends to one-hot, the barycenter collapses to a single , recovering the hard optimal-transport map; at large the weights become uniform and falls back to a plain average of the candidates, nearly independent of . Because the whole chain (utility, softmax, weighted sum) is differentiable, the encoder trains end-to-end by backpropagation.
Two implementations: SGLD and amortized
Taking the conditional prior to be an isotropic Gaussian , the effective potential to minimize is a tractable Log-Sum-Exp:
Reading this potential: the Gaussian prior folds straight into the exponent, so the integrand collects three terms — the energy critic , the reconstruction penalty , and a Gaussian pull (strength ) tying back to the prior mean . The leading negative log makes the “free energy” of this family of solutions; its gradient with respect to the parameters is the training signal. The only hard part is the integral over , and the two implementations are two ways to handle it.
- EVIA-SGLD (MCMC): draw negative samples iteratively with Langevin dynamics, , needing only the energy gradient; more general. Here is the step size and is Gaussian noise injected each step — that noise is what makes the iteration sample the whole family of the posterior rather than slide into a single minimum, so SGLD yields actual posterior samples.
- EVIA-amortized: train an encoder to predict in one shot and a decoder in place of , with the end-to-end consistency objective ; faster. It amortizes the inner MCMC, otherwise run per image, into a single forward pass, at the cost of approximating only the conditional expectation (the soft barycenter) rather than the full family.
The trade-off is direct: SGLD runs a Langevin chain per sample — slow, but it keeps the full posterior uncertainty; amortized swaps the inner sampling for one forward pass — much faster, but it delivers a point estimate of the posterior (the barycenter). This SGLD-vs-amortized split is exactly the choice that, in Cryo-ET, separates CryoWGEN-II (Langevin, a family of posterior solutions) from the single-answer variant.
Place in the autoencoder family
| Algorithm | Posterior matching | Form of the posterior |
|---|---|---|
| VAE | per-sample KL | Gaussian |
| WAE | optimal transport on the aggregated posterior | often a deterministic encoder |
| AAE | adversarial (density ratio) | implicit, set by the discriminator |
| EVIA | entropic optimal transport | Boltzmann posterior (everywhere positive) |
One thread through the table is “how tightly the posterior is constrained”: the VAE pins every sample near the prior, the strictest; WAE/AAE constrain only the aggregated posterior and free the single sample, estimating the same divergence by different means (closed-form MMD vs. an adversarial discriminator); EVIA also constrains only the aggregated posterior, but uses temperature to soften “hard” transport continuously into an everywhere-positive Boltzmann posterior — which folds the other three into one family: is the WAE, and adding the adversarial energy term connects to the AAE’s adversarial matching.
EVIA casts missing-wedge restoration as a distributional inverse problem: the observation (a wedge-deficient tomogram) does not pin down a unique solution, so the correct output is a family of volumes consistent with the data, not one image. In Cryo-ET the energy critic is learned adversarially, and the SGLD/amortized implementations correspond to “sample a posterior family” versus “take the posterior barycenter.” For the concrete application see CryoWGEN; its Langevin version, which returns a family of posterior solutions, is CryoWGEN-II.