Self-supervised learning

Learning representations and solving inverse problems without labels, by constructing supervision from the data itself.

Self-supervised learning trains models without human labels by inventing a pretext task whose targets are derived from the data itself. The model is forced to predict one part of an input from another, and in solving that artificial task it learns structure that transfers to downstream problems. Because the supervisory signal is generated automatically, the approach scales to large unlabeled datasets. The design of the pretext task governs which regularities the model is steered toward capturing, so its choice is typically tied closely to the downstream objective.

Put differently: supervised learning needs paired $(x,y)$ , and labeling cost grows linearly with the data. Self-supervision carves $y$ out of $x$ itself, decoupling the amount of data from the amount of supervision. The price is that the pretext task must be calibrated to be just hard enough — too easy (interpolating a masked pixel from its neighbors) and the model takes a shortcut that learns no semantics; too hard (reconstructing a whole image from one pixel) and there is no usable signal. A good pretext task pins the model at the point where the only way to solve it is to understand the underlying structure.

A direct demonstration of masked autoencoding — hide part, reconstruct from the rest:

Structure

Masked input

Self-supervised reconstruction

Mask ratio: 35%

Hide part of the structure and have the model predict the masked region from what remains — no labels, the data supervises itself. The more is masked, the less context remains and the blurrier the reconstruction. The missing wedge is itself a structured mask.

A taxonomy of objectives

Self-supervised objectives fall into two broad families. Generative (or predictive) objectives ask the model to reconstruct or predict missing content — a masked patch, the next token, a colorized version of a grayscale image — so that the target lives in the data space itself. Contrastive objectives instead operate in representation space: two augmented views of the same datum are pulled together while views of different data are pushed apart, so the model learns an embedding in which semantic similarity is encoded by distance, without ever reconstructing the input. Generative methods retain low-level detail and supply a reconstruction map directly; contrastive methods discard nuisance detail and excel at producing transferable embeddings. The two are complementary, and many systems blend them.

This distinction decides the downstream use. Contrastive embeddings suit classification and retrieval, tasks that only care whether two things are alike, because they deliberately throw away pixel-level detail; image restoration, by contrast, needs every pixel put back in place, so solving inverse problems is almost always generative — it wants exactly the reconstruction map that contrastive methods discard. Cryo-ET restoration is of the latter kind, and the rest of this page focuses on the generative branch.

A canonical generative example is masked autoencoding: a portion of the input is hidden and the model must reconstruct it from the visible remainder. Filling in a masked patch of an image or a masked token in a sequence requires capturing context, regularities, and long-range dependencies — knowledge that is useful far beyond the masking game itself. The masked autoencoder (MAE) makes this concrete for images by masking a large fraction of patches, encoding only the visible ones, and reconstructing the rest with a lightweight decoder; the high masking ratio forces the encoder to infer global structure rather than interpolate locally.

Why is the high masking ratio the crucial knob? Mask only 15% of patches and almost every hidden location has a near neighbor to interpolate from; a low-level texture extrapolator gets by without understanding what the object is. Push the ratio to 75% and the visible patches become sparse, so only semantic-level priors — “this is an airplane, the wing should continue here” — can fill the gaps. The masking ratio sets the level of abstraction the model is forced to invoke, which is the central feel of pretext-task design.

Intuition

The data is its own answer key. If part of a signal can be predicted from the rest, that predictability encodes real structure; a model that learns to exploit it has learned something about the signal, even though no human ever provided a label.

For inverse problems the same principle becomes a way to learn without ground truth. When the degradation operator $\mathcal{A}$ that corrupts a clean signal $x$ into an observation $y=\mathcal{A}(x)+\epsilon$ is known, it can itself supply supervision. Here $x$ is the clean signal we want to recover but never observe, $\mathcal{A}$ is the known physical process that turns it into a measurement (in Cryo-ET, projection plus the missing wedge), $\epsilon$ is zero-mean noise, and $y$ is the corrupted observation that is all we actually hold. One common scheme generates two corrupted views of the same underlying signal and trains the model to predict one from the other; consistency under the known corruption replaces the missing clean target.

Depth

Applying the known operator to a candidate restoration and comparing the result against real observations turns “is this restoration plausible?” into a measurable loss in observation space. A reconstruction $\hat x$ is acceptable when $\mathcal{A}(\hat x)$ matches the statistics of genuine measurements, so the operator $\mathcal{A}$ bridges the unobserved clean domain and the observed corrupted domain — exactly the structure that makes label-free reconstruction tractable.

But note: consistency $\mathcal{A}(\hat x)\approx y$ alone does not pin down a unique $\hat x$ . The directions that $\mathcal{A}$ erases — the Fourier components inside the missing wedge — leave no trace in observation space, so the restoration is underdetermined along them. This is why a prior is indispensable: either an inductive bias baked into the network architecture, or an explicit matching loss that pulls $\hat x$ toward the distribution of real structures. The known operator handles “agree with the observation”; the prior handles “fill in plausible structure along the directions the operator cannot see,” and neither works without the other.

Denoising without clean targets

The most direct instance of this principle in imaging is label-free denoising. Noise2Noise observes that if two independent noisy measurements $y_1, y_2$ of the same scene differ only by zero-mean noise, training a network to map $y_1\to y_2$ under a squared loss converges, in expectation, to the same optimum as training against the unavailable clean target — because the expectation of a noisy target equals the clean signal. Why does this hold? The minimizer of a squared loss $\mathbb{E}\,\Vert f(y_1)-y_2\Vert^2$ is the conditional mean $\mathbb{E}[y_2\mid y_1]$ ; when $y_2 = x + \text{noise}$ with noise that is zero-mean and independent of $y_1$ , then $\mathbb{E}[y_2\mid y_1]=\mathbb{E}[x\mid y_1]$ , identical to training against the clean $x$ . No ground truth is needed, only paired noisy observations.

Noise2Void removes even the pairing requirement: a blind-spot network predicts each pixel from its neighborhood while being structurally forbidden from seeing the pixel itself, so it cannot copy the pixel-independent noise and must instead reconstruct the underlying signal from context. The key assumption here is that the noise is conditionally independent across pixels — since this pixel’s noise cannot be inferred from the neighborhood, all the network can learn is the part the neighborhood does predict, namely the underlying signal. The cost is discarding the predicted pixel’s own information, making the denoiser slightly conservative, but it drops the dependence on paired data entirely. Both turn the statistics of a known noise process directly into supervision, differing only in whether they pivot on the noise being zero-mean or conditionally independent.

The lesson for Cryo-ET is direct: tomographic data carries abundant, statistically characterizable noise but almost no clean ground truth. Methods that extract supervision from the noise statistics themselves fit this data regime exactly — which is the premise for treating missing-wedge restoration as a generative self-supervision problem.

This degradation-as-supervision idea is the foundation of the self-supervised reconstruction methods on this site: with the missing-wedge and noise operator known, CryoGEN and CryoWGEN learn entirely from real tomograms. Concretely, CryoGEN’s self-supervision is instantiated by a random-rotation-then-missing-wedge proxy $T_M\circ R$ applied to its own outputs, which yields a supervised target with no ground truth — the known degradation operators $T_M$ (the missing-wedge mask) and $R$ (a random rotation) play the role of $\mathcal{A}$ above, while “the network’s output, once degraded, should match the distribution of real observations” plays the role of the consistency constraint. The matching loss draws on optimal transport and the autoencoder framework to inject the structure prior along the directions the operator cannot see. Along this line, the site’s four methods take different stances on what to return: CryoGEN-I yields a MAP point estimate; CryoGEN-II gives a stable single answer via WAE/optimal transport; CryoWGEN-I draws Monte-Carlo samples with EVIA; and CryoWGEN-II produces a posterior family via EVIA Langevin dynamics.

← Generative & Distribution Matching