Particle picking & template matching

Locating copies of a target molecule in a noisy tomogram is the step that precedes subtomogram averaging.

Particle picking is the task of finding the positions — and often the rough orientations — of a molecule of interest within a tomogram. It precedes subtomogram averaging, which needs the coordinates of every copy before it can extract and average them. Picking is hard precisely because the data is so noisy: the targets sit at very low signal-to-noise ratio, are distorted by the missing wedge, and may be densely packed among other cellular material. A single tomogram often contains hundreds to thousands of copies awaiting localization, so picking must be both accurate and scalable across the whole volume.

Intuition

Think of picking as “circling every copy of one screw model in an old, snow-flecked photograph.” Each screw is too blurry to make out its edges, but you recognize its rough shape and size. Picking answers two questions in sequence: first where a copy sits (localization), then which way it points (orientation). The second matters because the next step, averaging, has to stack thousands of copies into a common orientation — the rough angle picking reports is the starting point for that stacking. Pick a few more, miss a few fewer, and the final average resolves better; but circle too many “fake screws” and you fold noise into the average and drag resolution down. Picking is always a balancing act between recall and purity.

Noisy field (hidden particles)

Normalized cross-correlation

5 true particles5 Picked particles

Noise level: 0.60Peak threshold: 0.40

A disk template is correlated with the noisy field at every position; the normalized map peaks where the local density resembles the template. A lower threshold recovers more true particles but admits more false positives — the reason template matching prunes its candidate list.

Template matching: sliding a stencil through the volume

The classical approach is template matching. A 3D template — a known or low-resolution model of the target — is correlated against the tomogram at every position and orientation. The cross-correlation between template $t$ and volume $v$ ,

c(\mathbf{x}) = \sum_{\mathbf{u}} t(\mathbf{u})\,v(\mathbf{x}+\mathbf{u}),

peaks where the local density resembles the template. Here $\mathbf{x}$ is the position being tested in the tomogram, $\mathbf{u}$ runs over every voxel the template covers, $t(\mathbf{u})$ is the template density at offset $\mathbf{u}$ , and $v(\mathbf{x}+\mathbf{u})$ is the tomogram density at the corresponding spot. The sum is a voxel-by-voxel multiply-and-add: when the template’s shape matches the local density everywhere — same sign, same height — the products are all positive and the sum is large; when the shapes disagree, positives and negatives cancel and the sum falls toward zero. Searching over a grid of orientations — typically thousands of angle combinations — and taking local maxima of the correlation map yields candidate positions and angles.

Raw correlation values cannot simply be compared, because dense regions of the tomogram naturally produce a larger sum and drown out weaker signals whose shape actually matches. So in practice one uses normalized cross-correlation: at each position the local patch and the template are each mean-subtracted and divided by their standard deviation, which puts the score between $-1$ and $1$ , with $1$ a perfect linear match independent of absolute density. There is a second subtlety. The tomogram is missing a missing wedge of Fourier information, while the template is complete — compared directly, template density in directions the data never sampled would spuriously lower the correlation. The standard fix is to mask the template with the same missing wedge as the data, so both are compared on the same incomplete Fourier support and the comparison is fair.

Going deeper

Why does cross-correlation work as a detector at all? Put it back in a statistical frame: under additive white Gaussian noise, the test of “is there a copy of template $t$ here” has an optimal statistic — the matched filter — proportional to the correlation of $t$ with the data. Cross-correlation is not an ad-hoc similarity score but the maximum-likelihood detection score under that noise model. This also explains its two weaknesses. First, the white-noise assumption fails in tomograms: the CTF and reconstruction color the noise, making it spatially correlated, so structured peaks appear in the correlation map that should not be there, creating false positives. Second, correlation is sensitive to a template’s amplitude spectrum but less discriminating about phase, so two molecules with similar outlines but different interiors can produce similar peaks. Remedies include whitening by the noise power spectrum in Fourier space (the generalized matched filter, which weights each frequency by its noise variance) and local-energy normalization to suppress the inflated scores from high-density membranes or gold fiducials. Even so, the orientation search has a real cost: a finer angular grid localizes orientation more precisely but the compute grows linearly, and a full-orientation scan of one cellular tomogram is routinely hours of work.

Once the candidate peaks are found, the list almost always contains false positives and must be pruned. Common criteria include peak height (too low usually means a chance coincidence of noise), peak shape — sharp and isolated versus diffuse (diffuse peaks often correspond to membranes or ice contamination rather than a compact particle), a minimum spacing between neighboring peaks (to stop one particle being picked twice, usually set by the particle diameter), and whether the location is plausible (a ribosome should not land in the middle of a lipid bilayer). Template matching is noise-sensitive, but it remains a workhorse: it needs no human annotations, runs from even a crude initial model, and supplies the initial orientations that seed averaging — something a pure detector cannot give.

Neural picking: from correlation to learning

Neural picking has become a strong complement. A convolutional network (such as a 3D U-Net) or a transformer-based network is trained — from a modest set of annotations, or via self-supervised pretraining — to flag voxels belonging to the target. Its fundamental difference from template matching is this: template matching only asks “how much does the local density resemble this one stencil,” whereas a network has seen the target across many orientations and crowded backgrounds during training, so what it learns is a family of features more tolerant of noise and deformation than any single rigid stencil.

Intuition

Template matching is like trying every lock with one fixed key — a slightly worn key opens nothing. A neural network is the person who, after seeing thousands of “this is a lock, that is not” photos, has worked out for themselves what a lock looks like. The former needs you to have a good key first (a high-quality template); the latter needs you to have a stack of labeled examples first (annotations). In the dirty, crowded setting of a cellular tomogram the latter usually holds up better — it does not fail outright just because one copy is partly occluded by a neighbor.

The cost is annotation: a supervised network needs experts to mark true particles by hand in several tomograms first, and labeling cellular data is both slow and subjective. This is why self- and semi-supervised pretraining matter — learn a general representation on a large pool of unlabeled tomograms, then fine-tune with a handful of labels. Whichever route, one thing holds for both methods: the cleaner the underlying tomogram, the more reliable picking becomes. Template-matching peaks rise more clearly from the noise floor, and networks separate target from background more easily. This is one reason learned restoration methods such as CryoGEN are relevant upstream of this step — suppress missing-wedge artifacts and noise first, and downstream picking, alignment, and averaging all benefit.

Picking also works hand in hand with segmentation: segmenting large-scale structures like membranes and organelles first lets you restrict picking to plausible regions (search for ribosomes only in the cytosol, say), which both cuts false positives and saves compute. In in-situ imaging this “frame the scene, then pick the molecules” pairing is especially useful. Finally, the picked coordinates and angles flow directly into subtomogram extraction and on to subtomogram averaging — so the accuracy of picking is written directly into the resolution of the final structure.

← Structural Analysis