Segmentation

Delineating membranes, organelles, filaments, and macromolecules turns a tomogram into an interpretable map of cellular structures.

Segmentation partitions a tomogram into meaningful structures — labeling which voxels belong to a membrane, an organelle, a cytoskeletal filament, a macromolecular complex, or background. A raw reconstruction is a dense grey-scale volume in which everything overlaps; segmentation converts it into an annotated map that can be visualized, measured, and quantified. It is a general analysis task in Cryo-ET, applied across cellular tomograms regardless of the downstream goal.

One way to place it: reconstruction tells you how bright each voxel is, segmentation tells you what each voxel is. The first is a physical quantity (scattering density); the second is a semantic label (membrane / microtubule / ribosome / background). A tomogram on the order of $1000^3$ holds about $10^9$ voxels, far too many to label by hand, so the central question of segmentation is always the same: how do you turn “an expert recognizes a membrane at a glance” into a rule or function that runs automatically across every voxel.

Grey-scale slice

Segmentation mask

Structure recall: 87%Noise picked up: 14%

Threshold: 0.55

A high threshold misses faint membrane; a low one admits background noise as structure — no single global threshold satisfies both, which is why fixed thresholds fail under uneven contrast.

The simplest form of segmentation is thresholding: a density value is chosen, and every voxel denser than it (darker in inverted grey-scale) is labeled as structure. On a volume with high signal-to-noise and uniform contrast, a single threshold can outline a membrane cleanly; in real tomograms, however, intensity drifts with thickness, defocus, and depth, so a global threshold that fits in one region runs too high or too low in another. A high threshold drops faint membrane, while a low one admits background noise into the mask, and the two failure modes cannot be avoided at once.

To make this concrete, suppose structure voxels sit at a true density near $1.0$ (arbitrary units), background near $0.0$ , and the noise standard deviation is $\sigma \approx 0.4$ . Holding background false positives below $5\%$ requires a threshold above roughly $1.65\sigma \approx 0.66$ ; but if thickness in one region drops the effective contrast of real membrane to $0.7$ , that same line now grazes the signal itself and faint membrane vanishes in patches. This is the structural problem with a single scalar threshold: it has one knob but two conflicting jobs — suppress noise and retain weak signal — and at low SNR those goals pull in opposite directions.

Segmentation is difficult for the same reasons that pervade tomography. The low signal-to-noise ratio blurs boundaries between objects, the missing wedge elongates and weakens structures along the beam axis — membranes nearly perpendicular to the missing-wedge direction can fade or vanish — and molecular crowding inside cells packs many overlapping densities into a small volume. The missing wedge is especially awkward because it does not lower SNR uniformly: it erases features of specific orientations directionally, so a single membrane can read “clear here, gone there,” and methods that rely on local density alone struggle to bridge the gap.

Several families of methods are in use. Manual segmentation, in which an expert traces structures slice by slice, is accurate but slow and does not scale to large datasets. Threshold-based methods label voxels by density value, sometimes combined with geometric criteria such as local membrane-like curvature; they are fast but struggle with the uneven contrast and noise of real tomograms. Deep-learning segmentation, typically using convolutional networks trained on annotated examples, has become dominant: a network learns to recognize the appearance of a target class and produces dense labels across whole tomograms, generalizing across noise levels far better than fixed thresholds. Common tools include EMAN2’s tomoseg, which trains a convolutional network from sparse manual annotations to label membranes, organelles, and filaments; dedicated pipelines also target membrane bilayers or specific organelles such as mitochondria and the endoplasmic reticulum.

Deeper

Write voxel-level segmentation as a function $f$ that, at each position $\mathbf{x}$ , looks not at that point’s intensity alone but at a whole local neighborhood $v(\mathbf{x} + \mathbf{u})$ ( $\mathbf{u}$ ranging over a small window) and outputs the probability the point belongs to structure, $p(\mathbf{x}) = f\big(\{v(\mathbf{x}+\mathbf{u})\}\big)$ . Thresholding is the most degenerate special case of this framework — the window shrinks to a single point and $f$ becomes a step function $f(v) = \mathbb{1}[v > \tau]$ — which is exactly why it has no defense against noise: a single voxel’s intensity is itself corrupted at the $\sigma$ scale.

A convolutional network is stronger because the $f$ it learns lets the whole neighborhood vote. Its first layer is a convolution

(\,k * v\,)(\mathbf{x}) = \sum_{\mathbf{u}} k(\mathbf{u})\, v(\mathbf{x}+\mathbf{u}),

where $k$ is a learned kernel (a set of weights), $\mathbf{u}$ runs over the kernel’s support, and $v$ is the input tomogram. This step encodes local patterns such as the bright–dark–bright cross-profile of a membrane bilayer; stacking layers widens the receptive field, so the network can use context — is this dark line a continuous bilayer, or a filament passing through — to disambiguate. Training typically minimizes a per-voxel cross-entropy

\mathcal{L} = -\sum_{\mathbf{x}} \big[\, y(\mathbf{x})\log p(\mathbf{x}) + (1-y(\mathbf{x}))\log(1-p(\mathbf{x})) \,\big],

where $y(\mathbf{x}) \in \{0,1\}$ is the manual label for that voxel and $p(\mathbf{x})$ is the network’s predicted structure probability. The key point is that a few sparsely annotated surfaces suffice to constrain $f$ , because a given structural class shows the same local texture again and again throughout the volume; once the network has learned that texture it generalizes to regions that were never labeled.

Intuition

Segmentation answers “what is this?” at every point in the volume rather than “how bright is it?”. A threshold sees only intensity and is fooled by noise; a trained network learns the texture and context that distinguish a membrane bilayer from a passing filament, so it can label structures even where their density alone is ambiguous.

Cleaner input improves every approach, since both human and machine segmenters draw sharper boundaries on higher-SNR volumes — one reason learned restoration methods such as CryoGEN are useful before segmentation. Once restoration suppresses missing-wedge artifacts and noise, the “vanishing arc” of a membrane often reconnects, and both thresholds and networks return more coherent masks. The resulting labels support quantitative cell biology — membrane area, organelle volume, filament-network density, and inter-complex spacing all read directly off the mask — and complement molecule-level analysis such as particle picking and in situ structural biology: segmentation supplies the context of where organelles and membranes are, and picking then localizes individual molecules within that context. When downstream work aligns and stacks many copies of the same macromolecule — each extracted into a subtomogram box typically $64$ – $128$ voxels on a side — subtomogram averaging over roughly $10^4$ copies suppresses random noise by about $\sqrt{10^4}=100$ , a roughly $100\times$ gain in SNR.

← Structural Analysis