Masked Pretraining for Scientific Spectra: Lessons from Breaking BERT
Read this post on tubhyam.dev for interactive components, animations, and the best reading experience.
The fundamental challenge of spectral machine learning is the data asymmetry. ImageNet has 1,200 labeled images per class. Spectroscopy has approximately one labeled spectrum per molecule — sometimes zero if the compound has never been synthesized. You cannot train a foundation model on a dataset where every class has a single example.
Self-supervised pretraining sidesteps the label bottleneck entirely. Instead of "given this spectrum, predict the molecule," the model learns from a different signal: "given part of this spectrum, predict the rest." No labels. No classification. Just structure — the statistical regularities that make spectra more than random noise. Masked pretraining is the simplest and most effective way to extract this structure, and adapting it from discrete text to continuous spectra turned out to be harder than expected.
The idea behind masked pretraining is not conceptually difficult. You hide parts of the input, ask the model to fill in the blanks, and the representations it builds along the way capture the deep structure of the data. BERT did this for language. MAE did this for images. We are doing this for vibrational spectra — and the translation from discrete tokens to continuous signals exposed several non-obvious failure modes that consumed weeks of debugging.
From Tokens to Patches
BERT masks discrete tokens (words) and predicts them from context. Spectra are continuous 1D signals — there are no natural tokens. The solution is patching: divide the spectrum into contiguous wavenumber regions and treat each region as a token.
A 3,501-point IR spectrum split into 128 patches gives approximately 27 wavenumber points per patch. Each patch is embedded into a $d$-dimensional vector via a learned linear projection:
$$\mathbf{p}_i = \text{Embed}(s[i \cdot P : (i+1) \cdot P]) \in \mathbb{R}^d$$
where $P$ is the patch size and $s \in \mathbb{R}^{3501}$ is the raw spectrum. The patches play the role of BERT's word tokens. Masking a patch means replacing its embedding with a learned mask vector before feeding it into the encoder.
The connection to Masked Autoencoders (MAE) is direct: He et al. applied the same idea to image patches in 2022. Spectra have different properties than images — we will return to this — but the core mechanism is identical: mask some patches, predict them from context, and hope the representations learned in the process are useful for downstream tasks.
Patches vs. Points. Masking individual wavenumber points is too fine-grained. Spectra are locally smooth — the value at wavenumber $w_i$ is highly correlated with $w_{i-1}$ and $w_{i+1}$. The model can trivially interpolate single masked points from neighbors without learning any higher-level structure. Masking contiguous patches of ~27 points forces the model to reconstruct entire peak shapes from distant context — overtone correlations, combination band patterns, functional group fingerprints. This is the representation-building signal.
Why 27 Points?
Each patch spans approximately 35 cm$^{-1}$ at our 2048-point resolution over the 4000-400 cm$^{-1}$ mid-IR range. This is not arbitrary — it is calibrated to the physics of infrared absorption. The full width at half maximum (FWHM) of a typical IR absorption peak in the condensed phase falls between 20 and 50 cm$^{-1}$. A patch of 27 points covers roughly one peak width.
This matters because the patch size controls the difficulty of the pretext task, and hence the quality of the learned representations.
Smaller patches (5 points, ~7 cm$^{-1}$) slice individual peaks into multiple fragments. Each fragment is trivially reconstructible from its immediate neighbors by interpolation — the spectrum is smooth at this scale. The model learns local continuity, which is not useful for downstream molecular identification. In our ablations, 5-point patches produced encoder representations that performed only 1.2 percentage points above a random baseline on linear probe evaluation.
Larger patches (100 points, ~130 cm$^{-1}$) mask entire spectral features — and sometimes multiple overlapping features at once. In the fingerprint region (1000-1500 cm$^{-1}$), a 130 cm$^{-1}$ window can contain three or four overlapping C-C, C-O, and C-N stretches. Reconstructing all of them from distant context is too hard: the model produces blurred averages that earn low reconstruction loss but build poor representations. Accuracy on linear probe dropped by 3.4 percentage points compared to the 27-point optimum.
27 points (~35 cm$^{-1}$) hits the sweet spot: a masked patch removes approximately one peak but leaves neighboring peaks visible. To reconstruct the missing peak, the model must reason about inter-peak relationships — the correlation between C=O stretching and C-O stretching, the harmonic relationship between fundamentals and overtones, the characteristic spacing of functional group multiplets. This is exactly the reasoning that transfers to downstream tasks.
The Masking Strategy
Select a random subset of patches to mask. Three design choices matter:
Masking ratio. What fraction of patches to replace with the mask token. BERT uses 15% (conservative, designed for fine-tuning stability). MAE uses 75% (aggressive, works because images have high 2D spatial redundancy). For spectra, 30–40% works best. Higher than BERT because spectra have substantial local redundancy along the wavenumber axis. Lower than MAE because spectra are sparser than images — fewer peaks, more baseline — so masking too aggressively leaves insufficient context for reconstruction.
Mask token. A single learnable parameter $\mathbf{m} \in \mathbb{R}^d$ shared across all masked positions. This is the model's way of saying "I don't know what goes here." The mask token participates in self-attention (or SSM processing), allowing information from visible patches to flow into masked positions through the backbone.
Where to apply the mask. This is the critical decision. The mask replaces the patch embedding before the encoder sees it:
$$\tilde{\mathbf{p}}_i = \begin{cases} \mathbf{m} & \text{if } i \in \mathcal{M} \ \mathbf{p}_i & \text{otherwise} \end{cases}$$
where $\mathcal{M}$ is the random set of masked patch indices. This operation corrupts the encoder's input — the model cannot see the ground truth at masked positions.
The Architecture
The full pretraining pipeline:
- Raw spectrum $s \in \mathbb{R}^{3501}$ — area-normalized IR or Raman spectrum
- Patch embedding — linear projection to ${\mathbf{p}i}{i=1}^{128} \in \mathbb{R}^{d}$
- Mask injection — replace $\mathbf{p}_i$ with $\mathbf{m}$ for $i \in \mathcal{M}$
- Positional encoding — add learnable position embeddings
- D-LinOSS backbone — 4 layers of Diagonal Linear Operator State Space blocks
- Reconstruction head — linear projection back to patch dimension $\mathbb{R}^{27}$
- Loss — MSE computed only on masked patches
The loss is computed exclusively on masked patches. Visible patches are not penalized — the model is free to represent them however it wants. This forces the backbone to build contextual representations at every position: the output at a masked position must encode the prediction, and this prediction can only come from attending to visible neighbors.
The Near-Identity Collapse
This is the most important section of this post. It describes the single most dangerous pitfall in adapting masked pretraining from text to continuous signals.
The temptation is to implement masking as a loss mask rather than an input mask. Instead of replacing masked patch embeddings with $\mathbf{m}$ before the encoder, you feed the full, unmasked spectrum through the encoder and simply compute the loss only on the masked positions:
# The wrong way (loss-only masking)
embeddings = embed(full_spectrum) # no masking!
outputs = encoder(embeddings) # sees everything
reconstruction = decode(outputs)
loss = mse(reconstruction[mask], spectrum[mask]) # loss on masked only
This compiles. It runs. The loss drops beautifully — from 0.42 to 0.003 within 700 training steps. The training curve looks perfect. The model is completely useless.
What happened: without input masking, the encoder sees the ground truth at every position including the masked ones. The shortest path to zero reconstruction loss is the identity function — pass the input through unchanged. The latent dimension ($d = 256$) is large enough that the spectrum's intrinsic dimensionality fits comfortably. The model learns to copy, not to understand.
The training metrics are deceptive. MSRP of 0.003 looks like remarkable reconstruction quality. But the model has learned nothing about molecular structure, peak correlations, or spectral physics. It has learned $f(x) \approx x$.
With input masking, the encoder at masked positions sees $\mathbf{m}$ — a fixed, learned vector with no information about the local spectrum. The only way to reconstruct the masked patch is to infer it from surrounding context. This forces the model to learn:
- Peak correlations: the O–H stretch at 3300 cm⁻¹ implies an O–H bend near 1400 cm⁻¹
- Functional group patterns: C=O at 1720 cm⁻¹ with specific C–H neighbors constrains the carbonyl environment
- Overtone relationships: fundamentals predict their overtones and combination bands at fixed frequency ratios
- Baseline structure: smooth, globally constrained — trivially interpolated, freeing the model to focus on peaks
The Masking Principle. For masked pretraining to learn non-trivial representations, the mask must corrupt the encoder's input, not just the loss computation. This is obvious in hindsight — BERT replaces masked tokens with [MASK] before feeding to the Transformer. But when adapting to continuous signals, it is tempting to mask only the loss, since "the model should figure out what to predict." The model does figure it out: it predicts the identity.
Why This Works for Spectra
Masked pretraining works when the signal has structure that allows masked regions to be inferred from unmasked context. Spectra have this structure in abundance — but the reasons are more specific and more interesting than "spectra are redundant." Three distinct physical mechanisms provide the reconstruction signal, each forcing the model to learn different aspects of molecular structure.
Overtone Correlations
Molecular vibrations are not perfectly harmonic. A bond modeled as a Morse potential produces not just a fundamental vibration but a series of overtones and combination bands at predictable frequencies. The fundamental C-H stretch at 2900 cm$^{-1}$ has a first overtone near 5800 cm$^{-1}$ (slightly less than 2x due to anharmonicity) and a combination band with C-H bending at approximately 4300 cm$^{-1}$.
When the model masks the fundamental at 2900 cm$^{-1}$, the overtone at 5800 cm$^{-1}$ and the combination band at 4300 cm$^{-1}$ remain visible. To reconstruct the masked fundamental, the model must learn the anharmonicity relationship — that the overtone frequency is related to the fundamental by a factor slightly less than 2, with the deviation encoding the shape of the potential energy surface. This is not a statistical correlation learned from data. It is a hard physical constraint that the model discovers through masked reconstruction.
The same logic works in reverse: masking the overtone region forces reconstruction from the fundamental. And masking both simultaneously (which happens with ~12% probability at 35% masking ratio when the two regions fall in separate patches) forces the model to use the combination band — learning a three-way relationship. After 50K pretraining steps, we observe that the model's internal representations at the fundamental, overtone, and combination band positions become linearly correlated, confirming that the encoder has learned the anharmonic coupling.
Functional Group Patterns
A carbonyl group (C=O) does not exist in isolation. It is bonded to other atoms that produce their own spectral signatures: the C=O stretch appears near 1700 cm$^{-1}$, but the neighboring C-C stretch, the C-O stretch (if it is an ester or acid), and the C-C-O bending mode all produce characteristic features in the fingerprint region between 1000 and 1300 cm$^{-1}$. An amide carbonyl additionally shows N-H bending near 1550 cm$^{-1}$ and C-N stretching near 1400 cm$^{-1}$.
Masking the carbonyl peak at 1700 cm$^{-1}$ forces the model to predict it from the fingerprint region — learning that a specific pattern of C-O, C-N, and C-C stretches implies the presence (and exact position) of the carbonyl. This is group-level reasoning. The model does not memorize individual spectra; it learns the spectroscopic grammar of functional groups, the rules that govern which peaks co-occur and how their positions and intensities are correlated through molecular structure.
This group-level learning is directly visible in the downstream performance. On a functional group classification task (given a spectrum, identify which of 15 functional groups are present), a pretrained encoder achieves 91.2% F1 versus 83.7% for training from scratch — an 8-point gap. The pretrained model has already learned functional group fingerprints during reconstruction.
Baseline Physics
The spectral baseline — the slowly varying signal underneath the peaks — is not noise. It encodes the instrument response function: the interferogram apodization in FTIR, the detector sensitivity curve, Rayleigh scattering in Raman, and fluorescence background. These effects vary smoothly across the wavenumber range with characteristic length scales of 500-2000 cm$^{-1}$.
Masking baseline regions (patches in the 1800-2500 cm$^{-1}$ "spectral desert" where few organic molecules absorb) is trivially easy — the model interpolates from neighboring baseline points. But this ease is pedagogically valuable: it teaches the model to separate the slowly-varying baseline from the sharp molecular peaks, which is exactly the right inductive bias for downstream tasks. After pretraining, the encoder's first principal component across all positions corresponds almost perfectly to the baseline shape, meaning the model has learned to factor it out — a prerequisite for robust quantification and identification across different instruments.
Physical Constraints
Spectral intensities are non-negative. Integrated band areas are proportional to transition dipole moments (IR) or polarizability derivatives (Raman). Peak positions cluster at frequencies corresponding to molecular vibrations, not uniformly across the axis. These soft constraints narrow the reconstruction space and help the model converge to physically plausible predictions.
The comparison to images is instructive. Images have 2D spatial redundancy — a masked patch can be inferred from surrounding patches in all directions. Spectra have 1D spectral redundancy plus long-range physical correlations that span the entire wavenumber range. The effective redundancy per masked position is lower for spectra, which is why 30–40% masking works best (not 75% as in MAE for images).
Masking as Feature Selection. After pretraining, the encoder's output at a visible (unmasked) position encodes not just the local peak shape, but its relationship to all other peaks in the spectrum. The representation at 2900 cm⁻¹ (C–H stretch) carries information about what the model expects at 1450 cm⁻¹ (C–H bend), 5800 cm⁻¹ (overtone), and 1720 cm⁻¹ (whether a carbonyl is present). These contextual representations are exactly what downstream tasks — identification, quantification, anomaly detection — need.
What the Encoder Learns
What does a pretrained encoder actually learn? We can peek inside by examining the learned representations at different stages of the architecture.
First-layer CNN kernels. After pretraining, the 1D convolutional kernels in the patch embedding layer organize into three distinct classes of features:
-
Peak detectors — Gabor-like filters tuned to the typical peak FWHM of 20-50 cm$^{-1}$. These filters have a central excitatory lobe flanked by inhibitory lobes, responding maximally to the characteristic Gaussian or Lorentzian shape of an absorption peak. Roughly 40% of the 256 kernels learn this pattern, with different kernels tuned to different widths covering the range from narrow gas-phase peaks (~5 cm$^{-1}$) to broad hydrogen-bonded features (~100 cm$^{-1}$).
-
Derivative filters — Asymmetric kernels that compute approximate first and second derivatives of the spectral signal. First-derivative filters detect inflection points (peak edges), while second-derivative filters detect peak centers. These are precisely the features that classical chemometrics uses for peak detection (Savitzky-Golay differentiation), but the model learns them from scratch without any chemometric prior. About 35% of kernels are derivative-like.
-
Baseline estimators — Low-frequency filters with wide, smooth profiles that capture the slowly-varying instrument response and scattering background. These filters effectively perform implicit baseline correction — the remaining 25% of kernels extract the baseline component so that deeper layers can reason about peaks without baseline interference.
The fact that the encoder independently discovers these three feature types — which correspond exactly to the three processing steps in classical spectral preprocessing (baseline correction, smoothing/differentiation, peak detection) — is strong evidence that masked pretraining learns physically meaningful representations rather than statistical shortcuts.
Pretraining Results
Training setup: 222K QM9S computed spectra (IR + Raman), masked patch modeling with 35% masking ratio, D-LinOSS backbone (4 layers, $d = 256$, state dimension 128), trained on 2× RTX 5060 Ti 16GB.
Pretraining matters most when labels are scarce. The gap between pretrained and from-scratch grows as labels decrease. At 100% labels (the full 222K dataset), the gap is ~2 points — both approaches have enough data to learn. At 10% labels, the gap is 12 points. At 1% labels (2,200 spectra), pretrained reaches 71% versus from-scratch at 43% — a 29-point gap. This is the practical value: pretraining makes spectral ML viable in the realistic regime where labeled experimental data is expensive to produce.
This scaling behavior matches the theoretical expectation. Pretraining provides an initialization in a good basin of the loss landscape — one where the encoder already understands spectral structure. With abundant labels, gradient descent finds this basin regardless of initialization. With scarce labels, the loss landscape is underspecified and initialization quality dominates. The pretrained model starts in the right neighborhood; the randomly initialized model wanders.
Practical Pitfalls
Hard-won lessons from implementation:
Patch size matters. Too small (5 points) and the model interpolates from immediate neighbors — no long-range learning. Too large (100 points) and each masked region contains multiple overlapping peaks that are too complex to reconstruct from context. 27 points — matching the CNN tokenizer's receptive field and approximately one peak width — is the sweet spot for our architecture and spectral resolution. The details are in the "Why 27 Points?" section above, but the takeaway is simple: calibrate patch size to the characteristic feature width of your signal.
Learning rate for the mask token. The mask embedding $\mathbf{m}$ is a single parameter being pulled in different directions by every masked position in every training sample. Without a learning rate boost (10× the backbone LR), it gets stuck near initialization and all masked positions produce similar, uninformative outputs. A dedicated learning rate group for $\mathbf{m}$ fixes this. We use $\text{lr}_{\text{mask}} = 3 \times 10^{-3}$ while the backbone uses $3 \times 10^{-4}$. The mask token converges within the first 2K steps and then remains relatively stable — it finds a "neutral" point in embedding space equidistant from all patch clusters.
Random masking per sample per epoch. If the masking pattern is deterministic (same patches masked every time a spectrum is seen), the model memorizes the reconstruction for each training sample rather than learning general spectral relationships. The mask must be resampled independently for every sample in every epoch. With 222K training spectra and 50 epochs, each spectrum is seen ~50 times with different masks — generating ~50 different reconstruction tasks per molecule. This is where the "14 million effective training examples" number comes from: 222K spectra × ~40 masked patches per sample × ~1.6 epochs of unique masks before significant repetition.
Combine with OT loss. Pure MSE reconstruction loss misses shifted peaks, as described in the optimal transport post. Using the hybrid MSE + Sinkhorn loss from that work improves downstream accuracy by ~1.5 percentage points — the model learns to produce sharper, better-positioned peaks. The mechanism is that MSE penalizes amplitude errors uniformly, so a peak predicted at the right position but wrong height incurs the same loss as a peak predicted at the right height but shifted by 2 cm$^{-1}$. The Wasserstein-1 distance in the Sinkhorn loss is sensitive to position shifts, providing the complementary gradient signal the model needs to localize peaks precisely.
Gradient accumulation bookkeeping. With multi-GPU training and gradient accumulation (4 sub-steps in our setup), periodic operations — logging, validation, checkpointing — must trigger only when a full optimizer step completes, not on every sub-step. We lost a week to a bug where validation ran 4x too often, appearing to show faster convergence that was actually an artifact of evaluating mid-accumulation with stale gradients. The fix: track a did_step flag and gate all periodic operations on it. Additionally, step 0 satisfies 0 % N == 0 for all N, so every periodic action triggers at initialization. Guard all periodic checks with self.step > 0.
The Road Forward
Masked pretraining is the first stage of the Spektron training pipeline. It gives the encoder a strong initialization — an understanding of spectral structure, peak correlations, and instrument physics — without requiring a single label. But it is not sufficient on its own. The encoder representations from masked pretraining are optimized for reconstruction, not for the downstream tasks we actually care about: molecular identification, quantification, and calibration transfer.
The next stages of training — the Variational Information Bottleneck that disentangles chemistry from instrument, and the fine-tuning on labeled data with optimal transport loss — build on the pretrained representations. Each stage refines what the previous one built. The pretrained encoder provides the foundation: a model that already knows what spectra look like, how peaks relate to each other, and what constitutes a physically plausible reconstruction. The later stages teach it what to do with that knowledge.
Originally published at tubhyam.dev/blog/masked-pretraining-scientific-spectra