The Spectral Inverse Problem: From Group Theory to Foundation Models
Read this post on tubhyam.dev for interactive components, animations, and the best reading experience.
Vibrational spectroscopy — IR and Raman — is one of the most widely deployed analytical techniques in chemistry. You shine light on a molecule, measure what comes back, and try to figure out what the molecule looks like. The forward direction of this problem is solved: given a structure, compute its spectrum. The inverse direction — given a spectrum, recover the structure — is fundamentally harder, and the reason is group theory.
The spectrum above is unambiguous — each peak maps to a specific bond vibration, and the pattern uniquely fingerprints the molecule. But the inverse arrow asks: given only these peaks and their intensities, can we reconstruct the molecular structure that produced them? The forward map is a smooth function. The inverse map is a nightmare.
The Forward Map
The starting point is the Wilson GF secular equation:
$$\det(\mathbf{GF} - \lambda \mathbf{I}) = 0$$
The matrix G encodes atomic masses and molecular geometry. The matrix F is the force constant matrix — essentially the Hessian of the potential energy surface. The eigenvalues give the squared vibrational frequencies, and the eigenvectors determine which modes are observable by IR and Raman spectroscopy.
What makes the forward map well-behaved is that it's smooth, computable, and well-conditioned. Given any reasonable molecular geometry, you can compute the full IR and Raman spectrum to arbitrary precision. DFT codes do this routinely at the B3LYP/def2-TZVP level.
The inverse map has none of these properties.
Why the Inverse Fails: Symmetry
The fundamental obstruction to inversion is molecular symmetry. A molecule's point group G determines which vibrational modes are visible to each technique. The selection rules are strict:
- A mode is IR-active only if it transforms as a translation (changes the dipole moment)
- A mode is Raman-active only if it transforms as a quadratic form (changes the polarizability)
- Modes that do neither are silent — permanently invisible to both techniques
The Information Completeness Ratio measures the damage:
$$R(G, N) = \frac{N_{\text{IR}} + N_{\text{Raman}}}{3N - 6}$$
When $R = 1$, every vibrational degree of freedom is observable by at least one technique. When $R < 1$, information is permanently lost.
Theorem 1 — Symmetry Quotient
The vibrational forward map is G-invariant: it factors through the quotient space M/G. The inverse map recovers structure only up to symmetry equivalence. When R(G, N) = 1, the quotient map is potentially injective. When R < 1, the silent modes create a degenerate fiber — multiple distinct force constant matrices produce identical spectra.
How bad does it get? For 99.9% of organic molecules, R = 1 and everything is observable. But the exceptions matter:
Modal Complementarity
There is a structural result that makes combined IR + Raman strictly better than either alone. For molecules with a center of inversion (the centrosymmetric ones — CO₂, benzene, cubane), the mutual exclusion principle applies:
Theorem 2 — Modal Complementarity
For centrosymmetric molecules, IR-active and Raman-active modes are completely disjoint. Gerade (symmetric) modes are Raman-only. Ungerade (antisymmetric) modes are IR-only. Combined measurement always strictly increases the observable degrees of freedom.
This is not an approximation — it follows directly from the character table. The practical consequence: any ML model that fuses IR + Raman should see its largest accuracy gains on centrosymmetric molecules. This is a testable, quantitative prediction from the theory.
Generic Identifiability
The central open question is whether combined IR + Raman can uniquely determine molecular structure (up to symmetry equivalence) at generic points:
Conjecture 3 — Generic Identifiability
For almost all molecular geometries (outside a measure-zero set), the combined IR + Raman forward map is injective on the quotient space: distinct force constant equivalence classes produce distinct combined spectra.
This is a conjecture, not a theorem. The obstruction to proving it is that the forward map's smoothness breaks at eigenvalue degeneracies, so Sard's theorem does not directly apply. But the numerical evidence is strong:
A 4:1 overdetermination ratio means the combined spectra contain roughly four times more equations than unknowns. The inverse problem is not just solvable — it is well-conditioned.
The Architecture: Spektron
The theory says what is achievable. The model is designed to get there. Spektron is a CNN-SSM encoder with a Variational Information Bottleneck (VIB) that splits the latent space into chemistry and instrument:
The pipeline reads left to right. Raw spectra (2048 wavenumber channels) enter the Embed stage, where a 1D CNN with 7 convolutional layers compresses the input into 76 overlapping patches, each capturing local peak shapes and fine-grained spectral features. These patches feed into the Scan stage — four layers of D-LinOSS (Damped Linear Oscillatory State Space), a structured SSM that processes the full sequence in O(n) time while maintaining stable long-range dynamics via CFL-clamped recurrence. The Route stage applies Mixture-of-Experts gating, selecting the top 2 of 4 specialized expert networks for each patch — allowing different spectral regions (fingerprint vs. functional group) to be processed by different parameter subsets. The Split stage is the VIB head, which projects the routed features into two disentangled latent spaces: z_chem (128 dimensions of transferable chemical identity) and z_inst (64 dimensions of instrument-specific artifacts). Finally, Predict applies task-specific heads: masked reconstruction during pretraining, property regression during fine-tuning.
$$\mathcal{L}{\text{VIB}} = \mathbb{E}{q(z|x)}!\left[-\log p(y|z)\right] + \beta \, D_{\text{KL}}!\left(q(z|x) | p(z)\right)$$
The latent vector splits into z_chem (128 dimensions, transferable chemistry) and z_inst (64 dimensions, instrument artifacts). At transfer time, z_inst is discarded — only the chemistry survives.
Why the 128+64 Split
The 2:1 ratio between z_chem and z_inst is not arbitrary — it reflects the intrinsic dimensionality gap between chemical identity and instrument variation.
Chemical identity is high-dimensional. The QM9S training set contains roughly 130K unique molecules, each with a distinct combination of functional groups, ring systems, heteroatom positions, and conformational preferences. A meaningful embedding must capture not just coarse functional group presence (which ~20 dimensions could handle) but fine-grained distinctions: the difference between ortho- and meta-substituted benzenes, between primary and secondary amines, between strained and unstrained ring systems. Principal component analysis on computed force constant matrices shows that roughly 80-100 dimensions are needed to capture 95% of the variance across the QM9 chemical space. We allocate 128 dimensions — enough headroom for the nonlinear manifold structure that a neural encoder learns, which typically requires 1.2-1.5x the linear intrinsic dimensionality.
Instrument variation, by contrast, is low-dimensional. The dominant instrument effects — baseline drift (2-3 DOF for polynomial curvature), wavelength/wavenumber shift (1 DOF), intensity scaling (1 DOF), and spectral resolution broadening (1 DOF) — account for perhaps 8-10 true degrees of freedom. But we allocate 64 dimensions rather than 10 because the mapping from these physical effects to spectral distortions is highly nonlinear: a small wavelength shift produces peak-position-dependent intensity changes across the entire spectrum, and baseline curvature interacts with peak height in complex ways. The 64-dimensional z_inst space gives the VIB enough capacity to capture these nonlinear interactions without requiring the encoder to disentangle them into the clean physical parameters. At transfer time, all 64 dimensions are discarded — the over-allocation is essentially free since it only costs capacity during training, not at inference.
Architecture Ablation
Key Design Choice
A 1D CNN tokenizer before the SSM backbone gives 8-10% accuracy gains over raw patch tokenization on spectral data. Vibrational peaks are sharp, narrow features — convolutional kernels capture this local structure before the SSM handles global context. This is the single largest architectural improvement in ablation studies.
The numbers tell a layered story. The CNN-only baseline (71.2%) captures local peak shapes effectively — it knows what a carbonyl stretch looks like — but cannot model the long-range correlations between distant spectral regions. The difference between a primary amide and a secondary amide shows up as correlated changes in both the N-H stretching region (~3300 cm⁻¹) and the amide I/II bands (~1650/1550 cm⁻¹), separated by over 1500 wavenumber channels. A CNN with reasonable kernel sizes simply cannot see both ends of this correlation simultaneously.
The pure Transformer (78.3%) handles long-range correlations well through self-attention but struggles with the raw input: spectral peaks are 5-15 channels wide in a 2048-channel spectrum. Without convolutional preprocessing, the Transformer must learn peak detection from scratch in its early layers — wasting capacity on a task that a simple 1D convolution solves trivially. This is why the CNN tokenizer provides such a large boost.
CNN+Transformer (83.7%) combines the best of both: convolutional peak detection followed by attentional long-range modeling. But it pays O(n²) in sequence length for the attention mechanism. With 76 patches this is manageable, but it scales poorly if we want to process higher-resolution spectra or longer sequences.
CNN+D-LinOSS (84.2%) matches the CNN+Transformer accuracy while running in O(n) time. The D-LinOSS backbone is a damped linear oscillatory state space model — physically, it models the spectrum as a driven oscillator system, which is a natural inductive bias for vibrational data. The damped recurrence captures long-range correlations through persistent state evolution rather than explicit pairwise attention. The CFL stability constraint (clamping the recurrence eigenvalues inside the unit circle) prevents the gradient explosion that plagues vanilla SSMs on long sequences. At 2048 channels, the wall-clock speedup over the Transformer is 1.8x; at 8192 channels (high-resolution FT-IR), it would be ~7x.
The 0.5% accuracy advantage of D-LinOSS over Transformer is modest but consistent across random seeds. The real advantage is scaling: the same architecture handles 2K, 4K, or 8K channel spectra without quadratic blowup, which matters for high-resolution instruments and broadband spectral fusion.
Calibration Transfer
The practical test case. A model trained on spectra from instrument A fails on instrument B — different detectors, optical paths, lamp aging all shift the spectral shape. Current approaches (PDS, SBC) require 25+ paired transfer samples. The VIB architecture targets far fewer by learning instrument-invariant representations during pretraining.
The transfer objective aligns latent distributions across instruments using Sinkhorn-based optimal transport:
$$\mathcal{L}{\text{OT}} = W\epsilon!\left( q(z_{\text{chem}} | \mathcal{D}A), \, q(z{\text{chem}} | \mathcal{D}_B) \right)$$
Combined with test-time training — running a few self-supervised gradient steps at inference on the new instrument — this enables adaptation without labeled target data.
The Corn Moisture Benchmark
The corn dataset is the standard benchmark for calibration transfer: 80 samples of corn measured on three near-infrared instruments (m5, mp5, mp6), each recording 700 wavelength channels. The task is to predict moisture content. It is small, well-characterized, and every calibration transfer method in the literature reports numbers on it.
Piecewise Direct Standardization (PDS) — the classical approach — builds a linear transfer matrix between instruments using paired measurements of the same samples on both instruments. It achieves strong R² values (~0.94-0.96) but requires 25+ paired samples to build a stable transfer matrix. In practice, this means running 25 identical corn samples on both the old and new instruments — expensive, time-consuming, and sometimes impossible (e.g., when the old instrument has been decommissioned).
LoRA-CT (Low-Rank Adaptation for Calibration Transfer) is a recent deep learning approach that fine-tunes a pretrained spectral model using low-rank adapter matrices. It reduces the sample requirement to 10-15 paired samples while matching or exceeding PDS accuracy. The key insight is that instrument differences live in a low-rank subspace — LoRA naturally captures this structure.
Spektron with test-time training (TTT) targets a more aggressive operating point: comparable R² with 5 or fewer transfer samples. The mechanism is different from both PDS and LoRA-CT. Instead of learning an explicit transfer function (PDS) or fine-tuning model weights (LoRA-CT), Spektron discards z_inst entirely at transfer time and adapts z_chem through a few self-supervised gradient steps on unlabeled spectra from the new instrument. The self-supervised objective (masked spectral reconstruction) requires no labels — just raw spectra from the target instrument. With 5 unlabeled samples providing the TTT signal and the pretrained z_chem space providing the chemical prior, the model adapts its internal representation to the new instrument's characteristics without ever seeing a labeled target sample.
Benchmark Target
R² > 0.952 on corn moisture prediction (beating LoRA-CT) with ≤5 transfer samples across three NIR instruments (m5, mp5, mp6). PDS requires 25+ labeled pairs to reach this threshold. LoRA-CT requires 10-15. Spektron's VIB + TTT approach targets the same accuracy with only unlabeled spectra from the target instrument — a 5x reduction in labeled data requirements over LoRA-CT and a qualitative shift from supervised to self-supervised transfer.
Current Status
The theoretical framework is complete. The model is pretraining on QM9S (130K molecules, computed IR + Raman + UV at B3LYP/def2-TZVP) and ChEMBL (220K experimental spectra). Next: symmetry-stratified evaluation to test whether empirical accuracy tracks R(G, N) as the theory predicts. Details on the theory are in the companion post on spectral identifiability.
Originally published at tubhyam.dev/blog/spectral-inverse-problem