State Space Models for Spectroscopy: Why Sequence Models Beat CNNs
Read this post on tubhyam.dev for interactive components, animations, and the best reading experience.
A vibrational spectrum is conventionally treated as a fixed-length vector — 3501 intensity values spanning 500 to 4000 cm⁻¹. You feed it into a 1D CNN, extract features, and classify. This works. It also misses something fundamental.
A spectrum is a sequence. The O-H stretch at 3400 cm⁻¹ is physically correlated with the O-H bend at 1640 cm⁻¹ — they're different vibrations of the same bond. The C=O stretch at 1720 cm⁻¹ shifts when a neighboring C-H appears at 2950 cm⁻¹, because the bonds share electron density. These correlations span thousands of wavenumbers — far beyond the receptive field of any practical CNN.
State space models (SSMs) process sequences with linear-time complexity while maintaining a compressed memory of the entire history. Applied to spectra, this means the model at wavenumber 3400 cm⁻¹ already "remembers" what it saw at 500 cm⁻¹. No skip connections, no attention, no quadratic cost.
The Receptive Field Problem
A 1D CNN with kernel size $k$ and $L$ layers has an effective receptive field of $L \times (k-1) + 1$ points. For a standard architecture — $k=7$, $L=6$ — that's 37 points, or about 37 cm⁻¹. The O-H stretch and O-H bend are separated by 1760 cm⁻¹.
Try it yourself — toggle between CNN and SSM, adjust kernel size and layers, and see which spectral correlations fall inside the receptive field:
You can increase the CNN receptive field with dilated convolutions or deeper networks, but both have costs. Dilated convolutions create "gridding artifacts" — they sample the input at regular intervals and miss features between the dilation gaps. Deeper networks require more parameters and are harder to train.
Transformers solve the receptive field problem completely — full attention connects every point to every other point. But attention is $O(N^2)$ in sequence length. For a 3501-point spectrum, that's 12 million attention scores per layer. It works, but it's expensive, and the cost grows quadratically if you increase spectral resolution.
State Space Models: The Third Option
An SSM processes a sequence by maintaining a hidden state that evolves according to a linear dynamical system. The formalism starts in continuous time and is discretized for computation:
The matrix $\mathbf{A}$ is the state transition — it determines how memory decays and which frequencies are preserved. $\mathbf{B}$ controls how new input enters the state. $\mathbf{C}$ reads from the state to produce output. The hidden state $\mathbf{h}t$ is a compressed representation of the entire history $x_0, x_1, \ldots, x{t-1}$.
The critical innovation in S4 (Structured State Spaces for Sequence Modeling, Gu et al. 2022) is the HiPPO initialization of the $\mathbf{A}$ matrix. Instead of random initialization, $\mathbf{A}$ is set to optimally compress the history under a specific measure — specifically, it projects the input history onto a basis of Legendre polynomials, retaining long-range dependencies that a randomly initialized system would forget exponentially fast. The HiPPO matrix has a beautiful structure: its entries $A_{nk} = -(2n+1)^{1/2}(2k+1)^{1/2}$ for $n > k$ form a lower-triangular matrix where each state dimension captures a different polynomial moment of the history. This means the hidden state is literally a polynomial approximation of everything the model has seen.
Why Linear Dynamics Work
The linearity of the state transition is not a limitation — it's a feature. A linear SSM can be unrolled into a global convolution: $y = K * u$ where the kernel is $K = (\mathbf{CB}, \mathbf{CAB}, \mathbf{CA}^2\mathbf{B}, \ldots)$. This means a single SSM layer has a receptive field equal to the full sequence length. For a 2048-point spectrum, the kernel $K$ has 2048 entries — each output point is a weighted sum of all input points, with weights determined by the learned matrices $\mathbf{A}$, $\mathbf{B}$, and $\mathbf{C}$.
This dual interpretation is what gives SSMs their computational edge. During training, the convolution view enables parallel computation across the entire sequence using FFT — compute $K$ once, then $y = \text{iFFT}(\text{FFT}(K) \odot \text{FFT}(u))$ in $O(N \log N)$ time. At inference, the recurrence view processes each new point in $O(1)$ time, maintaining a running hidden state. No other architecture has both properties simultaneously.
For spectra, the full-sequence kernel matters because physical correlations are genuinely global. The C-H stretch at 2900 cm⁻¹ constrains what is possible in the fingerprint region at 1000 cm⁻¹ — these peaks arise from the same molecule and share electron density. A CNN with a 37 cm⁻¹ receptive field cannot see this constraint. An SSM encodes it directly in the convolution kernel.
From S4 to D-LinOSS
The SSM landscape has evolved rapidly. Three generations matter for spectral applications:
S4 (2022) — The original structured state space. Uses a diagonal approximation of $\mathbf{A}$ for efficiency. Showed that SSMs could match Transformers on long-range benchmarks (Path-X, ListOps) while being much faster.
Mamba (2023) — Made $\mathbf{B}$ and $\mathbf{C}$ input-dependent (selective state spaces). The transition matrices now depend on the input, allowing the model to selectively remember or forget information. This broke the convolution interpretation but enabled much better performance on language tasks.
D-LinOSS (2024) — Diagonal Linear Operator State Space. Returns to a diagonalized state transition but with a learnable discretization step that adapts to the input. Combines S4's parallelism with Mamba's input-dependent behavior.
The pure SSM outperforms the pure CNN by 9 points — the global receptive field matters. But the hybrid architecture (CNN tokenizer + SSM backbone) beats both, because peak shapes are inherently local features that CNNs capture better than SSMs, while cross-peak correlations are global features that SSMs capture better than CNNs.
D-LinOSS Deep Dive: Damped Oscillators as State Transitions
D-LinOSS deserves more than a one-line summary, because its design is deeply connected to the physics of the signals it processes.
The key innovation is replacing Mamba's data-dependent gates with a physically-motivated damped oscillator. Each state dimension in D-LinOSS evolves according to:
$$\mathbf{M} = \begin{pmatrix} \cos(\omega\Delta t) & \sin(\omega\Delta t)/\omega \ -\omega\sin(\omega\Delta t) & \cos(\omega\Delta t) \end{pmatrix} \cdot e^{-\gamma\Delta t}$$
This is the exact solution to a damped harmonic oscillator with frequency $\omega$ and damping $\gamma$. The state does not just accumulate information — it oscillates, maintaining a frequency-specific memory of the input. For vibrational spectroscopy, this is natural: each state dimension resonates at a learned frequency, acting as a tunable bandpass filter over the wavenumber axis. The damping term $e^{-\gamma\Delta t}$ controls how quickly old information decays — high damping for local features, low damping for long-range correlations.
The discretization step $\Delta t$ is input-dependent (computed by a linear projection of each token), giving D-LinOSS the selectivity of Mamba while preserving the parallelizable structure of S4. When the model encounters a strong peak, it can increase $\Delta t$ to "take a bigger step" through state space, effectively allocating more representational capacity. When it encounters baseline, it can decrease $\Delta t$ to coast through.
CFL Stability: The Training Killer
The damped oscillator formulation introduces a subtle but critical numerical constraint. The state transition matrix $\mathbf{M}$ is stable only when its eigenvalues remain inside the unit circle. For the discretized system, this requires:
$$\alpha = \frac{\Delta t^2 \cdot A}{S} \leq 2$$
where $A$ is the diagonal of the state matrix (which grows during training via gradient updates) and $S$ is a normalization constant. This is analogous to the Courant-Friedrichs-Lewy (CFL) condition in numerical PDE solvers — take too large a step and the simulation explodes.
In practice, this manifests as NaN around steps 1000-1400 during training. The $A$ parameters grow through gradient updates, eventually violating $\alpha \leq 2$. When the eigenvalues exit the unit circle, the 2048-step recurrence diverges exponentially: a perturbation of $10^{-6}$ at step 0 becomes $10^{+6}$ at step 2048, overflowing even float32.
The fix is a soft clamp that prevents $\alpha$ from reaching 2 while maintaining smooth gradients:
$$\alpha_{\text{clamped}} = 1.99 \cdot \tanh(\alpha / 1.99)$$
This maps $\alpha \in [0, \infty) \to [0, 1.99)$, keeping eigenvalues strictly inside the unit circle. The $\tanh$ ensures gradients flow smoothly even when $\alpha$ approaches the boundary — unlike a hard clamp, which would zero out gradients and stall learning. After applying this fix, training runs stably to 50K+ steps without a single NaN.
Why Spectra Are Ideal for SSMs
SSMs excel on sequences with specific properties — and vibrational spectra have all of them:
1. Long-range dependencies are physically meaningful. The correlation between the O-H stretch and the O-H bend is not a statistical artifact — it's a consequence of shared atomic displacement vectors. SSMs that model this correlation produce better molecular embeddings.
2. The sequence has a natural ordering. Wavenumber is a physical axis with units. Unlike token sequences in language (where position is arbitrary), the wavenumber axis has a metric structure. Adjacent points are more correlated than distant points, but distant correlations also exist.
3. Resolution can vary. Some spectral regions are information-dense (the fingerprint region, 500-1500 cm⁻¹) and others are sparse (2000-2500 cm⁻¹ for most organic molecules). An input-dependent SSM can allocate more state capacity to information-dense regions — something fixed architectures cannot do.
The Selective Attention Analogy
When Mamba or D-LinOSS processes an IR spectrum, the input-dependent gating learns to "pay attention" at peaks and "skip" over baselines. This is analogous to how a spectroscopist reads a spectrum: scan quickly over featureless regions, slow down at peaks, and relate distant peaks to each other. The SSM learns this reading strategy from data.
4. Sequence length is moderate. At 3501 points, a spectrum is long enough that Transformers become expensive but short enough that SSMs are extremely efficient. The sweet spot for SSMs is sequences of length 1K-100K — exactly where spectral data lives.
The Hybrid Architecture in Spektron
Spektron uses a CNN tokenizer → D-LinOSS backbone architecture. The CNN converts the raw 3501-point spectrum into 128 tokens, each representing a ~27 cm⁻¹ window. The D-LinOSS layers then process these tokens as a sequence, building global representations:
The CNN tokenizer provides two things the SSM needs: local feature extraction (peak shapes, shoulders, multiplets) and dimensionality reduction (3501 → 128 tokens). The D-LinOSS backbone then relates these local features across the full spectral range, producing representations where the O-H token "knows about" the C=O token 1000 cm⁻¹ away.
Ablation: CNN Tokenizer Matters
The CNN tokenizer is not optional. Replacing it with simple patch tokenization (chop the spectrum into 128 non-overlapping windows) drops accuracy by 8-10%:
The reason: vibrational peaks are sharp, asymmetric features that don't align with fixed patch boundaries. A peak at the edge of a patch gets split between two tokens, destroying its shape information. The CNN's overlapping receptive fields and learned filters capture peak shapes regardless of alignment.
Architecture Lesson
For 1D signal data with sharp local features and long-range correlations, the optimal architecture is a hybrid: CNN for local feature extraction → SSM for global context. This pattern applies beyond spectroscopy to any signal where local structure and global dependencies both matter — ECG, seismology, audio, time series.
Practical Considerations
Training SSMs on spectral data has a few gotchas, most of which we discovered the hard way during Spektron development.
Numerical Stability: bfloat16 vs float16
D-LinOSS uses complex-valued state matrices that can produce extreme intermediate values (±200K) before the GLU gate. Under mixed-precision training (AMP), the choice of reduced-precision format matters enormously:
float16 has a range of ±65,504. Values above this overflow to infinity, which propagates as NaN through the rest of the computation. The D-LinOSS state values routinely hit ±200K during normal operation — not a bug, but a consequence of the oscillatory dynamics. Float16 cannot represent these values.
bfloat16 has the same range as float32 (±3.4 × 10^38) but only 7 bits of mantissa precision versus float32's 23 bits. The reduced precision is acceptable for neural network training — gradient noise already introduces far more error than 7-bit rounding. But the range is critical: bfloat16 comfortably holds ±200K values without overflow.
The fix is two-layered. First, use bfloat16 instead of float16 for AMP. Second, force the entire LinOSSBlock to run in float32 via torch.amp.autocast('cuda', enabled=False), allowing only the rest of the model (CNN tokenizer, projections, loss computation) to use bfloat16. This costs approximately 15% more memory but eliminates the NaN failures completely.
Gradient Clipping and Weight Decay
Standard gradient clipping at 1.0 is too aggressive for SSMs. The $\mathbf{A}$ matrix gradients are inherently larger than typical parameter gradients because they affect every timestep in the recurrence — a small change to $\mathbf{A}$ compounds over 2048 steps. Clipping at 1.0 throttles these gradients excessively, slowing convergence by 3-5x.
We use gradient clipping at 5.0, which prevents catastrophic gradient spikes (which do occur when the CFL condition is nearly violated) while allowing the SSM parameters to learn at a reasonable rate.
Weight decay must be excluded for three parameter categories: LayerNorm parameters (which should remain at their learned scale), bias terms (which regularize toward zero unnaturally with weight decay), and embedding parameters (including the learnable mask token used in BERT-style pretraining). Applying weight decay to these parameters creates a persistent bias toward zero that fights the learning signal, manifesting as a subtle but consistent 1-2% accuracy drop.
Initialization and State Dimension
The HiPPO initialization of $\mathbf{A}$ assumes the input is a continuous signal sampled uniformly. Spectra satisfy this — wavenumber is uniformly sampled. But if you resample to non-uniform spacing (e.g., to compress baseline regions), you need to adjust the discretization step accordingly.
The hidden state dimension $d_{\text{state}}$ controls how much history the SSM can remember. For 128-token spectral sequences, $d_{\text{state}} = 128$ is sufficient — the state has as many dimensions as there are tokens. Increasing beyond this shows diminishing returns.
The Bigger Picture
SSMs represent a shift in how we think about spectral data. The traditional view — a spectrum is a vector of features — leads to architectures that treat each wavenumber independently. The sequence view — a spectrum is a signal unfolding along the wavenumber axis — leads to architectures that model dependencies between wavenumbers.
This distinction matters because the physics is sequential. The wavenumber axis is not arbitrary — it corresponds to energy, and physical correlations between modes follow from shared molecular structure. A model that respects this sequential structure learns more from less data.
The practical upshot: on QM9S with 130K spectra, a CNN + D-LinOSS hybrid achieves 84.2% identification accuracy with 12.4M parameters, matching a CNN + Transformer at 83.7% while running at linear cost. On larger datasets or higher-resolution spectra, the linear scaling advantage will compound.
The deeper lesson is that architecture should follow physics. Spectra are oscillatory, long-range, and ordered — and the best model for them is a damped oscillator with a full-sequence receptive field. When your data has structure, build that structure into the model. The SSM does not just learn spectral features; it learns to resonate with them.
Originally published at tubhyam.dev/blog/state-space-models-for-spectroscopy