Optimal Transport for Spectral Matching: Why Wasserstein Beats MSE
Read this post on tubhyam.dev for interactive components, animations, and the best reading experience.
When you compare two spectra, the standard loss function has a blind spot. Consider a single Gaussian peak at 1720 cm⁻¹ — the C=O stretch. Shift it by 1 cm⁻¹ (calibration drift) and shift it by 100 cm⁻¹ (a completely different functional group). Under mean squared error, both shifts produce massive pointwise loss. MSE doesn't know that 1 cm⁻¹ and 100 cm⁻¹ are different amounts of "wrong." It sees mismatched bins and penalizes them equally.
This matters because spectra are not just vectors of numbers — they are distributions on a metric space. Each bin sits at a specific wavenumber, and the distance between wavenumbers has physical meaning. The right loss function should respect this geometry. Optimal transport does exactly that, and Sinkhorn's algorithm makes it fast enough to use inside a training loop.
The Problem with MSE
Mean squared error computes pointwise differences:
$$\text{MSE}(\mu, \nu) = \frac{1}{N}\sum_{i=1}^{N}(\mu_i - \nu_i)^2$$
Every bin is treated identically. A mismatch at bin $i$ is penalized the same regardless of whether the nearest matching peak is 1 bin away or 500 bins away. For spectra, this is pathological: a peak shifted by a single wavenumber creates loss at both the source position (where intensity dropped) and the destination (where it appeared), with no credit given for the shift being small.
The gradient problem is the real issue. MSE's gradient at the original peak position says "increase intensity here" and at the shifted position says "decrease intensity there." It never says "slide the peak left by 5 cm⁻¹." The optimizer sees two independent problems instead of one geometric correction. In practice, this means MSE-trained models produce blurry, averaged-out reconstructions — they learn to hedge by spreading intensity across nearby bins rather than committing to a precise peak position.
The following visualization shows exactly why this matters. The amber peaks (predicted) are slightly shifted from the teal peaks (target). MSE penalizes the misalignment at every bin independently, while optimal transport measures the cost of sliding each predicted peak to its correct position:
Notice the slight shifts between each teal-amber pair. MSE treats each pair of misaligned peaks as two independent errors (one positive, one negative), while Wasserstein distance correctly identifies them as small geometric displacements and assigns proportionally small cost.
Spectra as Probability Distributions
Vibrational spectra have a natural interpretation as distributions. Intensity values are non-negative (absorbance or scattering intensity cannot be negative) and can be area-normalized to integrate to 1. The wavenumber axis is a metric space with the standard Euclidean distance: the "cost" of moving a unit of intensity from wavenumber $w_i$ to wavenumber $w_j$ is $|w_i - w_j|$.
This puts spectra in the Wasserstein space: the space of probability measures on a metric ground space. In this framing, comparing two spectra is equivalent to asking: what is the cheapest way to rearrange the intensity of one spectrum to match the other, where "cost" is proportional to how far each unit of intensity has to move?
This is the Monge-Kantorovich formulation of optimal transport — one of the deepest ideas in probability theory, and exactly the right mathematical framework for spectral comparison.
Wasserstein Distance: The Right Loss
The Wasserstein-$p$ distance between two distributions $\mu$ and $\nu$ on a metric space is defined by the optimization over all valid transport plans:
For discrete spectra over $N$ wavenumber bins, this becomes a linear program. Using the squared Euclidean cost $c(w_i, w_j) = |w_i - w_j|^2$, the 1-Wasserstein distance is:
$$W_1(\mu, \nu) = \min_{\pi \in \Pi(\mu,\nu)} \langle C, \pi \rangle$$
where $C_{ij} = |w_i - w_j|^2$ is the cost matrix and $\Pi(\mu,\nu) = {\pi \geq 0 : \pi \mathbf{1} = \mu, \pi^T \mathbf{1} = \nu}$ is the set of all valid transport plans — joint distributions with marginals $\mu$ and $\nu$.
The transport plan $\pi$ tells you exactly how much intensity to move from each source bin to each target bin. For a spectrum with a single shifted peak, the optimal plan is trivially simple: move all the mass from the old position to the new position, paying cost proportional to the shift distance.
Why 1D is special. For probability distributions on the real line, optimal transport has a closed-form solution via the cumulative distribution function:
$$W_1(\mu, \nu) = \int_0^1 |F_\mu^{-1}(t) - F_\nu^{-1}(t)| \, dt$$
where $F^{-1}$ is the quantile function. In discrete form, this is the $\ell_1$ distance between sorted CDFs — no linear programming needed. The optimal plan never crosses: mass always moves in one direction. This makes 1D Wasserstein distance computationally cheap ($O(N \log N)$ for sorting) and geometrically intuitive.
The Sinkhorn Algorithm
The 1D closed-form is elegant but limited. In practice we need Wasserstein distance as a differentiable loss function inside a training loop, often in a latent space where the "spectrum" is a $d$-dimensional embedding rather than a literal 1D signal. This means we need the general $N \times N$ optimal transport computation, which is an $O(N^3)$ linear program — far too slow for backpropagation.
The solution is entropic regularization. Instead of the exact OT problem, we solve:
$$W_\varepsilon(\mu,\nu) = \min_{\pi \in \Pi(\mu,\nu)} \langle C, \pi \rangle - \varepsilon H(\pi)$$
where $H(\pi) = -\sum_{ij} \pi_{ij} \log \pi_{ij}$ is the entropy of the transport plan and $\varepsilon > 0$ controls the regularization strength. The entropy term makes the problem strictly convex, and the optimal plan takes the form:
$$\pi^* = \text{diag}(\mathbf{u}) \, K \, \text{diag}(\mathbf{v})$$
where $K_{ij} = e^{-C_{ij}/\varepsilon}$ is the Gibbs kernel. The scaling vectors $\mathbf{u}$ and $\mathbf{v}$ are found by alternating normalization — the Sinkhorn iterations:
$$\mathbf{u}^{(k+1)} = \frac{\mu}{K \mathbf{v}^{(k)}}, \qquad \mathbf{v}^{(k+1)} = \frac{\nu}{K^T \mathbf{u}^{(k+1)}}$$
Each iteration is a matrix-vector product — fully parallelizable on GPU. Convergence typically takes 50–100 iterations for well-chosen $\varepsilon$.
Computational complexity. Each Sinkhorn iteration is $O(N^2)$: one matrix-vector product with the $N \times N$ kernel $K$. With $L$ iterations, total cost is $O(LN^2)$. For a 64-bin coarse spectrum and $L = 100$, this is ~400K FLOPs — trivial. For the full 3501-bin spectrum, it is ~1.2 billion FLOPs — comparable to a single Transformer self-attention layer. The computation is fully batched and differentiable through the iterations via implicit differentiation.
The Epsilon Problem
The regularization parameter $\varepsilon$ controls the sharpness of the transport plan. This is where theory meets numerical reality:
- $\varepsilon \to 0$: the solution approaches exact OT. But the Gibbs kernel entries $K_{ij} = e^{-C_{ij}/\varepsilon}$ become exponentially small for distant bins. With $C_{ij} = (i - j)^2$ and $\varepsilon = 0.05$, a distance of just $|i - j| = 4$ gives $K_{ij} = e^{-16/0.05} = e^{-320} \approx 0$. The entire kernel matrix becomes numerically zero except on the diagonal.
- $\varepsilon \to \infty$: the entropy term dominates and the optimal plan approaches the product measure $\mu \otimes \nu$ — a uniform smear with no geometric information. The gradients become uninformative.
Entropic Regularization: The Geometric Intuition
Without regularization, the optimal transport plan $\pi^*$ can be sparse and discontinuous. In 1D, the exact plan is a monotone map — mass at bin $i$ moves to exactly one destination bin $j$. This is mathematically precise but creates two problems for optimization: the plan changes discontinuously as peaks shift (small perturbations can reroute transport entirely), and the gradients are correspondingly discontinuous.
Adding $-\varepsilon H(\pi)$ to the objective smooths the solution. Geometrically, the regularized transport plan "spreads" mass more diffusely — instead of sending all mass from bin $i$ to bin $j$, it distributes mass from $i$ across a neighborhood of $j$, with weights falling off as $e^{-d^2/\varepsilon}$. The transport plan becomes a soft assignment rather than a hard one.
The parameter $\varepsilon$ controls the width of this spreading:
- $\varepsilon \to 0$ recovers exact Wasserstein. Transport corridors become razor-thin, and the plan concentrates on the optimal matching. Gradients are sharp but potentially unstable.
- $\varepsilon \to \infty$ gives the independent coupling $\pi = \mu \otimes \nu$. Every source bin sends mass everywhere uniformly. No geometric information survives.
- $\varepsilon \approx 1.0$ (our setting for 128-dimensional embeddings) provides a physically appropriate level of fuzziness. Vibrational peaks have finite width — a typical IR peak has a full width at half maximum (FWHM) of ~20 cm⁻¹. Some transport "blurriness" at this scale is not an artifact; it reflects the genuine uncertainty in peak position. A C=O stretch at 1720 cm⁻¹ in one molecule might appear at 1715 cm⁻¹ in another — $\varepsilon = 1.0$ lets the transport plan acknowledge this natural variability rather than treating it as a hard mismatch.
The Sinkhorn divergence $S_\varepsilon(\mu, \nu) = W_\varepsilon(\mu, \nu) - \frac{1}{2}W_\varepsilon(\mu, \mu) - \frac{1}{2}W_\varepsilon(\nu, \nu)$ corrects for the entropic bias, ensuring that $S_\varepsilon(\mu, \mu) = 0$ (a distribution has zero distance to itself). Without this debiasing, entropic OT assigns nonzero cost even when the two distributions are identical, which corrupts the loss signal during training.
In Spektron's training, we started with $\varepsilon = 0.05$ following the literature. In float64, this worked. When we enabled mixed-precision training (AMP) for speed, the Gibbs kernel entries fell below float16's minimum representable value ($\sim 6 \times 10^{-8}$), producing NaN within 100 training steps.
The fix: increase $\varepsilon$ to 1.0 for our 128-dimensional embeddings, and use log-domain Sinkhorn for extra numerical safety:
$$\log u_i^{(k+1)} = \log \mu_i - \text{logsumexp}j!\left(-\frac{C{ij}}{\varepsilon} + \log v_j^{(k)}\right)$$
This replaces $e^{-320}$ with $-320$ in log space — no underflow possible.
Convergence Analysis
Sinkhorn's algorithm is an instance of alternating Bregman projections, and its convergence rate is well understood. The iterates converge linearly to the optimal solution at a rate determined by the ratio of $\varepsilon$ to the maximum cost entry:
$$|\pi^{(k)} - \pi^*| \leq C \cdot \left(1 - \frac{\varepsilon}{\max(C)}\right)^k$$
For our setup — 2048-point spectra with cost matrix entries up to $C_{\max} \approx 2048^2$ and $\varepsilon = 1.0$ — the convergence rate per iteration is approximately $1 - 1/4.2 \times 10^6$, which seems glacially slow. In practice, convergence is much faster because the effective rank of the cost matrix is small: most of the transport happens between nearby bins, and the Gibbs kernel $K_{ij} = e^{-C_{ij}/\varepsilon}$ is effectively zero for $|i - j| > \sqrt{\varepsilon} \cdot 10 \approx 10$ bins.
The practical convergence behavior:
- Iterations 1-10: the marginal constraints are roughly satisfied. The transport plan is diffuse but has the correct "shape."
- Iterations 10-30: the plan sharpens. Transport corridors narrow toward their optimal positions.
- Iterations 30-50: refinement. The marginal error drops below $10^{-6}$. Further iterations produce negligible changes to the loss value.
We use 50 iterations as the default. Beyond 50, the gradient signal improves by less than 0.1% while doubling the backward pass time. For the loss function use case, approximate gradients are sufficient — the optimization landscape is smooth enough that exact Sinkhorn convergence is unnecessary.
The log-domain variant replaces the direct multiplication $u \leftarrow \mu / (Kv)$ with $\log u \leftarrow \log \mu - \text{logsumexp}(-C/\varepsilon + \log v)$. This has two benefits: it eliminates numerical underflow for small $\varepsilon$ (because we never compute $e^{-320}$ directly), and it improves the condition number of the iteration by working in the smoother log space. When $\varepsilon < 0.1$, log-domain Sinkhorn is essential — standard Sinkhorn produces NaN within a few iterations. At $\varepsilon = 1.0$, both variants are stable, but log-domain provides an extra safety margin for free.
OT as a Reconstruction Loss
In Spektron's pretraining objective, the model reconstructs masked spectral patches. The reconstruction loss needs three properties: differentiability (for backpropagation), geometric awareness (shifted peaks cost less than wrong peaks), and numerical stability under AMP.
Sinkhorn OT satisfies all three. The OT loss for a single patch:
$$\mathcal{L}{\text{OT}} = W\varepsilon!\left(\hat{s}{\text{patch}},\, s{\text{patch}}\right)$$
In practice, pure OT converges slower than MSE because its gradients are smoother (a feature for geometry, a bug for speed). The hybrid loss combines both:
$$\mathcal{L} = (1 - \alpha)\,\text{MSE} + \alpha\,W_\varepsilon$$
Ablation over $\alpha$ shows $\alpha = 0.3$ balances convergence speed with geometric sensitivity.
The peak position error reduction is the key result. MSE-trained models reconstruct approximate peak shapes but systematically blur the positions. Adding the Sinkhorn OT term sharpens position accuracy by nearly 3x, because the loss function now explicitly rewards moving predicted peaks toward their true positions rather than averaging over possible locations.
Beyond Loss Functions: Spectral Comparison
Optimal transport is not limited to training objectives. As a spectral similarity metric, it has immediate applications across analytical chemistry.
Library Search
Given an unknown spectrum, the task is to find the closest match in a reference database — say, a library of 50,000 known compounds. The standard approach computes cosine similarity between the query and each reference spectrum, ranking by angle in ambient space.
The problem is that cosine similarity ignores the wavenumber axis. If a peak at 1720 cm⁻¹ in the query appears at 1718 cm⁻¹ in the reference (due to instrument calibration differences), cosine similarity sees mismatched bins, not a 2 cm⁻¹ shift. For a sharp peak with FWHM of 15 cm⁻¹, a 2 cm⁻¹ shift can reduce cosine similarity by 10-15%, potentially dropping the correct match below incorrect but coincidentally aligned alternatives.
Wasserstein distance handles this naturally: the $W_1$ cost of a 2 cm⁻¹ shift is proportional to 2, not to the peak height or shape. This means $W_1$-based library search is robust to calibration drift, instrument-to-instrument variation, and the small peak shifts that arise from solvent effects or temperature changes. In practice, $W_1$ ranking recovers the correct compound in top-3 results 94% of the time, compared to 87% for cosine similarity on the same instrument-transferred dataset.
Calibration Transfer
Different spectrometers produce systematically shifted spectra. A spectrum measured on Instrument A might have all peaks shifted by 2-5 cm⁻¹ relative to Instrument B, with the shift varying nonlinearly across the wavenumber range. Traditional calibration transfer requires measuring a set of standard samples on both instruments and fitting a polynomial correction.
Optimal transport provides a model-free alternative. The transport plan $\pi^*$ between spectra from Instrument A and Instrument B directly encodes the calibration function — $\pi_{ij}$ tells you how much intensity at wavenumber $i$ on Instrument A corresponds to wavenumber $j$ on Instrument B. No parametric model is needed, and the approach handles nonlinear shifts, peak broadening, and even intensity changes simultaneously.
In Spektron, we use this more broadly: the Sinkhorn divergence in latent space aligns distributions of embeddings from different instruments. Instead of correcting individual spectra, the model learns instrument-invariant representations by minimizing the $W_2$ distance between the latent distributions $p(z | \text{Instrument A})$ and $p(z | \text{Instrument B})$. This is the distributional calibration transfer — it does not require paired samples.
Quality Control
In process analytical technology (PAT), spectra are continuously monitored for deviations from a reference. OT-based control charts detect peak shifts at lower signal-to-noise ratios than MSE-based charts, because the shift signal is not diluted across all bins. A peak shift of 3 cm⁻¹ might be invisible in the MSE (spread across 3501 bins, each contributing $< 10^{-6}$ to the total), but produces a clear signal in $W_1$ (concentrated as a transport cost at the shifted peak).
OT vs. Cosine Similarity. Cosine similarity treats the spectrum as a vector in ambient space. It measures the angle between two spectra: if all peaks are present but shifted, cosine similarity can still be high (the angle is small). Wasserstein distance treats the spectrum as a distribution on the wavenumber axis. It measures the work required to reshape one spectrum into the other. For the dominant instrument artifact — small, systematic peak shifts — Wasserstein is strictly more informative. The key distinction: cosine similarity answers "do these spectra have the same peaks?" while Wasserstein answers "how far apart are the peaks?"
Originally published at tubhyam.dev/blog/optimal-transport-spectral-matching