SpectraKit: A Functional API for Spectral Preprocessing
Read this post on tubhyam.dev for interactive components, animations, and the best reading experience.
Spectral preprocessing is the unglamorous part of spectroscopy. Before you can identify a compound, quantify a concentration, or train a model, you need to remove baselines, smooth noise, normalize intensities, and correct for scatter. Every spectroscopist does this. Most write their own scripts. The scripts are never reusable.
SpectraKit exists because I got tired of rewriting the same preprocessing code for every project. It's a Python library — pip install pyspectrakit — that provides a functional API over NumPy arrays. No classes, no state, no framework lock-in. Every function takes arrays in and returns arrays out.
The Preprocessing Pipeline
Every spectral analysis follows the same general flow: load raw data from whatever instrument format you have, correct the baseline, smooth out high-frequency noise, normalize intensities so spectra are comparable, detect peaks of interest, and export the results. SpectraKit covers each stage with composable, pure functions.
Why Functional
Most preprocessing libraries for spectroscopy are object-oriented. You create a Spectrum object, call methods on it, and the object mutates internal state. This design has two problems.
First, it forces a data model. Your spectra live in whatever container the library invented — Spectrum, SpectralCollection, Dataset. You can't use plain NumPy arrays. You can't use pandas DataFrames without wrapping them. Integration with any other tool requires conversion.
Second, it makes composition opaque. When you chain spectrum.baseline().smooth().normalize(), you can't easily inspect intermediate results, swap one step for another, or build a pipeline that sklearn can use. The method chain is convenient but rigid.
SpectraKit takes the opposite approach. Every function signature follows the same pattern: ndarray in, ndarray out.
You can inspect corrected before passing it to smooth_savgol. You can swap baseline_als for baseline_snip without changing anything else. You can use these functions inside a for loop, a multiprocessing pool, or a PyTorch data loader.
What It Covers
The library handles the full preprocessing pipeline that every spectroscopist needs. Fourteen modules, each doing one thing well:
Every baseline correction method returns convergence diagnostics — not just the corrected spectrum, but also the number of iterations and the final residual. You don't have to trust that ALS converged. You can check.
Before and After: A Real-World Pipeline
Here's a scenario that every FTIR spectroscopist has lived through. You receive an IR spectrum from a Bruker FTIR — an ethanol sample measured in ATR mode. The raw data has a sloping baseline from incomplete ATR correction, high-frequency interferometric noise, and arbitrary intensity units that make it incomparable to any reference spectrum.
The problems are visible: the baseline drifts upward toward the low-wavenumber end (right side), there are noisy spurious peaks scattered throughout, and the O-H stretch at 3300 cm-1 is sitting on top of a broad hump that obscures its true shape. Without preprocessing, any peak-picking algorithm would report dozens of false positives, and any quantitative model would be biased by the baseline offset.
Here's the SpectraKit pipeline that fixes all three problems in four lines:
And here's what the spectrum looks like after preprocessing:
The difference is stark. The sloping baseline is gone — arPLS (asymmetrically reweighted penalized least squares) identified it as a low-frequency trend and subtracted it without distorting the peaks. The spurious high-frequency noise is smoothed away by Savitzky-Golay filtering, which fits local polynomials instead of just averaging, so peak shapes are preserved. And SNV normalization has centered the spectrum at zero mean and unit variance, making it directly comparable to any other ethanol spectrum in your dataset regardless of path length or instrument sensitivity.
The key detail: baseline_arpls returned both the corrected spectrum and a ConvergenceInfo object. You can verify that the algorithm converged (23 iterations, residual below 1e-7). If it hadn't converged — say, because you set lam too low and the baseline was trying to follow every peak — you'd see a high residual and could adjust parameters before the error propagated downstream.
The Dependency Decision
SpectraKit has two core dependencies: numpy and scipy. That's it. Everything else — matplotlib for plotting, h5py for HDF5 I/O, scikit-learn for pipeline integration — is optional. You install what you need.
This was a deliberate constraint. Spectroscopy code runs in environments ranging from Jupyter notebooks to embedded systems to production pipelines. A library that drags in tensorflow or torch as a dependency is unusable in half these contexts. NumPy and SciPy are the common denominator.
The practical benefit: installing SpectraKit inside a Docker container for a production preprocessing service adds less than 50MB to the image. Compare that to a spectroscopy library that depends on PyTorch (2.5GB) or TensorFlow (1.8GB). When you're deploying preprocessing as a microservice or running it in a CI pipeline, dependency size matters.
The I/O Problem
Spectral file formats are a mess. JCAMP-DX has six variants. SPC files encode data differently depending on whether the vendor is Thermo, PerkinElmer, or Shimadzu. Bruker OPUS is a binary format with no official spec — you need to reverse-engineer the byte layout.
SpectraKit's I/O module handles all of these with a single consistent interface. read_jcamp, read_spc, read_opus — each returns a named tuple with wavenumbers, intensities, and metadata. The format detection is automatic: pass a file path and the library figures out the rest.
The Bruker OPUS parser deserves special mention. Most Python libraries that claim OPUS support wrap the Bruker SDK or shell out to a command-line converter. SpectraKit reads the binary format directly — no external dependencies, no SDK license, no subprocess calls. It handles single-channel, interferogram, and ratioed spectra from any Bruker instrument manufactured after 2000.
Pipelines and sklearn
Functional composition is natural — you chain function calls. But for production use, you often want a reusable pipeline object that can be serialized, logged, and dropped into a sklearn workflow.
SpectraKit's Pipeline class wraps the functional API into a declarative chain:
SpectralTransformer wraps any SpectraKit pipeline into a sklearn-compatible transformer. It implements fit, transform, and fit_transform. This means you can use SpectraKit preprocessing inside GridSearchCV, cross_val_score, or any sklearn meta-estimator without writing adapter code.
A Complete sklearn Integration
The real power of SpectralTransformer shows up in production chemometrics workflows. Here's a complete example: predicting ethanol concentration from NIR spectra using PLS regression with SpectraKit preprocessing baked into the model pipeline.
The critical thing: preprocessing is now part of the model. When you serialize this pipeline with joblib.dump, the preprocessing steps are saved alongside the PLS model. When you load it in production and call pipeline.predict(new_spectra), the baseline correction, smoothing, and normalization happen automatically. No separate preprocessing scripts. No chance of applying different parameters in production than in training.
This also means you can grid-search preprocessing parameters. Want to know if Whittaker smoothing works better than Savitzky-Golay for your dataset? Want to compare SNV vs. MSC normalization? Wrap the options in GridSearchCV and let cross-validation decide.
Edge Cases and Defensive Design
Spectral data in the wild is messy. Instruments produce artifacts, file formats have ambiguities, and users pass unexpected inputs. SpectraKit handles these defensively rather than crashing silently.
Constant signals and division by zero
SNV normalization divides each spectrum by its standard deviation. When you have a constant signal — a blank measurement, a flat baseline region, or a dead detector channel — the standard deviation is zero. Dividing by zero produces infinity or NaN, which then propagates through every downstream operation.
SpectraKit catches this. When normalize_snv encounters a constant spectrum, it returns a zero array and emits a SpectraKitWarning rather than silently producing infinity. The warning includes the spectrum index (when processing batches) so you can identify which measurement is problematic.
NaN propagation
NaN values in spectral data happen more than you'd think — dead pixels in array detectors, parsing errors in corrupt files, interpolation at the edges of wavelength ranges. SpectraKit's default behavior is to propagate NaN rather than silently impute. If your input contains NaN, the output contains NaN in the affected regions. This follows NumPy's convention and ensures you never get a "clean" output from dirty input without knowing about it.
But for batch processing where you need robustness over strict correctness, every function accepts an nan_policy parameter: "propagate" (default), "raise" (throw an error), or "omit" (ignore NaN positions and interpolate). The "omit" mode uses linear interpolation to fill gaps before processing and masks the results back to NaN afterward, so the output shape is preserved.
Negative absorbance
Absorbance values below zero are physically impossible — they'd mean the sample is generating light. But they happen all the time in practice, usually because the baseline correction overcorrected (subtracted too much) or because the reference measurement drifted between background and sample scans.
SpectraKit's quality module flags this explicitly:
The check_spectrum function runs a suite of quality checks and returns a structured report: signal-to-noise ratio, roughness (RMS of second derivative), negative value detection, saturation detection (absorbance above 3.0 where Beer-Lambert breaks down), and wavenumber range validation. It doesn't fix anything — that's not its job. It tells you what's wrong and suggests which SpectraKit functions would address each issue.
Testing
699 tests. Zero mypy strict-mode errors. Zero ruff violations. Every public function has tests for:
- Correctness — Output matches reference implementations (SciPy, MATLAB, published papers)
- Shape preservation — 1D input produces 1D output, 2D batch input produces 2D output
- Edge cases — Empty arrays, single-point spectra, constant signals, NaN handling
- Numerical stability — Large dynamic ranges, near-zero denominators, ill-conditioned matrices
Testing Philosophy
The baseline tests are the most important. ALS and ArPLS are iterative algorithms — they can silently fail to converge, producing baselines that look reasonable but introduce systematic error downstream. Every baseline function in SpectraKit returns convergence metadata (iterations, residual norm), and the tests verify convergence on real-world spectral shapes, not just synthetic Gaussians.
The I/O tests are the most tedious. Each format has vendor-specific quirks — JCAMP-DX files from Shimadzu use ##XYDATA=(X++(Y..Y)) while files from Bruker use ##XYDATA=(X++(Y..Y)) $$DX=0.964. SPC files from Thermo use 32-bit float Y-data while older Galactic files use 16-bit integers with a separate exponent. The test suite includes real spectral files from five different instrument vendors to catch these quirks.
The Augmentation Module
The augment module is one of SpectraKit's newer additions, built specifically for training spectral ML models. When you're training a model like Spektron on a finite dataset, data augmentation is critical — but you can't apply the same augmentations used for image data. Random cropping makes no sense for spectra. Horizontal flips reverse the wavenumber axis, which is physically meaningless.
Spectral augmentation needs to be physically plausible. SpectraKit provides five augmentation functions, each designed to simulate real-world spectral variation:
The spectral_mixup function deserves a note: it's an adaptation of Zhang et al.'s mixup regularization for spectral data. Instead of linearly interpolating pixel values (which works for images because pixel intensities are arbitrary), it interpolates absorbance values — which is physically valid because absorbance is additive under Beer-Lambert law. A 70/30 mix of ethanol and methanol spectra genuinely looks like a 70/30 mixture.
What's Next
SpectraKit is stable, tested, and published. The next step is using it as the preprocessing foundation for Spektron — the spectral foundation model. Every spectrum that enters the Spektron training pipeline goes through SpectraKit preprocessing first. The functional API makes this trivial: the data loader calls baseline_als, normalize_snv, and resampling in sequence, each operating on raw NumPy arrays that PyTorch can consume directly.
Originally published at tubhyam.dev/blog/spectrakit