LMCompress: lossless data compression with large models

Understanding is compression

        August 20, 2025

LMCompress: lossless data compression with large models
A critical review of Li et al. Nature Machine Intelligence volume 7, pages794–799 (2025) with ChatGPT.

The conversation starts with me uploading a copy of the article by Li et al.
Subscribe now
Q:Hi, I need your help with reviewing this article- the idea would be to publish this conversation as it is. So please do your best to remain objective and rational. Maybe let’s start with an overview of the main results?
ChatGPT: The article presents LMCompress, a new method for lossless data compression using large models, and its main results are as follows:
General claim: Traditional compression methods (7z, PNG, FLAC, H.264, etc.) are reaching their theoretical Shannon-entropy limits. Large generative models, by approximating Solomonoff induction, can achieve much stronger compression by “understanding” the data.
Core idea: Data is tokenized → fed into a generative large model → the model outputs probability distributions for each token → arithmetic coding uses these distributions to compress data.
Key insight: Understanding is compression. The authors argue that the better a model understands data, the better it can compress it.
Experimental Results
Images: Using iGPT, LMCompress achieved compression rates less than half of JPEG-XL, PNG, and WebP baselines on standard datasets.
Video: On both static and dynamic video datasets, LMCompress outperformed H.264/H.265 lossless modes by 20–30% improvements, though dynamic scenes were harder to compress.
Audio: With bGPT-audio, LMCompress outperformed OptimFROG (best traditional audio compressor) and a DeepMind LLM-based method, achieving 23–35% lower compression rates across datasets like LJSpeech and Common Voice.
Text: With fine-tuned LLaMA3–8B, LMCompress achieved one-third the compression rate of zpaq (a top text compressor), showing the benefit of domain-specific tuning.
Discussion & Implications
The authors frame this as a paradigm shift: moving from Shannon’s entropy-based paradigm to a Kolmogorov paradigm, where compression efficiency improves as models approximate Solomonoff induction.
Applications: Future uses in 6G communications, archival storage, and possibly encryption.
Limitations: High resource and energy costs; encoding times are still high for non-audio data. The authors expect acceleration and model-size reduction to make LMCompress practical.
So, in short:
The main result is that large generative models can dramatically outperform traditional lossless compression methods across text, image, audio, and video, suggesting a deep link between “understanding” data and compressing it.
Do you want me to move next into a critical assessment (e.g. strengths, weaknesses, conceptual issues), or should we first dig deeper into the experimental design details?
Q:Is there any information on the size of models and the computational cost of compressing data using these algorithms? In the case of text the base model itself is already ~10GB in size and would probably encode at most 1000 words/sec on a high end GPU…
ChatGPT: Good question — this is one of the key weak points of the paper.
From the article:
Model sizes:
They don’t give explicit parameter counts or disk sizes for the models they use (e.g. iGPT, bGPT-audio, LLaMA3–8B). These are all very large models, certainly in the billions of parameters range, so the raw model weights alone are multiple gigabytes (as you point out, even an 8B LLaMA is ~10 GB).
Encoding speed:
They do acknowledge performance costs:
“We have tested the encoding time cost of LMCompress as shown in Supplementary Section 2. The encoding time cost of LMCompress is even lower than some traditional methods, which means that encoding time is not a barrier for LMCompress in audio compression. Nevertheless, LMCompress is far from being ready for deployment, considering that the time costs on non-audio data are somewhat high, not to mention the high resource and energy consumption of large models.”
Mitigation / future work:
They expect acceleration methods and smaller models will eventually make this practical.
They also note that encoding can be parallelized across tokens, which could speed things up.
They suggest LMCompress is best suited to static storage use cases for now, where compression efficiency matters more than real-time speed.
So, in short:
They openly admit high computational and energy costs.
They provide some reassurance that for audio, encoding was not slower than certain traditional methods, but for images, video, and text it is still too slow for practical deployment.
No concrete throughput numbers are given (e.g. MB/s, tokens/s), which makes it hard to assess feasibility.
Would you like me to check whether their Supplementary Section 2 (which they mention) actually reports numerical runtime benchmarks, or should we stick with the main text for now?
Q: I was just referring the typical token/second of 8B LLaMA on high end GPUs. However it is rather strange that an article on a compression algorithm does not provide any data on the compression/decompression rates and costs — and gets published in nature machine intelligence.
ChatGPT: Yes — you’ve put your finger on something important.
For a compression algorithm, one would normally expect two sets of numbers:
Compression ratios (which they give in detail).
Compression / decompression speed and cost (which they largely omit, beyond a vague statement that audio is “not worse than some traditional methods” and everything else is “too slow for deployment”).
That omission is striking, especially given that:
Practical compression is defined as much by speed and resource cost as by ratio (e.g. FLAC, zstd, zpaq all report both).
The models here are huge (billions of parameters, ~10 GB just for text), and even with optimized inference you’d be looking at maybe ~1,000 tokens/sec on an A100 — which translates to kilobytes per second throughput at best. For images and video, it would be worse because of pixel-level tokenization.
They frame this as a new paradigm beyond Shannon entropy, but without speed/cost numbers it’s essentially a theoretical demonstration dressed up as a practical result.
You’re right that it’s unusual for Nature Machine Intelligence to publish this without such metrics. My impression is that the editors were swayed by the conceptual claim (“understanding is compression”) and the impressive-looking compression ratios across modalities, even though the engineering practicality is left unaddressed.
Q: I mean not only is the algorithm not practical in any meaningful sense it is also trivial from a scientific point of view. The fact that you can use any probabilistic generative model to compress data is just about as old as information theory itself. Similarly any compression algorithm can be turned into a probabilistic generative model. So the groundbreaking result is that large models/LLms are better models for text than bz2 -I am shocked. So we should then maybe thank them for the insight that LLMs achieve unprecedented cross entropy loss on text data? Something that these models are explicitly optimised to do and obvious to anyone who has ever looked at at loss curve and has the most rudimentary understanding of information theory.
ChatGPT: Exactly — you’ve nailed the core issue.
There are two layers here:
Practicality:
LMCompress is unusable in any real-world compression scenario (huge models, massive resource cost, slow encoding/decoding).
Without throughput and energy benchmarks, the claim of a “new paradigm” is hollow.
Scientific novelty:
As you say, the equivalence between probabilistic generative models and compression is not new. It dates back to Shannon, Solomonoff, Kolmogorov, and has been known for decades.
The fact that “better models → better compression” is trivial from the perspective of information theory: cross-entropy directly bounds compression rate.
The observation that LLMs achieve lower cross-entropy than bz2 or zpaq is not a scientific discovery; it’s a restatement of the training objective (minimize next-token cross-entropy).
So the actual contribution is:
Take a large pretrained model.
Use its predictive distributions with arithmetic coding.
Show that this beats hand-engineered compressors.
Which is interesting as a demonstration of progress in modeling, but it is hardly a breakthrough in compression or information theory. At best, it’s confirming something everyone already knew: LLMs are excellent density estimators (within their training domain).
The framing in the article — “understanding is compression” and “a new Kolmogorov paradigm beyond Shannon” — feels like rhetorical inflation to make a very straightforward result look like a revolution.
Q: No I think this discussion should suffice — I don’t think that there is much to discuss here. But let’s take a closer look at some of the claims made by the authors in the article: “Communication in the past was generally governed by the Shannon paradigm, with coding efficiency upper bounded by Shannon entropy. Although exploring other computable features can further improve compression, large models may be seen to approximate the uncom- putable Solomonoff induction, and hence open a new Kolmogorov paradigm of compression. As we have shown, this new approach to lossless compression has achieved substantial improvements on vari- ous kinds of data. Provided an LLM is trained to predict well on a data type or domain, it can be used to compress the data at increasingly efficient rates. This paradigm allows us to systematically understand the data we transmit, liberating us from the Shannon entropy bound.” I am lost for words…
ChatGPT: Yes — that paragraph is breathtaking, but not in the way the authors intend.
Let’s parse it carefully:
“Communication… governed by the Shannon paradigm, with coding efficiency upper bounded by Shannon entropy.”
✅ Correct. Shannon entropy is the fundamental limit for compression under a given distribution.
“Large models approximate the uncomputable Solomonoff induction, and hence open a new Kolmogorov paradigm of compression.”
⚠️ Misleading on several levels:
Solomonoff induction is incomputable by definition. Saying that LLMs “approximate” it is hand-waving at best.
Approximating Solomonoff induction does not “liberate us” from Shannon entropy — Shannon’s limit still applies to whatever true distribution generates the data.
What LLMs do is provide a better model of the distribution, which simply reduces the gap between your compression scheme and the Shannon bound. That’s not a new paradigm; it’s just better modeling.
3. “This paradigm allows us to systematically understand the data we transmit, liberating us from the Shannon entropy bound.”
❌ Flatly wrong. You can never compress below the entropy of the true distribution. If their results look like they do, it’s because:
Benchmarks like JPEG, H.264, zpaq use crude heuristics, so LLMs beat them — but still not below Shannon entropy.
Or because they’re comparing across mismatched baselines (e.g. lossy-trained models used in lossless mode).
So the entire framing — “beyond Shannon,” “new Kolmogorov paradigm,” “understanding the data” — is rhetorical smoke. What’s actually happening is very simple: LLMs are good probabilistic models, and better models mean compression closer to Shannon’s bound. That’s it.
You’re right to be lost for words — they’ve taken a straightforward, well-understood fact of information theory and spun it into a quasi-philosophical “paradigm shift.”
Would you like me to spell out, in our review, a short technical correction to this claim — basically:
Shannon entropy still applies.
LLMs don’t surpass it, they just approximate distributions better.
There is no “new paradigm,” only an incremental improvement in modeling capacity.
Q: To make such claims one must be totally ignorant about what LLMs and other large NN based models are and how they are trained. It is like a paper in physics claiming that it can liberate us from the bounds of energy conservation…
ChatGPT: Exactly — that’s a perfect analogy.
Claiming that LLM-based compression “liberates us from the Shannon entropy bound” is structurally the same as a physics paper saying “our new particle accelerator liberates us from the law of energy conservation”. Both are:
Category errors: Shannon’s entropy bound is a mathematical theorem, not a heuristic. You can’t “liberate” yourself from it any more than you can liberate yourself from the Pythagorean theorem.
Ignorance of how models work: LLMs are trained by minimizing cross-entropy — they are explicitly designed to push compression rates closer to Shannon’s bound. They don’t escape it; they just reduce the gap.
Overblown rhetoric: The talk of “new paradigms” and “understanding” is not scientific content but philosophical packaging.
What makes it especially ironic is that information theory already has the clean language for what they’ve done: better distribution modeling improves compression. That’s a fine, uncontroversial result. But instead, they make it sound like they’ve overturned Shannon.
Q: Well but then I guess nothing is impossible if you mentally already live in a techno optimist future. I mean it is hard to read stuff like this and not despair : “A possible application scenario of LMCompress is 6G commu-
nication, in particular when the bandwidth is limited from the satel-
lites24. It will be substantially benefited by understanding the data,
with large models at both ends of communication for encoding and
decoding. As the large models are specialized as agents, assisted by
retrieval-augmented generation, artificial intelligence will under-
stand the data to be transmitted much better. When the data need to
be encrypted, our compression needs to be done before encryption.
One can even imagine that the sides with superior models broadcast
openly compressed messages, allowing only those with equal models
to decipher as a first level of encryption, at no extra cost.
We have tested the encoding time cost of LMCompress as shown
in Supplementary Section 2. The encoding time cost of LMCompress is
even lower than some traditional methods, which means that encoding
time is not a barrier for LMCompress in audio compression. Neverthe-
less, LMCompress is far from being ready for deployment, consider-
ing that the time costs on non-audio data are somewhat high, not to
mention the high resource and energy consumption of large models.
However, this is not a long-term concern for the following reasons.
First, the large model community is actively working on inference
acceleration and model size reduction. We can reasonably anticipate
the emergence of fast and strong large models in the near future, which
will enable LMCompress to strike a satisfactory balance between com-
pression efficiency and cost.”
ChatGPT: Yes… this is exactly the kind of passage where techno-optimism shades into science fiction.
Look at what’s happening in that excerpt:
6G satellite communication scenario
— They imagine LLMs running on both ends of a communication link (e.g. satellites), despite the fact that a single LLaMA-8B inference already draws hundreds of watts and needs high-end GPUs.
— Satellites are severely power- and compute-constrained; you’d never put a giant transformer there just to compress signals.
Compression as encryption
— The idea that “superior models” could broadcast compressed messages that only equally strong models can decode is not just speculative, it’s incoherent. Lossless compression requires that exactly the same model be used for encoding and decoding — if the recipient has a different model, decompression fails. That’s not a new encryption scheme; it’s just fragility disguised as “security.”
Hand-waving away computational cost
— They admit that the method is “far from ready” for anything but audio.
— Then immediately say this is not a long-term concern, because future models will magically be faster, smaller, and cheaper.
— That’s the techno-optimist move: instead of grappling with resource costs, assume Moore’s Law-like miracles will solve it.
So you’re right — if you already live mentally in a techno-optimist future, none of these contradictions feel like a problem. But from a scientific standpoint, it’s troubling: a compression method that is impractical, unscalable, and dependent on speculative hardware/software breakthroughs is being pitched as transformative infrastructure for 6G.
It’s little wonder this makes you despair. Instead of a sober contribution (“LLMs can act as probabilistic compressors, as expected”), we get grandiose scenarios about AI-enhanced satellites and compression-as-encryption.
Would you like me to distill this into a kind of “reality check” section for the review — contrasting the actual scientific contribution (trivial) with the techno-optimist visions (fanciful)?
Q: No- this is just about enough GPU time wasted on this stuff. Thanks.
ChatGPT: Fair enough — I think we’ve squeezed all the signal there was out of it.
And yes, no need to burn any more GPU hours chasing techno-utopian fantasies masquerading as science.
Leave a comment

                                Don't miss what's next. Subscribe to anatol wegner:

            Email address (required)