This is a post I wanted to write some time ago; I’ve forgotten why, but it was short and cute enough to finish. Our starting point is the following observation:

**Theorem 1:** Universal lossless compression is impossible. That is, there is no function which takes as input finite strings (over some fixed alphabet) and always produces as output shorter finite strings (over the same alphabet) in such a way that the latter is recoverable from the former.

Supposedly people sometimes claim to be able to do this. It is as impossible as constructing a perpetual motion machine, if not more so, and for much more easily understandable reasons.

*Proof. *The problem is simply that there are not enough short strings. To be specific, let’s work with just binary strings. There are strings of length , and

strings of length at most , which is smaller. Hence there’s no injective function from the set of binary strings of length to the set of binary strings of length at most , for any . The problem only gets worse if is replaced by a larger number, or if we try to compress strings of length at most instead of strings of length exactly .

Informally, another way to summarize the argument is that it takes bits to represent an arbitrary binary string of length , and less than bits to represent an arbitrary binary string of length at most .

This argument straightforwardly implies that any lossless compression algorithm which makes at least one of its inputs smaller necessarily makes at least one of its inputs larger, and more broadly draws attention to the set of inputs one might actually want to compress in practice. If you want to compress 1000×1000 images – strings of a million pixels – but you only want to compress, say, photos that humans might actually take, this is potentially much smaller than the set of all 1000×1000 images (for example, because most real images have the property that most of the time, a pixel is close in color to its neighbors), and so the theoretical lower bound on how well such images can be compressed might be much better.

More generally, the situation is improved if you have a probability distribution over inputs you want to compress with low entropy and you only want to compress well on average, which allows you to do something like Huffman coding. (This is a generalization because the uniform distribution over a set of size has entropy .) Probably all probability distributions describing real-life data have this property. And this is before taking into account the possibility of lossy compression.

What I like about Theorem 1 is that it is both incredibly easy to prove and a surprisingly deep and important fact. Here is, for example, another very important result that is arguably just a corollary of Theorem 1.

Consider what is in some sense the “ultimate” lossless compression scheme, namely, given a binary string of length , compress it to the shortest program, in some fixed programming language, which prints that string. The length of this program – let’s also write it as a binary string, using ASCII or equivalent – is called the Kolmogorov complexity of the string. Let’s call this **Kolmogorov compression**.

The sense in which Kolmogorov compression is the “ultimate” lossless compression scheme is that whatever decompression is, it surely ought to be computable or else we wouldn’t be able to do it (ignoring for the moment that Kolmogorov compression is uncomputable!), and any compression scheme whatsoever (computable or not) which compresses some complicated input string into a string of length , together with the decompression program, constitutes a program which prints , of size at most plus the size of the decompression program, which is a constant.

So the Kolmogorov complexity of a string is a lower bound, up to an additive constant, on the length of a string that compresses it, according to *any* compression scheme with a computable decompression function. (Of course you can come up with a silly compression scheme which “memorizes” any fixed string and compresses it arbitrarily small, but the price you pay is that the Kolmogorov complexity of this string is more or less a lower bound on the size of your decompression program, and also there’s no reason this would help you compress anything you actually care about.)

Now, Theorem 1, applied to Kolmogorov compression, implies the following.

**Theorem 2:** For every , there exist strings of length whose Kolmogorov complexity is at least .

In other words, there exist incompressible strings, whose shortest descriptions are just those strings themselves. In fact, most strings are basically incompressible in this sense; a slight generalization of Theorem 1 shows that at most of all binary strings of length can be compressed to have length at most . For example, taking , less than 1% of strings of length can have their length reduced by or more by any fixed compression scheme.

This observation suggests an important connection between incompressibility and randomness. On the one hand, a randomly chosen string is incompressible with high probability. On the other hand, incompressible strings “look random” – they don’t have any patterns in them, because patterns can be described by short programs.

on October 31, 2020 at 7:07 pm |RafaelI think you should do a Phd in Switzerland (ETH Zürich, for instance). You’d be a great professor at an University and a excellent researcher!

on October 16, 2020 at 6:44 am |J.P. McCarthyGreat to see you active again mathematically and I hope you are well.

on November 1, 2020 at 12:24 pm |Qiaochu YuanThanks!

on October 15, 2020 at 8:27 pm |Arnav DhariyaI am a freshman in high school and I am passionate about mathematics(currently doing precalc through AoPS) and physics(currently doing AP physics 1 from Khan Academy). I am looking to improve both of them. Can you recommend some online resources to do that ?

on October 15, 2020 at 8:28 pm |Arnav DhariyaThank you!!

on September 27, 2020 at 2:47 pm |iconjackGood article, as usual. Another way to prove Theorem 1: Let f(s) be the compression of string s, and assume f(s) is always shorter than s. Then f(f(s)) is even shorter, and f(f(f(s))) shorter still, and so on. But this contradicts the well-ordering principle of the integers.

on October 1, 2020 at 10:41 am |Qiaochu YuanThanks! Yep. Alternatively, without contradiction, repeatedly applying f eventually produces a string of the same length or longer.