Kraft’s inequality: Let be a prefix code on an alphabet of size
, and let
denote the number of codewords of length
. Then
.
In lieu of reposting my old blog posts I thought I’d revisit some of their topics instead. In an old post of mine I set this inequality as a practice problem. I had an easy but uninspired proof in mind, but recently I learned that this problem is an exercise in the first chapter of The probabilistic method by Alon and Spencer. The probabilistic proof is quite elegant, so I thought it was worth presenting.
The “follow-your-nose” proof
(Below “code” will mean “prefix code.”)
The idea behind the inequality is that there is a trade-off among codewords: having a particular small codeword forbids you from having many other larger codewords. Define the weight of a codeword of length to be
, and define the weight of a code to be the sum of the weights of its codewords. Then we want to show that prefix codes cannot be too heavy.
An example of a code of weight is the code
consisting of precisely the letters of the alphabet. This code has the property that it is not possible to add more words to it without violating the prefix condition; call such a code maximal. Clearly any code is a subcode of a maximal code, so it suffices to show that every maximal code has weight at most
. (As we will see, they in fact have weight exactly
.)
Given a codeword of length one may replace it by all the codewords of length
whose prefix is that codeword; call this lengthening the codeword. The important observation here is that lengthening a codeword leaves the weight of the code invariant; moreover, lengthening preserves the property of being a maximal code. The following claim will then prove the original assertion.
Claim: Any maximal code can be obtained from
by a “convergent sequence” of lengthenings, in the sense that for any
there exists
such that at any time after the first
lengthenings the codewords of length at most
are correct.
Proof. This is an example of a just-do-it proof. Construct a sequence of codes as follows: starting with
, at step
lengthen every codeword of length
in
that is not in
and call the new code
. This algorithm guarantees that after step
the codes
agrees with
at least up to codewords of length
, since lengthening a codeword of length
can only cause “disagreements” for codewords of length at least
. And we are done.
The topology we’re using here is essentially the -adic topology on the generating functions
of these codes. Consideration of these generating functions leads to another proof of this result, but the Wikipedia article does a good job of explaining it, so I won’t.
The probabilistic proof
Write down letters from the alphabet at random. If at any point you write down a codeword, stop. This defines a probability distribution on the set of codewords, and since no codeword is a prefix of another codeword the probability of reaching a codeword of length is exactly
.
It follows that is the probability that this process never terminates.
Generalizations
The generating functions proof generalizes to the Kraft-McMillan inequality for uniquely decodable codes. This is essentially the reason people care about prefix codes: given a concatenation of codewords one can recover each codeword uniquely. This is the motivating idea behind the generating functions proof, but the Kraft-McMillan inequality is also listed as an exercise to be done via the probabilistic method. Unfortunately, I haven’t yet found this more general proof.
I am really curious whether the prefix-code result could be subsumed under a generalization of Dilworth’s theorem. A prefix code is an antichain in the poset of words ordered by prefix; what is the analogue of the size of a chain decomposition in this setting? Alternately, how can Dilworth’s theorem be stated (and proven?) as a probabilistic result?
If you don’t care about generalizing to uniquely decodable codes, then Kraft’s inequality follows because a prefix code (say, a binary one) can be represented by a sequence of subintervals of $[0, 1]$. Namely, we map a sequence
to the interval of length
starting at
(in the binary expansion). The prefix-free condition means that the intervals don’t overlap, so the fact their total measure is at most one is precisely the inequality in question.
[…] known results, but clever applications can give upper bounds as well, for instance in the proof of Kraft’s inequality. Of particular interest to us is the fact that it provides our first nontrivial proof in the […]
[…] Yuan writes about Kraft’s inequality for prefix codes and the probabilistic method, and also on linear relationships between square […]
Now that I think about it, the LYM inequality is another instance of this very general result that doesn’t actually seem to exist, and is way more obviously similar to the Kraft inequality.
You probably thought of this already, but it should be stated outright: In particular, note that both Kraft’s inequality and Dilworth’s theorem have to do with maximal weights of antichains in a poset with weighted elements. (Dilworth’s theorem just sets all the weights to 1.)