In lieu of reposting my old blog posts I thought I’d revisit some of their topics instead. In an old post of mine I set this inequality as a practice problem. I had an easy but uninspired proof in mind, but recently I learned that this problem is an exercise in the first chapter of The probabilistic method by Alon and Spencer. The probabilistic proof is quite elegant, so I thought it was worth presenting.
The “follow-your-nose” proof
(Below “code” will mean “prefix code.”)
The idea behind the inequality is that there is a trade-off among codewords: having a particular small codeword forbids you from having many other larger codewords. Define the weight of a codeword of length to be , and define the weight of a code to be the sum of the weights of its codewords. Then we want to show that prefix codes cannot be too heavy.
An example of a code of weight is the code consisting of precisely the letters of the alphabet. This code has the property that it is not possible to add more words to it without violating the prefix condition; call such a code maximal. Clearly any code is a subcode of a maximal code, so it suffices to show that every maximal code has weight at most . (As we will see, they in fact have weight exactly .)
Given a codeword of length one may replace it by all the codewords of length whose prefix is that codeword; call this lengthening the codeword. The important observation here is that lengthening a codeword leaves the weight of the code invariant; moreover, lengthening preserves the property of being a maximal code. The following claim will then prove the original assertion.
Claim: Any maximal code can be obtained from by a “convergent sequence” of lengthenings, in the sense that for any there exists such that at any time after the first lengthenings the codewords of length at most are correct.
Proof. This is an example of a just-do-it proof. Construct a sequence of codes as follows: starting with , at step lengthen every codeword of length in that is not in and call the new code . This algorithm guarantees that after step the codes agrees with at least up to codewords of length , since lengthening a codeword of length can only cause “disagreements” for codewords of length at least . And we are done.
The topology we’re using here is essentially the -adic topology on the generating functions of these codes. Consideration of these generating functions leads to another proof of this result, but the Wikipedia article does a good job of explaining it, so I won’t.
The probabilistic proof
Write down letters from the alphabet at random. If at any point you write down a codeword, stop. This defines a probability distribution on the set of codewords, and since no codeword is a prefix of another codeword the probability of reaching a codeword of length is exactly .
It follows that is the probability that this process never terminates.
The generating functions proof generalizes to the Kraft-McMillan inequality for uniquely decodable codes. This is essentially the reason people care about prefix codes: given a concatenation of codewords one can recover each codeword uniquely. This is the motivating idea behind the generating functions proof, but the Kraft-McMillan inequality is also listed as an exercise to be done via the probabilistic method. Unfortunately, I haven’t yet found this more general proof.
I am really curious whether the prefix-code result could be subsumed under a generalization of Dilworth’s theorem. A prefix code is an antichain in the poset of words ordered by prefix; what is the analogue of the size of a chain decomposition in this setting? Alternately, how can Dilworth’s theorem be stated (and proven?) as a probabilistic result?