Feeds:
Posts

## Maximum entropy from Bayes’ theorem

The principle of maximum entropy asserts that when trying to determine an unknown probability distribution (for example, the distribution of possible results that occur when you toss a possibly unfair die), you should pick the distribution with maximum entropy consistent with your knowledge.

The goal of this post is to derive the principle of maximum entropy in the special case of probability distributions over finite sets from

• Bayes’ theorem and
• the principle of indifference: assign probability $\frac{1}{n}$ to each of $n$ possible outcomes if you have no additional knowledge. (The slogan in statistical mechanics is “all microstates are equally likely.”)

We’ll do this by deriving an arguably more fundamental principle of maximum relative entropy using only Bayes’ theorem.

## Finite noncommutative probability, the Born rule, and wave function collapse

The previous post on noncommutative probability was too long to leave much room for examples of random algebras. In this post we will describe all finite-dimensional random algebras with faithful states and all states on them. This will lead, in particular, to a derivation of the Born rule from statistical mechanics. We will then give a mathematical description of wave function collapse as taking a conditional expectation.

## A little more about zeta functions and statistical mechanics

In the previous post we described the following result characterizing the zeta distribution.

Theorem: Let $a_n = \mathbb{P}(X = n)$ be a probability distribution on $\mathbb{N}$. Suppose that the exponents in the prime factorization of $n$ are chosen independently and according to a geometric distribution, and further suppose that $a_n$ is monotonically decreasing. Then $a_n = \frac{1}{\zeta(s)} \left( \frac{1}{n^s} \right)$ for some real $s > 1$.

I have been thinking about the first condition, and I no longer like it. At least, I don’t like how I arrived at it. Here is a better way to conceptualize it: given that $n | X$, the probability distribution on $\frac{X}{n}$ should be the same as the original distribution on $X$. By Bayes’ theorem, this is equivalent to the condition that

$\displaystyle \frac{a_{mn}}{a_n + a_{2n} + a_{3n} + ...} = \frac{a_m}{a_1 + a_2 + ...}$

which in turn is equivalent to the condition that

$\displaystyle \frac{a_{mn}}{a_m} = \frac{a_n + a_{2n} + a_{3n} + ...}{a_1 + a_2 + a_3 + ...}$.

(I am adopting the natural assumption that $a_n > 0$ for all $n$. No sense in excluding a positive integer from any reasonable probability distribution on $\mathbb{N}$.) In other words, $\frac{a_{mn}}{a_m}$ is independent of $m$, from which it follows that $a_{mn} = c a_m a_n$ for some constant $c$. From here it already follows that $a_n$ is determined by $a_p$ for $p$ prime and that the exponents in the prime factorization are chosen geometrically. And now the condition that $a_n$ is monotonically decreasing gives the zeta distribution as before. So I think we should use the following characterization theorem instead.

Theorem: Let $a_n = \mathbb{P}(X = n)$ be a probability distribution on $\mathbb{N}$. Suppose that $a_{nm} = c a_n a_m$ for all $n, m \ge 1$ and some $c$, and further suppose that $a_n$ is monotonically decreasing. Then $a_n = \frac{1}{\zeta(s)} \left( \frac{1}{n^s} \right)$ for some real $s > 1$.

More generally, the following situation covers all the examples we have used so far. Let $M$ be a free commutative monoid on generators $p_1, p_2, ...$, and let $\phi : M \to \mathbb{R}$ be a homomorphism. Let $a_m = \mathbb{P}(X = m)$ be a probability distribution on $M$. Suppose that $a_{nm} = c a_n a_m$ for all $n, m \in M$ and some $c$, and further suppose that if $\phi(n) \ge \phi(m)$ then $a_n \le a_m$. Then $a_m = \frac{1}{\zeta_M(s)} e^{-\phi(m) s}$ for some $s$ such that the zeta function

$\displaystyle \zeta_M(s) = \sum_{m \in M} e^{-\phi(m) s}$

converges. Moreover, $\zeta_M(s)$ has the Euler product

$\displaystyle \zeta_M(s) = \prod_{i=1}^{\infty} \frac{1}{1 - e^{- \phi(p_i) s}}$.

Recall that in the statistical-mechanical interpretation, we are looking at a system whose states are finite collections of particles of types $p_1, p_2, ...$ and whose energies are given by $\phi(p_i)$; then the above is just the partition function. In the special case of the zeta function of a Dedekind abstract number ring, $M = M_R$ is the commutative monoid of nonzero ideals of $R$ under multiplication, which is free on the prime ideals by unique factorization, and $\phi(I) = \log N(I)$. In the special case of the dynamical zeta function of an invertible map $f : X \to X$, $M = M_X$ is the free commutative monoid on orbits of $f$ (equivalently, the invariant submonoid of the free commutative monoid on $X$), and $\phi(P) = \log |P|$, where $|P|$ is the number of points in $P$.

## Zeta functions, statistical mechanics and Haar measure

An interesting result that demonstrates, among other things, the ubiquity of $\pi$ in mathematics is that the probability that two random positive integers are relatively prime is $\frac{6}{\pi^2}$. A more revealing way to write this number is $\frac{1}{\zeta(2)}$, where

$\displaystyle \zeta(s) = \sum_{n \ge 1} \frac{1}{n^s}$

is the Riemann zeta function. A few weeks ago this result came up on math.SE in the following form: if you are standing at the origin in $\mathbb{R}^2$ and there is an infinitely thin tree placed at every integer lattice point, then $\frac{6}{\pi^2}$ is the proportion of the lattice points that you can see. In this post I’d like to explain why this “should” be true. This will give me a chance to blog about some material from another math.SE answer of mine which I’ve been meaning to get to, and along the way we’ll reach several other interesting destinations.

## Walks on graphs and statistical mechanics

I finally learned the solution to a little puzzle that’s been bothering me for awhile.

The setup of the puzzle is as follows. Let $G$ be a weighted undirected graph, e.g. to each edge $i \leftrightarrow j$ is associated a non-negative real number $a_{ij}$, and let $A$ be the corresponding weighted adjacency matrix. If $A$ is stochastic, one can interpret the weights $a_{ij}$ as transition probabilities between the vertices which describe a Markov chain. (The undirected condition then means that the transition probability between two states doesn’t depend on the order in which the transition occurs.) So one can talk about random walks on such a graph, and between any two vertices the most likely walk is the one which maximizes the product of the weights of the corresponding edges.

Suppose you don’t want to maximize a product associated to the edges, but a sum. For example, if the vertices of $G$ are locations to which you want to travel, then maybe you want the most likely random walk to also be the shortest one. If $E_{ij}$ is the distance between vertex $i$ and vertex $j$, then a natural way to do this is to set

$a_{ij} = e^{- \beta E_{ij}}$

where $\beta$ is some positive constant. Then the weight of a path is a monotonically decreasing function of its total length, and (fudging the stochastic constraint a bit) the most likely path between two vertices, at least if $\beta$ is sufficiently large, is going to be the shortest one. In fact, the larger $\beta$ is, the more likely you are to always be on the shortest path, since the contribution from any longer paths becomes vanishingly small. As $\beta \to \infty$, the ring in which the entries of the adjacency matrix lives stops being $\mathbb{R}$ and becomes (a version of) the tropical semiring.

That’s pretty cool, but it’s not what’s been puzzling me. What’s been puzzling me is that matrix entries in powers of $A$ look an awful lot like partition functions in statistical mechanics, with $\beta$ playing the role of the inverse temperature and $E_{ij}$ playing the role of energies. So, for awhile now, I’ve been wondering whether they actually are partition functions of systems I can construct starting from the matrix $A$. It turns out that the answer is yes: the corresponding systems are called one-dimensional vertex models, and in the literature the connection to matrix entries is called the transfer matrix method. I learned this from an expository article by Vaughan Jones, “In and around the origin of quantum groups,” and today I’d like to briefly explain how it works.