.

It turns out that the equivalence class of is completely determined by its rank . To prove this we construct some bases by induction. For starters, let be a vector such that ; this is always possible unless . Next, let be a vector such that is linearly independent of ; this is always possible unless .

Continuing in this way, we construct vectors such that the vectors are linearly independent, hence a basis of the column space of . Next, we complete the and to bases of in whatever manner we like. With respect to these bases, takes a very simple form: we have if and otherwise . Hence, in these bases, is a block matrix where the top left block is an identity matrix and the other blocks are zero.

Explicitly, this means we can write as a product

where has the block form above, the columns of are the basis of we found by completing , and the columns of are the basis of we found by completing . This decomposition can be computed by row and column reduction on , where the row operations we perform give and the column operations we perform give .

Conceptually, the question we’ve asked is: what does a linear transformation between vector spaces “look like,” when we don’t restrict ourselves to picking a particular basis of or ? The answer, stated in a basis-independent form, is the following. First, we can factor as a composite

where is the image of . Next, we can find direct sum decompositions and such that is the projection of onto its first factor and is the inclusion of the first factor into . Hence every linear transformation “looks like” a composite

of a projection onto a direct summand and an inclusion of a direct summand. So the only basis-independent information contained in is the dimension of the image , or equivalently the rank of . (It’s worth considering the analogous question for functions between sets, whose answer is a bit more complicated.)

The actual problem this blog post is about is more interesting: it is to classify matrices up to *orthogonal* change of basis in both the source and the target. In other words, we now want to understand the equivalence classes of the equivalence relation given by

.

Conceptually, we’re now asking: what does a linear transformation between finite-dimensional *Hilbert spaces* “look like”?

**Inventing singular value decomposition**

As before, we’ll answer this question by picking bases with respect to which is as easy to understand as possible, only this time we need to deal with the additional restriction of choosing orthonormal bases. We will follow roughly the same inductive strategy as before. For starters, we would like to pick a unit vector such that ; this is possible unless is identically zero, in which case there’s not much to say. Now, there’s no guarantee that will be a unit vector, but we can always use

as the beginning of an orthonormal basis of . The question remains which of the many possible values of to use. In the previous argument it didn’t matter because they were all related by change of coordinates, but now it very much does because the length may differ for different choices of . A natural choice is to pick so that is as large as possible (hence equal to the **operator norm ** of ); writing , we then have

.

is called the **first singular value** of , is called its **first right singular vector**, and is called its **first left singular vector**. (The singular vectors aren’t unique in general, but we’ll ignore this for now.) To continue building orthonormal bases we need to find a unit vector

orthogonal to such that is linearly independent of ; this is possible unless , in which case we’re already done and is completely describable as ; equivalently, in this case we have

.

We’ll pick using the same strategy as before: we want the value of such that is as large as possible. Note that since , this is equivalent to finding the value of such that is as large as possible. Call this largest possible value and write

.

At this point we are in trouble unless ; if this weren’t the case then our strategy would fail to actually build an orthonormal basis of . Very importantly, this turns out to be the case.

**Key lemma #1:** Suppose is a unit vector maximizing . Let be a unit vector orthogonal to . Then is also orthogonal to .

*Proof.* Consider the function

.

The vectors are all unit vectors since are orthonormal, so by construction (of ) this function is maximized when . In particular, its derivative at is zero. On the other hand, we can expand out using dot products as

.

Now we can compute the first-order Taylor series expansion of this function around , giving

so setting the first derivative equal to gives as desired.

This is the technical heart of singular value decomposition, so it’s worth understanding in some detail. Michael Nielsen has a very nice interactive demo / explanation of this. Geometrically, the points trace out an ellipse centered at the origin, and by hypothesis describes the semimajor axis of the ellipse: the point furthest away from the origin. As we move away from , to first order we are moving slightly in the direction of , and so if were not orthogonal to it would be possible to move slightly further away from the origin than by moving either in the positive or negative direction, depending on whether the angle between and is greater than or less than . The only way to ensure that moving in the direction of does not, to first order, get us further away from the origin is if is orthogonal to .

Note that this gives a proof that the semiminor axis of an ellipse – the point closest to the origin – is always orthogonal to its semimajor axis. We can think of key lemma #1 above as more or less being equivalent to this fact, also known as the principal axis theorem in the plane, and which is also closely related to but slightly weaker than the spectral theorem for real matrices.

Thanks to key lemma #1, we can continue our construction. With as before, we inductively produce orthonormal vectors such that is maximized subject to the condition that for all , and write

where is the maximum value of on all vectors orthogonal to ; note that this implies that

.

The are the **singular values** of , the are its **right singular vectors**, and the are its **left singular vectors**. Repeated application of key lemma #1 shows that the are an orthonormal basis of the column space of , so the construction stops here: is identically zero on the orthogonal complement of , because if it weren’t then it would take a value orthogonal to . This means we can write as a sum

.

This is one version of the **singular value decomposition** (SVD for short) of , and it has the benefit of being as close to unique as possible. A more familiar version of SVD is obtained by completing the and to orthonormal bases of and (necessarily highly non-unique in general). With respect to these bases, takes, similar to the warm-up, a block form where the top left block is the diagonal matrix with entries and the remaining blocks are zero. Hence we can write as a product

where has the above block form, has columns given by , and has columns given by .

So, stepping back a bit: what have we learned about what a linear transformation between Hilbert spaces looks like? Up to orthogonal change of basis, we’ve learned that they all look like “weighted projections”: we are almost projecting onto the image as in the warmup, except with weights given by the singular values to account for changes in length. The only orthogonal-basis-independent information contained in a linear transformation turns out to be its singular values.

Looking for more analogies between singular value decomposition and the warmup, we might think of the singular values as a quantitative refinement of the rank, since there are of them where is the rank, and if some of them are small then is close (in the operator norm) to a linear transformation having lower rank.

Geometrically, one way to describe the answer provided by singular value decomposition to the question “what does a linear transformation look like” is that the key to understanding is to understand what it does to the unit sphere of . The image of the unit sphere is an -dimensional ellipsoid, and its principal axes have direction given by the left singular vectors and lengths given by the singular values . The right singular vectors map to these.

**Properties**

Singular value decomposition has lots of useful properties, some of which we’ll prove here. First, note that taking the transpose of a singular value decomposition gives another singular value decomposition

showing that has the same singular values as , but with the left and right singular vectors swapped. This can be proven more conceptually as follows.

**Key lemma #2:** Write . Then for every , the left and right singular vectors maximize the value of subject to the constraint that for all , that is orthogonal to , and that is orthogonal to . This maximum value is .

*Proof.* At the maximum value of subject to the constraint that is orthogonal to and is orthogonal to , it must also be the case that if we fix then takes its maximum value at . But for fixed , uniquely takes its maximum value when is proportional to (if possible), hence must in fact be equal to ; moreover, this is always possible thanks to key lemma #1. So we are in fact maximizing

subject to the above constraints and we already know the solution is given by .

**Left-right symmetry:** Let be the singular values, left singular vectors, and right singular vectors of as above. Then are the singular values, left singular vectors, and right singular vectors of . In particular, .

*Proof.* Apply key lemma #2 to , and note that is the same either way, just with the roles of and switched.

**Singular = eigen:** The left singular vectors are the eigenvectors of corresponding to its nonzero eigenvalues, which are . The right singular vectors are the eigenvectors of corresponding to its nonzero eigenvalues, which are also .

*Proof.* We now know that and that , hence

and

.

Hence are orthonormal eigenvectors of respectively. Moreover, these matrices have rank at most (in fact exactly) , so this exhausts all eigenvectors corresponding to nonzero eigenvalues.

This gives an alternative route to understanding singular value decomposition which comes from writing as

and then applying the spectral theorem to to diagonalize, but I think it’s worth knowing that there’s a route to singular value decomposition which is independent of the spectral theorem.

In addition to the above algebraic characterization of singular values, the singular values also admit the following variational characterization.

**Variational characterizations of singular values (Courant, Fischer):** We have

and

.

*Proof. *For the first characterization, any -dimensional subspace intersects nontrivially, hence contains a unit vector of the form

.

We compute that

and hence that

.

We conclude that every contains such that , hence . Equality is obtained when .

The second characterization is very similar. Any -dimensional subspace intersects nontrivially, hence contains a unit vector of the form

.

We compute that

and hence that

.

We conclude that every contains a vector such that , hence . Equality is obtained when .

The second variational characterization above can be used to prove the following important theorem.

**Low rank approximation (Eckart, Young):** If is the SVD of , let where has diagonal entries and all other entries zero. Then is the closest matrix to in operator norm with rank at most ; that is, minimizes subject to the constraint that . This minimum value is .

*Proof.* Suppose is a matrix of rank at most . Let be the nullspace of , which by hypothesis has dimension at least . By the second variational characterization above, this means that it contains a vector such that , and since this gives

and hence that . Equality is obtained when as defined above.

The variational characterizations can also be used to prove the following inequality relating the singular values of two matrices and of their sum, which can be thought of as a quantitative refinement of the observation that the rank of a sum of two matrices is at most the sum of their ranks.

**Additive perturbation (Weyl):** Let be matrices with singular values . Then

.

*Proof.* We want to bound in terms of the singular values of and . By the second variational characterization, we have

.

To give an upper bound on a minimum value of a function we just need to give an upper bound on some value that it takes. Let and be the subspaces of of dimensions respectively which achieve the minimum values of and respectively, and let be their intersection. This intersection has dimension at least , and by construction

.

Since has dimension at least , the above is an upper bound on the value of for any -dimensional subspace , from which the conclusion follows.

The slightly curious off-by-one indexing in the above inequality can be understood as follows: if and are both very small, this means that and are close to matrices of rank at most and respectively, and hence is close to a matrix of rank at most , hence also ought to be small.

Setting in the additive perturbation inequality we deduce the following corollary.

**Singular values are Lipschitz:** The singular values, as functions on matrices, are uniformly Lipschitz with respect to the operator norm with Lipschitz constant : that is,

.

*Proof.* Apply additive perturbation twice with , first to get

(remembering that is the operator norm), and second to get

(remembering that ).

This is very much not the case with eigenvalues: a small perturbation of a square matrix can have a large effect on its eigenvalues. This is explained e.g. in in this blog post by Terence Tao, and is related to pseudospectra.

Setting , or equivalently , in the additive perturbation inequality, we deduce the following corollary.

**Interlacing:** Suppose are matrices such that . Then

.

*Proof. *Apply additive perturbation twice, first to get

and second to get

.

Interlacing gives us some control over what happens to the singular values under a low-rank perturbation (as opposed to a low-norm perturbation; a low-rank perturbation may have arbitrarily high norm, and vice versa). For example, we learn that if all of the singular values are clumped together, then a rank- perturbation will keep most of the singular values clumped together, except possibly for either the largest or smallest singular values. We can’t expect any control over these, since in the worst case a rank- perturbation can make the largest singular values arbitrarily large, or make the smallest singular values arbitrarily small.

A particular special case of a low-rank perturbation is deleting a small number of rows or columns (note that a row or column which is entirely zero does not affect the singular values, so deleting a row or column is equivalent to setting all of its entries to zero), in which case the upper bound above can be tightened.

**Cauchy interlacing:** Suppose is a matrix and is obtained from by deleting at most rows. Then

.

*Proof.* The lower bound follows from interlacing. The upper bound follows from the observation that we have for all , then applying either variational characterization of the singular values.

Cauchy interlacing also applies to deleting columns, or combinations of rows and columns, because the singular values are unchanged by transposition. In particular, we learn that if is obtained from by deleting either a single row or a single column, then the singular values of interlace with the singular values of , hence the name.

In particular, if all of the singular values of are clumped together then so are those of , with no exceptions. Taking the contrapositive, if the singular values of are spread out, then the singular values of must be as well.

**Three special cases**

Three special cases of the general singular value decomposition are worth pointing out.

First, if has orthogonal columns, or equivalently if is diagonal, then the singular values are the lengths of its columns, we can take the right singular vectors to be the standard basis vectors , and we can take the left singular vectors to be the unit rescalings of its columns. This means that we can take to be the identity matrix, and in general suggests that is a measure of the extent to which the columns of fail to be orthogonal (with the caveat that is not unique and so in general we would want to look at the closest to ).

Second, if has orthogonal rows, or equivalently if is diagonal, then the singular values are the lengths of its rows, we can take the left singular vectors to be the standard basis vectors , and we can take the right singular vectors to be the unit rescalings of its rows. This means that we can take to be the identity matrix, and in general suggests that is a measure of the extent to which the rows of fail to be orthogonal (with the same caveat as above).

Finally, if is square and an orthogonal matrix, so that , then the singular values are all equal to , and an arbitrary choice of either the left or the right singular vectors uniquely determines the other. This means that we can take to be the identity matrix, and in general suggests that is a measure of the extent to which fails to be orthogonal. In fact it is possible to show that the closest orthogonal matrix to is given by , or in other words by replacing all of the singular values of with , so

is precisely the distance from to the nearest orthogonal matrix. This fact can be used to solve the orthogonal Procrustes problem.

In general, we should expect that the SVD of a matrix is relevant to answering any question about whose answer is invariant under left and right multiplication by orthogonal matrices. This includes, for example, the question of low-rank approximations to with respect to operator norm we answered above, since both rank and operator norm are invariant.

]]>

The answer is that we can think of the Morita 2-category as a 2-category of **module categories** over the symmetric monoidal category of -modules, equipped with the usual tensor product over . By the Eilenberg-Watts theorem, the Morita 2-category is equivalently the 2-category whose

- objects are the categories , where is a -algebra,
- morphisms are cocontinuous -linear functors , and
- 2-morphisms are natural transformations.

An equivalent way to describe the morphisms is that they are “-linear” in that they respect the natural action of on given by

.

This action comes from taking the adjoint of the enrichment of over , which gives a tensoring of over . Since the two are related by an adjunction in this way, a functor respects one iff it respects the other.

So Morita theory can be thought of as a categorified version of module theory, where we study modules over instead of over . In the simplest cases, we can think of Morita theory as a categorified version of linear algebra, and in this post we’ll flesh out this analogy further.

**Technical preliminaries**

Let be a symmetric monoidal category (which in this post will be , for a commutative ring), and consider categories enriched over , or -categories for short. If and are two -categories, their **naive t****ensor product** is the -category whose objects are pairs of objects in and objects in , and whose homs are given by the tensor products

of the homs in and , with the obvious composition (which requires the ability to switch tensor factors to define). When and have one object and , this reduces to the usual tensor product of -algebras.

Thinking of -categories as many-object generalizations of -algebras (one might call them “-algebroids,” but I won’t), it’s natural to define notions of modules over them. We’ll say that a **left module** over is a -functor (-enriched functor) , while a **right module** over is a -functor . If are two -categories, a **-bimodule** is a -functor .

All of this terminology has its usual meaning when and have one object (so correspond to algebras) and .

**The basic analogy**

The basic analogy, one piece of structure at a time, goes like this.

- Sets are analogous to categories.
- Abelian groups are analogous to cocomplete categories. (There are several other things we could have said here, but this is the one that’s relevant to thinking about Morita theory. The idea is that taking colimits categorifies addition.)
- Rings are analogous to monoidal cocomplete categories (this includes the condition that the monoidal structure distributes over colimits).
- Commutative rings are analogous to symmetric monoidal cocomplete categories.
- Modules over commutative rings are analogous to cocomplete module categories over symmetric monoidal cocomplete categories.

We won’t get into the generalities of thinking about symmetric monoidal categories or modules over them because, in this post, the only symmetric monoidal categories we care about are those of the form for a commutative ring , and the only module categories over them we care about are the ones that are “free on a category of generators” in the sense that they are categories of -linear presheaves / right modules

on essentially small -linear categories (thinking of taking presheaves as the free cocompletion). By the universal property of the free cocompletion, cocontinuous -linear functors correspond to -linear functors , or equivalently (by an adjunction) to -bimodules**;** this generalizes, and in particular proves, the Eilenberg-Watts theorem. Composition is given by tensor product of bimodules, which is computed using coends.

In the special case that the categories involved have one object, they correspond to -algebras, and the words “module,” “bimodule,” and “tensor product” all have their usual meaning. More generally, if has finitely many isomorphism classes of objects, then we can replace it with the endomorphism ring of the direct sum of one object from each isomorphism class (because is Morita equivalent to the one-object -linear category with this endomorphism ring), so we get a genuinely bigger Morita 2-category if we allow to have infinitely many objects.

From now on we’ll work in this bigger Morita 2-category, which is now the 2-category deserving the name . It has

- objects essentially small -linear categories ,
- morphisms -bimodules over , and
- 2-morphisms homomorphisms of bimodules.

Equivalently, it has

- objects the cocomplete -linear categories (where is as above),
- morphisms cocontinuous -linear functors , and
- 2-morphisms natural transformations.

From now on, when we say that a -linear category “is” a -algebra, possibly with some further properties, what we mean is that it’s equivalent to a category with one object and endomorphism ring a -algebra, possibly with some further properties.

In what ways do the objects of the Morita 2-category behave like modules?

**Proposition:** The Morita 2-category has biproducts.

In other words, the product is also the coproduct. This is because on the one hand a functor (cocontinuous and -linear) into is precisely a pair of functors into and , and on the other hand

If and are algebras rather than categories it would be tempting to write rather than , but in the case of algebras these are Morita equivalent as -linear categories. Actually, the above argument, with disjoint unions of -linear categories, applies just as well to infinitely many -linear categories: this means that the bigger Morita 2-category has *infinite* biproducts. Note that this has nothing to do with itself having biproducts: the same is true for the bigger Morita 2-category over .

**Proposition: **There is a tensor-hom adjunction

.

Here the internal hom is the category of cocontinuous -linear functors, which (as we’ve seen above, since it’s a category of bimodules) is itself a cocomplete -linear category, and the tensor product is

(this is a computation, not a definition: the definition is a universal property in terms of functors out of which are cocontinuous and -linear in each variable). Here is the “naive” tensor product over .

Both sides of the equivalence above are equivalent to , which might look familiar as a categorification of a corresponding statement about finite-dimensional vector spaces. This reflects the fact that is always dualizable with respect to the above tensor product, with dual .

With respect to this tensor product, is the unit object. It should be thought of as the “tensor product over ,” exactly analogous to the tensor product on modules over a commutative ring.

**Aside: big categories vs. little categories**

The biggest secret about category theory that I don’t think is in common circulation is that there are approximately two kinds of categories, and they should be thought of very differently (because the relevant notion of morphism between them is very different):

**Big**categories are “categories*of*mathematical objects.” Typical examples are categories of modules and sheaves. They tend to be cocomplete, and people like to consider cocontinuous functors (really, left adjoints) between them.**Little**categories are “categories*as*mathematical objects.” Typical examples include categories with one object, or finitely many objects. They tend to be Cauchy complete at best, and people like to consider arbitrary functors, or more generally bimodules, between them.

The Morita 2-categories we discussed above have both little descriptions and big descriptions (via -linear categories and modules over these respectively), and both are important. We can pass from little to big by taking modules / presheaves, and we can pass from big to little by taking tiny objects. (This can at best recover the Cauchy completion of the original little -linear category, but this is fine since we’re only hoping to doing things up to Morita equivalence anyway.)

I think one thing that confuses people when they first start to learn category theory is that the first examples of categories (e.g. groups, rings, modules) tend to be big, even though little categories figure prominently in the theory (e.g. as shapes for diagrams to take limits or colimits over, and/or as things to take presheaves or sheaves on) and feel very different. It’s little categories that can reasonably be thought of as algebraic objects generalizing more familiar objects like monoids and posets (or, in our enriched setting, -algebras), whereas big categories, I think, genuinely require a new set of intuitions.

On the other hand, the ability to pass between big and little categories is also important. Eilenberg-Watts, as we have seen, gives one version of this: another version is Gabriel-Ulmer duality.

The big vs. little nomenclature suggests, but is not equivalent to, the rigorous distinction between large and small categories, and is related to the distinction between big and little toposes (or, depending on your preferences, between gros and petit topoi) in sheaf theory. The basic point of this distinction is that there are two somewhat different sorts of things people mean by sheaf: on the one hand one might mean a functor on a little category like the category of open subsets of a topological space, and on the other hand one might mean a functor on a big category like the category of commutative rings. It would be nice if people emphasized this distinction more.

**Bases, coordinates, and matrices**

In terms of higher linear algebra, big categories of modules are “higher vector spaces,” while the little categories that can be used to present the big categories as categories of modules are “bases” for them. As we learned previously, a cocomplete abelian category has a “basis” in this sense iff it has a family of tiny (compact projective) generators, for various notions of generator.

Given a “basis” for a “higher vector space” (really, a higher module), any object is described by a module / presheaf ; the components of this presheaf can be thought of as the “coordinates” of in the “basis” . In the same way that a vector is a sum over elements of a basis weighted by its coordinates in that basis, a presheaf is a weighted colimit / coend / functor tensor product

weighted by its “coordinates” of the “basis” of representable presheaves. This is an enriched version of the familiar statement that a presheaf of sets over a category is canonically a colimit of representable presheaves. It is sometimes called the co-Yoneda lemma.

Similarly, the statement that cocontinuous -linear functors are equivalent to functors can be interpreted as saying that such functors can be written as “matrices” indexed by and . Composition, as well as evaluation, are given by the familiar formulas if we consistently reinterpret the relevant products as tensor products and the relevant sums as coends.

Suppose in particular that . Then we get that endomorphisms of correspond to -bimodules, or equivalently to functors . These are precisely the sorts of things we can take coends of, getting an object

which deserves to be called the **trace** of the endomorphism . This is a generalization of (zeroth) Hochschild homology with coefficients in a bimodule, which it reduces to when is an algebra. (There is also a second trace generalizing Hochschild cohomology and computed using ends.)

The identity functor turns out to be represented by , so taking its trace we get the **Hochschild homology** or **trace** of (or, depending on your point of view, of ) itself:

.

More explicitly, this coend is the result of coequalizing the left and right actions of on (by postcomposition and precomposition respectively), so can be written as

where the two arrows send a pair of morphisms and to the two composites and . In other words, is the quotient of the direct sum of the endomorphism rings of every object in by the subspace of “commutators” of the form , where need not themselves be endomorphisms.

By construction, this means that is the recipient of a “universal trace” map: any endomorphism in has an image satisfying

for all as above, and is universal with respect to this property.

*Example.* Suppose has one object, so corresponds to an algebra . Then is just the quotient of by the subspace of commutators, hence is the ordinary zeroth Hochschild cohomology of (with coefficients in itself).

*T*he above construction of the trace implies that it is Morita invariant, and is Morita equivalent to the -linear category of finitely generated projective (right) -modules (which is its Cauchy completion). It follows that the trace of this category is also , and this means that for any finitely generated projective -module and any endomorphism there is a trace

.

This trace is called the **Hattori-Stallings trace.** For free modules it is computed by taking the image of the sum of the diagonal elements of in . It is a shadow of various more general and more interesting maps from versions of K-theory to versions of Hochschild homology, including a version of the Chern character and the Dennis trace.

In particular, if is commutative, we recover the usual notion of trace of an endomorphism of a finitely generated projective -module without using a monoidal structure (previously we recovered this notion using dualizability with respect to the usual tensor product of -modules).

The simplest interesting case of the above discussion occurs when is a field and the algebras we consider are finite direct sums of matrix algebras ; equivalently, the -module categories we consider are finite direct sums of copies of . These are known as **Kapranov-Voevodsky 2-vector spaces**. Morphisms from to correspond to matrices of vector spaces, just as for free modules, and the trace of an endomorphism is the direct sum of its diagonal vector spaces.

**Higher representation theory**

One reason you might want a higher version of linear algebra is to study “higher representation theory,” where groups (or higher versions of groups, such as 2-groups) act on higher vector spaces. Previously we saw such actions occur naturally in Galois descent: namely, if is a Galois extension with Galois group , then naturally acts on the category of -vector spaces, and we used this action to describe Galois descent for vector spaces.

More generally, if is a group acting on a scheme (or some more or less general object, depending on taste), then naturally acts on the category of quasicoherent sheaves on . If is an -scheme for some base , this is naturally a module category over , and if acts on by -scheme automorphisms, the induced action on is -linear. One reason you might want to understand higher representation theory is to understand these sorts of actions. In particular, a natural question is what the homotopy fixed points of this action are, and the answer is that

;

that is, the homotopy fixed point category (which here is typically called “-equivariant quasicoherent sheaves on “) is the category of quasicoherent sheaves on the quotient regarded as a stack. (It certainly does not suffice to take the ordinary quotient of schemes: for example, if is a point, then is , and is the category of -linear representations of .)

If the stacky quotient happens to be an ordinary scheme (so is a -torsor in schemes), this is a generalization of Galois descent, to which it reduces in the case when is a finite extension, , and is the Galois group.

]]>

**Fire**

**Fire** is a sustained chain reaction involving **combustion**, which is an exothermic reaction in which an oxidant, typically oxygen, oxidizes a fuel, typically a hydrocarbon, to produce products such as carbon dioxide, water, and heat and light. A typical example is the combustion of methane, which looks like

.

The heat produced by combustion can be used to fuel more combustion, and when that happens enough that no additional energy needs to be added to sustain combustion, you’ve got a fire. To stop a fire, you can remove the fuel (e.g. turning off a gas stove), remove the oxidant (e.g. smothering a fire using a fire blanket), remove the heat (e.g. spraying a fire with water), or remove the combustion reaction itself (e.g. with halon).

Combustion is in some sense the opposite of photosynthesis, an endothermic reaction which takes in light, water, and carbon dioxide and produces hydrocarbons.

It’s tempting to assume that when burning wood, the hydrocarbons that are being combusted are e.g. the cellulose in the wood. It seems, however, that something more complicated happens. When wood is exposed to heat, it undergoes **pyrolysis** (which, unlike combustion, doesn’t involve oxygen), which converts it to more flammable compounds, such as various gases, and these are what combust in wood fires.

When a wood fire burns for long enough it will lose its flame but continue to smolder, and in particular the wood will continue to glow. Smoldering involves **incomplete combustion**, which, unlike complete combustion, produces carbon monoxide.

**Flames**

**Flames **are the visible parts of a fire. As fires burn, they produce **soot **(which can refer to some of the products of incomplete combustion or some of the products of pyrolysis), which heats up, producing **thermal radiation**. This is one of the mechanisms responsible for giving fire its color. It is also how fires warm up their surroundings.

Thermal radiation is produced by the motion of charged particles: anything at positive temperature consists of charged particles moving around, so emits thermal radiation. A more common but arguably less accurate term is **black body radiation**; this properly refers to the thermal radiation emitted by an object which absorbs all incoming radiation. It’s common to approximate thermal radiation by black body radiation, or by black body radiation times a constant, because it has the useful property that it depends only on the temperature of the black body. Black body radiation happens at all frequencies, with more radiation at higher frequencies at higher temperatures; in particular, the peak frequency is directly proportional to temperature by **Wien’s displacement law**.

Everyday objects are constantly producing thermal radiation, but most of it is infrared – its wavelength is longer than that of visible light, and so is invisible without special cameras. Fires are hot enough to produce visible light, although they are still producing a lot of infrared light.

Another mechanism giving fire its color is the emission spectra of whatever’s being burned. Unlike black body radiation, emission spectra occur at discrete frequencies; this is caused by electrons producing photons of a particular frequency after transitioning from a higher-energy state to a lower-energy state. These frequencies can be used to detect elements present in a sample in flame tests, and a similar idea (using absorption spectra) is used to determine the composition of the sun and various stars. Emission spectra are also responsible for the color of fireworks and of colored fire.

The characteristic shape of a flame on Earth depends on gravity. As a fire heats up the surrounding air, natural convection occurs: the hot air (which contains, among other things, hot soot) rises, while cool air (which contains oxygen) falls, sustaining the fire and giving flames their characteristic shape. In low gravity, such as on a space station, this no longer occurs; instead, fires are only fed by the diffusion of oxygen, and so burn more slowly and with a spherical shape (since now combustion is only happening at the interface of the fire with the parts of the air containing oxygen; inside the sphere there is presumably no more oxygen to burn):

**Black body radiation**

Black body radiation is described by **Planck’s law**, which is fundamentally quantum mechanical in nature, and which was historically one of the first applications of any form of quantum mechanics. It can be deduced from (quantum) statistical mechanics as follows.

What we’ll actually compute is the distribution of frequencies in a (quantum) gas of photons at some temperature ; the claim that this matches the distribution of frequencies of photons emitted by a black body at the same temperature comes from a physical argument related to Kirchhoff’s law of thermal radiation. The idea is that the black body can be put into thermal equilibrium with the gas of photons (since they have the same temperature). The gas of photons is getting absorbed by the black body, which is also emitting photons, so in order for them to stay in equilibrium, it must be the case that at every frequency the black body is emitting radiation at the same rate as it’s absorbing it, which is determined by the distribution of frequencies in the gas. (Or something like that. I Am Not A Physicist, so if your local physicist says different then believe them instead.)

In statistical mechanics, the probability of finding a system in microstate given that it’s in thermal equilibrium at temperature is proportional to

where is the energy of state and is **thermodynamic beta** (so is temperature and is Boltzmann’s constant); this is the Boltzmann distribution. For one possible justification of this, see this blog post by Terence Tao. This means that the probability is

where is the normalizing constant

called the **partition function**. Note that these probabilities don’t change if is modified by an additive constant (which multiplies the partition function by a constant); only differences in energy between states matter.

It’s a standard observation that the partition function, up to multiplicative scale, contains the same information as the Boltzmann distribution, so anything that can be computed from the Boltzmann distribution can be computed from the partition function. For example, the moments of the energy are given by

and, up to solving the moment problem, this characterizes the Boltzmann distribution. In particular, the average energy is

.

The Boltzmann distribution can be used as a definition of temperature. It correctly suggests that in some sense is the more fundamental quantity because it might be zero (meaning every microstate is equally likely; this corresponds to “infinite temperature”) or negative (meaning higher-energy microstates are more likely; this corresponds to “negative temperature,” which it is possible to transition to after “infinite temperature,” and which in particular is hotter than every positive temperature).

To describe the state of a gas of photons we’ll need to know something about the quantum behavior of photons. In the standard quantization of the electromagnetic field, the electromagnetic field can be treated as a collection of quantum harmonic oscillators each oscillating at various (angular) frequencies . The energy eigenstates of a quantum harmonic oscillator are labeled by a nonnegative integer , which can be interpreted as the number of photons of frequency . The energies of these eigenstates are (up to an additive constant, which doesn’t matter for this calculation and so which we will ignore)

where is the reduced Planck constant. The fact that we only need to keep track of the number of photons rather than distinguishing them reflects the fact that photons are bosons. Accordingly, for fixed , the partition function is

.

**Digression: the (wrong) classical answer**

The assumption that , or equivalently the energy , is required to be an integer here is the **Planck postulate**, and historically it was perhaps the first appearance of a quantization (in the sense of quantum mechanics) in physics. Without this assumption (so using classical harmonic oscillators), the sum above becomes an integral (where is now proportional to the square of the amplitude), and we get a “classical” partition function

.

(It’s unclear what measure we should be integrating against here, but but this calculation appears to reproduce the usual classical answer, so I’ll stick with it.)

These two partition functions give very different predictions, although the quantum one approaches the classical one as . In particular, the average energy of all photons of frequency , computed using the quantum partition function, is

whereas the average energy computed using the classical partition function is

.

The quantum answer approaches the classical answer as (so for small frequencies), and the classical answer is consistent with the equipartition theorem in classical statistical mechanics, but it is also grossly inconsistent with experiment and experience. It predicts that the average energy of the radiation emitted by a black body at a frequency is a constant independent of , and since radiation can occur at arbitrarily high frequencies, the conclusion is that a black body is emitting an infinite amount of energy, at every possible frequency, which is of course badly wrong. This is (most of) the ultraviolet catastrophe.

The quantum partition function instead predicts that at low frequencies (relative to the temperature) the classical answer is approximately correct, but that at high frequencies the average energy becomes exponentially damped, with more damping at lower temperatures. This is because at high frequencies and low temperatures a quantum harmonic oscillator spends most of its time in its ground state, and cannot easily transition to its next lowest state, which is exponentially less likely. Physicists say that most of this “degree of freedom” (the freedom of an oscillator to oscillate at a particular frequency) gets “frozen out.” The same phenomenon is responsible for classical but incorrect computations of specific heat, e.g. for diatomic gases such as oxygen.

**The density of states and Planck’s law**

Now that we know what’s going on at a fixed frequency , it remains to sum over all possible frequencies. This part of the computation is essentially classical and no quantum corrections to it need to be made.

We’ll make a standard simplifying assumption that our gas of photons is trapped in a box with side length subject to periodic boundary conditions (so really, the flat torus ); the choice of boundary conditions, as well as the shape of the box, will turn out not to matter in the end. Possible frequencies are then classified by standing wave solutions to the electromagnetic wave equation in the box with these boundary conditions, which in turn correspond (up to multiplication by ) to eigenvalues of the Laplacian . More explicitly, if , where is a smooth function , then the corresponding standing wave solution of the electromagnetic wave equation is

and hence (keeping in mind that is typically negative, so is typically purely imaginary) the corresponding frequency is

.

This frequency occurs times where is the -eigenspace of the Laplacian.

The reason for the simplifying assumptions above are that for a box with periodic boundary conditions (again, mathematically a flat torus) it is very easy to explicitly write down all of the eigenfunctions of the Laplacian: working over the complex numbers for simplicity, they are given by

where is the wave vector. (Somewhat more generally, on the flat torus where is a lattice, wave numbers take values in the dual lattice of , possibly up to scaling by depending on conventions). The corresponding eigenvalue of the Laplacian is

from which it follows that the multiplicity of a given eigenvalue is the number of ways to write as a sum of three squares. The corresponding frequency is

and so the corresponding energy (of a single photon with that frequency) is

.

At this point we’ll approximate the probability distribution over possible frequencies , which is strictly speaking discrete, as a continuous probability distribution, and compute the corresponding **density of states **; the idea is that should correspond to the number of states available with frequencies between and . Then we’ll do an integral over the density of states to get the final partition function.

Why is this approximation reasonable (unlike the case of the partition function for a single harmonic oscillator, where it wasn’t)? The full partition function can be described as follows. For each wavenumber , there is an occupancy number describing the number of photons with that wavenumber; the total number of photons is finite. Each such photon contributes to the energy, from which it follows that the partition function factors as a product

over all wave numbers , hence that its logarithm factors as a sum

.

and it is this sum that we want to approximate by an integral. It turns out that for reasonable temperatures and reasonably large boxes, the integrand varies very slowly as varies, so the approximation by an integral is very close. The approximation stops being reasonably only at very low temperatures, where as above quantum harmonic oscillators mostly end up in their ground states and we get Bose-Einstein condensates.

The density of states can be computed as follows. We can think of wave vectors as evenly spaced lattice points living in some “phase space,” from which it follows that the number of wave vectors in some region of phase space is proportional to its volume, at least for regions which are large compared to the lattice spacing . In fact, the number of wave vectors in a region of phase space is exactly times the volume, where is the volume of our box / torus.

It remains to compute the volume of the region of phase space given by all wave vectors with frequencies between and . This region is a spherical shell with thickness and radius , and hence its volume is

from which we get that the density of states for a single photon is

.

Actually this formula is off by a factor of two: we forgot to take photon polarization into account (equivalently, photon spin), which doubles the number of states with a given wave number, giving the corrected density

.

The fact that the density of states is linear in the volume is not specific to the flat torus; it’s a general feature of eigenvalues of the Laplacian by Weyl’s law. This gives that the logarithm of the partition function is

.

Taking its derivative with respect to gives the average energy of the photon gas as

but for us the significance of this integral lies in its integrand, which gives the “density of energies”

describing how much of the energy of the photon gas comes from photons of frequencies between and . This, finally, is a form of Planck’s law, although it needs some massaging to become a statement about black bodies as opposed to about gases of photons (we need to divide by to get the energy density per unit volume, then do some other stuff to get a measure of radiation).

Planck’s law has two noteworthy limits. In the limit as (meaning high temperature relative to frequency), the denominator approaches , and we get

.

This is a form of the Rayleigh-Jeans law, which is the classical prediction for black body radiation. It’s approximately valid at low frequencies but becomes less and less accurate at higher frequencies.

Second, in the limit as (meaning low temperature relative to frequency), the denominator approaches , and we get

.

This is a form of the Wien approximation. It’s approximately valid at high frequencies but becomes less and less accurate at low frequencies.

Both of these limits historically preceded Planck’s law itself.

**Wien’s displacement law**

This form of Planck’s law is enough to tell us at what frequency the energy is maximized given the temperature (and hence roughly what color a black body of temperature is): we differentiate with respect to and find that we need to solve

.

or equivalently (taking the logarithmic derivative instead)

.

Let , so that we can rewrite the equation as

or, with some rearrangement,

.

This form of the equation makes it relatively straightforward to show that there is a unique positive solution , and hence that , giving that the maximizing frequency is

where is the temperature. This is Wien’s displacement law for frequencies. Rewriting in terms of wavelengths gives

where

(the units here being meter-kelvins). This computation is typically done in a slightly different way, by first re-expressing the density of energies in terms of wavelengths, then taking the maximum of the resulting density. Because is proportional to , this has the effect of changing the from earlier to an , so replaces with the unique solution to

which is about . This gives a maximizing wavelength

where

.

This is Wien’s displacement law for wavelengths. Note that .

A wood fire has a temperature of around (or around celsius), and substituting this in above produces wavelengths of

and

.

For comparison, the wavelengths of visible light range between about for red light and for violet light. Both of these computations correctly suggest that most of the radiation from a wood fire is infrared; this is the radiation that’s heating you but not producing visible light.

By contrast, the temperature of the surface of the sun is about , and substituting that in produces wavelengths

and

which correctly suggests that the sun is emitting lots of light all around the visible spectrum (hence appears white). In some sense this argument is backwards: probably the visible spectrum evolved to be what it is because of the wide availability of light in the particular frequencies the sun emits the most.

Finally, a more sobering calculation. Nuclear explosions reach temperatures of around , comparable to the temperature of the interior of the sun. Substituting this in produces wavelengths of

and

.

These are the wavelengths of X-rays. Planck’s law doesn’t just stop at the maximum, so nuclear explosions also produce even shorter wavelength radiation, namely gamma rays. This is solely the radiation a nuclear explosion produces because it is hot, as opposed to the radiation it produces because it is nuclear, such as neutron radiation.

]]>

on the partition function . In this post I’d like to record an elementary argument, making no use of generating functions, giving a lower bound of the form for some , which might help explain intuitively why this exponential-of-a-square-root rate of growth makes sense.

The starting point is to think of a partition of as a Young diagram of size , or equivalently (in French coordinates) as a lattice path from somewhere on the y-axis to somewhere on the x-axis, which only steps down or to the right, such that the area under the path is . Heuristically, if the path takes a total of steps then there are about such paths, and if the area under the path is then the length of the path should be about , so this already goes a long way towards explaining the exponential-of-a-square-root behavior.

We can make this argument into a rigorous lower bound as follows. Consider lattice paths beginning at and ending at where is a positive integer to be determined later. Suppose that the steps of the lattice paths alternate between paths of the form (down, right, right, down) and (right, down, down, right), which means that is even. Then the area under the path is exactly the area of the right triangle it approximates, which is , and the number of such paths is exactly . This gives

whenever , so we get a lower bound of the form where , quite a bit worse than the correct value . This bound generalizes to all values of with only a small loss in the exponent because is nondecreasing (since the lattice path can continue along the line for awhile at the end before hitting the x-axis).

One reason this construction can’t produce a very good bound is that the partitions we get this way do not resemble the “typical” partition, which (as proven by Vershik and explained by David Speyer here) is a suitably scaled version of the curve

.

whereas our partitions resemble the curve . With a more convex curve we can afford to make the path longer while fixing the area under it.

So let’s remove the restriction that our curve resemble as follows. Rather than count directly, we will count , so the number of lattice paths with area at most . Since is increasing, it must be at least times this count. And we have much more freedom to pick a path now that we only need to bound its area rather than find it exactly. We can now take the path to be any Dyck path from to , of which there are

where denotes the Catalan numbers and the asymptotic can be derived from Stirling’s approximation. The area under a Dyck path is at most , which gives the lower bound

and hence, when (so that ),

which (ignoring polynomial factors) is of the from where , a substantial improvement over the previous bound. Although we are now successfully in a regime where our counts include paths of a typical shape, we’re overestimating the area under them, so the bound is still not as good as it could be.

]]>

Let denote the partition function, which describes the number of ways to write as a sum of positive integers, ignoring order. In 1918 Hardy and Ramanujan proved that is given asymptotically by

.

This is a major plot point in the new Ramanujan movie, where Ramanujan conjectures this result and MacMahon challenges him by agreeing to compute and comparing it to what this approximation gives. In this post I’d like to describe how one might go about conjecturing this result up to a multiplicative constant; proving it is much harder.

**Verification**

MacMahon computed using a recursion for implied by the pentagonal number theorem, which makes it feasible to compute by hand. Instead of doing it by hand I asked WolframAlpha, which gives

whereas the asymptotic formula gives

.

Ramanujan is shown getting closer than this in the movie, presumably using a more precise asymptotic.

How might we conjecture such a result? In general, a very powerful method for figuring out the growth rate of a sequence is to associate to it the generating function

and relate the behavior of , often for complex values of , to the behavior of . The most comprehensive resource I know for descriptions both of how to write down generating functions and to analyze them for asymptotic behavior is Flajolet and Sedgewick’s *Analytic Combinatorics*, and the rest of this post will follow that book’s lead closely.

**The meromorphic case**

The generating function strategy is easiest to carry out in the case that is meromorphic (for example, if it’s rational); in that case the asymptotic growth of is controlled by the behavior of near the pole (or poles) closest to the origin. For rational this is particularly simple and just a matter of looking at the partial fraction decomposition.

*Example.*The generating function for the sequence of partitions into parts of size at most is the rational function

whose most important pole is at and of order , and whose other poles are at the roots of unity, and of order at most . We can factor as

which gives, upon substituting , that the partial fraction decomposition begins

and hence, using the fact that

we conclude that is asymptotically

.

We ignored all of the other terms in the partial fraction decomposition to get this estimate, so there’s no reason to expect it to be particularly accurate unless is fairly large compared to . Nonetheless, let’s see if we can learn anything from it about how big we should expect the partition function itself to be. Taking logs and using the rough form of Stirling’s approximation gives

and substituting for some gives

.

If then this quantity is negative; at this point we’re clearly not in the regime where this approximation is accurate. If then the first term dominates and we get something that grows like . But if then the first term vanishes and we get

.

This is at least qualitatively the correct behavior for , and the multiplicative constant isn’t so off either: the correct value is .

*Example.* A weak order is like a total order, but with ties: for example, it describes a possible outcome of a horse race. Let denote the number of weak orders of elements (the ordered Bell numbers, or Fubini numbers). General techniques described in Flajolet and Sedgewick can be used to show that has exponential generating function

(which should be parsed as ; loosely speaking, this corresponds to a description of the combinatorial species of weak orders as the species of “lists of nonempty sets”). This is a meromorphic function with poles at . The unique pole closest to the origin is at , and we can compute using l’Hopital’s rule that

and hence the partial fraction decomposition of begins

which gives the asymptotic

Curiously, the error in the above approximation has some funny quasi-periodic behavior corresponding to the arguments of the next most relevant poles, at .

**The saddle point bound**

However, understanding meromorphic functions is not enough to handle the case of partitions, where the relevant generating function is

.

This function is holomorphic inside the unit disk but has essential singularities at every root of unity, and to handle it we will need a more powerful method known as the **saddle point method**, which is beautifully explained with pictures both in Flajolet and Sedgewick and in these slides, and concisely explained in these notes by Jacques Verstraete.

The starting point is that we can recover the coefficients of a generating function using the Cauchy integral formula, which gives

where is a closed contour in with winding number around the origin and such that is holomorphic in an open disk containing . This integral can be straightforwardly bounded by the product of the length of , denoted , and the maximum value that takes on it, which gives

.

From here, the name of the game is to attempt to pick such that this bound is as good as possible. For example, if we pick to be the circle of radius , then . If has nonnegative coefficients, which will always be the case in combinatorial applications, then will take its maximum value when , which gives the **saddle point bound**

as long as is holomorphic in an open disk of radius greater than . Strictly speaking, this bound doesn’t require any complex analysis to prove: if has nonnegative coefficients and converges then for we clearly have

.

But later we will actually use some complex analysis to improve the saddle point bound to an estimate.

The saddle point bound gets its name from what happens when we try to optimize this bound as a function of : we’re led to pick a value of such that is minimized (subject to the constraint that is less than the radius of convergence of ). This value will be a solution to the **saddle point equation**

and at such points the function , as a real-valued function of , will have a saddle point. The saddle point equation can be rearranged into the more convenient form

and from here we can attempt to pick (which will generally depend on ) such that this equation is at least approximately satisfied, then see what kind of bound we get from doing so.

We’ll often be able to write for some nice , in which case the saddle point equation simplifies further to

.

*Example.* Let ; we’ll use this generating function to get a lower bound on factorials. The saddle point bound gives

for any , since has infinite radius of convergence. The saddle point equation gives , which gives the upper bound

or equivalently the lower bound

.

We see that the saddle point bound already gets us within a factor of of the true answer.

This example has a probabilistic interpretation: the saddle point bound can be rearranged to read

which says that the probability that a Poisson random variable with rate takes the value is at most . When we take we’re looking at the Poisson random variable with rate , which for large is concentrated around its mean .

*Example.* Let denote the number of involutions (permutations squaring to the identity) on elements. These are precisely the permutations consisting of cycles of length or , so the exponential formula gives an exponential generating function

.

The saddle point bound gives

for any , since as above has infinite radius of convergence. The saddle point equation is

with exact solution

and using for simplicity gives

This approximation turns out to also only be off by a factor of from the true answer.

*Example.* The Bell numbers count the number of partitions of a set of elements into disjoint subsets. They have exponential generating function

(which also admits a combinatorial species description: this is the species of “sets of nonempty sets”), which as above has infinite radius of convergence. The saddle point equation is

which is approximately solved when (a bit more precisely, we want something more like , and it’s possible to keep going from here but let’s not), giving the saddle point bound

.

This turns out to be off from the true answer by a factor of something like .

*Example.* Now let’s tackle the example of the partition function . Its generating function

has radius of convergence , unlike the previous examples, so our saddle points will be confined the interval (between the pole of at and the essential singularity at ). The saddle point equation involves understanding the logarithmic derivative of , so let’s try to understand the logarithm. ( previously appeared on this blog here as the generating function of the number of subgroups of of index , although that won’t be directly relevant here.) The logarithm is

and it will turn out to be convenient to rearrange this a little: expanding this out gives

and exchanging the order of summation gives

.

The point of writing in this way is to make the behavior as very clear: we see that as , approaches

.

It turns out that as , the saddle point gets closer and closer to the essential singularity at . Near this singularity we may as well for the sake of convenience replace with , which gives the approximate saddle point equation

.

This saddle point equation reinforces the idea that is close to : the only way to solve it in the interval is to take something like , so we can ignore the factor, which gives the approximate saddle point

.

and saddle point bound

.

From here we need an upper bound on and a lower bound on to get an upper bound on . **Edit, 12/12/16:** As Akshaj explains in the comments, the argument that was previously here regarding the lower bound on was incorrect. Akshaj’s argument involving the Taylor series of log gives

As for the upper bound on , if then for , hence

from which we conclude that

and hence that

which is off from the true answer.

Of course, without knowing a better method that in fact gives the true answer, we have no way of independently verifying that the saddle point bounds are as close as we’ve claimed they are. We need a more powerful idea to turn these bounds into asymptotics and recover our factors of .

**Hardy’s approach**

In the eighth of Hardy’s twelve lectures on Ramanujan’s work, he describes a more down-to-earth way to guess that

starting from the approximation

.

as . He first observes that must grow faster than a polynomial but slower than an exponential: if grew like a polynomial then would have a pole of finite order at , whereas if grew like an exponential then would have a singularity closer to the origin. Hence, in Hardy’s words, it is “natural to conjecture” that

for some and some . From here he more or less employs a saddle point bound in reverse, estimating

based on the size of its largest term. It’s convenient to write so that this sum can be rewritten

so that, differentiating in , we see that the maximum occurs when . We want to turn this into an estimate for involving only , so we want to eliminate and use the approximation , valid as . This gives

(keeping in mind that is positive), so that

and

which altogether gives (again, approximating by its largest term)

for . Matching this up with gives and

hence .

**The saddle point method**

The saddle point bound, although surprisingly informative, uses very little of the information provided by the Cauchy integral formula. We ought to be able to do a lot better by picking a contour to integrate over such that we can, by analyzing the contour integral more closely, bound the contour integral

more carefully than just by the “trivial” bound .

From here we’ll still be looking at saddle points, but more carefully, as follows. Ignoring the length factor in the trivial bound, if we try to minimize we’ll end up at a saddle point of (so we get a point very slightly different from above, where we used a saddle point of ). Depending on the geometry of the locations of the zeroes, poles, and saddle points of , we can hope to choose to be a contour that passes through this saddle point in such a way that is in fact maximized at . This means that should enter and exit the “saddle” around the saddle point in the direction of steepest descent from the saddle point.

If we pick carefully, and has certain nice properties, we can furthermore hope that

- the contour integral over a small arc (in terms of the circle method, the “major arc”) is easy to approximate (usually by a Gaussian integral), and
- the contour integral over everything else (in terms of the circle method, the “minor arc”) is small enough that it’s easy to bound.

The Gaussian integrals that often appear when integrating over the major arc are responsible for the factors of we lost above.

Let’s see how this works in the case of the factorials, where . The function has a unique saddle point at , but to simplify the computation we’ll take as before. We’ll take to be the circle of radius , which gives a contour integral which can be rewritten in polar coordinates as

.

This integral can also be thought of as coming from computing the Fourier coefficients of a suitable Fourier series. Write the integrand as , so that

.

only controls the phase of the integrand, and since it doesn’t vary much (it grows like for small ) we’ll be able to ignore it. controls the absolute value of the integrand and so is much more important. For small values of we have

so from here we can try to break up the integral into a “major arc” where for some small (where the meaning of “small” depends on ) and a “minor arc” consisting of the other values of , and try to show both the integral over the major arc is well approximated by the Gaussian integral

and that the integral over the minor arc is negligible compared to this. This can be done, and the details are in Flajolet and Sedgewick (who take ); ignoring all the details, the conclusion is that

and hence that

which is exactly Stirling’s approximation. This computation also has a probabilistic interpretation: it says that the probability that a Poisson random variable with rate takes its mean value is asymptotically , which can be viewed as a corollary of the central limit theorem, since such a Poisson random variable is a sum of independent Poisson random variables with rate .

In general we’ll again find that, under suitable hypotheses, we can approximate the major arc integral by a Gaussian integral (using the same strategy as above) and bound the minor arc integral to show that it’s negligible. This gives the following:

**Theorem (saddle point approximation):** Under suitable hypotheses, if and , let be a saddle point of , so that . Then, as , we have

.

Directly applying this theorem to the partition function is difficult because of the difficulty of bounding what happens on the minor arc. has essential singularities on a dense subset of the unit circle, and delicate analysis has to be done to describe the contributions of these singularities (or more precisely, of saddle points near these singularities); the circle method used by Hardy and Ramanujan to prove the asymptotic formula accomplishes this by choosing the contour very carefully and then using modularity properties of (which is closely related to the eta function).

We will completely ignore these difficulties and pretend that only the contribution from the (saddle point near the) essential singularity at matters to the leading term. Even ignoring the minor arc, to make use of the saddle point approximation requires that we know the asymptotics of as in more detail than we do right now.

Unfortunately there does not seem to be a really easy way to do this; Hardy’s approach uses the modular properties of the eta function, while Flajolet and Sedgewick use Mellin transforms. So at this point we’ll just quote without proof the asymptotic we need from Flajolet and Sedgewick, up to the accuracy we need, namely

.

Although this changes the location of the saddle point slightly, for ease of computation (and because it will lose us at worst multiplicative factors in the end) we’ll continue to work with the same approximate saddle point

as before. The saddle point approximation differs from the saddle point bound we established earlier in two ways: first, the introduction of the term contributes a factor of

and second, the introduction of the denominator contributes another factor, which we approximate as follows. We have

and hence

so that

which gives (we actually know the multiplicative constant here, but it doesn’t matter because we already lost multiplicative constants when estimating ) and hence

.

Altogether the saddle point approximation, up to a multiplicative constant, is

.

]]>

(Disclaimer: this blog does not endorse any of the opinions Hardy expresses in the Apology, e.g. the one about mathematics being a young man’s game, the one about pure math being better than applied math, or the one about exposition being an unfit activity for a real mathematician. The opinion of this blog is that the Apology should be read mostly for insight into Hardy’s psychology rather than for guidance about how to do mathematics.)

Anyway, since this is a movie about Ramanujan, let’s talk about some of the math that appears in the movie. It’s what he would have wanted, probably.

**Elliptic integrals**

There’s a moment in the movie where a Cambridge professor writes on the board (if memory serves) the complete elliptic integral of the first kind

and goads Ramanujan into stepping up to the board, presumably with the intent to embarrass him, whereupon Ramanujan immediately writes down the Taylor series expansion of the integral as a function of . In the movie it’s a bit unclear whether this meant he worked out the answer off the top of his head or knew it already, but my inclination is to assume the latter based on the fact that this integral appears in Carr’s *Synopsis*, which Ramanujan famously studied in India.

In any case, how might we go about finding this Taylor series? A natural strategy is to first compute the Taylor series of the integrand, then integrate it term-by-term, especially if, in the spirit of Ramanujan, we’re willing to play fast and loose with issues like where the Taylor series converges and whether we can exchange infinite sums and integrals here. Using the general form of the binomial theorem, the integrand expands to

where

which simplifies to

.

It follows that we have

so we’re left with computing the integral of . This is straightforward to compute using Euler’s formula, which gives

.

Using a second application of the binomial theorem, the term is

.

Since we know the answer is real, we can ignore the imaginary part of this integral and focus on its real part. The integral of the real part

vanishes unless by symmetry considerations, so is the only relevant term and we get

where the integral is just . This gives the final answer

which Ramanujan wrote as

.

**Ramanujan’s prime number theorem**

Ramanujan claimed, in his letters to Hardy, that he had found a more or less exact formula for the prime counting function (the number of primes less than or equal to ). Upon closer inspection by Hardy and Littlewood, this formula was later shown to be incorrect. This is part of the ongoing tension in the movie between mathematical intuition, as exemplified by Ramanujan, and mathematical rigor, as exemplified by Hardy; Hardy emphasizes that situations like this are why intuition is not enough and Ramanujan needs rigor as well. But the movie never explains what, exactly, Ramanujan’s error was. So what was it?

**Edit, 5/9/16:** In fact Hardy writes about this error in *The Indian Mathematician Ramanujan* (beginning on page 150), as I learned from Alison Miller in the comments. It’s well worth reading everything Hardy has to say, but on the subject of in particular, he writes

Ramanujan’s theory of primes was vitiated by his ignorance of the theory of functions of a complex variable. It was (so to say) what the theory might be if the Zeta-function had no complex zeros. His method depended upon a wholesale use of divergent series… That his proofs should have been invalid was only to be expected. But the mistakes went deeper than that, and many of the actual results were false. He had obtained the dominant terms of the classical formulae, although by invalid methods; but none of them are such close approximations as he supposed.

So, it sounds like what happened is that Ramanujan found a version of the explicit formulas relating to the zeroes of the Riemann zeta function. However, he believed, incorrectly, that the zeta function had no complex zeroes, and so didn’t include the terms in the explicit formulas having to do with those zeroes; this simplifies the formulas but at the cost of introducing errors which Ramanujan did not do enough computations to notice. Hardy once said elsewhere of Ramanujan that he

had indeed but the vaguest idea of what a function of a complex variable was was

and says here, more specifically, that he

knew nothing at all about the theory of analytic functions

so this is perhaps unsurprising.

**Black holes**

At some point in the movie there is some text claiming that Ramanujan’s work is now being applied to understand black holes. It’s easy for such claims to be overblown: for example, when Grothendieck died, some articles claimed that his work had applications to subjects like cryptography, robotics, and genetics. These claims come from a combination of two claims:

- Grothendieck’s work had a big impact on algebraic geometry.
- Algebraic geometry is applied to cryptography, robotics, and genetics.

However, as far as I can tell, Grothendieck’s work in particular has no direct relevance to cryptography, robotics, or genetics, although I’d be happy to see evidence to the contrary.

But this black hole claim seems to check out, sort of. I believe it refers to Ramanujan’s work on mock modular forms, which he studied in the last year of his life, after leaving England and before he died (not shown in the movie). Ramanujan described various examples of such functions, but a general theory, including a general definition, was missing until surprisingly recently, when Zwegers showed in 2002 that they were related to harmonic Maass forms.

The connection to black holes comes from Dabholkar, Murthy, and Zagier, who showed that certain mock modular forms arise as generating functions of BPS states in certain supersymmetric string theories, which are relevant to the study of black holes from the perspective of quantum gravity. This ties Ramanujan’s mock modular forms to a rich interaction between physics and mathematics involving the AdS/CFT correspondence, also known as the holographic principle, and variants of Monstrous moonshine such as umbral moonshine.

A more famous example of this relationship comes from Monstrous moonshine itself, as follows. Perhaps the most famous non-mock modular form is the j-invariant, whose Fourier expansion begins

.

The story of Monstrous moonshine begins with McKay’s famous 1978 observation that the coefficients of can be written as sums of the dimensions of the irreducible representations of the Monster group; for example, the dimension of its smallest nontrivial irreducible representation is , and . Frenkel, Lepowsky, and Meurman later showed that this is because the j-invariant is the generating function for the dimensions of the graded pieces of a vertex operator algebra on which the Monster acts, and which describes a certain conformal field theory related to the Leech lattice.

Here the relationship to black holes comes from Witten (via John Baez), who suggested that the Monster conformal field theory might have something to do with 3d (really 2+1d; 2 space, 1 time) quantum gravity, and hence with 3d black holes, via the holographic principle. A tantalizing piece of numerical evidence for this conjecture comes from calculations of black hole entropy. The lightest black hole in one version of the theory has states, and so its entropy is

whereas a semiclassical approximation to this entropy, using the Bekenstein-Hawking formula, gives

.

These aren’t supposed to agree exactly because there are quantum corrections to the semiclassical approximation. There is a parameter that can be varied in the theory, and as the agreement between the quantum and semiclassical answers becomes exact. This is proven using a known asymptotic for the coefficients of the j-invariant, namely that

.

It’s very curious to think that this might be related to black hole entropy.

Incidentally, the proof of this result (with good error terms) relies on the Hardy-Littlewood circle method, which was pioneered by Hardy and Ramanujan in the work on the asymptotics of the partition function. This is a major part of the movie which we’ll defer discussion of to a second post.

]]>

**Definition-Theorem:** The following conditions on are all equivalent, and all define what it means for to be a **separable **-algebra:

- is projective as an -bimodule (equivalently, as a left -module).
- The multiplication map has a section as an -bimodule map.
- admits a
**separability idempotent**: an element such that and for all (which implies that ).

(**Edit, 3/27/16: **Previously this definition included a condition involving Hochschild cohomology, but it’s debatable whether what I had in mind is the correct definition of Hochschild cohomology unless is a field or is projective over . It’s been removed since it plays no role in the post anyway.)

When is a field, this condition turns out to be a natural strengthening of the condition that is semisimple. In general, loosely speaking, a separable -algebra is like a “bundle of semisimple algebras” over .

**Proofs that the above conditions are equivalent**

: the multiplication map

is an epimorphism of -bimodules, so if is projective as an -bimodule then it splits, meaning it has a section. Conversely, since is a free -bimodule, if this map has a section then is a retract of a free -bimodule, hence is projective.

: a section of the multiplication map, as above, is determined by what it does to ; let’s call the image , and abuse terminology by identifying with the section it defines. What does it mean for to split the multiplication map? As a splitting, it must satisfy

since it’s the image of . Second, as an -bimodule map, it must satisfy for all , since in . (In fact is the free -bimodule on a generator with this property.) These conditions together imply that

hence that is an idempotent, as the name “separability idempotent” suggests. Hence a splitting of the multiplication map is the same data as a separability idempotent, which is in fact an idempotent.

This concludes the proof.

**A note on **

Above we chose to write the source of the multiplication map as to emphasize that it is the free -module, or equivalently the free -bimodule, on a generator. However, it can just as well be written , provided that we remember that the natural -bimodule structure on this is given by left multiplication on the first copy of and right multiplication on the second copy of . (Also, when we think of the separability idempotent as an idempotent, we really have in mind the algebra structure on .) This is how we’ll write things down in examples below.

**Some examples**

*Example.* itself is a separable -algebra, since it is even free as a -module.

*Example.* The matrix algebra is a separable -algebra. We can prove this explicitly by writing down a separability idempotent. Letting be the usual basis of (with relations , and all other multiplications zero), set

where is fixed. We have

and

while

so is a separability idempotent.

*Example.* If is a finite group whose order is invertible in , then the group algebra is a separable -algebra. Again we can prove this explicitly by writing down a separability idempotent. Set

.

We have

and, for any ,

where in the second line we made the substitution . So is a separability idempotent.

To get some more examples it will be convenient to use the following lemma.

**Lemma:** If is semisimple, then is separable over .

*Proof.* A ring is semisimple iff every module over it is projective. So if is semisimple, then in particular is a projective -module.

**Corollary:** If is a finite separable extension of a field (in the usual sense), then is separable over (in the above sense).

*Proof.* By the primitive element theorem, for some irreducible separable polynomial . Hence

where are the irreducible factors of over (it has at least two, since by definition contains a root of ). This is a finite product of fields and hence semisimple, so by the lemma we conclude that is separable.

Over a field, this gives another way to prove that the matrix algebras and the group algebras (where is invertible in ) are separable: is semisimple, and so is . But the proofs via writing down separability idempotents work for much more general base rings.

**Some general lemmas**

**Lemma:** is a separable -algebra iff is.

*Proof.* The opposite of a separability idempotent for is a separability idempotent for .

More explicitly, suppose is a separability idempotent for . Then is a separability idempotent for .

**Lemma:** If and are separable -algebras, then so is .

*Proof.* The sum of separability idempotents for and is a separability idempotent for .

More explicitly, recall that tensor product distributes over finite products for algebras. (In the commutative case this means that product distributes over finite coproducts for affine schemes.) Hence

and this is even an isomorphism of -bimodules, respecting the multiplication map down to . From this it’s not hard to verify that if is a separability idempotent for , and is a separability idempotent for , then , included into the above, is a separability idempotent for .

**Lemma:** If and are separable -algebras, then so is .

*Proof.* The tensor product of separability idempotents for and is a separability idempotent for .

**Lemma:** If is a separable -algebra, then the base change is a separable -algebra, for any commutative -algebra . (Hence separability is a **geometric** property in the strong sense that it is preserved by arbitrary base change.)

*Proof.* A separability idempotent for remains a separability idempotent for .

Note that this is not true for semisimple algebras, since an inseparable extension of the ground field is a counterexample.

**Lemma:** Any quotient of a separable algebra is separable.

*Proof.* A separability idempotent for remains a separability idempotent for any quotient of .

**Corollary: **Two -algebras are separable over if and only if is separable over .

*Proof.* In one direction, if are separable, then so is . In the other, if is separable, then are quotients of it, hence are also separable.

**Lemma:** Separability is Morita invariant: if and are Morita equivalent over , then is separable over iff is.

*Proof.* By the Eilenberg-Watts theorem, the category of -bimodules is equivalent (even monoidally) to the category of cocontinuous -linear endofunctors of . Among these, the bimodule itself represents the identity functor. Hence separability is equivalent to the condition that the identity is projective, and since this condition can be stated entirely in terms of it is Morita invariant.

**Lemma:** If is a commutative -algebra and is an -algebra such that 1) is separable over and 2) is separable over , then is separable over .

*Proof.* By hypothesis, the multiplication maps and split as bimodule maps, and we want to know that the same is true of the multiplication map . (Note that we are writing even though is commutative and so canonically isomorphic to its opposite; we don’t want to use this isomorphism.) If we write

then we can factor the multiplication map as a composite of two maps we know how to split, namely

where the first map applies the multiplication map between the two copies of and the second map is the multiplication map . (A similar argument can be used to show that the tensor product of separable algebras is separable.)

**Classification over a field**

We now classify separable -algebras when is a field.

**Lemma:** If is separable over a field , then is semisimple.

*Proof.* Any -bimodule describes a cocontinuous -linear functor as follows:

.

The bimodule represents the identity functor, while the free bimodule represents the functor

.

Consequently, the multiplication map represents the natural transformation

given by the action of . Now, if is separable, then this natural transformation splits, and in particular all of these action maps split. If is a field, then is a free -module, so is a free -module. This means that every -module is a retract of a free -module, hence is projective, and so is semisimple as desired.

**Corollary:** If is separable over a field , then is a finite product of matrix algebras over division algebras over , all of which must also be separable.

*Proof.* Since is semisimple, Artin-Wedderburn implies that for some division rings over . The lemmas we proved above imply that is separable iff are separable and that is separable iff is separable, so any such product is separable iff each is separable.

**Corollary:** If is separable over a field , then is **geometrically semisimple**: is semisimple for every field extension .

*Proof.* We know that separability is geometric (preserved by base change), so is separable over for every . If is a field extension, then the above lemma implies that is also semisimple.

**Corollary:** If is separable over a field , then it is finite-dimensional over .

*Proof.* The base change to the algebraic closure is semisimple, and a semisimple -algebra is necessarily a finite direct product of matrix algebras over (since there are no nontrivial division algebras over an algebraically closed field), hence finite-dimensional over . And .

**Corollary:** An algebra over a field is separable over iff is semisimple.

*Proof.* Above we showed that if is semisimple, then is separable over (since every module, and in particular , is projective over ). Conversely, we also showed that tensor products and opposites of separable algebras are separable, so if is separable over then so is , and so it must also be semisimple.

At this point we’ve reduced to looking for finite-dimensional division algebras over such that is semisimple. We’ll find them by inspecting their centers , which are finite extensions of .

**Lemma:** Let be algebras over a field , and let denote the center. Then .

*Proof.* Suppose is central. This is equivalent to the condition that

for all and

.

for all . Since we’re working over a field, we can assume WLOG that the and are linearly independent in and respectively, from which it follows that these conditions hold if and only if for all . Hence , which (again, since we’re working over a field) is naturally a subalgebra of , and so must be the entire thing.

**Lemma:** If is a separable algebra over a field , then so is its center .

*Proof.* We know that is separable iff is semisimple. By the above lemma, we have

and since the center of a semisimple algebra is (a finite product of fields, hence) semisimple, it follows that is semisimple, hence (by another lemma) that is separable.

**Lemma:** Let be a finite field extension of a field . The following conditions are equivalent:

- is a separable extension of (in the usual sense).
- is a separable -algebra (in the above sense).
- is geometrically semisimple: is semisimple for all field extensions .
- is semisimple (equivalently, is a finite product of copies of ).

*Proof.* : we proved this above from the primitive element theorem.

: follows from a lemma above.

: set .

: any generates a -subalgebra of isomorphic to where is the minimal polynomial of . If is semisimple, it is a finite product of copies of , hence so is every -subalgebra of it. And is a finite product of copies of iff is separable over .

**Corollary:** If a division algebra over a field is separable over , then is finite-dimensional over , and its center is a separable extension of (in the usual sense).

This necessary condition in fact turns out to be sufficient. We need one more lemma to prove this, which is the following.

**Theorem:** Let be a **central simple algebra** over a field : that is, is a finite-dimensional simple -algebra with center . Then , where .

*Proof.* naturally acts on ( acting from the left, acting from the right). Simplicity of means that has no nontrivial two-sided ideals, or equivalently has no nontrivial -submodules, hence is simple as an -module. Its endomorphism ring is the center , and as a module over this endomorphism ring, . Hence the natural action of on gives a map

and we want this map to be a bijection. But it is a surjection by the Jacobson density theorem, hence a bijection since both sides have dimension over .

**Corollary:** A central simple algebra over a field is separable over .

**Corollary:** A finite-dimensional division algebra over a field whose center is a finite separable extension of is separable over .

*Proof.* We now know that is separable over , and that is separable over . By a lemma, it follows that is separable over .

**Corollary:** The separable algebras over a field are precisely the finite products of matrix algebras over finite-dimensional division algebras over whose centers are separable extensions of .

Over a perfect field, the last condition is automatic, so this just says that the separable algebras over are precisely the finite-dimensional semisimple -algebras.

]]>

Let be a cocommutative coalgebra over a commutative ring . If we want to make sense of as defining an algebro-geometric object, it needs to have a functor of points on commutative -algebras. Here it is:

.

In words, the functor of points of a cocommutative coalgebra sends a commutative -algebra to the set of setlike elements of . In the rest of this post we’ll work through some examples.

**Sets**

Recall that if is a set then is a cocommutative coalgebra with comultiplication coming from the diagonal . More explicitly, the comultiplication is determined by the condition that for all .

The functor of points of this coalgebra sends a commutative -algebra to the set of setlike elements of , and as we computed before, these are precisely the elements of the form where

and , or equivalently is a complete orthogonal set of idempotents in . Together, the determine a direct product decomposition

which geometrically corresponds to a decomposition of into disjoint components . As mentioned previously, the data of such a decomposition is equivalent to the data of a continuous function from the Pierce spectrum to .

In other words, consists of “locally constant functions from to .”

We can also equip with a group structure, and then , with the usual Hopf algebra structure, has a functor of points sending a commutative -algebra to the group of continuous functions from to , with pointwise product.

**Finite-dimensional algebras**

Now we restrict to the case that is a field.

Let be a finite-dimensional commutative -algebra. Then the linear dual acquires a natural coalgebra structure given by dualizing the algebra structure on . (We don’t need commutativity to say this.) More explicitly, if is an element of , then the comultiplication is

and the counit is

.

On the other hand,

.

We conclude the following.

**Lemma:** A linear functional is setlike if and only if for all and ; in other words, if and only if is a morphism of -algebras.

More generally, because is a finite-dimensional -vector space, if is any commutative -algebra then the natural map

is an isomorphism. We can check that it’s even an isomorphism of coalgebras, and exactly the same computation as above shows the following.

**Corollary:** An element of is setlike if and only if the corresponding element of is a morphism of -algebras.

Hence the functor of points of as a coalgebra is precisely the functor of points of as an algebra: setlike elements of correspond to morphisms of affine schemes over .

The dual map induces an equivalence of categories between finite-dimensional commutative algebras and finite-dimensional cocommutative coalgebras over , so we can learn something about the latter by learning something about the former. Every finite-dimensional commutative algebra over a field is in particular Artinian, and so factors as a finite product of Artinian local rings. The nilradical of such a ring coincides with its Jacobson radical, and the quotient is a finite-dimensional commutative semisimple -algebra, hence factors as a finite product of finite field extensions of .

Hence, up to taking finite extensions, looks like a finite set of points together with some “nilpotent fuzz.” looks like functions on this and looks like distributions; both are equally sensitive to the “nilpotent fuzz,” as we saw previously in the special case of primitive elements.

**Infinite-dimensional algebras**

Again let be a field. Let be a commutative -algebra, not necessarily finite-dimensional. Then it is no longer true that we can put a coalgebra structure on : when we try dualizing the multiplication, the map goes in the wrong direction to get a comultiplication.

Intuitively, the problem is that because we’re using the algebraic tensor product to define coalgebras, the comultiplication can only output a sum of finitely many tensors, and so has trouble dealing with distributions that are not “compactly supported.”

However, it is possible to rescue this construction as follows. If is a commutative -algebra, define its **finite dual**

to consist of all linear functionals factoring through a finite quotient of (as a -algebra). Geometrically, these are the distributions with “finite support,” and they do in fact have a comultiplication, as follows. If factors through a finite quotient , then

factors through

and the quotient map dualizes to a map , giving us an element of coming from , and hence giving an element of . This is our comultiplication. The counit is as usual; this poses no problems.

The result we showed in the finite-dimensional case above shows the following here.

**Theorem:** Let be a commutative -algebra and let be its finite dual. Then the setlike elements of can naturally be identified with the -algebra homomorphisms which factor through a finite quotient of .

Geometrically, this says that the functor of points of sends an affine scheme to maps from to the spectrum of the **profinite completion **

of . In other words, itself is the coalgebra of distributions on the profinite completion.

*Example.* Let , so that is the affine line. The distributions on the affine line with finite support, or equivalently the profinite completion of , can be very explicitly classified. By the Chinese remainder theorem, the finite quotients of take the form

where the are irreducible polynomials over . This is a finite product, hence a finite direct sum, of vector spaces, and so any linear functional on it breaks up as a direct sum of linear functionals on each piece, so we can restrict attention to linear functionals on (distributions “supported on “) without loss of generality.

In the simplest case, is a linear polynomial . Then the linear dual of has a basis consisting of taking each of the first terms of the Taylor series expansion of a polynomial in centered at : these are (up to the issue of dividing by various factors if has positive characteristic) the derivatives of the Dirac delta at .

In the general case we can understand what’s happening using Galois descent. After passing to a suitable field extension of , namely the splitting field of , the quotient breaks up further into linear factors. In the case that is Galois, linear functionals on can be interpreted as -invariant distributions on . Geometrically we should think of a finite set of “fuzzy” points acted on by the Galois group; examples of Galois-invariant distributions on this include the sum of Dirac deltas at each point, or the sum of derivatives of Dirac deltas at each point. If isn’t Galois (meaning that is inseparable), there is actually extra “fuzziness” that could be hidden over and only becomes visible over .

*Subexample.* Let and consider the quotient of . After passing to the Galois extension , this quotient becomes , and it’s clear that the dual space has a natural basis given by two Dirac deltas, one at and one at . The corresponding linear functionals are just evaluation at these two points.

Unfortunately, these Dirac deltas don’t directly make sense over . Instead, there are two Galois-invariant linear combinations that do: we can take

which, up to a factor of , takes the real part of , and

which, again up to a factor of , takes the imaginary part.

**Cartier duality**

We mostly restricted to the case of a field above because over a field duality behaves in the following very nice way.

**Theorem:** The functor is a contravariant equivalence of symmetric monoidal categories between the symmetric monoidal category of finite-dimensional -vector spaces and itself.

Because this equivalence is symmetric monoidal, it induces various further equivalences.

**Corollary:** The functor is a contravariant equivalence of categories between finite-dimensional -algebras and finite-dimensional -coalgebras, and between finite-dimensional commutative -algebras and finite-dimensional cocommutative -coalgebras.

These remain symmetric monoidal equivalences if we equip everything with the usual tensor product (which for commutative algebras is the coproduct and for cocommutative coalgebras is the product, so in this case we get that the equivalence is symmetric monoidal for free). We can even ask for both an algebra and a coalgebra structure at once, which gives us this.

**Corollary (Cartier duality):** The functor is a contravariant equivalence of categories between finite-dimensional commutative and cocommutative Hopf algebras over and itself.

Finite-dimensional commutative and cocommutative Hopf algebras over are the analogues of finite abelian groups in the world of algebraic geometry over : more precisely, they are finite (in the sense that they are Spec of a finite-dimensional algebra) commutative (because “abelian” means something else in algebraic geometry) group schemes (meaning Spec of a commutative Hopf algebra).

*Example.* Suppose is a finite abelian group, and is its group algebra, regarded as a Hopf algebra in the usual way (so cocommutative for general reasons, and commutative because is abelian). Then the Cartier dual of is the function algebra , regarded as a Hopf algebra in the usual way (commutative for general reasons, and cocommutative because is abelian).

*Subexample.* If is the cyclic group of order , then , as a group scheme, has functor of points

sending a commutative -algebra to the group of roots of unity in . This group scheme has its own name in algebraic geometry: it’s called . On the other hand, its Cartier dual is the “constant” group scheme with value : it has functor of points

sending a commutative -algebra to, as above, the group of locally constant functions from to . This is the same functor of points we get if we think about as a coalgebra, and its name is just .

Cartier duality can be described as switching between two possible functors of points for a finite-dimensional commutative and cocommutative Hopf algebra as above: one based on thinking of as a group object in finite schemes, and one based on thinking of itself as a group object in finite-dimensional cocommutative coalgebras. In the second description, the functor of points

sends a commutative -algebra to the group (really a group now, since we are in a Hopf algebra) of setlike elements of .

As it turns out, it’s possible to give a description of what this functor is doing without explicitly thinking about coalgebras or Cartier duality. Namely, we saw above that the coalgebra of distributions on a point represents the setlike elements functor on coalgebras. We can ask what represents the setlike elements functor on Hopf algebras, and it’s not hard to see that the answer is the Hopf algebra whose underlying algebra is

where the comultiplication is , the counit is , and the antipode is . This Hopf algebra is commutative, and thinking of it as a group scheme, it is a very famous one, the **multiplicative group scheme** , whose functor of points

sends a commutative -algebra to its group of units. Morphisms of Hopf algebras correspond to setlike elements of , and if is commutative these in addition correspond to morphisms of affine group schemes. A morphism from an affine group scheme to the multiplicative group is called a **character**: it is the correct notion of a -dimensional representation in the world of group schemes.

Cartier duality can then be interpreted as follows: if is a finite commutative group scheme, then “characters of ” forms another finite commutative group scheme, whose functor of points

sends a commutative -algebra to the group (under pointwise multiplication) of characters of the base change . But we saw earlier that this is nothing more than the set of setlike elements of , or equivalently the set of homomorphisms , and so this is precisely the functor of points of the Cartier dual as previously defined.

Once Cartier duality is described in terms of characters, it seems a little more suprising: since the dual of the dual of a finite-dimensional vector space is just again, we conclude that taking characters of the characters of a finite commutative group scheme gets us the same group scheme again. This should be compared to Pontryagin duality for finite abelian groups, which says the same thing, where “characters” means homomorphisms , and which can be interpreted as Cartier duality for constant group schemes over .

]]>

Less commonly, mathematicians sometimes think about coalgebras. In general it seems that mathematicians find these harder to think about, although it’s sometimes unavoidable, e.g. when discussing Hopf algebras. The goal of this post is to describe how to begin thinking about cocommutative coalgebras as consisting of distributions of some sort on spaces of some sort.

**Functions vs. distributions**

Distributions are typically defined as being duals (spaces of continuous linear functionals) to topological vector spaces of functions. Loosely speaking, a distribution is something you can integrate a class of functions against; it’s a kind of generalized measure.

For example, the dual of the space of continuous functions on a compact Hausdorff space (with the sup norm topology) is a space of (signed) Radon measures on . A class of examples closer to the examples we’ll be considering, although it involves more technicalities than we’ll need, is the dual of the space of smooth functions on a smooth manifold (with the Fréchet topology), which can be thought of as “distributions with compact support” on .

The simplest examples of distributions are the **Dirac delta** distributions, definable in great generality: as linear functionals on spaces of functions they are precisely the evaluation functionals

.

When we take duals to spaces of smooth functions, as opposed to continuous functions, we get more interesting distributions “supported at a point” given by taking derivatives. For example, on , at every point there are linear functionals on given by

.

These distributions are named using derivative notation because they are the distributional derivatives of .

The two most important things to keep in mind about the difference between functions and distributions is the following.

- Functions pull back, while distributions push forward.
- Functions form commutative algebras, while distributions form cocommutative coalgebras.

These points are closely related: the multiplication on functions resp. the comultiplication on functions, can be described using pullback resp. pushforward along the diagonal map

.

Namely, because we can multiply functions on by functions on to get functions on , for any reasonable notion of functions we get a dual map

giving the multiplication on functions.

The situation for distributions is similar but less straightforward: if is any reasonable notion of distributions we get a map

To get a comultiplication from this we’d like for there to be an isomorphism, or at least a map, from to . Unfortunately, the map that exists usually goes in the other direction, and usually will not be an isomorphism unless is some kind of completed tensor product.

Nevertheless, in some examples, and/or with the right modified notion of tensor product, the required maps do exist and we do get a comultiplication on distributions.

In addition to comultiplication, coalgebras also need a counit. In the case of distributions on spaces this counit comes from pushing forward along the unique map , getting a map

which, if we think of distributions as generalized measures, computes the “total measure” of a measure.

**The diagonal**

The appearance of the diagonal map above can be put into a more abstract context. Recall that in any category with finite products, every object is canonically a cocommutative comonoid in a unique way, via the diagonal map

.

A typical example for us will be , and in general we’ll want to think of as a category of “spaces.” We can get both commutative monoids and cocommutative comonoids out of diagonal maps as follows.

If is a contravariant functor out of (describing a notion of “functions”) to a symmetric monoidal category (typically something like ) which is lax symmetric monoidal in the sense that it is equipped with natural transformations

compatible with symmetries (plus some unit stuff), then pulling back along the diagonal endows each with the structure of a commutative monoid in .

*Example.* If , then we can take to consist of all functions , where is the underlying field. If , then is even symmetric monoidal in the sense that the natural transformations above are isomorphisms.

Dually, if is a covariant functor out of (describing a notion of “distributions”) to a symmetric monoidal category which is oplax symmetric monoidal in the sense that it is equipped with natural transformations

compatible with symmetries (plus unit stuff as above), then pushing forward along the diagonal endows each with the structure of a cocommutative comonoid in .

*Example.* If , then we can take to consist of the free -vector space on , where is the underlying field. Without any finiteness hypotheses, this is even symmetric monoidal.

**Sets as coalgebras**

Let’s slightly generalize the construction above. Let be a commutative ring (in fact we could take a commutative semiring here). Then we have a free -module functor from sets to -modules. The above construction shows that this functor can be regarded as taking values in cocommutative coalgebras over , so in fact we have a functor

.

At this point it will be convenient to introduce the following definition.

**Definition:** An element of a coalgebra (where is the comultiplication and is the counit) is *setlike* if and . If is a coalgebra, we’ll write for its set of setlike elements.

(The more common term is *grouplike*, but that term is really only appropriate to the case of Hopf algebras, since in that case the setlike elements form a group. Here the setlike elements only form a set.)

Now we can describe , as a coalgebra, as being freely generated by setlike elements. Thinking in terms of distributions, setlike elements correspond to Dirac distributions, and so it’s reasonable to think of them as the “points” of a coalgebra, or more precisely of a hypothetical space on which the coalgebra is distributions.

**Proposition:** The functor from sets to coalgebras above has a right adjoint sending a coalgebra to its set of setlike elements.

*Proof.* We want to show that if is a set and is a coalgebra, we have a natural bijection

.

But this is clear from the observation that is a free -module on setlike elements, from which it follows that a coalgebra homomorphism is uniquely and freely determined by what it does to each element . These elements must be sent to some setlike element of and can be sent to any such element.

In praticular, the functor is represented by the coalgebra (of “distributions on a point”).

**Lemma: **Suppose has no nontrivial idempotents (that is, it is a connected ring). Then the setlike elements of are precisely the elements : that is, the unit of the above adjunction is an isomorphism.

*Proof.* Suppose is a setlike element. Then

must be equal to

which happens if and only if if and otherwise. The counit condition is

.

Altogether, the condition that is primitive is precisely the condition that the elements are a complete set of orthogonal idempotents in . Since has no nontrivial idempotents by assumption, each is equal to either or . Since they are orthogonal (meaning if ), at most one of them is equal to . And since they sum to , exactly one of them is equal to . Hence our setlike element is some .

The correct statement without the hypothesis that is connected, which is not hard to extract from the above argument, is that the setlike elements of in general correspond to functions from the set of connected components of to with finite image, or equivalently to continuous functions from the Pierce spectrum to .

**Corollary:** Let have no nontrivial idempotents. Then the functor is an equivalence of categories from sets to cocommutative coalgebras over which are free on setlike elements.

In other words, as a slogan, sets are coalgebras of Dirac deltas.

*Proof.* We showed that the unit of the adjunction between sets and coalgebras is an isomorphism on sets. In general, an adjunction restricts to an equivalence of categories between the subcategories on which the unit resp. the counit of the adjunction are isomorphisms. So it remains to determine for which coalgebras the counit of the adjunction is an isomorphism. Explicitly, the counit is the natural map

from the free -module on the setlike elements of a coalgebra to . If this is an isomorphism, then must in particular be free on some setlike elements. Conversely, if is free on setlike elements, then the lemma above shows that naturally, so that is an isomorphism.

This equivalence induces an equivalence between groups and cocommutative Hopf algebras over which are free (as modules) on setlike (here “grouplike”) elements.

**Beyond Dirac deltas**

We’ve said a lot about setlike elements of coalgebras, or equivalently about Dirac delta distributions. But coalgebras have lots of other kinds of elements in general. For example, if is a Lie algebra, its universal enveloping algebra has a natural comultiplication given by extending

where ; that is, each is primitive. In a geometric story about distributions, where do the primitives?

The first observation is that in an arbitrary coalgebra there isn’t an element called , so coalgebras don’t have a notion of primitive element. What makes the element special is that it is in fact the unique setlike element: it satisfies and is the only element of with this property. So whatever primitivity means, geometrically it has something to do with a fixed setlike element, or in distributional terms with a fixed Dirac delta.

**Definition:** Let be a setlike element of a coalgebra . An element is *primitive with respect to* if

and .

We can get a big hint about what this definition means by going back to the example of distributions coming from taking the dual of the space of smooth functions . Consider the distribution

.

How does comultiplication act on this distribution? To answer that question we need to see what this distribution does to a product of functions (since this describes the action of the distribution on at least a dense subspace of the pullback of functions along the diagonal map ). The answer, using the product rule, is that

.

This gives that

and tells us that primitivity is a reflection of the Leibniz rule for derivations: saying that an element is primitive with respect to a setlike element means that if is a “point,” or more precisely a Dirac delta at a point, then is a “directional derivative” in a tangent direction at that point. Similarly, computing the pushforward to a point means differentiating constant functions (which are the functions pulled back from a point), which gives zero.

More formally, we can say the following.

**Theorem:** Let be a setlike element of a cocommutative coalgebra over , and let be an arbitrary element. Then is primitive with respect to iff is a setlike element of .

*Proof.* Computation.

Intuitively, is primitive with respect to iff both and are “points,” where the indicates that they are “infinitesimally close” points.

The fact that , as a Hopf algebra, is generated by primitive elements can be interpreted geometrically as saying that it corresponds to distributions “supported at a point.” In fact it is possible to describe as distributions supported at the identity on a Lie group with Lie algebra .

]]>

The goal of this post is to derive the principle of maximum entropy in the special case of probability distributions over finite sets from

- Bayes’ theorem and
- the principle of indifference: assign probability to each of possible outcomes if you have no additional knowledge. (The slogan in statistical mechanics is “all microstates are equally likely.”)

We’ll do this by deriving an arguably more fundamental principle of maximum relative entropy using only Bayes’ theorem.

**A better way to state Bayes’ theorem**

Suppose you have a set of hypotheses about something, exactly one of which can be true, and some prior probabilities that these hypotheses are true (which therefore sum to ). Then you see some evidence . (Here is a simultaneous definition of both hypotheses and evidence: hypotheses are things that assert how likely or unlikely evidence is. That is, what it means to give evidence about some hypotheses is that there ought to be some conditional probabilities , the likelihoods, describing how likely it is that you see evidence conditional on hypothesis .)

Bayes’ theorem in this setting is then usually stated as follows: you should now have updated posterior probabilities that your hypotheses are true conditional on your evidence, and they should be given by

.

That is, each prior probability gets multiplied by , which describes how much more likely thinks the evidence is than before. You might be concerned that requires the introduction of extra information, but in fact it must be given by

by conditioning on each in turn, so it’s already determined by the priors and the likelihoods. (This is if the are parameterized by a discrete parameter ; in general this sum should be replaced by an integral.)

In practice this statement of Bayes’ theorem seems to be annoyingly easy to forget, at least for me. Here is a better statement. The idea is to think of as just a normalization constant. Hence the revised statement is

.

That is, the posterior probability is proportional to the prior probability times the likelihood, where the proportionality constant is uniquely determined by the requirement that the probabilities sum to .

Intuitively: after seeing some evidence, your confidence in a hypothesis gets multiplied by how well the hypothesis predicted the evidence, then normalized. Now you can take your posteriors to be your new priors in preparation for seeing some more evidence. This is a **Bayesian update**.

**Aside: measures up to scale and improper priors**

This statement of Bayes’ theorem suggests a slight reformulation of what we mean by a probability measure: a probability measure is the same thing as a measure with nonzero total measure, up to scaling by positive reals. One reason to like this description is that it naturally incorporates improper priors, which correspond to prior probabilities with possibly infinite total measure, up to scaling by positive reals. The point is that after a Bayesian update an improper prior may become proper again. For example, there’s an improper prior assigning measure to every positive integer , which allows us to talk about hypotheses indexed by the positive integers and with a prior which makes all of them equally likely.

Improper priors may seem obviously bad because they don’t assign probabilities to things: in order to assign a probability you need to normalize by the total measure, which is infinite. However, with an improper prior it is still meaningful to make comparisons between probabilities: you can still meaningfully say that is larger than , or exactly times , since this comparison is invariant under scaling by positive reals.

There’s a somewhat philosophical argument that when performing Bayesian reasoning, only comparisons between probabilities are meaningful anyway: in order to know the probability, in the absolute sense, of something, you need to be absolutely sure you’ve written down every possible hypothesis (in order to ensure that exactly one of them is true). If you leave out the true hypothesis, then you might end up being more and more sure of an arbitrarily bad hypothesis because the true hypothesis wasn’t included in your calculations. In other words, computing the normalization constant in the usual statement of Bayes’ theorem is “global” in that it requires information about all of the , but computing is “local” in that it only involves one at a time.

(And it’s not enough just to have a few hypotheses and then a catch-all hypothesis called “everything else,” because “everything else” is not a hypothesis in the sense that it does not assign likelihoods . A hypothesis has to make predictions.)

**The setup**

Back to maximum entropy. Imagine that you are repeatedly rolling an -sided die, and you don’t know what the various weights on the die are: that is, you don’t know the true probabilities that the face of the die will come up.

However, you have some hypotheses about these probabilities. Your hypotheses are parameterized by a parameter , which for the sake of concreteness we’ll take to be a real number or a tuple of real numbers, but which could in principle be anything. Your hypotheses assign probability to the face coming up. You also have some prior over your hypotheses, which we’ll write as a probability density function . Hence and

while the are normalized so that

.

*Example.* If , we might imagine that we’re flipping a coin, with the probability that we flip tails and the probability that we flip heads. Our hypotheses might take the form where , and our prior might be the uniform prior: each is equally likely. Hence our probability density is .

Now suppose you roll the die times. What happens to your beliefs under Bayesian updating in the limit ?

**The principle of maximum relative entropy**

Suppose you see the face come up times (so the are nonnegative and ; they described observed relative frequencies of the various faces coming up, and altogether describe the empirical probability distribution). Hypothesis predicts that this happens with probability

.

Let’s see how this function behaves as . Taking the log, and using Stirling’s approximation in the form

we get

.

Various terms cancel here due to the fact that . At the end of the day we get

.

This is the first apperarance of the function , the **entropy** of , regarded as a probability distribution over faces. This is perhaps the most concrete and least mysterious way of introducing entropy: it’s a concise way of summarizing the asymptotic behavior of the multinomial distribution as . Already we see that the entropy being larger corresponds to the counts being more likely, in a very serious way.

But there’s a second term in the likelihood, so let’s compute the logarithm of that too. This gives

.

Thus the logarithm of the likelihood, or log-likelihood, is

.

Now the function that appears is the negative of the **Kullback-Leibler divergence**. We’ll call it the **relative entropy** (although this term is sometimes used for the KL divergence, not its negative) and denote it somewhat arbitrarily by .

Altogether, the posterior density is now proportional to

.

From here it’s not hard to see that the posterior density is overwhelmingly concentrated at the hypotheses that maximize relative entropy as , subject to the constraint that the prior density is positive. This is because all the other posterior densities are exponentially smaller in comparison, and as long as the prior density is positive, it doesn’t matter what its exact value is because it too is exponentially small in comparison to the main exponential term.

This calculation suggests that we can interpret the relative entropy as a measure of how well the hypothesis fits the evidence : the larger this number is, the better the fit. (A more common way to describe relative entropy is as a measure of how well a hypothesis fits the “truth.” Here our model for being told that is the “truth” is seeing it asymptotically as .)

Let’s wrap that conclusion up into a theorem.

**Theorem:** With hypotheses as above, as , Bayesian updates converge towards believing the hypothesis that maximizes the relative entropy subject to the constraint that the prior density is positive.

Now, suppose the true probabilities are . Then as we expect, by the law of large numbers, that the observed frequencies approach the true probabilities . If the true probabilities are among our hypotheses , we would hope, and it seems intuitively clear, that we’ll converge towards believing the true hypothesis. This requires showing the following.

**Theorem:** The relative entropy is nonpositive, and for fixed , it takes its maximum value iff .

*Proof.* This is more or less a computation with Lagrange multipliers. For fixed , we want to maximize

subject to the constraint . This constraint means that at a critical point of (whether a maximum, a minimum, or a saddle point), all of the partial derivatives should be equal. (Intuitively, we have a “budget” of probability to spend to increase , and as we spend more probability on one we necessarily must spend less probability on the others. The critical points are then the points where we can’t do any better by shifting our probability budget, meaning that the marginal value of each probability increase is equally good.)

We compute that

so setting all partial derivatives equal we conclude that must be proportional to , and the additional constraint gives for all .

At this critical point takes value . Now we need to show that this critical point is a maximum and not a minimum. Since it’s the unique critical point, it suffices to show that it’s a local maximum. So, consider a point in a small neighborhood of this critical point, where . To second order, we have

and hence

The linear term vanishes, and the quadratic term is negative definite as desired. (Strictly speaking we need to require that the are all positive, but if any of them happen to be zero then the corresponding value of can be safely ignored anyway, since it won’t figure in any of our computations.)

**Corollary:** With hypotheses as above, as , if the true hypothesis is among the hypotheses with positive density, Bayesian updates converge towards believing it. False hypotheses are disbelieved at an exponential rate with base the exponential of the relative entropy.

In other words, as , the Bayesian definition of probability converges to the frequentist definition of probability.

*Example.* Let’s return to the example, where we’re flipping a coin with an unknown bias , so that are the probabilities of flipping heads and tails respectively given bias , and our prior is uniform. Suppose that after trials we observe heads and tails, where . Then

.

(We can drop the binomial coefficient because it’s the same for all values of and so can be absorbed into our proportionality constant. We introduced it into the above computation because it becomes important later.)

This computation can be used to deduce the rule of succession, which asserts that at this point you should assign probability to heads coming up on the next coin flip. Note that as this converges to .

The posterior density can be written as

which takes its maximum value when by our results above, although in this case one-variable calculus suffices to prove this. Near this maximum value, Taylor expanding around shows that, for values of sufficiently close to , the posterior density is approximately a Gaussian centered at with standard deviation . Hence in order to be confident that the true bias lies in an interval of size with high probability we need to look at coin flips.

**Maximum entropy**

We were supposed to get a criterion in terms of maximizing entropy, not relative entropy. What happened to that?

Now instead of knowing the relative frequencies , let’s assume that we only know that they satisfy some conditions. For example, for any function , where is a finite-dimensional real vector space, we might know the expected value

of with respect to the empirical probability distribution. In statistical mechanics a typical and important example is that we might know the average energy. We also might be observing a random walk, and while we don’t know how many steps the random walker took in a given direction (perhaps because they’re moving too fast for us to see), we might know where they ended up after steps, which tells us the average of all the steps the walker took.

(Strictly speaking, if we’re talking about the empirical distribution, where in particular each is necessarily rational, it’s too much to ask that any particular condition be exactly satisfied. We’d be happy to see that it’s asymptotically satisfied as , from which we’re concluding that our conditions are exactly satisfied for the true probabilities , or something like that. It seems there’s something subtle going on here and I am going to completely ignore it.)

Knowing the expected values of some functions is equivalent to knowing that the empirical distribution lies in some affine subspace of the probability simplex

.

However, more complicated constraints are possible. For example, suppose that and that we’re really rolling two independent dice with and sides, respectively, so that we can relabel the possible outcomes with pairs

.

Observing this is true means that, at least asymptotically as , we observe that we can write , where

is the empirical probability that the first die comes up , and similarly

is the empirical probability that the second die comes up . This is a nonlinear constraint: in fact it describes a collection of quadratic equations that the variables must satisfy. Imposing these equations turns out to be equivalent to imposing the simpler homogeneous quadratic equations

which we might recognize as the equations cutting out the image of the Segre embedding

.

The idea is to think of the probability simplex as sitting inside projective space ; then the restriction of the Segre embedding to probability simplices produces a map

describing how a probability distribution over the first die and a probability distribution over the second die gives rise to a joint probability distribution over both of them. More complicated variations of this example are considered in algebraic statistics.

In any case, the game is that instead of knowing the empirical distribution we now only know some conditions it satisfies. Write the set of all distributions satisfying these conditions as . What happens as ? Hypothesis still predicts that empirical distribution occurs with probability

and hence it predicts that we observe that our conditions are satisfied with probability

(where is shorthand for the event ). Using our previous approximations, we can rewrite this as

which gives posterior densities

.

As before, we find that the posterior densities are overwhelmingly concentrated at the hypotheses that maximize relative entropy as (again subject to the constraint that ), but where is now allowed to run over all .

If our prior assigns nonzero density to every possible probability distribution in the probability simplex (for simple, we could take to be parameterized by the points of the probability simplex, to be the probability distribution corresponding to the point , and to be a constant, suitably normalized), then we know that relative entropy takes its maximum value when its two arguments are equal, so we can restrict our attention to the case that above, and we find that, asymptotically as , the posterior density is proportional to the prior density as long as satisfies the conditions, and otherwise.

This is unsurprising: we assumed that all we were told about the empirical distribution is that it satisfied some conditions, so the only change we make to our prior is that we condition on that.

We still haven’t gotten a characterization in terms of entropy, as opposed to relative entropy. This is where we are going to invoke the principle of indifference, which in this situation asserts that the prior we should have is the one concentrated entirely at the hypothesis

that the die rolls are being generated uniformly at random. Note that this means the posterior is *also* concentrated entirely at this hypothesis!

We now predict empirical probability distribution with probability distributed according to the multinomial distribution, namely

where is now a normalization constant and can be ignored, and is the entropy, rather than the relative entropy. This comes from substituting the uniform distribution for into the relative entropy .

We now want to ask a slightly different question than before. Before we were asking what our beliefs were about the underlying “true” probabilities generating the die rolls. Now we’ve already fixed those beliefs, and we’re instead going to ask what our beliefs are about the empirical distribution , which we now no longer know, conditioned on the fact that . By Bayes’ theorem, this is

where is either if or if , and as before. Overall, we conclude the following.

**Theorem:** Starting from the indifference prior, as Bayesian updates converge towards believing that the empirical distribution is the maximum entropy distribution in .

In some sense this is not at all a deep statement: it’s just the observation that entropy describes the asymptotics of the multinomial distribution, together with conditioning on . Although it is somewhat interesting that conditioning on , in this setup, is done by seeing that appears to asymptotically lie in as .

**Edit: **This is essentially the Wallis derivation, but with a much larger emphasis placed on the choice of prior.

*Example.* Suppose consists of all probability distributions such that the expected value

of some random variable (possibly vector-valued) is fixed; call this fixed value . Then we want to maximize subject to this constraint and the constraint . This is again a Lagrange multiplier problem. We’ll introduce a vector-valued Lagrange multipler , as well as a scalar Lagrange multiplier for the constraint that will later disappear from the calculation. (The will slightly simplify the calculation.)

Then the method of Lagrange multipliers says that any maximum must be a critical point of the function

for some value of and . (Here we are hiding the dependence of on .) Using the fact that

we compute that

and setting these partial derivatives equal to gives

where . But since the must sum to , is a normalization constant determined by this condition, and in fact must be the **partition function**

.

From here we can compute that the expected value of is

and the entropy is

.

In statistical mechanics, a “die” is a statistical-mechanical system, and is a vector of variables such as energy and particle number describing that system. , the Lagrange multiplier, is a vector of conjugate variables such as (inverse) temperature and chemical potential. The probability distribution we’ve just described is the canonical ensemble if consists only of energy and the grand canonical ensemble if consists of energy and particle numbers.

The uniform prior we assumed using the principle of indifference, possibly after conditioning on a fixed value of the energy (rather than a fixed expected value), is the microcanonical ensemble. The assumption that this is a reasonable prior is called the fundamental postulate of statistical mechanics. As Terence Tao explains here, in a suitable finite toy model involving Markov chains at equilibrium it can be proven rigorously, but in more complicated settings the fundamental postulate is harder to justify, and of course in some settings it will just be wrong. In these settings we can instead use the principle of maximum relative entropy.

]]>