Feeds:
Posts
Comments

In Part I we discussed some conceptual proofs of the Sylow theorems. Two of those proofs involve reducing the existence of Sylow subgroups to the existence of Sylow subgroups of S_n and GL_n(\mathbb{F}_p) respectively. The goal of this post is to understand the Sylow p-subgroups of GL_n(\mathbb{F}_p) in more detail and see what we can learn from them about Sylow subgroups in general.

Continue Reading »

As an undergraduate the proofs I saw of the Sylow theorems seemed very complicated and I was totally unable to remember them. The goal of this post is to explain proofs of the Sylow theorems which I am actually able to remember, several of which use our old friend

The p-group fixed point theorem (PGFPT): If P is a finite p-group and X is a finite set on which P acts, then the subset X^P of fixed points satisfies |X^P| \equiv |X| \bmod p. In particular, if |X| \not \equiv 0 \bmod p then this action has at least one fixed point.

There will be some occasional historical notes taken from Waterhouse’s The Early Proofs of Sylow’s Theorem.

Continue Reading »

This is a post I wanted to write some time ago; I’ve forgotten why, but it was short and cute enough to finish. Our starting point is the following observation:

Theorem 1: Universal lossless compression is impossible. That is, there is no function which takes as input finite strings (over some fixed alphabet) and always produces as output shorter finite strings (over the same alphabet) in such a way that the latter is recoverable from the former.

Continue Reading »

Gradient descent

Note: this is a repost of a Facebook status I wrote off the cuff about a year ago, lightly edited. As such it has a different style from my other posts, but I still wanted to put it somewhere where it’d be easier to find and share than Facebook. 

Gradient descent, in its simplest where you just subtract the gradient of your loss function J, is not dimensionally consistent: if the parameters you’re optimizing over have units of length, and the loss function is dimensionless, then the derivatives you’re subtracting have units of inverse length.

This observation can be used to reinvent the learning rate, which, for dimensional consistency, must have units of length squared. It also suggests that the learning rate ought to be set to something like L^2 for some kind of characteristic length scale L, which loosely speaking is the length at which the curvature of J starts to matter.

It might also make sense to give different parameters different units, which suggests furthermore that one might want a different learning rate for each parameter, or at least that one might want to partition the parameters into different subsets and choose different learning rates for each.

Going much further, from an abstract coordinate-free point of view the extra information you need to compute the gradient of a smooth function is a choice of (pseudo-)Riemannian metric on parameter space, which if you like is a gigantic hyperparameter you can try to optimize. Concretely this amounts to a version of preconditioned gradient descent where you allow yourself to multiply the gradient (in the coordinate-dependent sense) by a symmetric (invertible, ideally positive definite) matrix which is allowed to depend on the parameters. In the first paragraph this matrix was a constant scalar multiple of identity and in the third paragraph this matrix was constant diagonal.

This is an extremely general form of gradient descent, general enough to be equivariant under arbitrary smooth change of coordinates: that is, if you do this form of gradient descent and then apply a diffeomorphism to parameter space, you are still doing this form of gradient descent, with a different metric. For example, if you pick the preconditioning matrix to be the inverse Hessian (in the usual sense, assuming it’s invertible), you recover Newton’s method. This corresponds to choosing the metric at each point to be given by the Hessian (in the usual sense), which is the choice that makes the Hessian (in the coordinate-free sense) equal to the identity. This is a precise version of “the length at which the curvature of J starts to matter” and in principle ameliorates the problem where gradient descent performs poorly in narrow valleys (regions where the Hessian (in the usual sense) is poorly conditioned), at least up to cubic and higher order effects.

In general it’s expensive to compute the inverse Hessian, so a more practical thing to do is to use a matrix which approximates it in some sense. And now we’re well on the way towards quasi-Newton methods

In this post we’ll describe the representation theory of the additive group scheme \mathbb{G}_a over a field k. The answer turns out to depend dramatically on whether or not k has characteristic zero.

Continue Reading »

As a warm-up to the subject of this blog post, consider the problem of how to classify n \times m matrices M \in \mathbb{R}^{n \times m} up to change of basis in both the source (\mathbb{R}^m) and the target (\mathbb{R}^n). In other words, the problem is to describe the equivalence classes of the equivalence relation on n \times m matrices given by

\displaystyle M \sim N \Leftrightarrow M = PNQ^{-1}, P \in GL_n(\mathbb{R}), Q \in GL_m(\mathbb{R}).

It turns out that the equivalence class of M is completely determined by its rank r = \text{rank}(M). To prove this we construct some bases by induction. For starters, let x_1 \in \mathbb{R}^m be a vector such that y_1 = M x_1 \neq 0; this is always possible unless M = 0. Next, let x_2 \in \mathbb{R}^m be a vector such that y_2 = M x_2 is linearly independent of y_1; this is always possible unless \text{rank}(M) = 1.

Continuing in this way, we construct vectors x_1, \dots x_r \in \mathbb{R}^m such that the vectors y_1 = M x_1, \dots y_r = M x_r \in \mathbb{R}^n are linearly independent, hence a basis of the column space of M. Next, we complete the x_i and y_i to bases of M in whatever manner we like. With respect to these bases, M takes a very simple form: we have M x_i = y_i if 1 \le i \le r and otherwise M x_i = 0. Hence, in these bases, M is a block matrix where the top left block is an r \times r identity matrix and the other blocks are zero.

Explicitly, this means we can write M as a product

\displaystyle M = PDQ^{-1}, P \in GL_n(\mathbb{R}), Q \in GL_m(\mathbb{R})

where D has the block form above, the columns of P are the basis of \mathbb{R}^n we found by completing M x_1, \cdots M x_r, and the columns of Q are the basis of \mathbb{R}^m we found by completing x_1, \cdots x_r. This decomposition can be computed by row and column reduction on M, where the row operations we perform give P and the column operations we perform give Q.

Conceptually, the question we’ve asked is: what does a linear transformation T : X \to Y between vector spaces “look like,” when we don’t restrict ourselves to picking a particular basis of X or Y? The answer, stated in a basis-independent form, is the following. First, we can factor T as a composite

\displaystyle X \xrightarrow{p} \text{im}(T) \xrightarrow{i} Y

where \text{im}(T) is the image of T. Next, we can find direct sum decompositions X \cong \text{im}(T) \oplus X' and Y \cong \text{im}(T) \oplus Y' such that p is the projection of X onto its first factor and i is the inclusion of the first factor into Y. Hence every linear transformation “looks like” a composite

\displaystyle \text{im}(T) \oplus X' \xrightarrow{p_{\text{im}(T)}} \text{im}(T) \xrightarrow{i_{\text{im}(T)}} \text{im}(T) \oplus Y

of a projection onto a direct summand and an inclusion of a direct summand. So the only basis-independent information contained in T is the dimension of the image \text{im}(T), or equivalently the rank of T. (It’s worth considering the analogous question for functions between sets, whose answer is a bit more complicated.)

The actual problem this blog post is about is more interesting: it is to classify n \times m matrices M \in \mathbb{R}^{n \times m} up to orthogonal change of basis in both the source and the target. In other words, we now want to understand the equivalence classes of the equivalence relation given by

\displaystyle M \sim N \Leftrightarrow M = UNV^{-1}, U \in O(n), V \in O(m).

Conceptually, we’re now asking: what does a linear transformation T : X \to Y between finite-dimensional Hilbert spaces “look like”?

Continue Reading »

Let k be a commutative ring. A popular thing to do on this blog is to think about the Morita 2-category \text{Mor}(k) of algebras, bimodules, and bimodule homomorphisms over k, but it might be unclear exactly what we’re doing when we do this. What are we studying when we study the Morita 2-category?

The answer is that we can think of the Morita 2-category as a 2-category of module categories over the symmetric monoidal category \text{Mod}(k) of k-modules, equipped with the usual tensor product \otimes_k over k. By the Eilenberg-Watts theorem, the Morita 2-category is equivalently the 2-category whose

  • objects are the categories \text{Mod}(A), where A is a k-algebra,
  • morphisms are cocontinuous k-linear functors \text{Mod}(A) \to \text{Mod}(B), and
  • 2-morphisms are natural transformations.

An equivalent way to describe the morphisms is that they are “\text{Mod}(k)-linear” in that they respect the natural action of \text{Mod}(k) on \text{Mod}(A) given by

\displaystyle \text{Mod}(k) \times \text{Mod}(A) \ni (V, M) \mapsto V \otimes_k M \in \text{Mod}(A).

This action comes from taking the adjoint of the enrichment of \text{Mod}(A) over \text{Mod}(k), which gives a tensoring of \text{Mod}(A) over \text{Mod}(k). Since the two are related by an adjunction in this way, a functor respects one iff it respects the other.

So Morita theory can be thought of as a categorified version of module theory, where we study modules over \text{Mod}(k) instead of over k. In the simplest cases, we can think of Morita theory as a categorified version of linear algebra, and in this post we’ll flesh out this analogy further.

Continue Reading »

I was staring at a bonfire on a beach the other day and realized that I didn’t understand anything about fire and how it works. (For example: what determines its color?) So I looked up some stuff, and here’s what I learned.

Continue Reading »

In the previous post we described a fairly straightforward argument, using generating functions and the saddle-point bound, for giving an upper bound

\displaystyle p(n) \le \exp \left( \pi \sqrt{ \frac{2n}{3} } \right)

on the partition function p(n). In this post I’d like to record an elementary argument, making no use of generating functions, giving a lower bound of the form \exp C \sqrt{n} for some C > 0, which might help explain intuitively why this exponential-of-a-square-root rate of growth makes sense.

The starting point is to think of a partition of n as a Young diagram of size n, or equivalently (in French coordinates) as a lattice path from somewhere on the y-axis to somewhere on the x-axis, which only steps down or to the right, such that the area under the path is n. Heuristically, if the path takes a total of L steps then there are about 2^L such paths, and if the area under the path is n then the length of the path should be about O(\sqrt{n}), so this already goes a long way towards explaining the exponential-of-a-square-root behavior.

Continue Reading »

(Part I of this post is here)

Let p(n) denote the partition function, which describes the number of ways to write n as a sum of positive integers, ignoring order. In 1918 Hardy and Ramanujan proved that p(n) is given asymptotically by

\displaystyle p(n) \approx \frac{1}{4n \sqrt{3}} \exp \left( \pi \sqrt{ \frac{2n}{3} } \right).

This is a major plot point in the new Ramanujan movie, where Ramanujan conjectures this result and MacMahon challenges him by agreeing to compute p(200) and comparing it to what this approximation gives. In this post I’d like to describe how one might go about conjecturing this result up to a multiplicative constant; proving it is much harder.

Continue Reading »