Feeds:
Posts

## Archive for the ‘math’ Category

Note: this is a repost of a Facebook status I wrote off the cuff about a year ago, lightly edited. As such it has a different style from my other posts, but I still wanted to put it somewhere where it’d be easier to find and share than Facebook.

Gradient descent, in its simplest where you just subtract the gradient of your loss function $J$, is not dimensionally consistent: if the parameters you’re optimizing over have units of length, and the loss function is dimensionless, then the derivatives you’re subtracting have units of inverse length.

This observation can be used to reinvent the learning rate, which, for dimensional consistency, must have units of length squared. It also suggests that the learning rate ought to be set to something like $L^2$ for some kind of characteristic length scale $L$, which loosely speaking is the length at which the curvature of $J$ starts to matter.

It might also make sense to give different parameters different units, which suggests furthermore that one might want a different learning rate for each parameter, or at least that one might want to partition the parameters into different subsets and choose different learning rates for each.

Going much further, from an abstract coordinate-free point of view the extra information you need to compute the gradient of a smooth function is a choice of (pseudo-)Riemannian metric on parameter space, which if you like is a gigantic hyperparameter you can try to optimize. Concretely this amounts to a version of preconditioned gradient descent where you allow yourself to multiply the gradient (in the coordinate-dependent sense) by a symmetric (invertible, ideally positive definite) matrix which is allowed to depend on the parameters. In the first paragraph this matrix was a constant scalar multiple of identity and in the third paragraph this matrix was constant diagonal.

This is an extremely general form of gradient descent, general enough to be equivariant under arbitrary smooth change of coordinates: that is, if you do this form of gradient descent and then apply a diffeomorphism to parameter space, you are still doing this form of gradient descent, with a different metric. For example, if you pick the preconditioning matrix to be the inverse Hessian (in the usual sense, assuming it’s invertible), you recover Newton’s method. This corresponds to choosing the metric at each point to be given by the Hessian (in the usual sense), which is the choice that makes the Hessian (in the coordinate-free sense) equal to the identity. This is a precise version of “the length at which the curvature of $J$ starts to matter” and in principle ameliorates the problem where gradient descent performs poorly in narrow valleys (regions where the Hessian (in the usual sense) is poorly conditioned), at least up to cubic and higher order effects.

In general it’s expensive to compute the inverse Hessian, so a more practical thing to do is to use a matrix which approximates it in some sense. And now we’re well on the way towards quasi-Newton methods

## The representation theory of the additive group scheme

In this post we’ll describe the representation theory of the additive group scheme $\mathbb{G}_a$ over a field $k$. The answer turns out to depend dramatically on whether or not $k$ has characteristic zero.

## Singular value decomposition

As a warm-up to the subject of this blog post, consider the problem of how to classify $n \times m$ matrices $M \in \mathbb{R}^{n \times m}$ up to change of basis in both the source ($\mathbb{R}^m$) and the target ($\mathbb{R}^n$). In other words, the problem is to describe the equivalence classes of the equivalence relation on $n \times m$ matrices given by

$\displaystyle M \sim N \Leftrightarrow M = PNQ^{-1}, P \in GL_n(\mathbb{R}), Q \in GL_m(\mathbb{R})$.

It turns out that the equivalence class of $M$ is completely determined by its rank $r = \text{rank}(M)$. To prove this we construct some bases by induction. For starters, let $x_1 \in \mathbb{R}^m$ be a vector such that $y_1 = M x_1 \neq 0$; this is always possible unless $M = 0$. Next, let $x_2 \in \mathbb{R}^m$ be a vector such that $y_2 = M x_2$ is linearly independent of $y_1$; this is always possible unless $\text{rank}(M) = 1$.

Continuing in this way, we construct vectors $x_1, \dots x_r \in \mathbb{R}^m$ such that the vectors $y_1 = M x_1, \dots y_r = M x_r \in \mathbb{R}^n$ are linearly independent, hence a basis of the column space of $M$. Next, we complete the $x_i$ and $y_i$ to bases of $M$ in whatever manner we like. With respect to these bases, $M$ takes a very simple form: we have $M x_i = y_i$ if $1 \le i \le r$ and otherwise $M x_i = 0$. Hence, in these bases, $M$ is a block matrix where the top left block is an $r \times r$ identity matrix and the other blocks are zero.

Explicitly, this means we can write $M$ as a product

$\displaystyle M = PDQ^{-1}, P \in GL_n(\mathbb{R}), Q \in GL_m(\mathbb{R})$

where $D$ has the block form above, the columns of $P$ are the basis of $\mathbb{R}^n$ we found by completing $M x_1, \cdots M x_r$, and the columns of $Q$ are the basis of $\mathbb{R}^m$ we found by completing $x_1, \cdots x_r$. This decomposition can be computed by row and column reduction on $M$, where the row operations we perform give $P$ and the column operations we perform give $Q$.

Conceptually, the question we’ve asked is: what does a linear transformation $T : X \to Y$ between vector spaces “look like,” when we don’t restrict ourselves to picking a particular basis of $X$ or $Y$? The answer, stated in a basis-independent form, is the following. First, we can factor $T$ as a composite

$\displaystyle X \xrightarrow{p} \text{im}(T) \xrightarrow{i} Y$

where $\text{im}(T)$ is the image of $T$. Next, we can find direct sum decompositions $X \cong \text{im}(T) \oplus X'$ and $Y \cong \text{im}(T) \oplus Y'$ such that $p$ is the projection of $X$ onto its first factor and $i$ is the inclusion of the first factor into $Y$. Hence every linear transformation “looks like” a composite

$\displaystyle \text{im}(T) \oplus X' \xrightarrow{p_{\text{im}(T)}} \text{im}(T) \xrightarrow{i_{\text{im}(T)}} \text{im}(T) \oplus Y$

of a projection onto a direct summand and an inclusion of a direct summand. So the only basis-independent information contained in $T$ is the dimension of the image $\text{im}(T)$, or equivalently the rank of $T$. (It’s worth considering the analogous question for functions between sets, whose answer is a bit more complicated.)

The actual problem this blog post is about is more interesting: it is to classify $n \times m$ matrices $M \in \mathbb{R}^{n \times m}$ up to orthogonal change of basis in both the source and the target. In other words, we now want to understand the equivalence classes of the equivalence relation given by

$\displaystyle M \sim N \Leftrightarrow M = UNV^{-1}, U \in O(n), V \in O(m)$.

Conceptually, we’re now asking: what does a linear transformation $T : X \to Y$ between finite-dimensional Hilbert spaces “look like”?

## Higher linear algebra

Let $k$ be a commutative ring. A popular thing to do on this blog is to think about the Morita 2-category $\text{Mor}(k)$ of algebras, bimodules, and bimodule homomorphisms over $k$, but it might be unclear exactly what we’re doing when we do this. What are we studying when we study the Morita 2-category?

The answer is that we can think of the Morita 2-category as a 2-category of module categories over the symmetric monoidal category $\text{Mod}(k)$ of $k$-modules, equipped with the usual tensor product $\otimes_k$ over $k$. By the Eilenberg-Watts theorem, the Morita 2-category is equivalently the 2-category whose

• objects are the categories $\text{Mod}(A)$, where $A$ is a $k$-algebra,
• morphisms are cocontinuous $k$-linear functors $\text{Mod}(A) \to \text{Mod}(B)$, and
• 2-morphisms are natural transformations.

An equivalent way to describe the morphisms is that they are “$\text{Mod}(k)$-linear” in that they respect the natural action of $\text{Mod}(k)$ on $\text{Mod}(A)$ given by

$\displaystyle \text{Mod}(k) \times \text{Mod}(A) \ni (V, M) \mapsto V \otimes_k M \in \text{Mod}(A)$.

This action comes from taking the adjoint of the enrichment of $\text{Mod}(A)$ over $\text{Mod}(k)$, which gives a tensoring of $\text{Mod}(A)$ over $\text{Mod}(k)$. Since the two are related by an adjunction in this way, a functor respects one iff it respects the other.

So Morita theory can be thought of as a categorified version of module theory, where we study modules over $\text{Mod}(k)$ instead of over $k$. In the simplest cases, we can think of Morita theory as a categorified version of linear algebra, and in this post we’ll flesh out this analogy further.

## More on partition asymptotics

In the previous post we described a fairly straightforward argument, using generating functions and the saddle-point bound, for giving an upper bound

$\displaystyle p(n) \le \exp \left( \pi \sqrt{ \frac{2n}{3} } \right)$

on the partition function $p(n)$. In this post I’d like to record an elementary argument, making no use of generating functions, giving a lower bound of the form $\exp C \sqrt{n}$ for some $C > 0$, which might help explain intuitively why this exponential-of-a-square-root rate of growth makes sense.

The starting point is to think of a partition of $n$ as a Young diagram of size $n$, or equivalently (in French coordinates) as a lattice path from somewhere on the y-axis to somewhere on the x-axis, which only steps down or to the right, such that the area under the path is $n$. Heuristically, if the path takes a total of $L$ steps then there are about $2^L$ such paths, and if the area under the path is $n$ then the length of the path should be about $O(\sqrt{n})$, so this already goes a long way towards explaining the exponential-of-a-square-root behavior.

## The man who knew partition asymptotics

(Part I of this post is here)

Let $p(n)$ denote the partition function, which describes the number of ways to write $n$ as a sum of positive integers, ignoring order. In 1918 Hardy and Ramanujan proved that $p(n)$ is given asymptotically by

$\displaystyle p(n) \approx \frac{1}{4n \sqrt{3}} \exp \left( \pi \sqrt{ \frac{2n}{3} } \right)$.

This is a major plot point in the new Ramanujan movie, where Ramanujan conjectures this result and MacMahon challenges him by agreeing to compute $p(200)$ and comparing it to what this approximation gives. In this post I’d like to describe how one might go about conjecturing this result up to a multiplicative constant; proving it is much harder.