*Note: this is a repost of a Facebook status I wrote off the cuff about a year ago, lightly edited. As such it has a different style from my other posts, but I still wanted to put it somewhere where it’d be easier to find and share than Facebook. *

Gradient descent, in its simplest where you just subtract the gradient of your loss function , is not dimensionally consistent: if the parameters you’re optimizing over have units of length, and the loss function is dimensionless, then the derivatives you’re subtracting have units of inverse length.

This observation can be used to reinvent the learning rate, which, for dimensional consistency, must have units of length squared. It also suggests that the learning rate ought to be set to something like for some kind of characteristic length scale , which loosely speaking is the length at which the curvature of starts to matter.

It might also make sense to give different parameters different units, which suggests furthermore that one might want a different learning rate for each parameter, or at least that one might want to partition the parameters into different subsets and choose different learning rates for each.

Going much further, from an abstract coordinate-free point of view the extra information you need to compute the gradient of a smooth function is a choice of (pseudo-)Riemannian metric on parameter space, which if you like is a gigantic hyperparameter you can try to optimize. Concretely this amounts to a version of preconditioned gradient descent where you allow yourself to multiply the gradient (in the coordinate-dependent sense) by a symmetric (invertible, ideally positive definite) matrix which is allowed to depend on the parameters. In the first paragraph this matrix was a constant scalar multiple of identity and in the third paragraph this matrix was constant diagonal.

This is an extremely general form of gradient descent, general enough to be equivariant under arbitrary smooth change of coordinates: that is, if you do this form of gradient descent and then apply a diffeomorphism to parameter space, you are still doing this form of gradient descent, with a different metric. For example, if you pick the preconditioning matrix to be the inverse Hessian (in the usual sense, assuming it’s invertible), you recover Newton’s method. This corresponds to choosing the metric at each point to be given by the Hessian (in the usual sense), which is the choice that makes the Hessian (in the coordinate-free sense) equal to the identity. This is a precise version of “the length at which the curvature of starts to matter” and in principle ameliorates the problem where gradient descent performs poorly in narrow valleys (regions where the Hessian (in the usual sense) is poorly conditioned), at least up to cubic and higher order effects.

In general it’s expensive to compute the inverse Hessian, so a more practical thing to do is to use a matrix which approximates it in some sense. And now we’re well on the way towards quasi-Newton methods.