*Note: this is a repost of a Facebook status I wrote off the cuff about a year ago, lightly edited. As such it has a different style from my other posts, but I still wanted to put it somewhere where it’d be easier to find and share than Facebook. *

Gradient descent, in its simplest where you just subtract the gradient of your loss function , is not dimensionally consistent: if the parameters you’re optimizing over have units of length, and the loss function is dimensionless, then the derivatives you’re subtracting have units of inverse length.

This observation can be used to reinvent the learning rate, which, for dimensional consistency, must have units of length squared. It also suggests that the learning rate ought to be set to something like for some kind of characteristic length scale , which loosely speaking is the length at which the curvature of starts to matter.

It might also make sense to give different parameters different units, which suggests furthermore that one might want a different learning rate for each parameter, or at least that one might want to partition the parameters into different subsets and choose different learning rates for each.

Going much further, from an abstract coordinate-free point of view the extra information you need to compute the gradient of a smooth function is a choice of (pseudo-)Riemannian metric on parameter space, which if you like is a gigantic hyperparameter you can try to optimize. Concretely this amounts to a version of preconditioned gradient descent where you allow yourself to multiply the gradient (in the coordinate-dependent sense) by a symmetric (invertible, ideally positive definite) matrix which is allowed to depend on the parameters. In the first paragraph this matrix was a constant scalar multiple of identity and in the third paragraph this matrix was constant diagonal.

This is an extremely general form of gradient descent, general enough to be equivariant under arbitrary smooth change of coordinates: that is, if you do this form of gradient descent and then apply a diffeomorphism to parameter space, you are still doing this form of gradient descent, with a different metric. For example, if you pick the preconditioning matrix to be the inverse Hessian (in the usual sense, assuming it’s invertible), you recover Newton’s method. This corresponds to choosing the metric at each point to be given by the Hessian (in the usual sense), which is the choice that makes the Hessian (in the coordinate-free sense) equal to the identity. This is a precise version of “the length at which the curvature of starts to matter” and in principle ameliorates the problem where gradient descent performs poorly in narrow valleys (regions where the Hessian (in the usual sense) is poorly conditioned), at least up to cubic and higher order effects.

In general it’s expensive to compute the inverse Hessian, so a more practical thing to do is to use a matrix which approximates it in some sense. And now we’re well on the way towards quasi-Newton methods.

on March 19, 2018 at 10:22 pm |rajeshd007Having different learning rates for different parameters is by definition, not gradient descent, unless there isn’t a mathematically sensible way of choosing them. In a sense you are playing catch 22, where in you are searching for best paraemeters, and apriori you are making assumptions about their relative scaling (by having different learning rates), which makes your entire solution a heuristic one rather than gradient descent.

on February 16, 2018 at 6:40 am |RobertThere are various “gradient descent strategies” that are not stateless: e.g. momentum. Can you transform an optimization problem in such a way (this obviously requires making the parameter space larger) that momentum becomes SGD?

on February 8, 2018 at 3:38 am |Kram EinsnulldreizweiJust a heads up: That final link is broken. You accidentally put the url twice.

on February 8, 2018 at 9:38 am |Qiaochu YuanFixed, thanks!