Feeds:
Posts

## Singular value decomposition

As a warm-up to the subject of this blog post, consider the problem of how to classify $n \times m$ matrices $M \in \mathbb{R}^{n \times m}$ up to change of basis in both the source ($\mathbb{R}^m$) and the target ($\mathbb{R}^n$). In other words, the problem is to describe the equivalence classes of the equivalence relation on $n \times m$ matrices given by

$\displaystyle M \sim N \Leftrightarrow M = PNQ^{-1}, P \in GL_n(\mathbb{R}), Q \in GL_m(\mathbb{R})$.

It turns out that the equivalence class of $M$ is completely determined by its rank $r = \text{rank}(M)$. To prove this we construct some bases by induction. For starters, let $x_1 \in \mathbb{R}^m$ be a vector such that $y_1 = M x_1 \neq 0$; this is always possible unless $M = 0$. Next, let $x_2 \in \mathbb{R}^m$ be a vector such that $y_2 = M x_2$ is linearly independent of $y_1$; this is always possible unless $\text{rank}(M) = 1$.

Continuing in this way, we construct vectors $x_1, \dots x_r \in \mathbb{R}^m$ such that the vectors $y_1 = M x_1, \dots y_r = M x_r \in \mathbb{R}^n$ are linearly independent, hence a basis of the column space of $M$. Next, we complete the $x_i$ and $y_i$ to bases of $M$ in whatever manner we like. With respect to these bases, $M$ takes a very simple form: we have $M x_i = y_i$ if $1 \le i \le r$ and otherwise $M x_i = 0$. Hence, in these bases, $M$ is a block matrix where the top left block is an $r \times r$ identity matrix and the other blocks are zero.

Explicitly, this means we can write $M$ as a product

$\displaystyle M = PDQ^{-1}, P \in GL_n(\mathbb{R}), Q \in GL_m(\mathbb{R})$

where $D$ has the block form above, the columns of $P$ are the basis of $\mathbb{R}^n$ we found by completing $M x_1, \cdots M x_r$, and the columns of $Q$ are the basis of $\mathbb{R}^m$ we found by completing $x_1, \cdots x_r$. This decomposition can be computed by row and column reduction on $M$, where the row operations we perform give $P$ and the column operations we perform give $Q$.

Conceptually, the question we’ve asked is: what does a linear transformation $T : X \to Y$ between vector spaces “look like,” when we don’t restrict ourselves to picking a particular basis of $X$ or $Y$? The answer, stated in a basis-independent form, is the following. First, we can factor $T$ as a composite

$\displaystyle X \xrightarrow{p} \text{im}(T) \xrightarrow{i} Y$

where $\text{im}(T)$ is the image of $T$. Next, we can find direct sum decompositions $X \cong \text{im}(T) \oplus X'$ and $Y \cong \text{im}(T) \oplus Y'$ such that $p$ is the projection of $X$ onto its first factor and $i$ is the inclusion of the first factor into $Y$. Hence every linear transformation “looks like” a composite

$\displaystyle \text{im}(T) \oplus X' \xrightarrow{p_{\text{im}(T)}} \text{im}(T) \xrightarrow{i_{\text{im}(T)}} \text{im}(T) \oplus Y$

of a projection onto a direct summand and an inclusion of a direct summand. So the only basis-independent information contained in $T$ is the dimension of the image $\text{im}(T)$, or equivalently the rank of $T$. (It’s worth considering the analogous question for functions between sets, whose answer is a bit more complicated.)

The actual problem this blog post is about is more interesting: it is to classify $n \times m$ matrices $M \in \mathbb{R}^{n \times m}$ up to orthogonal change of basis in both the source and the target. In other words, we now want to understand the equivalence classes of the equivalence relation given by

$\displaystyle M \sim N \Leftrightarrow M = UNV^{-1}, U \in O(n), V \in O(m)$.

Conceptually, we’re now asking: what does a linear transformation $T : X \to Y$ between finite-dimensional Hilbert spaces “look like”?

Inventing singular value decomposition

As before, we’ll answer this question by picking bases with respect to which $M$ is as easy to understand as possible, only this time we need to deal with the additional restriction of choosing orthonormal bases. We will follow roughly the same inductive strategy as before. For starters, we would like to pick a unit vector $v_1 \in \mathbb{R}^m, \| v_1 \| = 1$ such that $Mv_1 \neq 0$; this is possible unless $M$ is identically zero, in which case there’s not much to say. Now, there’s no guarantee that $M v_1$ will be a unit vector, but we can always use

$\displaystyle u_1 = \frac{M v_1}{\| M v_1 \|} \in \mathbb{R}^n$

as the beginning of an orthonormal basis of $\mathbb{R}^n$. The question remains which of the many possible values of $v_1$ to use. In the previous argument it didn’t matter because they were all related by change of coordinates, but now it very much does because the length $\| M v_1 \|$ may differ for different choices of $v_1$. A natural choice is to pick $v_1$ so that $\| M v_1 \|$ is as large as possible (hence equal to the operator norm $\| M \|$ of $M$); writing $\sigma_1 = \| M v_1 \|$, we then have

$\displaystyle M v_1 = \sigma_1 u_1, \| v_1 \| = \| u_1 \| = 1$.

$\sigma_1$ is called the first singular value of $M$, $v_1$ is called its first right singular vector, and $u_1$ is called its first left singular vector. (The singular vectors aren’t unique in general, but we’ll ignore this for now.) To continue building orthonormal bases we need to find a unit vector

$\displaystyle v_2 \in \mathbb{R}^m, \| v_2 \| = 1, \langle v_1, v_2 \rangle = 0$

orthogonal to $v_1$ such that $M v_2$ is linearly independent of $M v_1$; this is possible unless $\text{rank}(M) = 1$, in which case we’re already done and $M$ is completely describable as $M_1 = \sigma_1 u_1 \otimes v_1^{\ast}$; equivalently, in this case we have

$\displaystyle M x = M_1 x = \sigma_1 u_1 \langle v_1, x \rangle$.

We’ll pick $v_2$ using the same strategy as before: we want the value of $v_2$ such that $\| M v_2 \|$ is as large as possible. Note that since $M_1 v_2 = 0$, this is equivalent to finding the value of $v_2$ such that $\| (M - M_1) v_2 \|$ is as large as possible. Call this largest possible value $\sigma_2$ and write

$\displaystyle M v_2 = \sigma_2 u_2, \| v_2 \| = \| u_2 \| = 1$.

At this point we are in trouble unless $\langle u_1, u_2 \rangle = 0$; if this weren’t the case then our strategy would fail to actually build an orthonormal basis of $\mathbb{R}^n$. Very importantly, this turns out to be the case.

Key lemma #1: Suppose $v_1$ is a unit vector maximizing $\| M v_1 \|$. Let $v$ be a unit vector orthogonal to $v_1$. Then $M v$ is also orthogonal to $M v_1$.

Proof. Consider the function

$\displaystyle f(t) = \| M(v_1 \cos t + v \sin t) \|^2$.

The vectors $v_1 \cos t + v \sin t$ are all unit vectors since $v_1, v$ are orthonormal, so by construction (of $v_1$) this function is maximized when $t = 0$. In particular, its derivative at $t = 0$ is zero. On the other hand, we can expand $f(t)$ out using dot products as

$\displaystyle f(t) = \| M v_1 \| \cos^2 t + 2 \langle M v_1, M v \rangle \cos t \sin t + \| M v \| \sin^2 t$.

Now we can compute the first-order Taylor series expansion of this function around $t = 0$, giving

$\displaystyle f(t) = \| M v_1 \| + 2t \langle M v_1, M v \rangle + \| M v \|+ O(t^2)$

so setting the first derivative equal to $0$ gives $2 \langle M v_1, M v \rangle = 0$ as desired. $\Box$

This is the technical heart of singular value decomposition, so it’s worth understanding in some detail. Michael Nielsen has a very nice interactive demo / explanation of this. Geometrically, the points $M v_1 \cos t + M v \sin t$ trace out an ellipse centered at the origin, and by hypothesis $M v_1$ describes the semimajor axis of the ellipse: the point furthest away from the origin. As we move away from $t = 0$, to first order we are moving slightly in the direction of $M v$, and so if $M v$ were not orthogonal to $M v_1$ it would be possible to move slightly further away from the origin than $M v_1$ by moving either in the positive or negative $t$ direction, depending on whether the angle between $M v_1$ and $M v$ is greater than or less than $90^{\circ}$. The only way to ensure that moving in the direction of $Mv$ does not, to first order, get us further away from the origin is if $Mv$ is orthogonal to $M v_1$.

Note that this gives a proof that the semiminor axis of an ellipse – the point closest to the origin – is always orthogonal to its semimajor axis. We can think of key lemma #1 above as more or less being equivalent to this fact, also known as the principal axis theorem in the plane, and which is also closely related to but slightly weaker than the spectral theorem for real $2 \times 2$ matrices.

Thanks to key lemma #1, we can continue our construction. With $r = \text{rank}(M)$ as before, we inductively produce orthonormal vectors $v_1, \dots v_r \in \mathbb{R}^m$ such that $\| M v_i \|$ is maximized subject to the condition that $\langle v_i, v_j \rangle = 0$ for all $j \le i-1$, and write

$\displaystyle M v_i = \sigma_i u_i, \| u_i \| = 1$

where $\sigma_i$ is the maximum value of $\| M v \|$ on all vectors $v$ orthogonal to $v_1, \dots v_{i-1}$; note that this implies that

$\displaystyle \sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_r > 0$.

The $\sigma_i$ are the singular values of $M$, the $v_i$ are its right singular vectors, and the $u_i$ are its left singular vectors. Repeated application of key lemma #1 shows that the $u_i$ are an orthonormal basis of the column space of $M$, so the construction stops here: $M$ is identically zero on the orthogonal complement of $\text{span}(v_1, \dots v_r)$, because if it weren’t then it would take a value orthogonal to $\text{span}(u_1, \dots u_r)$. This means we can write $M$ as a sum

$\displaystyle M = \sum_{i=1}^r \sigma_i u_i \otimes v_i^{\ast}$.

This is one version of the singular value decomposition (SVD for short) of $M$, and it has the benefit of being as close to unique as possible. A more familiar version of SVD is obtained by completing the $v_i$ and $u_i$ to orthonormal bases of $\mathbb{R}^m$ and $\mathbb{R}^n$ (necessarily highly non-unique in general). With respect to these bases, $M$ takes, similar to the warm-up, a block form where the top left block is the diagonal matrix with entries $\sigma_1, \dots \sigma_r$ and the remaining blocks are zero. Hence we can write $M$ as a product

$\displaystyle M = U \Sigma V^T, U \in O(n), V \in O(m)$

where $\Sigma$ has the above block form, $U$ has columns given by $u_1, \dots u_n$, and $V$ has columns given by $v_1, \dots v_m$.

So, stepping back a bit: what have we learned about what a linear transformation $T : X \to Y$ between Hilbert spaces looks like? Up to orthogonal change of basis, we’ve learned that they all look like “weighted projections”: we are almost projecting onto the image as in the warmup, except with weights given by the singular values $\sigma_i$ to account for changes in length. The only orthogonal-basis-independent information contained in a linear transformation turns out to be its singular values.

Looking for more analogies between singular value decomposition and the warmup, we might think of the singular values as a quantitative refinement of the rank, since there are $r$ of them where $r$ is the rank, and if some of them are small then $T$ is close (in the operator norm) to a linear transformation having lower rank.

Geometrically, one way to describe the answer provided by singular value decomposition to the question “what does a linear transformation look like” is that the key to understanding $T$ is to understand what it does to the unit sphere of $X$. The image of the unit sphere is an $r$-dimensional ellipsoid, and its principal axes have direction given by the left singular vectors $u_i$ and lengths given by the singular values $\sigma_i$. The right singular vectors $v_i$ map to these.

Properties

Singular value decomposition has lots of useful properties, some of which we’ll prove here. First, note that taking the transpose of a singular value decomposition $M = U \Sigma V^T$ gives another singular value decomposition

$\displaystyle M^T = V \Sigma^T U^T$

showing that $M^T$ has the same singular values as $M$, but with the left and right singular vectors swapped. This can be proven more conceptually as follows.

Key lemma #2: Write $B(u, v) = \langle u, Mv \rangle = \langle M^T u, v \rangle$. Then for every $1 \le i \le r$, the left and right singular vectors $u_i, v_i$ maximize the value of $B(u, v)$ subject to the constraint that $\| u_i \| = \| v_i \| = 1$ for all $i$, that $u_i$ is orthogonal to $u_j, j \le i-1$, and that $v_i$ is orthogonal to $v_j, j \le i-1$. This maximum value is $\sigma_i$.

Proof. At the maximum value of $B(u, v)$ subject to the constraint that $u$ is orthogonal to $u_j, j \le i-1$ and $v$ is orthogonal to $v_j, j \le i-1$, it must also be the case that if we fix $v$ then $B(-, v)$ takes its maximum value at $u$. But for fixed $v$, $B(u, v) = \langle u, M v \rangle$ uniquely takes its maximum value when $u$ is proportional to $Mv$ (if possible), hence must in fact be equal to $\frac{Mv}{\| M v \|}$; moreover, this is always possible thanks to key lemma #1. So we are in fact maximizing

$\displaystyle \left\langle \frac{Mv}{\| M v \|}, M v \right\rangle = \| M v \|$

subject to the above constraints and we already know the solution is given by $v = v_i$. $\Box$

Left-right symmetry: Let $\sigma_i, u_i, v_i$ be the singular values, left singular vectors, and right singular vectors of $M$ as above. Then $\sigma_i, v_i, u_i$ are the singular values, left singular vectors, and right singular vectors of $M^T$. In particular, $M^T u_i = \sigma_i v_i$.

Proof. Apply key lemma #2 to $M^T$, and note that $B(u, v)$ is the same either way, just with the roles of $u$ and $v$ switched. $\Box$

Singular = eigen: The left singular vectors $u_i$ are the eigenvectors of $M M^T$ corresponding to its nonzero eigenvalues, which are $\sigma_i^2, 1 \le i \le r$. The right singular vectors $v_i$ are the eigenvectors of $M^T M$ corresponding to its nonzero eigenvalues, which are also $\sigma_i^2, 1 \le i \le r$.

Proof. We now know that $M v_i = \sigma_i u_i$ and that $M^T u_i = \sigma_i v_i$, hence

$\displaystyle M^T M v_i = M^T (\sigma_i u_i) = \sigma_i^2 v_i$

and

$\displaystyle M M^T u_i = M (\sigma_i v_i) = \sigma_i^2 u_i$.

Hence $v_i, u_i$ are orthonormal eigenvectors of $M^T M, M M^T$ respectively. Moreover, these matrices have rank at most (in fact exactly) $r$, so this exhausts all eigenvectors corresponding to nonzero eigenvalues. $\Box$

This gives an alternative route to understanding singular value decomposition which comes from writing $\| M v \|^2$ as

$\displaystyle \| M v \|^2 = \langle M v, M v \rangle = \langle v, M^T M v \rangle$

and then applying the spectral theorem to $M^T M$ to diagonalize, but I think it’s worth knowing that there’s a route to singular value decomposition which is independent of the spectral theorem.

In addition to the above algebraic characterization of singular values, the singular values also admit the following variational characterization.

$\displaystyle \sigma_k = \max_{V \subseteq \mathbb{R}^m, \dim V = k} \min_{v \in V, \| v \| = 1} \| M v \|$

and

$\displaystyle \sigma_{k+1} = \min_{V \subseteq \mathbb{R}^m, \dim V = m-k} \max_{v \in V, \| v \| = 1} \| M v \|$.

Proof. For the first characterization, any $k$-dimensional subspace $V$ intersects $\text{span}(v_k, \dots v_m)$ nontrivially, hence contains a unit vector of the form

$\displaystyle v = \sum_{i=k}^m c_i v_i, \| v \| = \sum_{i=k}^m c_i^2 = 1$.

We compute that

$\displaystyle M v = \sum_{i=k}^m c_i \sigma_i u_i$

and hence that

$\displaystyle \| M v \|^2 = \sum_{i=k}^m c_i^2 \sigma_i^2 \le \sigma_k^2$.

We conclude that every $V$ contains $v$ such that $\| M v \| \le \sigma_k$, hence $\min_{v \in V, \| v \| = 1} \| M v \| \le \sigma_k$. Equality is obtained when $V = \text{span}(v_1, \dots v_k)$.

The second characterization is very similar. Any $m-k$-dimensional subspace $V$ intersects $\text{span}(v_1, \dots v_{k+1})$ nontrivially, hence contains a unit vector of the form

$\displaystyle v = \sum_{i=1}^{k+1} c_i v_i, \| v \| = \sum_{i=1}^{k+1} c_i^2 = 1$.

We compute that

$\displaystyle M v = \sum_{i=1}^{k+1} c_i \sigma_i u_i$

and hence that

$\displaystyle \| M v \|^2 = \sum_{i=1}^{k+1} c_i^2 \sigma_i^2 \ge \sigma_{k+1}^2$.

We conclude that every $V$ contains a vector $v$ such that $\| M v \| \ge \sigma_{k+1}$, hence $\max_{v \in V, \| v \| = 1} \| M v \| \ge \sigma_{k+1}$. Equality is obtained when $V = \text{span}(v_{k+1}, \dots v_m)$. $\Box$

The second variational characterization above can be used to prove the following important theorem.

Low rank approximation (Eckart, Young): If $M = U \Sigma V^T$ is the SVD of $M$, let $M_k = U \Sigma_k V^T$ where $\Sigma_k$ has diagonal entries $\sigma_1, \dots \sigma_k$ and all other entries zero. Then $M_k$ is the closest matrix to $M$ in operator norm with rank at most $k$; that is, $M_k$ minimizes $\| M - X \|$ subject to the constraint that $\text{rank}(X) \le k$. This minimum value is $\sigma_{k+1}$.

Proof. Suppose $X$ is a matrix of rank at most $k$. Let $W = \text{ker}(X)$ be the nullspace of $X$, which by hypothesis has dimension at least $m-k$. By the second variational characterization above, this means that it contains a vector $w$ such that $\| Mw \| \ge \sigma_{k+1}$, and since $Xw = 0$ this gives

$\| (M - X) w \| = \| M w \| \ge \sigma_{k+1}$

and hence that $\| M - X \| \ge \sigma_{k+1}$. Equality is obtained when $X = M_k$ as defined above. $\Box$

The variational characterizations can also be used to prove the following inequality relating the singular values of two matrices and of their sum, which can be thought of as a quantitative refinement of the observation that the rank of a sum $M + N$ of two matrices is at most the sum of their ranks.

Additive perturbation (Weyl): Let $M, N$ be $n \times m$ matrices with singular values $\sigma_i(M), \sigma_i(N)$. Then

$\displaystyle \sigma_{k+\ell+1}(M+N) \le \sigma_{k+1}(M) + \sigma_{\ell+1}(N)$.

Proof. We want to bound $\sigma_{k+\ell+1}(M + N)$ in terms of the singular values of $M$ and $N$. By the second variational characterization, we have

$\displaystyle \sigma_{k+\ell+1}(M + N) = \min_{V \subseteq \mathbb{R}^m, \dim V = m-k-\ell} \max_{v \in V, \| v \| = 1} \| (M + N) v \|$.

To give an upper bound on a minimum value of a function we just need to give an upper bound on some value that it takes. Let $V_M$ and $V_N$ be the subspaces of $\mathbb{R}^m$ of dimensions $m - k, m - \ell$ respectively which achieve the minimum values of $\max_{v \in V_M, \| v \| = 1} \| M v \|$ and $\max_{v \in V_N, \| v \| = 1} \| N v \|$ respectively, and let $W = V_M \cap V_N$ be their intersection. This intersection has dimension at least $m - k - \ell$, and by construction

$\displaystyle \max_{v \in W, \| v \| = 1} \| Mv + Nv \| \le \max_{v \in W, \| v \| = 1} \| M v \| + \| N v \| \le \sigma_{k+1}(M) + \sigma_{\ell+1}(N)$.

Since $W$ has dimension at least $m - k - \ell$, the above is an upper bound on the value of $\max_{v \in V, \| v \| = 1} \| (M + N) v \|$ for any $(m-k-\ell)$-dimensional subspace $V \subseteq W$, from which the conclusion follows. $\Box$

The slightly curious off-by-one indexing in the above inequality can be understood as follows: if $\sigma_{k+1}(M)$ and $\sigma_{\ell+1}(N)$ are both very small, this means that $M$ and $N$ are close to matrices of rank at most $k$ and $\ell$ respectively, and hence $M+N$ is close to a matrix of rank at most $k+\ell$, hence $\sigma_{k+\ell+1}(M+N)$ also ought to be small.

Setting $\ell = 0$ in the additive perturbation inequality we deduce the following corollary.

Singular values are Lipschitz: The singular values, as functions on matrices, are uniformly Lipschitz with respect to the operator norm with Lipschitz constant $1$: that is,

$\displaystyle \left| \sigma_k(M) - \sigma_k(N) \right| \le \| M - N \|$.

Proof. Apply additive perturbation twice with $\ell = 0$, first to get

$\displaystyle \sigma_k(M) \le \sigma_k(N) + \sigma_1(M-N)$

(remembering that $\sigma_1$ is the operator norm), and second to get

$\displaystyle \sigma_k(N) \le \sigma_k(M) + \sigma_1(N-M)$

(remembering that $\sigma_1(N-M) = \sigma_1(M-N)$). $\Box$

This is very much not the case with eigenvalues: a small perturbation of a square matrix can have a large effect on its eigenvalues. This is explained e.g. in in this blog post by Terence Tao, and is related to pseudospectra.

Setting $\sigma_{\ell+1}(N) = 0$, or equivalently $\text{rank}(N) \le \ell$, in the additive perturbation inequality, we deduce the following corollary.

Interlacing: Suppose $M, N$ are matrices such that $\text{rank}(M-N) \le \ell$. Then

$\displaystyle \sigma_{k+\ell}(M) \le \sigma_k(N) \le \sigma_{k-\ell}(M)$.

Proof. Apply additive perturbation twice, first to get

$\displaystyle \sigma_{k+\ell}(M) \le \sigma_k(N) + \sigma_{\ell+1}(M-N) = \sigma_k(N)$

and second to get

$\displaystyle \sigma_k(N) \le \sigma_{k-\ell}(M) + \sigma_{\ell+1}(N-M) = \sigma_{k-\ell}(M)$. $\Box$

Interlacing gives us some control over what happens to the singular values under a low-rank perturbation (as opposed to a low-norm perturbation; a low-rank perturbation may have arbitrarily high norm, and vice versa). For example, we learn that if all of the singular values are clumped together, then a rank-$\ell$ perturbation will keep most of the singular values clumped together, except possibly for either the $\ell$ largest or $\ell$ smallest singular values. We can’t expect any control over these, since in the worst case a rank-$\ell$ perturbation can make the $\ell$ largest singular values arbitrarily large, or make the $\ell$ smallest singular values arbitrarily small.

A particular special case of a low-rank perturbation is deleting a small number of rows or columns (note that a row or column which is entirely zero does not affect the singular values, so deleting a row or column is equivalent to setting all of its entries to zero), in which case the upper bound above can be tightened.

Cauchy interlacing: Suppose $M$ is a matrix and $N$ is obtained from $M$ by deleting at most $\ell$ rows. Then

$\displaystyle \sigma_k(M) \ge \sigma_k(N) \ge \sigma_{k+\ell}(M)$.

Proof. The lower bound follows from interlacing. The upper bound follows from the observation that we have $\| N v \| \le \| M v \|$ for all $v$, then applying either variational characterization of the singular values.  $\Box$

Cauchy interlacing also applies to deleting columns, or combinations of rows and columns, because the singular values are unchanged by transposition. In particular, we learn that if $N$ is obtained from $M$ by deleting either a single row or a single column, then the singular values of $N$ interlace with the singular values of $M$, hence the name.

In particular, if all of the singular values of $M$ are clumped together then so are those of $N$, with no exceptions. Taking the contrapositive, if the singular values of $N$ are spread out, then the singular values of $M$ must be as well.

Three special cases

Three special cases of the general singular value decomposition $M = U \Sigma V^T$ are worth pointing out.

First, if $M$ has orthogonal columns, or equivalently if $M^T M$ is diagonal, then the singular values $\sigma_i$ are the lengths of its columns, we can take the right singular vectors to be the standard basis vectors $v_i = e_i$, and we can take the left singular vectors to be the unit rescalings of its columns. This means that we can take $V = I$ to be the identity matrix, and in general suggests that $\| I - V \|$ is a measure of the extent to which the columns of $M$ fail to be orthogonal (with the caveat that $V$ is not unique and so in general we would want to look at the $V$ closest to $I$).

Second, if $M$ has orthogonal rows, or equivalently if $M M^T$ is diagonal, then the singular values $\sigma_i$ are the lengths of its rows, we can take the left singular vectors to be the standard basis vectors $u_i = e_i$, and we can take the right singular vectors to be the unit rescalings of its rows. This means that we can take $U = I$ to be the identity matrix, and in general suggests that $\| I - U \|$ is a measure of the extent to which the rows of $M$ fail to be orthogonal (with the same caveat as above).

Finally, if $M$ is square and an orthogonal matrix, so that $M^T M = M M^T = I$, then the singular values $\sigma_i$ are all equal to $1$, and an arbitrary choice of either the left or the right singular vectors uniquely determines the other. This means that we can take $\Sigma = I$ to be the identity matrix, and in general suggests that $\| I - \Sigma \|$ is a measure of the extent to which $M$ fails to be orthogonal. In fact it is possible to show that the closest orthogonal matrix to $M = U \Sigma V^T$ is given by $U V^T$, or in other words by replacing all of the singular values of $M$ with $1$, so

$\displaystyle \| I - \Sigma \| = \max_i |1 - \sigma_i|$

is precisely the distance from $M$ to the nearest orthogonal matrix. This fact can be used to solve the orthogonal Procrustes problem.

In general, we should expect that the SVD of a matrix $M$ is relevant to answering any question about $M$ whose answer is invariant under left and right multiplication by orthogonal matrices. This includes, for example, the question of low-rank approximations to $M$ with respect to operator norm we answered above, since both rank and operator norm are invariant.

### 4 Responses

1. Btw. In the proof of interlacing, I don’t see the first displayed equality: $\sigma_k(N) + \sigma_{\ell +1}(M-N) = \sigma_{k+\ell}(N)$. What am I missing?

• Oops, that’s a typo; it should just be $\sigma_k(N)$ on the RHS.

2. Thanks for that concise and clear introduction to the SVD. I do not understand why it is often not even touched in a first class in linear algebra. It seems to me that it would make sense to introduce it even before the spectral theorem.

Regarding “weighted projections”: up to a scale factor (i.e., a single weight $\sigma_1$, you can view a linear transformation
$T: X \to Y$ between Euclidean spaces as an orthogonal projection.

Specifically, if $X$ and $Y$ are subspaces of $\mathbb{R}^N$ of dimension $n$ and $m$ respectively, and $P_{YX}$ is the orthogonal projection onto $Y$ restricted to $X$, then, the singular values of $P_{YX}$ are the cosines of the principal angles
https://en.wikipedia.org/wiki/Angles_between_flats
If $N \geq m+n$, then these singular values can take any value in $\latex [0,1]$.

So we see that any $T: X \to Y$ can be represented as $\sigma_1(T)P_{YX}$ for an appropriate choice of $\latex X,Y$ as subspaces of $\mathbb{R}^{m+n-1}$.

Also, regarding the best orthogonal transformation to represent $M$, it is worth pointing out that you are talking about the orthogonal factor in the polar decomposition
https://en.wikipedia.org/wiki/Polar_decomposition
which is an immediate consequence of the SVD. We can always represent our matrix $M$ as a composition of an orthogonal matrix and a positive semidefinte matrix:
$M = \Phi R = R' \Phi$,
where $\Phi = UV^T$ and $R= V \Sigma V^T$ and $R’ = U\Sigma U^T$.

3. on March 15, 2017 at 2:57 am | Reply Ammar Husain

Thinking of the QR algorithm and the Toda lattice, you gain when you swap and refactor repeatedly as dressing transformations. Haven’t thought about what if anything reasonable happens when you permute the factors in a square SVD.