Notes - Numerical Analysis HT24, Singular value decomposition
Flashcards
Quickly prove that given any $A \in \mathbb R^{m \times n}$, there exists an SVD
\[A = U\Sigma V^\top\]
where $U \in \mathbb R^{m \times n}$, $\Sigma \in \mathbb R^{n \times n}$ and $V \in \mathbb R^{n \times n}$ (i.e. the thin SVD).
Consider the symmetric matrix $A^\top A$. Since the eigenvalues of $A^\top A$ are real and non-negative, it implies that $A^\top A$ can be diagonalised, so that $A^\top A = VD^2 V^\top$ for some $V, D$ both $n \times n$. (This could be done using e.g. the QR algorithm) (Why write $D^2$ here?)
Let $B = AV$. Since $B^\top B = D^2$, this implies the columns of $B$ are pairwise orthogonal (though not neccesarily orthonormal) and $B$ is an $m \times n$ matrix.
In some sense, we are almost there. If we were to write $B D V^\top$, then this would become $AV D V^\top$. If we could eliminate that factor of $D$ in the middle, it would become $AVV^\top = I$ (since $V$ is square). We also need to handle the fact that $B$ doesn’t have orthonormal columns, so the rest of the analysis is working out how we can modify $B$ into some matrix $U$ that satisfies the properties we need.
The full rank case: If $\text{rank}(A) = n$, then $\text{rank}(A^\top A) = n$ and $D^2 = \text{diag}(\lambda _ 1, \cdots, \lambda _ n)$ where each $\lambda _ i > 0$. (This is not an obvious fact, it follows from the identity $\text{rank}(A) = \text{rank}(A^\top A)$, which can be seen by showing that these two matrices have the same kernel).
Then $D^{-1} = \text{diag}(1/\sqrt{\lambda _ 1}, \cdots, 1/\sqrt{\lambda _ n})$, and by letting $U = BD^{-1}$ we ensure that $U^\top U = I$ (since $U^\top U = D^{-1} D^2 D^{-1} = I$). Then letting $\Sigma = D$ gives
\[A = U\Sigma V^\top\]since $U\Sigma V^\top = BD^{-1}D V^\top = A V V^\top = A$ as required. (Why not write $\Sigma = D$ in the first place? Because in the rank deficient case, $\Sigma$ and $D$ don’t exactly correspond).
The rank deficient case: If $\text{rank}(A) = r < n$, then $D^2 = \text{diag}(\lambda _ 1, \cdots, \lambda _ r, 0, \cdots, 0)$ for strictly positive $\lambda _ i$.
This implies the last $n - r$ columns of $B$ are $0$ ($B^\top B = D^2$ has the norm of each column on the diagonal, so if there is a zero on the diagonal, it must correspond to be a zero column).
Let $D _ r = \text{diag}(\lambda _ 1, \cdots, \lambda _ r)$. Then
\[B \begin{bmatrix} D_{r}^{-1} & \\\\ & I_{n-r} \end{bmatrix} = [U_1; 0]\]For some $U _ 1$ with orthonormal columns. Then
\[A = [U_1; 0] \begin{bmatrix} D_{r} & \\\\ & I_{n-r} \end{bmatrix} V^\top\]since
\[[U_1; 0] \begin{bmatrix} D_{r}^{-1} & \\\\ & I_{n-r} \end{bmatrix} V^\top = B \begin{bmatrix} D_{r}^{-1} & \\\\ & I_{n-r} \end{bmatrix} \begin{bmatrix} D_{r} & \\\\ & I_{n-r} \end{bmatrix} V^\top = BV^\top = A\](this is just substituting expression for $[U; 0]$). But then
\[\begin{aligned} A &= [U_1; 0] \begin{bmatrix} D_{r} & \\\\ & I_{n-r} \end{bmatrix} V^\top \\\\ &= [U_1; U_2] \begin{bmatrix} D_{r} & \\\\ & 0 \end{bmatrix} V^\top \\\\ &= U\Sigma V^\top \end{aligned}\]where $U _ 2$ is a matrix orthogonal to $U _ 1$, $U$ is defined as $[U _ 1; U _ 2]$ and $\Sigma$ is defined as $\begin{bmatrix} D _ {r} & \\ & 0 \end{bmatrix}$.
(note that this is still the “thin” SVD, since $U$ is $m \times n$. If we wished to construct the “full” SVD, we could add more orthogonal columns to $U$ and more $0$ entries to $\Sigma$ in order to make $U$ and $m \times m$ matrix and $\Sigma$ and $m \times n$ matrix).
Suppose we have a matrix $A \in \mathbb R^{m \times n}$. By breaking down the cases $m \le n$ and $m \ge n$, what are all the possible types of SVD factorisation, and what are the dimensions of all the matrices involved?
(Oops, $V$ should be labelled $V^\top$ in all of the following pictures)
When $m \le n$,
“Fat SVD”: $A = U\Sigma V^\top$, $U \in \mathbb R^{m \times m}$, $\Sigma \in \mathbb R^{m \times m}$, $V \in \mathbb R^{m \times n}$:
When $m \ge n$,
“Full SVD”: $A = U\Sigma V^\top$, $U \in \mathbb R^{m \times m}$, $\Sigma \in \mathbb R^{m \times n}$, $V \in \mathbb R^{n \times n}$:
“Thin SVD”: $A = U\Sigma V^\top$, $U \in \mathbb R^{m \times n}$, $\Sigma \in \mathbb R^{n \times n}$, $V \in \mathbb R^{n \times n}$:
(note that the decomposition where $\Sigma$ is $r \times r$ and $r$ is the rank of $A$ as it appears in the machine learning course doesn’t appear in the numerical analysis course )
Suppose we want to compute the “full” versions of either the SVD or QR factorisations of a matrix. Given that we can compute the thin versions, how can this be done?
Compute the thin versions $A = U \Sigma V$ or $A = QR$, and then add columns to either $U$ or $Q$ that are orthonormal to previous columns to obtain $U _ \perp$ and $Q _ \perp$ so that $U _ F = [U; U _ \perp]$ or $Q _ F = [Q; Q _ \perp]$.
Suppose $A$ has the singular value decomposition
\[A = U\Sigma V^\top\]
How can you find the…
- the rank $r$ of $A$?
- a basis for the column space of $A$?
- a basis for the column space of $A^\top$ (i.e. the row space)?
- a basis for the null space of $A$?
- a basis for the null space of $A^\top$?
- The rank $r$ of $A$: the number of non-zero singular values of $A$, which is the number of non-zero entries in $\Sigma$
- A basis for the column space of $A$: The first $r$ columns of $U$
- A basis for the column space of $A^\top$: The first $r$ columns of $V$
- A basis for the null space of $A$: The last $n - r$ columns of $V$
- A basis for the null space of $A^\top$: The last $n - r$ columns of $U$
(the fact that a basis for the column space and null space of $A^\top$ come from considering the columns of $V$ rather than $U$ follows from the fact that if you take the transpose of the SVD, you get $A^\top = V \Sigma U^\top$ and so using the first $r$ columns of $V$ is identical to the case where you haven’t taken the transpose).
Suppose $A \in \mathbb R^{m \times n}$. What is the spectral norm of this matrix? Give three alternative characterisations.
- The largest singular value of $A$
- $\max _ {x \ne 0} \frac{ \vert \vert Ax \vert \vert }{ \vert \vert x \vert \vert }$
- $\max _ { \vert \vert x \vert \vert = 1} \vert \vert Ax \vert \vert $
Quickly prove that
\[||A|| _ 2 := \max _ {||x|| = 1} ||Ax|| = \sigma _ 1\]
where $\sigma _ 1$ is the largest singular value of $A$.
but equality must be achieved for some $x$, since if $x = v _ 1$, then
\[Ax = Av_1 = U\Sigma V^\top v_1 = U \begin{pmatrix}\sigma_1 \\\\ 0 \\\\ \vdots \\\\ 0\end{pmatrix} = \sigma_1 u_1\]so
\[\max_{||x|| = 1} ||Ax|| = \sigma_1\]Suppose $A$ and $B$ matrices and $ \vert \vert \cdot \vert \vert $ is the spectral norm, defined as
\[||A|| _ 2 := \max _ {x \ne 0} \frac{||Ax||}{||x||}\]
Quickly show that
\[||A B||_2 \le ||A||_2 ||B||_2\]
Where $(\star)$ is justified since $\text{Im}(AB)$ is a subset of $\text{Im}(A)$.
State the Eckart-Young theorem, which connects SVD, low-rank approximations and the spectral norm.
Suppose:
- $A$ is a matrix
- $A _ k$ is the matrix obtained after truncating SVD after $k$ steps, i.e. $A _ k := \sum^k _ {i=1} \sigma _ i \pmb u _ i \pmb v _ i^\top$
Then:
- For all matrices $B$, $A _ k$ gives the best rank-$k$ approximation of $A$ with respect to the spectral norm, i.e. $\forall B$, $ \vert \vert A-A _ k \vert \vert _ 2 \le \vert \vert A - B \vert \vert _ 2$.
(in fact, it is the best approximation under any unitarily invariant norm, i.e. $ \vert \vert UA \vert \vert = \vert \vert AU \vert \vert = \vert \vert A \vert \vert $ where $U$ is unitary).
Quickly prove the Eckart-Young theorem, which states that if:
- $A$ is a matrix
- $A _ k$ is the matrix obtained after truncating SVD after $k$ steps, i.e. $A _ k := \sum^k _ {i=1} \sigma _ i \pmb u _ i \pmb v _ i^\top$
Then:
- For all matrices $B$, $A _ k$ gives the best rank-$k$ approximation of $A$ with respect to the spectral norm, i.e. $\forall B$, $ \vert \vert A-A _ k \vert \vert _ 2 \le \vert \vert A - B \vert \vert _ 2$.
Since $ \vert \vert A - A _ k \vert \vert _ 2 = \sigma _ {k+1}$, we can show that $ \vert \vert A - B \vert \vert _ 2 \ge \sigma _ {k+1}$ for any rank $k$ matrix $B$.
As $B$ is rank $k$, we can write
\[B = B_1 B_2^\top\]where $B _ 1$ and $B _ 2$ are $n \times k$.
Then $B _ 2$ has $k$ columns, we can find a nontrivial linear combination of the first $k+1$ columns of $V$, say $w = \gamma _ 1 v _ 1 + \cdots + \gamma _ {k+1} v _ {k+1}$ such that $B _ 2^\top w = 0$, and we can ensure $ \vert \vert w \vert \vert = 1$.
Why is this possible? Since $\dim \langle v _ 1, \cdots, v _ {k+1}\rangle = k+1$ and $\dim \ker B _ 2 = n - k$, it follows their intersection must be nonempty (otherwise considering $\langle v _ 1, \cdots, v _ {k+1} \rangle \cup \ker B _ 2$ would give a $n+1$ dimensional subspace, a contradiction).
Then:
\[\begin{aligned} ||(A - B)||^2_2 &\ge ||(A -B)w||^2_2 &&(1\star)\\\\ &= ||Aw||_2^2 \\\\ &= ||U\Sigma V^\top w||^2_2 \\\\ &= ||U\Sigma (\gamma_1 e_1 + \cdots + \gamma_{k+1}e_{k+1})|| \\\\ &= || \gamma_1 \sigma_1 e_1 + \cdots + \gamma_{k+1} \sigma_{k+1} e_{k+1} || &&(2\star)\\\\ &= \gamma_1^2 \sigma_1^2 + \cdots + \gamma_{k+1}^2 \sigma_{k+1}^2 &&(3\star)\\\\ &\ge \sigma^2_{k+1} (\gamma_1^2 + \cdots + \gamma^2_{k+1}) \\\\ &= \sigma^2_{k+1} \end{aligned}\]Notes:
- $(1\star)$: This follows from the weakly mulplicative property of the spectral norm (this can be proved considering the definition of the spectral norm as $ \vert \vert A \vert \vert _ 2 = \max _ { \vert \vert x \vert \vert = 1} \vert \vert Ax \vert \vert $), and the fact that $ \vert \vert w \vert \vert = 1$.
- $(2\star)$: Since $U$ is orthogonal, it doesn’t change the norm.
- $(3\star)$: This follows from the fact that the spectral norm on a vector viewed as an $n \times 1$ matrix coincides with the standard Euclidean norm.
If $A \in \mathbb R^{m \times n}$ and the singular values of $A$ are strictly positive, distinct and arranged in descending order of magnitude, how many possible (thin) SVDs are there?
Some notes:
- This comes from taking $\pm$ in the square root of $D$ in the derivation.
- If the singular values are not distinct, then there are infinitely many because you can rotate the subspace corresponding to that singular value.
- If the singular values are not necessarily strictly positive but are still distinct, then there is (I think) $2^{n-1}$ possible SVDs.
- This is all working over $\mathbb R$. If we are working over $\mathbb C$, then the situation is more complicated
- Also, if you use the full SVD, there are also choices you can make in finding an orthonormal extension of $U$