NLA MT25, HMT algorithm


Flashcards

What is the purpose of the Halko-Martinsson-Tropp algorithm?

Given $A \in \mathbb R^{m \times n}$ and rank $r$, find a rank-$r$ approximation $\hat A \approx A$.

@Describe the Halko-Martinsson-Tropp (HMT) @algorithm and state its time cost.

Setup: Given $A \in \mathbb R^{m \times n}$ and rank $r$, find a randomised rank-$r$ approximation $\hat A \approx A$.

Algorithm:

  1. Form a Gaussian random matrix $G \in \mathbb R^{n \times r}$ (entries i.i.d. $\sim N(0, 1)$).
  2. Compute $AG \in \mathbb R^{m \times r}$.
  3. Compute the thin QR factorisation $AG = QR$.
  4. Approximate $A \approx \hat A := Q Q^\top A$, a rank-$r$ approximation.

Time: $O(mnr)$ for dense $A$.

Given $A \in \mathbb R^{m \times n}$ and rank $r$, the HMT algorithm proceeds as:

  1. Form a random matrix $G \in \mathbb R^{n \times r}$ with i.i.d. entries
  2. Compute $AG \in \mathbb R^{m \times r}$
  3. Compute the QR factorisation $AG = QR$
  4. Approximate $A \approx \hat A := Q Q^\top A$, a rank-$r$ approximation

@State a theorem concerning the error of this algorithm, and describe what it says intuitively (unlike ∆hmt-algorithm-rough-error, we do not prove this theorem in the course).

For any $\hat r < r - 1$,

\[\mathbb E[\|A - \hat A\| _ {\mathsf F}] \le \sqrt{1 + \frac{\hat r}{r - \hat r - 1}} \|A - A _ {\hat r}\| _ {\mathsf F},\]

where $A _ {\hat r}$ is the rank $\hat r$-truncated SVD and the expectation is with respect to the random matrix $G$.

This says that the approximant $\hat A$ has error $\|A - \hat A\| _ {\mathsf F}$ that is within a factor $\sqrt{1 + \frac{\hat r}{r - \hat r - 1}}$ of the optimal rank-$\hat r$ truncated SVD.

Given $A \in \mathbb R^{m \times n}$ and rank $r$, the HMT algorithm proceeds as:

  1. Form a random matrix $G \in \mathbb R^{n \times r}$ with i.i.d. entries
  2. Compute $AG \in \mathbb R^{m \times r}$
  3. Compute the QR factorisation $AG = QR$
  4. Approximate $A \approx \hat A := Q Q^\top A$, a rank-$r$ approximation

We have the error bound that for any $\hat r < r - 1$,

\[\mathbb E[\|A - \hat A\| _ {\mathsf F}] \le \sqrt{1 + \frac{\hat r}{r - \hat r - 1}} \|A - A _ {\hat r}\| _ {\mathsf F},\]

where $A _ {\hat r}$ is the rank $\hat r$-truncated SVD and the expectation is with respect to the random matrix $G$. This says that the approximant $\hat A$ has error $\|A - \hat A\| _ {\mathsf F}$ that is within a factor $\sqrt{1 + \frac{\hat r}{r - \hat r - 1}}$ of the optimal rank-$\hat r$ truncated SVD.

Explain why we have this $\hat r$ in the analysis, rather than just considering the error for $r$.

This is sort-of an artefact of the fact square Gaussian random matrices can be ill-conditioned. It may be helpful to think of $\hat r$ as $r = \hat r + p$ where $\hat r$ is the target rank, and $p$ is a quantity that measures “oversampling”.

Given $A \in \mathbb R^{m \times n}$ and rank $r$, the HMT algorithm proceeds as:

  1. Form a random matrix $G \in \mathbb R^{n \times r}$ with i.i.d. entries
  2. Compute $AG \in \mathbb R^{m \times r}$
  3. Compute the QR factorisation $AG = QR$
  4. Approximate $A \approx \hat A := Q Q^\top A$, a rank-$r$ approximation

@State a theorem concerning the rough error of this algorithm, and describe what it says intuitively (unlike the tighter result in ∆hmt-algorithm-error, we prove this looser bound in the course in ∆hmt-algorithm-rough-error-proof).

With high probability,

\[\|A - \hat A\| _ 2 \lesssim \frac{\sqrt n + \sqrt r}{\sqrt r - \sqrt{\hat r}} \|A - A _ {\hat r}\| _ 2\]

for any comparison rank $\hat r$, where $A _ {\hat r}$ is the optimal rank-$\hat r$ approximation (from the SVD, see ∆eckart-young-theorem).

Given $A \in \mathbb R^{m \times n}$ and rank $r$, the HMT algorithm proceeds as:

  1. Form a random matrix $G \in \mathbb R^{n \times r}$ with i.i.d. entries
  2. Compute $AG \in \mathbb R^{m \times r}$
  3. Compute the QR factorisation $AG = QR$
  4. Approximate $A \approx \hat A := Q Q^\top A$, a rank-$r$ approximation

@Prove (with a little bit of handwaving) a theorem concerning the rough error of this algorithm, that with high probability

\[\|A - \hat A\| _ 2 \lesssim \frac{\sqrt n + \sqrt r}{\sqrt r - \sqrt{\hat r}} \|A - A _ {\hat r}\| _ 2\]

for any comparison rank $\hat r$, where $A _ {\hat r}$ is the optimal rank-$\hat r$ approximation.

Strategy: The algorithm outputs $\hat A := Q Q^\top A$, so the error may be written

\[A - \hat A = (I _ m - Q Q^\top) A.\]

$QQ^\top$ is the orthogonal projector onto $\text{range}(Q)$, and $(I _ m - QQ^\top)A$ is the part of the columns of $A$ that don’t belong to $\text{range}(Q)$.

By construction, $\text{range}(Q) = \text{range}(AG)$ by construction (since $AG = QR)$. This means that we have

\[(I _ m - Q Q^\top)AG = (I _ m - QQ^\top)QR = QR - QR = 0.\]

Hence we may write the error as

\[\begin{aligned} A - \hat A &= (I _ m - Q Q^\top)A - (I _ m - QQ^\top)AG \\ &= (I _ m - QQ^\top)(A - AG) \\ &= (I _ m - QQ^\top)A(I _ n - GM^\top). \end{aligned}\]

We aim to choose $M$ cleverly so that the expression $(I _ m - Q Q^\top)A(I _ n - GM^\top)$ has small norm.

Step 1: Set $M^\top = (V _ 1^\top G)^\dagger V _ 1^\top$, where $V = [v _ 1, \ldots, v _ {\hat r}] \in \mathbb R^{n \times \hat r}$ is the top $\hat r$ right singular vectors of $A$. Recall that $\hat r$ is an integer bounded by $\hat r$.

Then $V _ 1^\top G \in \mathbb R^{\hat r \times r}$ has full row rank with probability one, since it is an $\hat r \times r$ Gaussian random matrix by the orthogonal invariance of Gaussian matrices, so $(V _ 1^\top G)(V _ 1^\top G)^\dagger = I _ {\hat r}$, and

\[A _ {\hat r}GM^\top = U _ 1 \Sigma _ 1 (V _ 1^\top G)(V _ 1^\top G)^\dagger V _ 1^\top = A _ {\hat r}.\]

By properties of the pseudoinverse, we have $(V _ 1^\top G)(V _ 1^\top G)^\dagger = I _ {\hat r}$, so $A _ {\hat r} (I _ n - GM^\top) = 0$. Thus

\[\begin{aligned} A(I _ n - GM^\top) &= A(I _ n - GM^\top) - A _ {\hat r}(I _ n - GM^\top) \\ &= (A - A _ {\hat r})(I _ n - GM^\top) \end{aligned}\]

Hence we have overall

\[A - \hat A = (I _ m - QQ^\top)(A - A _ {\hat r})(I _ n - GM^\top)\]

Step 2: Taking norms and using submultiplicativity we have

\[\begin{aligned} \|A - \hat A\| &\le \|I _ m - QQ^\top\| \cdot \|A - A _ {\hat r}\| \cdot \|I _ n - GM^\top\| \\ &\le 1 \cdot \|A - A _ {\hat r}\| \cdot \|I _ n - GM^\top\| \end{aligned}\]

where $\|I _ m - QQ^\top\| \le 1$ since this is an orthogonal projector.

Step 3: We now see that $I - GM^\top$ is itself a projector. Note that

\[\begin{aligned} (GM^\top)^2 &= G(V _ 1^\top G)^\dagger V _ 1^\top G(V _ 1^\top G)^\dagger V _ 1^\top \\ &= GM^\top \end{aligned}\]

where we appeal to ∆pseudoinverse-symmetry. Note also ∆projector-norm-identity yields that $\|I-P\| _ 2 = \|P\| _ 2$ for any non-trivial projector, so then $\|I _ n - GM^\top\| _ 2 = \|GM^\top\| _ 2$.

Step 4: Apply ∆marchenko-pastur on the components of $\|GM^\top\|$ to yield

\[\begin{aligned} \|GM^\top\| _ 2 &= \|G(V _ 1^\top G)^\dagger V _ 1^\top\| \\ &\le \|G\| _ 2 \cdot \|(V _ 1^\top G)^\dagger\| _ 2 \cdot \|V _ 1^\top\|, \end{aligned}\]

since:

  • $\|G\| _ 2 \lesssim \sqrt n + \sqrt r$ (as $G \in \mathbb R^{n \times r}$ is Gaussian)
  • $\|(V _ 1^\top G)^\dagger\| _ 2 = \tfrac{1}{\sigma _ \min(V _ 1^\top G)} \lesssim \tfrac{1}{\sqrt r - \sqrt{\hat r}}$, as $V _ 1^\top G \in \mathbb R^{\hat r \times r}$ is Gaussian and ∆pseudoinverse-spectral-norm.
  • $\|V _ 1^\top\| _ 2 = 1$.

Hence

\[\|GM^\top\| _ 2 \lesssim \frac{\sqrt n + \sqrt r}{\sqrt r - \sqrt{\hat r}}.\]

Conclusion: Combining these steps, we see

\[\begin{aligned} \|A - \hat A\| _ 2 &\le \|A - A _ {\hat r}\| _ 2 \cdot \|GM^\top\| _ 2 \\ &\lesssim \frac{\sqrt n + \sqrt r}{\sqrt r - \sqrt{\hat r}} \|A - A _ {\hat r}\| _ 2. \end{aligned}\]

Given $A \in \mathbb R^{m \times n}$ and rank $r$, the HMT algorithm proceeds as:

  1. Form a random matrix $G \in \mathbb R^{n \times r}$ with i.i.d. entries
  2. Compute $AG \in \mathbb R^{m \times r}$
  3. Compute the QR factorisation $AG = QR$
  4. Approximate $A \approx \hat A := Q Q^\top A$, a rank-$r$ approximation

In analysis, $G$ is typically assumed to be a Gaussian matrix. In practice, the performance is similar for a non-Gaussian random matrix. What’s a popular alternative choice and why?

SRFT matrices, which use the FFT and can be applied to $A$ with $O(mn \log m)$ cost rather than $O(mn r)$.

Bite-sized

HMT’s rank-$r$ approximation $\hat A = QQ^\top A$ costs $O(mnr)$ flops for dense $A$, vs $O(mn^2)$ for the (truncated) SVD — a big win when $r \ll n$.

Source: Lecture 15, Randomised SVD by HMT slide and §15.1 of the lecture notes (cost discussion under Algorithm 15.1).

@bite~

Why does the range of $AG$ (with $G$ Gaussian) capture the dominant column space of $A$?

If $A$ has exact rank $r$, write $A = U _ r \Sigma _ r V _ r^\top$. Then $AG = U _ r \Sigma _ r (V _ r^\top G)$, and $V _ r^\top G$ is Gaussian and full row rank with probability $1$, so $\Sigma _ r (V _ r^\top G)$ is full rank. Therefore $\mathrm{range}(AG) = \mathrm{range}(U _ r)$ exactly, and projecting $A$ onto $\mathrm{range}(Q) = \mathrm{range}(AG)$ recovers $A$. For approximately rank-$r$ $A$, the dominant components are captured the same way; the trailing singular values contribute an error controlled by Marchenko-Pastur conditioning of $V _ 1^\top G$.

Source: Lecture 15, §15.1 of the lecture notes (post-Algorithm-15.1 “To gain insight” paragraph).

@bite~

The HMT prefactor $\sqrt{1 + \tfrac{\hat r}{r - \hat r - 1}}$ becomes modest under moderate oversampling: $r = 2 \hat r$ gives factor approximately $\sqrt 2$; $r = \hat r + 10$ gives factor close to $1$.

Source: Lecture 15, §15.1 of the lecture notes (Theorem 15.1 post-statement remark).

@bite~

Once HMT has produced $\hat A = Q Q^\top A$, how do you turn it into an SVD-like factorisation?

Compute the (small) SVD of the $r \times n$ matrix $Q^\top A = U _ 0 \Sigma _ 0 V _ 0^\top$. Then $\hat A = Q U _ 0 \Sigma _ 0 V _ 0^\top$, so the approximate left singular vectors are the columns of $Q U _ 0$, the singular values are $\Sigma _ 0$, and the right singular vectors are the columns of $V _ 0$. Cost is $O(n r^2)$ — negligible compared to forming $Q^\top A$.

Source: Lecture 15, §15.1 of the lecture notes (parenthetical in Algorithm 15.1 step 4: “$\hat A = Q (Q^\top A) = (Q U _ 0) \Sigma _ 0 V _ 0^\top$”).

@bite~

Strategy for proving the HMT rough error bound (∆hmt-algorithm-rough-error-proof).

  • Write the error as a product of three “structurally significant” matrices: $A - \hat A = (I - QQ^\top)(A - A _ {\hat r})(I - GM^\top)$ for a cleverly chosen $M$.
  • Choose $M$: $M^\top = (V _ 1^\top G)^\dagger V _ 1^\top$ where $V _ 1$ holds the top-$\hat r$ right singular vectors. This makes the leading rank-$\hat r$ part of $A$ disappear: $A _ {\hat r}(I - GM^\top) = 0$.
  • Bound each factor: $\|I - QQ^\top\| _ 2 \le 1$ (orthogonal projector); middle factor gives $\|A - A _ {\hat r}\| _ 2$ for free; the third factor is itself a projector so $\|I - GM^\top\| _ 2 = \|GM^\top\| _ 2$ (∆projector-norm-identity).
  • Marchenko-Pastur on each Gaussian piece: $\|G\| _ 2 \lesssim \sqrt n + \sqrt r$, $\|(V _ 1^\top G)^\dagger\| _ 2 \lesssim 1/(\sqrt r - \sqrt{\hat r})$. Multiply.

Source: Lecture 15, §15.1 of the lecture notes (proof of the rough HMT bound).

@bite~ @proofsupport~

Where does the choice of $M^\top = (V _ 1^\top G)^\dagger V _ 1^\top$ come from in the HMT error proof (∆hmt-algorithm-rough-error-proof), and why is it “the right” choice?

$M$ is engineered so the rank-$\hat r$ part of $A$ exactly cancels. With $V _ 1 \in \mathbb R^{n \times \hat r}$ the top right singular vectors of $A$:

  • $V _ 1^\top G \in \mathbb R^{\hat r \times r}$ is rectangular Gaussian (by orthogonal invariance), hence full row rank a.s., so $(V _ 1^\top G)(V _ 1^\top G)^\dagger = I _ {\hat r}$.
  • Plugging into $A _ {\hat r} G M^\top = U _ 1 \Sigma _ 1 V _ 1^\top G (V _ 1^\top G)^\dagger V _ 1^\top = U _ 1 \Sigma _ 1 V _ 1^\top = A _ {\hat r}$.
  • So $A _ {\hat r}(I - GM^\top) = 0$, reducing $A(I - GM^\top) = (A - A _ {\hat r})(I - GM^\top)$ — only the tail beyond rank $\hat r$ matters.

This is the analogue, for randomised SVD, of the “subtract-the-leading-SVD-part” trick that appears in CUR and Blendenpik proofs.

Source: Lecture 15, §15.1 of the lecture notes (Step 1 of the rough-bound proof).

@bite~ @proofsupport~