NLA MT25, CUR approximation
Flashcards
@Define the CUR approximation of a matrix $A \in \mathbb R^{m \times n}$ and state the two common choices for the middle factor. Why might it be useful?
A CUR approximation of $A$ has the form
\[A \approx CUR\]and:
- $C = A[:, J] \in \mathbb R^{m \times r}$ is a column submatrix of $A$, $J \subseteq \{1, \ldots, n\}$ with $ \vert J \vert = r$
- $R = A[I, :] \in \mathbb R^{r \times n}$ is a row submatrix of $A$, $I \subseteq \{1, \ldots, m\}$ with $ \vert I \vert = r$
- $U \in \mathbb R^{r \times r}$ is the middle factor, with two common choices:
- $U _ 1 = C^\dagger A R^\dagger$
- $U = A[I, J]^{-1}$ (we may also choose $ \vert I \vert \ne \vert J \vert $ and then set $U = A[I, J]^{\dagger}$).
It might be useful, because:
- With some choices of $U$ (e.g. the last one, computing $CUR$ doesn’t require looking at the full matrix $A$)
- $C, R$ may be sparse if $A$ is
- $C$ and $R$ being submatrices of $A$ might give them interpretable physical structure
The CUR approximation of $A$ is
\[A \approx CUR\]
where
- $C = A[:, J] \in \mathbb R^{m \times r}$ is a column submatrix of $A$, $J \subseteq \{1, \ldots, n\}$ with $ \vert J \vert = r$
- $R = A[I, :] \in \mathbb R^{r \times n}$ is a row submatrix of $A$, $I \subseteq \{1, \ldots, m\}$ with $ \vert I \vert = r$
- $U \in \mathbb R^{r \times r}$ is the middle factor, with two common choices:
- $U _ 1 = C^\dagger A R^\dagger$
- $U = A[I, J]^{-1}$
@Describe how you might check the quality of a choice of $C$ and $R$, and how this could be estimated.
- $U _ 1 = C^\dagger A R^\dagger$
- $U = A[I, J]^{-1}$
Compute
\[\|A - CUR\| _ F\]for which a reliable estimate is
\[\|A - CUR\| \approx \frac{1}{\sqrt \ell} \|(A - CUR)G\| _ F\]where $G \in \mathbb R^{n \times \ell}$ is Gaussian and $\ell = O(1)$.
Suppose we consider the CUR approximation $A \approx CUR$ where
- $C = A[:, J] \in \mathbb R^{m \times r}$ is a column submatrix of $A$, $J \subseteq \{1, \ldots, n\}$ with $ \vert J \vert = r$
- $R = A[I, :] \in \mathbb R^{r \times n}$ is a row submatrix of $A$, $I \subseteq \{1, \ldots, m\}$ with $ \vert I \vert = r$
and we pick $U = A[I, J]^{-1}$.
@State a useful way of decomposing the error.
After permuting the $I$ rows and $J$ columns of $A - CUR$ to the leading $r$ positions, the residual collapses to a single trailing block:
\[P _ 1 (A - CUR) P _ 2 = \begin{bmatrix} 0 _ {r \times r} & 0 _ {r \times (n-r)} \\ 0 _ {(m-r) \times r} & A _ 2 \end{bmatrix},\]where $P _ 1, P _ 2$ are the corresponding row and column permutation matrices and $A _ 2 \in \mathbb R^{(m-r) \times (n-r)}$ is the Schur-complement block. Since permutations are orthogonal, in any unitarily invariant norm
\[\|A - CUR\| = \|P _ 1 (A - CUR) P _ 2\| = \|A _ 2\|.\]Suppose we consider the CUR approximation $A \approx CUR$ where
- $C = A[:, J] \in \mathbb R^{m \times r}$ is a column submatrix of $A$, $J \subseteq \{1, \ldots, n\}$ with $ \vert J \vert = r$
- $R = A[I, :] \in \mathbb R^{r \times n}$ is a row submatrix of $A$, $I \subseteq \{1, \ldots, m\}$ with $ \vert I \vert = r$
and we pick $U = A[I, J]^{-1}$.
@Justify the connection between this and the LU decomposition.
With $U = A[I, J]^{-1}$, after permuting the chosen $r$ rows and columns to the top-left, the residual has block structure
\[P _ 1 (A - CUR) P _ 2 = \begin{bmatrix} 0 _ {r \times r} & 0 _ {r \times (n-r)} \\ 0 _ {(m-r) \times r} & A _ 2 \end{bmatrix},\]where $A _ 2$ is the Schur complement of $A[I, J]$ in $A$.
This is the structure of the residual after $r$ steps of the rank-1 outer-product LU algorithm (∆rank-1-view-of-lu) applied to the permuted matrix:
\[P _ 1 A P _ 2 - \sum^r _ {i = 1} L _ i U _ i = \begin{bmatrix} 0 _ {r \times r} & 0 _ {r \times (n-r)} \\ 0 _ {(m-r) \times r} & A _ 2 \end{bmatrix}.\]CUR with this $U$ is essentially $r$ steps of $LU$ where the pivots are chosen by $I, J$, rather than greedily picked column-by-column.
Suppose we consider the CUR approximation of $A \in \mathbb R^{m \times n}$ and $A \approx CUR$ where
- $C = A[:, J] \in \mathbb R^{m \times r}$ is a column submatrix of $A$, $J \subseteq \{1, \ldots, n\}$, and $ \vert I \vert = r$
- $R = A[I, :] \in \mathbb R^{r \times n}$ is a row submatrix of $A$, $I \subseteq \{1, \ldots, m\}$, and $ \vert J \vert = r$
and we pick $U = A[I, J]^{-1}$.
@State a (rough, proven in this course) bound on the CUR approximation error, and note how it compares to tighter bounds.
There exist a choice of $C$ and $R$ such that
\[\|A - CUR\| _ {\mathsf F} \le \sqrt{(m - r)r + 1} \left( \sqrt{(n - r)r + 1} + 1 \right) \|A - A _ r\| _ {\mathsf F}.\]Comparison with tighter bounds: It is possible to prove that the error is in fact bounded by $(r+1)\|A - A _ r\| _ {\mathsf F}$.
@Prove the ∆cur-error-bound, i.e. that if we consider the CUR approximation of $A \in \mathbb R^{m \times n}$ and $A \approx CUR$ where
- $C = A[:, J] \in \mathbb R^{m \times r}$ is a column submatrix of $A$, $J \subseteq \{1, \ldots, n\}$, and $ \vert I \vert = r$
- $R = A[I, :] \in \mathbb R^{r \times n}$ is a row submatrix of $A$, $I \subseteq \{1, \ldots, m\}$, and $ \vert J \vert = r$
and we pick $U = A[I, J]^{-1}$, then there exist a choice of $C$ and $R$ such that
\[ \vert A - CUR\| _ {\mathsf F} \le \sqrt{(m - r)r + 1} \left( \sqrt{(n - r)r + 1} + 1 \right) \|A - A _ r\| _ {\mathsf F}.\]
Setup: Write the SVD of $A = U _ 1 \Sigma _ 1 V _ 1^\top + E$ where $\Sigma _ 1 \in \mathbb R^{r\times r}$ holds the leading $r$ singular values and $E = U _ 2 \Sigma _ 2 V^\top _ 2$ is the tail with $\|E\| _ {\mathsf F} = \|A - A _ r\| _ {\mathsf F}$. Let $P _ C \in \mathbb R^{n \times r}$ be the column-subsampling matrix selecting columns $J$ (so $C = AP _ C)$, and $P _ R \in \mathbb R^{r \times m}$ be the row-subsampling matrix selecting rows $I$ (so $R = P _ R A$). We have $\|P _ C\| _ 2 = \|P _ R\| _ 2 = 1$.
Step 1: Right-multiplying $C = U _ 1 \Sigma _ 1(V _ 1^\top P _ C) + EP _ C$ by $M := (V _ 1^\top P _ C)^\dagger V _ 1^\top$ gives
\[CM = U _ 1 \Sigma _ 1 V _ 1^\top + E P _ C M = A + (-E + E P _ C M).\]By the triangle inequality and submultiplicativity, we have
\[\begin{aligned} \|CM - A\| _ {\mathsf F} &\le \|E\| _ {\mathsf F} + \|E P _ C M\| _ {\mathsf F} \\ &\le \|\Sigma _ 2\| _ {\mathsf F} (1 + \|M\| _ 2) \\ &= \left( 1 + \frac{1}{\sigma _ \min(V _ 1^\top P _ C)} \right) \|A - A _ r\| _ {\mathsf F}, \end{aligned}\]using $\|M\| _ 2 = \frac{1}{\sigma _ \min(V _ 1^\top P _ C)}$ and $\|V _ 2^\top P _ C\| _ 2 \le 1$.
Now $C C^\dagger$ is the orthogonal projector onto $\text{range}(C)$ (the thin SVD $C = U _ C \Sigma _ C V _ C^\top$ gives $CC^\dagger = U _ C U _ C^\top)$, so $C C^\dagger A$ is the column-wise orthogonal projection of $A$ onto $\text{range}(C)$. The Frobenius norm decomposes column-by-column as
\[\|A - X\| _ {\mathsf F} = \sum _ j \|a _ j - x _ j\| _ 2^2\]so $CC^\dagger A$ is the Frobenius-best approximation to $A$ among matrices whose columns lie in $\text{range}(C)$. The matrix $CM$ has columns in $\text{range}(C)$, and hence
\[\|C C^\dagger A - A\| _ {\mathsf F} \quad \le \quad \|CM - A\| _ {\mathsf F} \quad \le \quad \left(1 + \frac{1}{\sigma _ \min(V _ 1^\top P _ C)}\right) \|A - A _ r\| _ {\mathsf F}.\]Step 2: Computing $C^\dagger A$ requires reading all of $A$. Approximate it instead by sketched-least-squares with row-subsampling $P _ R$, solving
\[\min _ {\hat M} \|P _ R(C \hat M - A)\| _ {\mathsf F}\]Since $P _ R C = C[I, :] = A[I, J]$ is $r \times r$ and nonsingular (justified in step 3). So $C\hat M = CUR$, which is exactly the CUR formula.
We now bound how far the sketched solution is from the optimal via ∆sketched-least-squares-multiple-rhs. This gives us the residual bound (with $\|P _ R\| _ 2 = 1$) that:
\[\|CUR - A\| _ {\mathsf F} \le \frac{1}{\sigma _ \min(P _ R Q _ C)} \|C C^\dagger A - A\| _ {\mathsf F}\]where $C = Q _ C R _ C$ is the thin QR.
Step 3: We now make our choice of $J$ and $I$ explicit. Both $V _ 1 \in \mathbb R^{n \times r}$ and $Q _ C \in \mathbb R^{m \times r}$ are tall-skinny matrices with orthonormal columns. Applying ∆max-vol-orthonormal-submatrix to each gives the existence of sub-samplings:
- applied to $V _ 1$, with the chosen rows of $V _ 1$ defining $J$: $\sigma _ \min(V _ 1^\top P _ C) \ge 1/\sqrt{(n-r)r + 1}$.
- applied to $Q _ C$, with the chosen rows of $Q _ C$ defining $I$: $\sigma _ \min(P _ R Q _ C) \ge 1/\sqrt{(m-r)r + 1}$.
Both lower bounds are positive, so $V _ 1^\top P _ C$ and $P _ R Q _ C$ are nonsingular. The first nonsingularity makes $C$ full column rank, since in $C = U _ 1 \Sigma _ 1 (V _ 1^\top P _ C) + U _ 2 \Sigma(V _ 2^\top P _ C)$, the first term has rank $r$ and lies in $\text{range}(U _ 1)$, while the second lies in $\text{range}(U _ 2) \perp \text{range}(U _ 1)$, so $\text{rank}(C) \ge r$, and hence equals $r$. Therefore $R _ C$ in $C = Q _ C R _ C$ is invertible, so $P _ R C = P _ R Q _ C R _ C$ is invertible, and equivalently $A[I, J]$ is invertible, justifying $A[I, J]^{-1}$ in Step 2.
Combine: Substituting the bounds from Step 3 into Step 2 and $(\ast)$,
\[\|A - CUR\| _ {\mathsf F} \le \sqrt{(m-r) r + 1)}(\sqrt{(n-r) r + 1} + 1) \|A - A _ r\| _ {\mathsf F}.\]@State a result bounding the residual of a sketched least-squares problem with multiple right-hand sides.
Suppose:
- $A \in \mathbb R^{m \times r}$ with $m > r$ has full column rank
- $B \in \mathbb R^{m \times n}$
- $S \in \mathbb R^{s \times m}$ with $r \le s \le m$, and $SA$ has full column rank
- $X = \text{argmin} _ X \|AX - B\| _ F$ (exact solution)
- $\hat X = \text{argmin} _ {\hat X}\|S(A\hat X - B)\| _ {\mathsf F}$ (sketched solution)
Let $A = QR$ be the thin QR factorisation. Then
\[\frac{\|A \hat X - B\| _ {\mathsf F}}{\|AX - B\| _ {\mathsf F}} \le \frac{\|S\| _ 2}{\sigma _ \min(SQ)}.\]When $S$ is a subsampling matrix, $\|S\| _ 2 = 1$ and the bound is $1/\sigma _ \min(SQ)$.
Why useful? The right hand side doesn’t depend on $B$, so the bound holds uniformly across all right-hand sides, which is useful when $B$ has many columns, as in ∆cur-error-bound-proof.
@Prove ∆sketched-least-squares-multiple-rhs, i.e. that if
- $A \in \mathbb R^{m \times r}$ with $m > r$ has full column rank
- $B \in \mathbb R^{m \times n}$
- $S \in \mathbb R^{s \times m}$ with $r \le s \le m$, and $SA$ has full column rank
- $X = \text{argmin} _ X \|AX - B\| _ F$ (exact solution)
- $\hat X = \text{argmin} _ {\hat X}\|S(A\hat X - B)\| _ {\mathsf F}$ (sketched solution)
- $A = QR$ is the thin QR factorisation of $A$
then:
\[\frac{\|A \hat X - B\| _ {\mathsf F}}{\|AX - B\| _ {\mathsf F}} \le \frac{\|S\| _ 2}{\sigma _ \min(SQ)}.\]
and the same bound holds where the Frobenius norm on the left-hand side is replaced by the $2$-norm.
Decompose $B$: $B = Q Q^\top B + (I - QQ^\top)B =: B _ 1 + B _ 2$. This decomposition is orthogonal, since $B _ 1$ lies in $\text{range}(Q) = \text{range}(A)$ and $B _ 2$ lies in $\text{range}(Q)^\perp$.
Exact solution and its residual: $X = R^{-1} Q^\top B$, so $AX = QR \cdot R^{-1}Q^\top B = QQ^\top B = B _ 1$, giving
\[\|AX - B\| _ {\mathsf F} = \|B _ 1 - B\| _ {\mathsf F} = \|B _ 2\| _ {\mathsf F}.\]Sketched solution: Using $(SA)^\dagger = (SQR)^\dagger = R^{-1} (SQ)^\dagger$ and $(SQ)^\dagger SQ = I _ r$ (since $SQ$ has full column rank):
\[\hat X = (SA)^\dagger SB = R^{-1}(SQ)^\dagger SB _ 1 + R^{-1}(SQ)^\dagger S B _ 2.\]For the first term, $SB _ 1 = SQQ^\top B$, so $(SQ)^\dagger SB _ 1 = (SQ)^\dagger (SQ) Q^\top B = Q^\top B$, giving $R^{-1} (SQ)^\dagger SB _ 1 = R^{-1} Q^\top B = X$. Hence
\[\hat X = X + R^{-1}(SQ)^\dagger S B _ 2, \qquad A\hat X = B _ 1 + Q(SQ)^\dagger SB _ 2.\]Residual bound: Subtracting $B = B _ 1 + B _ 2$, we have
\[A\hat X - B = (Q(SQ)^\dagger S - I)B _ 2.\]The matrix $P := Q(SQ)^\dagger S$ is a projector:
\[P^2 = Q(SQ)^\dagger \cdot SQ \cdot (SQ)^\dagger S = Q(SQ)^\dagger S = P\]For any nontrivial projector, $\|I - P\| _ 2 = \|P\| _ 2$ (∆projector-norm-identity), hence:
\[\begin{aligned} \|A\hat X - B\| _ {\mathsf F} &\le \|P\| _ 2 \|B _ 2\| _ {\mathsf F} \\ &= \|(SQ)^\dagger S\| _ 2 \|B _ 2\| _ {\mathsf F} \\ &\le \|(SQ)^\dagger\| _ 2 \|S\| _ 2 \|B _ 2\| _ {\mathsf F} \\ &= \frac{\|S\| _ 2}{\sigma _ \min(SQ)} \|AX - B\| _ {\mathsf F} \end{aligned}\]By replacing $\|\cdot\| _ {\mathsf F}$ by $\|\cdot\| _ 2$ where necessary, we also see the same bound holds for the $2$-norm.
@State a result giving a well-conditioned $r \times r$ submatrix of any tall orthonormal matrix.
Suppose $Q \in \mathbb R^{n \times r}$ has orthonormal columns. Then there exists an $r \times r$ submatrix $SQ$ formed by selecting rows of $Q$ such that
\[\frac{1}{\sigma _ \min(SQ)} \le \sqrt{(n-r) r + 1}.\]The submatrix achieving this bound is the one with the maximum absolute determinant, sometimes called the max-volume submatrix.
Remark: since $\|S\| _ 2 = \|Q\| _ 2 = 1$ implies $\sigma _ {\max}(SQ) \le 1$, the bound is equivalent in strength to a condition-number bound:
\[\kappa _ 2(SQ) = \frac{\sigma _ {\max}(SQ)}{\sigma _ \min(SQ)} \le \frac{1}{\sigma _ \min(SQ)} \le \sqrt{(n-r) r + 1}.\]So the lemma says the max-volume subsampling keeps $SQ$ well-conditioned, with $\kappa _ 2(SQ) = O(\sqrt{nr})$ in the worst case.
Why useful? This gives the existence of a row/column subsampling that keeps a tall orthonormal matrix well-conditioned after subsampling, which is useful in ∆cur-error-bound-proof.
@Prove that if $Q \in \mathbb R^{n \times r}$ has orthonormal columns, then there exists an $r \times r$ submatrix $SQ$ formed by selecting rows of $Q$ such that
\[\frac{1}{\sigma _ \min(SQ)} \le \sqrt{(n-r) r + 1}.\]
Setup: Choose $S \in \mathbb R^{r \times n}$ to maximise $ \vert \det(SQ) \vert $ over all $r \times r$ submatrices of $Q$ (the “max-volume” submatrix; since $ \vert \det(SQ) \vert $ is the product of the singular values of $SQ$, maximising the determinant indirectly maximises $\sigma _ {\min}(SQ)$). By permuting rows of $Q$ if necessary, assume without loss of generality $S = [I _ r \mid 0]$, so $SQ = Q _ 1$ where $Q = \begin{bmatrix}Q _ 1 \\ Q _ 2\end{bmatrix}$ with $Q _ 1 \in \mathbb R^{r \times r}$ and $Q _ 2 \in \mathbb R^{(n-r) \times r}$. Consider
\[Q Q _ 1^{-1} = \begin{bmatrix} I _ r \\ Q _ 2 Q _ 1^{-1} \end{bmatrix}.\]Key claim: every entry of $B := Q Q _ 1^{-1} = \begin{bmatrix} I _ r \\ Q _ 2 Q _ 1^{-1} \end{bmatrix}$ has absolute value $\le 1$.
Proof of claim: the top $r$ rows of $B$ are $I _ r$ (entries $0$ or $1$), so only a bottom entry can be a problem. Fix one, $B _ {ij}$ with row $i > r$ and column $j \le r$; we show $ \vert B _ {ij} \vert \le 1$.
Form the $r \times r$ matrix $\tilde Q$ by taking $Q _ 1$ and replacing its $j$-th row with the $i$-th row of $Q$. This $\tilde Q$ is again an $r \times r$ submatrix of $Q$ (it picks rows $\{1, \ldots, r\}$ but with $j$ swapped out for $i$), so the max-volume property of $Q _ 1$ gives $ \vert \det \tilde Q \vert \le \vert \det Q _ 1 \vert $.
Now relate $\det \tilde Q$ to $B _ {ij}$. Since $B = Q Q _ 1^{-1}$ means $Q = B Q _ 1$, the $i$-th row of $Q$ equals (the $i$-th row of $B$) $\cdot Q _ 1$. Hence $\tilde Q = R\, Q _ 1$, where $R \in \mathbb R^{r \times r}$ is the identity matrix with its $j$-th row replaced by the $i$-th row of $B$. It is this $r \times r$ matrix $R$ whose determinant we expand: every row of $R$ except row $j$ is a standard basis vector $e _ k^\top$, so Laplace expansion collapses to the single surviving term $\det R = R _ {jj} = B _ {ij}$. Therefore
\[\det \tilde Q = \det R \cdot \det Q _ 1 = B _ {ij}\, \det Q _ 1.\]Combined with $ \vert \det \tilde Q \vert \le \vert \det Q _ 1 \vert $ and $\det Q _ 1 \ne 0$, this forces $ \vert B _ {ij} \vert \le 1$. $\square$
Conclude: Since $Q^\top Q = I _ r$, $Q$ preserves vector $2$-norms ($\|Qy\| _ 2 = \|y\| _ 2$ for all $y \in \mathbb R^r$), so $\|Q Q _ 1^{-1}\| _ 2 = \|Q _ 1^{-1}\| _ 2$. Then
\[\begin{aligned} \frac{1}{\sigma _ \min(Q _ 1)} = \|Q _ 1^{-1}\| _ 2 &= \left\| \begin{bmatrix} I _ r \\ Q _ 2 Q _ 1^{-1} \end{bmatrix} \right\| _ 2 \\ &= \sqrt{1 + \|Q _ 2 Q _ 1^{-1}\|^2 _ 2} \\ &\le \sqrt{1 + \|Q _ 2 Q _ 1^{-1}\|^2 _ {\mathsf F}} \\ &\le \sqrt{1 + (n-r) r}. \end{aligned}\]The third equality uses $\left\| \begin{bmatrix} I _ r \\ M \end{bmatrix} x \right\|^2 = \|x\|^2 + \|Mx\|^2$, so the operator norm squared is $1 + \|M\| _ 2^2$. The final inequality uses the key claim: $Q _ 2 Q _ 1^{-1}$ has $(n - r) r$ entries each bounded by $1$ in absolute value, so $\|Q _ 2 Q _ 1^{-1}\|^2 _ {\mathsf F} \le (n - r) r$.
Bite-sized
With the “fast” choice $U = A[I, J]^{-1}$, computing the CUR approximation $A \approx CUR$ requires looking at only the chosen $r$ columns and $r$ rows of $A$, not the full matrix — the standout appeal of CUR.
CUR is a special case of generalised Nyström $\hat A = AX(Y^\top A X)^\dagger Y^\top A$, with the sketches taken as subsampling matrices: $X = I _ n[:, J]$ (selecting the $r$ columns) and $Y = I _ m[I, :]$ (selecting the $r$ rows).
What are the three CUR-related Frobenius-norm error bounds shown to exist for any $A \in \mathbb R^{m \times n}$ and rank $r$?
- Column-only (best rank-$r$ approximant with columns in $\mathrm{range}(C)$): $\exists\ C$ with $\|A - C C^\dagger A\| _ F \le \sqrt{r + 1}\, \|A - A _ r\| _ F$.
- Standard CUR with $U _ 1 = C^\dagger A R^\dagger$: $\|A - C U _ 1 R\| _ F \le \sqrt{2(r + 1)}\, \|A - A _ r\| _ F$.
- “Fast” CUR with $U = A[I, J]^\dagger$: $\|A - C U R\| _ F \le (r + 1)\, \|A - A _ r\| _ F$.
Reading the column-only bound: $C = A[:, J]$ is the chosen $r$ columns of $A$, and $C C^\dagger$ is the orthogonal projector onto $\mathrm{range}(C)$ (equal to $Q _ C Q _ C^\top$ if $C = Q _ C R _ C$ is a thin QR). So $C C^\dagger A$ projects every column of $A$ onto the span of the chosen columns, and — by column-by-column least squares — it is the Frobenius-best rank-$\le r$ matrix whose columns lie in $\mathrm{range}(C)$ (it is $CM$ with the optimal $M = C^\dagger A$ in $\min _ M \|A - CM\| _ F$). The bound therefore says: some choice of $r$ actual columns of $A$ comes within $\sqrt{r+1}$ of the SVD-optimal $\|A - A _ r\| _ F$ — only a $\sqrt{r+1}$ penalty for being forced to use real columns of $A$ rather than its singular vectors. It is the foundational case: the standard- and fast-CUR bounds add an $r$-row restriction on top of it, each extra sketch costing a further factor (∆cur-error-bound-proof).
A crucial feature of the Frobenius-norm CUR bounds is that the suboptimality factor depends only on $r$ — it does not contain $\sqrt m, \sqrt n$. (In the spectral norm, by contrast, the best known bounds do involve $\sqrt m, \sqrt n$.)
Given that good $I, J$ exist for CUR, how do we find them in practice?
Many choices of $I, J$ achieve a near-optimal bound, so finding any one suffices. A deterministic $O(m n^2)$ algorithm due to Osinsky finds such a CUR. Randomised sketching algorithms can compute near-optimal CURs even faster (possibly faster than HMT or generalised Nyström).
Strategy for proving the CUR error bound (∆cur-error-bound-proof).
The proof has three stages:
- Column-only stage: bound the column-projection error $\|C C^\dagger A - A\| _ F$ by exhibiting an explicit better-than-orthogonal-projection witness $CM$, with $M = (V _ 1^\top P _ C)^\dagger V _ 1^\top$. The witness recovers the rank-$r$ leading SVD part exactly, leaving only the tail $\|A - A _ r\| _ F$ scaled by $(1 + 1/\sigma _ \min(V _ 1^\top P _ C))$.
- Row-only stage: replace the unavailable $C^\dagger A$ in the column projection by the sketched solution $CUR$. By ∆sketched-least-squares-multiple-rhs, this loses another factor $1/\sigma _ \min(P _ R Q _ C)$ where $Q _ C$ is the thin-QR-Q of $C$.
- Choice of $I, J$: apply ∆max-vol-orthonormal-submatrix to $V _ 1$ (giving $J$) and to $Q _ C$ (giving $I$). Each yields $\sigma _ \min \ge 1/\sqrt{(\cdot)r + 1}$.
The two load-bearing $\sigma _ \min$ lower bounds in the CUR proof come from applying ∆max-vol-orthonormal-submatrix: once to the right singular vectors $V _ 1 \in \mathbb R^{n \times r}$ giving $\sigma _ \min(V _ 1^\top P _ C) \ge $ $1/\sqrt{(n-r)r + 1}$, and once to $Q _ C \in \mathbb R^{m \times r}$ (the thin-QR factor of $C$) giving $\sigma _ \min(P _ R Q _ C) \ge $ $1/\sqrt{(m-r)r + 1}$. These products give the suboptimality factor.
Strategy for proving the sketched least-squares multiple-RHS bound (∆sketched-least-squares-multiple-rhs-proof).
The bound (∆sketched-least-squares-multiple-rhs): for full-column-rank $A \in \mathbb R^{m \times r}$ ($m > r$), $B \in \mathbb R^{m \times n}$, and a sketch $S \in \mathbb R^{s \times m}$ ($r \le s \le m$, $SA$ full column rank), the sketched solution $\hat X = \mathrm{argmin} _ {\hat X}\|S(A\hat X - B)\| _ {\mathsf F}$ has residual within a fixed factor of the exact $X = \mathrm{argmin} _ X\|AX - B\| _ {\mathsf F}$:
\[\frac{\|A\hat X - B\| _ {\mathsf F}}{\|AX - B\| _ {\mathsf F}} \le \frac{\|S\| _ 2}{\sigma _ \min(SQ)}\]where $A = QR$ is the thin QR (so for a subsampling $S$, $\|S\| _ 2 = 1$ and the factor is $1/\sigma _ \min(SQ)$). The factor is independent of $B$ — exactly what makes it usable for the many right-hand sides in ∆cur-error-bound-proof.
Strategy:
- Decompose $B = B _ 1 + B _ 2$ orthogonally with $B _ 1 = QQ^\top B \in \mathrm{range}(Q)$ and $B _ 2 = (I - QQ^\top)B \in \mathrm{range}(Q)^\perp$. Then the exact LS residual equals $\|B _ 2\| _ F$.
- Show the sketched solution $\hat X$ recovers the $B _ 1$-component exactly: $A\hat X = B _ 1 + Q(SQ)^\dagger SB _ 2$. Hence the sketched residual is $A\hat X - B = (Q(SQ)^\dagger S - I)B _ 2$.
- Observe $P := Q(SQ)^\dagger S$ is a projector (using $(SQ)^\dagger SQ = I$). By ∆projector-norm-identity, $\|I - P\| _ 2 = \|P\| _ 2$.
- Bound $\|P\| _ 2 \le \|(SQ)^\dagger\| _ 2 \|S\| _ 2 = \|S\| _ 2 / \sigma _ \min(SQ)$ via submultiplicativity.
Strategy for proving the max-volume orthonormal-submatrix bound (∆max-vol-orthonormal-submatrix-proof).
The bound (∆max-vol-orthonormal-submatrix): for any tall $Q \in \mathbb R^{n \times r}$ with orthonormal columns, there is an $r \times r$ row-submatrix $SQ$ — the max-volume one, i.e. with maximal $ \vert \det \vert $ — that stays well-conditioned after subsampling:
\[\frac{1}{\sigma _ \min(SQ)} \le \sqrt{(n-r)\,r + 1}\]This is what guarantees the row/column subsamplings in ∆cur-error-bound-proof keep the orthonormal factors well-conditioned.
Strategy:
- Choose the submatrix $SQ = Q _ 1$ with maximum $ \vert \det Q _ 1 \vert $. WLOG permute so $Q = \begin{bmatrix}Q _ 1 \\ Q _ 2\end{bmatrix}$.
- Swap argument: every entry of $Q _ 2 Q _ 1^{-1}$ has absolute value $\le 1$. Otherwise, replacing the corresponding row of $Q _ 1$ with a row of $Q _ 2$ would multiply the determinant by that entry’s value (via Laplace expansion), giving a larger-volume submatrix and contradicting maximality.
- Convert entry-bound to operator bound: $1/\sigma _ \min(Q _ 1) = \|Q _ 1^{-1}\| _ 2 = \big\|\begin{bmatrix}I _ r \\ Q _ 2 Q _ 1^{-1}\end{bmatrix}\big\| _ 2 = \sqrt{1 + \|Q _ 2 Q _ 1^{-1}\| _ 2^2} \le \sqrt{1 + \|Q _ 2 Q _ 1^{-1}\| _ F^2}$.
- The Frobenius norm has $\le (n-r)r$ entries each bounded by $1$, so $\|Q _ 2 Q _ 1^{-1}\| _ F^2 \le (n-r)r$, giving the result.