NLA MT25, Randomised least-squares
Flashcards
In the context of randomised least-squares, what kinds of least-squares problems are we considering?
where $A \in \mathbb R^{m \times n}$ is tall-skinny so that $m \gg n$, and $\text{rank}(A) = n$.
Row subset selection
Suppose:
- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$
What’s the intuitive idea behind row subset selection?
Suppose that $Ax^\ast = b$, so that the system is consistent. Then $x^\ast$ can be found by solving the smaller problem
\[x^\ast = \text{argmin} _ x \vert \vert A _ 1 x - b _ 1 \vert \vert _ 2\]where $A _ 1 \in \mathbb R^{s \times n}$ is any rank-$n$ submatrix of $A$. If this is the case, we can obtain the solution in $O(sn^2)$ operations.
Suppose:
- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$
The intuitive idea behind row subset selection is that if $Ax^\ast = b$ for some $x^\ast$, then $x^\ast$ can be found by solving the smaller problem
\[x^\ast = \text{argmin} _ x \vert \vert A _ 1 x - b _ 1 \vert \vert _ 2\]
where $A _ 1 \in \mathbb R^{s \times n}$ is any rank-$n$ submatrix of $A$. If this is the case, we can obtain the solution in $O(sn^2)$ operations.
Give an @example to show that this need not be the case when $Ax _ \ast \ne b$.
where $A _ 1 = \epsilon I _ n$ and $\epsilon \ll 1$, and $A _ i = I _ n$ for $i \ge 2$, and $b _ i = b _ j$ if $i, j \ge 2$.
Then the least squares solution is approximately $x \approx b _ 2$. But if the top $s = n$ rows are chosen, solving $\min _ x \vert \vert A _ 1 x - b _ 1 \vert \vert _ 2$ gets $\hat x = \epsilon^{-1} b _ 1$, and this can be arbitrarily bad.
Sketch-and-solve
Suppose:
- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$
@Describe the sketch-and-solve @algorithm in this context.
- Generate a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
- Computes $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min _ x \vert \vert G(Ax - b) \vert \vert _ 2$
Suppose:
- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$
The ∆least-squares-sketch-and-solve algorithm approximates this problem by:
- Generating a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
- Computing $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min _ x \vert \vert G(Ax - b) \vert \vert _ 2$
@Justify its connection to the subset selection algorithm.
Consider the square Gaussian matrix
\[\hat G = \begin{bmatrix} G \\ G _ 2 \end{bmatrix}\]and consider $\min _ x \vert \vert \hat G(Ax - b) \vert \vert _ 2$, and solve by selecting the first $s$ rows.
Suppose:
- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$
The ∆least-squares-sketch-and-solve algorithm approximates this problem by:
- Generating a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
- Computing $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min _ x \vert \vert G(Ax - b) \vert \vert _ 2$
@State a theorem characterising the error in the approximate solutions.
Suppose:
- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$
- $x^\ast = \text{argmin} _ x \vert \vert Ax - b \vert \vert _ 2$
- $G \in \mathbb R^{s \times m}$ is a Gaussian matrix with $s > cn$ and $c > 1$
- $\hat x = \text{argmin} _ x \vert \vert G(Ax - b) \vert \vert _ 2$
Then, with high probability:
\[\vert \vert A\hat x - b \vert \vert _ 2 \le \frac{\sqrt s + \sqrt{n+1}}{\sqrt{s} - \sqrt{n+1}} \vert \vert Ax^\ast - b \vert \vert _ 2\]Suppose:
- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$
The ∆least-squares-sketch-and-solve algorithm approximates this problem by:
- Generating a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
- Computing $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min _ x \vert \vert G(Ax - b) \vert \vert _ 2$
@Prove that then if:
- $x^\ast = \text{argmin} _ x \vert \vert Ax - b \vert \vert _ 2$
- $G \in \mathbb R^{s \times m}$ is a Gaussian matrix with $s > cn$ and $c > 1$
- $\hat x = \text{argmin} _ x \vert \vert G(Ax - b) \vert \vert _ 2$
Then, with high probability:
\[\vert \vert A\hat x - b \vert \vert _ 2 \le \frac{\sqrt s + \sqrt{n+1}}{\sqrt{s} - \sqrt{n+1}} \vert \vert Ax^\ast - b \vert \vert _ 2\]
The trick: view every residual $Av - b$ as a single vector in a fixed $(n+1)$-dimensional subspace, then show the sketch $G$ is a near-isometry on it.
Residual as one vector. For any $v$,
\[Av - b = [A,\, b] \begin{bmatrix} v \\ -1 \end{bmatrix}.\]Take the thin QR factorisation $[A, b] = QR$, with $Q \in \mathbb R^{m \times (n+1)}$ having orthonormal columns and $R \in \mathbb R^{(n+1) \times (n+1)}$ upper triangular, and set $w := R \begin{bmatrix} v \ -1 \end{bmatrix} \in \mathbb R^{n+1}$. Then
\[Av - b = QR \begin{bmatrix} v \\ -1 \end{bmatrix} = Qw, \qquad G(Av - b) = (GQ) w,\]and since $Q$ has orthonormal columns, $ \vert \vert Av - b \vert \vert _ 2 = \vert \vert Qw \vert \vert _ 2 = \vert \vert w \vert \vert _ 2$.
Near-isometry of the sketch. By ∆orthogonal-invariance-of-gaussian-matrices, $GQ \in \mathbb R^{s \times (n+1)}$ is again Gaussian, so by ∆marchenko-pastur its singular values lie in $[\sqrt s - \sqrt{n+1}, \sqrt s + \sqrt{n+1}]$. Hence for every $w$,
\[(\sqrt s - \sqrt{n+1}) \vert \vert w \vert \vert _ 2 \le \vert \vert (GQ) w \vert \vert _ 2 \le (\sqrt s + \sqrt{n+1}) \vert \vert w \vert \vert _ 2,\]so the stretch is $\approx \sqrt s$, not $\approx 1$ (a Gaussian with $N(0,1)$ entries enlarges every direction by about $\sqrt s$). Translating back via $ \vert \vert w \vert \vert _ 2 = \vert \vert Av - b \vert \vert _ 2$ and $ \vert \vert (GQ) w \vert \vert _ 2 = \vert \vert G(Av - b) \vert \vert _ 2$ gives, for every $v$,
\[(\sqrt s - \sqrt{n+1}) \vert \vert Av - b \vert \vert _ 2 \le \vert \vert G(Av - b) \vert \vert _ 2 \le (\sqrt s + \sqrt{n+1}) \vert \vert Av - b \vert \vert _ 2. \qquad (\ast)\]The bound holds uniformly in $v$, which is essential: it must apply to the $G$-dependent minimiser $\hat x$, so a single-vector concentration bound would not do.
Combine.
\[\begin{aligned} \vert \vert A\hat x - b \vert \vert _ 2 &\le \frac{1}{\sqrt s - \sqrt{n+1}} \vert \vert G(A\hat x - b) \vert \vert _ 2 && (\star 1) \\ &\le \frac{1}{\sqrt s - \sqrt{n+1}} \vert \vert G(Ax^\ast - b) \vert \vert _ 2 && (\star 2) \\ &\le \frac{\sqrt s + \sqrt{n+1}}{\sqrt s - \sqrt{n+1}} \vert \vert Ax^\ast - b \vert \vert _ 2 && (\star 3) \end{aligned}\]as required. The absolute $\sqrt s$ never appears in the final bound: it is the ratio of $GQ$’s two extreme singular values, both $\approx \sqrt s$.
- $(\star 1)$ lower bound of $(\ast)$ at $v = \hat x$.
- $(\star 2)$ optimality: $\hat x$ minimises the sketched residual, so $ \vert \vert G(A\hat x - b) \vert \vert _ 2 \le \vert \vert G(Ax^\ast - b) \vert \vert _ 2$.
- $(\star 3)$ upper bound of $(\ast)$ at $v = x^\ast$.
Blendenpik
Bite-sized
For sketch-and-solve least-squares with $G \in \mathbb R^{s \times m}$ Gaussian, choosing $s = 4(n + 1)$ makes the residual-bound prefactor $\frac{\sqrt s + \sqrt{n+1}}{\sqrt s - \sqrt{n+1}}$ equal to $3$.
Why does randomly mixing the rows (sketch-and-solve) succeed where deterministic row subset selection can fail catastrophically?
In subset selection, if the chosen rows happen to be those with anomalous structure (e.g., the $\epsilon I _ n$ block in the failure example), the small subproblem is misleading and $\hat x$ can be arbitrarily bad. A random Gaussian (or FFT-type) sketch spreads each row’s signal across the sketched rows, so with high probability no submatrix is degenerate — Marchenko-Pastur then guarantees a near-isometric embedding.
Sketch-and-solve cost: forming the dense Gaussian sketch $GA$ already takes $O(smn)$ flops, which is no faster than the direct QR-based solver. The speedup comes from structured sketches (e.g., $G = SFD$ with FFT, “SRFT”), reducing the cost to $O(mn \log s + s n^2)$.
In one sentence, what does Blendenpik do?
Blendenpik uses a random sketch $G \in \mathbb R^{4n \times m}$ to build a preconditioner: form $GA = \hat Q \hat R$ (QR factorisation), then solve the original problem $\min _ x |Ax - b| _ 2$ in the well-conditioned variables $y = \hat R x$, i.e. run CG on the normal equations of $B = A \hat R^{-1}$ (with $\kappa _ 2(B) = O(1)$) and recover $x = \hat R^{-1} y$. The change of variables is exact ($By = Ax$), so this returns the true minimiser to full accuracy, in $O(mn \log m + n^3)$ rather than $O(m n^2)$.
Blendenpik forms $GA = \hat Q \hat R$ with $G \in \mathbb R^{4n \times m}$ Gaussian. The preconditioned matrix $A \hat R^{-1}$ has condition number $O(1)$ with high probability, justified by applying Marchenko-Pastur to $GQ$ (where $A = QR$ is the thin QR of $A$).
Inside Blendenpik, once $\hat R$ is computed, the inner solver is CG applied to the normal equation $B^\top B y = B^\top b$ with $B = A \hat R^{-1}$. Because $\kappa _ 2(B^\top B) = O(1)$, CG reaches a target tolerance $\tau$ on the relative residual in $O(\log(1/\tau))$ iterations, each costing $O(mn)$. ($\tau$ is the iteration tolerance — distinct from the course’s standing $\epsilon = O(u)$ for unit roundoff.)
Strategy for proving the sketch-and-solve error bound (∆sketch-and-solve-error-bound-proof).
- Reduce the problem to a property of $GQ$ where $Q$ is from the QR factorisation $[A, b] = QR$. Then $|Av - b| _ 2 = |Qw| _ 2$ and $|G(Av - b)| _ 2 = |(GQ)w| _ 2$.
- Argue $GQ$ is itself Gaussian by orthogonal invariance of Gaussian matrices, since $Q$ is independent of $G$.
- Apply Marchenko-Pastur to the rectangular $s \times (n+1)$ Gaussian $GQ$: all its singular values lie in $[\sqrt s - \sqrt{n+1}, \sqrt s + \sqrt{n+1}]$.
- Conclude a two-sided sandwich $|G(Av - b)| _ 2 \approx |Av - b| _ 2$ uniformly in $v$, then plug $v = \hat x$ and $v = x^\ast$ and use that $\hat x$ minimises the sketched residual.
The load-bearing two-sided inequality in the sketch-and-solve proof: for any vector $v$, $(\sqrt s - \sqrt{n+1}) \ \vert Av - b\ \vert _ 2 \le \ \vert G(Av - b)\ \vert _ 2 \le (\sqrt s + \sqrt{n+1}) \ \vert Av - b\ \vert _ 2$. The bound depends on $s, n$ but not on $v$ — it is a uniform near-isometry on $\mathrm{range}([A, b])$.
Where does the hypothesis “$G$ is Gaussian” enter the sketch-and-solve error proof (∆sketch-and-solve-error-bound-proof)?
In exactly two places, both via the rectangular random matrix $GQ$:
- Orthogonal invariance of Gaussian matrices: applied to argue that $GQ$ has the same distribution as a fresh Gaussian. This uses $Q$ orthogonal and independent of $G$.
- Marchenko-Pastur conditioning: applied to $GQ$ to bound its singular values in $[\sqrt s - \sqrt{n+1}, \sqrt s + \sqrt{n+1}]$. Strictly, MP itself only needs i.i.d. mean-$0$ variance-$1$ entries — but here orthogonal invariance is the step that gets us to a fresh random matrix with i.i.d. entries.
So if $G$ were merely a non-Gaussian sub-Gaussian sketch, MP-style conditioning of $GQ$ would still go through, but only via a more delicate non-invariance-based argument.