# Notes - NLA MT25, Randomised least-squares

> Source: https://ollybritton.com/notes/uni/part-c/mt25/nla/randomised-least-squares/ · Updated: 2026-05-05 · Tags: uni, notes

- [Course - Numerical Linear Algebra MT25](https://ollybritton.com/notes/uni/part-c/mt25/nla/numerical-linear-algebra/)
	- [Notes - NLA MT25, Least-squares](https://ollybritton.com/notes/uni/part-c/mt25/nla/notes/least-squares/)
	- [Notes - NLA MT25, Marchenko-Pastur theorem](https://ollybritton.com/notes/uni/part-c/mt25/nla/notes/marchenko-pastur-theorem/)
	- [Notes - NLA MT25, Gaussian random matrices](https://ollybritton.com/notes/uni/part-c/mt25/nla/notes/gaussian-random-matrices/)

### Flashcards
In the context of randomised least-squares, what kinds of least-squares problems are we considering?::

$$
\min_x ||Ax - b||_2
$$
where $A \in \mathbb R^{m \times n}$ is tall-skinny so that $m \gg n$, and $\text{rank}(A) = n$.

#### Row subset selection
Suppose:

- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min_x ||Ax - b||_2$

What's the intuitive idea behind row subset selection?::

Suppose that $Ax^\ast = b$, so that the system is consistent. Then $x^\ast$ can be found by solving the smaller problem
$$
x^\ast = \text{argmin}_x||A_1 x - b_1||_2
$$
where $A_1 \in \mathbb R^{s \times n}$ is any rank-$n$ submatrix of $A$. If this is the case, we can obtain the solution in $O(sn^2)$ operations.

Suppose:

- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min_x ||Ax - b||_2$

The intuitive idea behind row subset selection is that if $Ax^\ast = b$ for some $x^\ast$, then $x^\ast$ can be found by solving the smaller problem
$$
x^\ast = \text{argmin}_x||A_1 x - b_1||_2
$$
where $A_1 \in \mathbb R^{s \times n}$ is any rank-$n$ submatrix of $A$. If this is the case, we can obtain the solution in $O(sn^2)$ operations.

Give an @example to show that this need not be the case when $Ax_\ast \ne b$.::

$$
A = \begin{bmatrix}
A_1 \\ A_2 \\ \vdots \\ A_k
\end{bmatrix}, \quad b = \begin{bmatrix}
b_1 \\ b_2 \\ \vdots \\ b_k
\end{bmatrix}
$$
where $A_1 = \epsilon I_n$ and $\epsilon \ll 1$, and $A_i = I_n$ for $i \ge 2$, and $b_i = b_j$ if $i, j \ge 2$.

Then the least squares solution is approximately $x \approx b_2$. But if the top $s = n$ rows are chosen, solving $\min_x ||A_1 x - b_1||_2$ gets $\hat x = \epsilon^{-1} b_1$, and this can be arbitrarily bad.

#### Sketch-and-solve
Suppose:

- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min_x ||Ax - b||_2$

@Describe the sketch-and-solve @algorithm in this context.::

1. Generate a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
2. Computes $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min_x ||G(Ax - b)||_2$

Suppose:

- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min_x ||Ax - b||_2$

The ∆least-squares-sketch-and-solve algorithm approximates this problem by:

1. Generating a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
2. Computing $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min_x ||G(Ax - b)||_2$

@Justify its connection to the subset selection algorithm.::

Consider the square Gaussian matrix
$$
\hat G = \begin{bmatrix}
G \\ G_2
\end{bmatrix}
$$
and consider $\min_x ||\hat G(Ax - b)||_2$, and solve by selecting the first $s$ rows.

Suppose:

- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min_x ||Ax - b||_2$

The ∆least-squares-sketch-and-solve algorithm approximates this problem by:

1. Generating a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
2. Computing $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min_x ||G(Ax - b)||_2$

@State a theorem characterising the error in the approximate solutions.::

Suppose:

- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min_x ||Ax - b||_2$
- $x^\ast = \text{argmin}_x ||Ax - b||_2$
- $G \in \mathbb R^{s \times m}$ is a Gaussian matrix with $s > cn$ and $c > 1$
- $\hat x = \text{argmin}_x||G(Ax - b)||_2$

Then, with high probability:
$$
||A\hat x - b||_2 \le \frac{\sqrt s + \sqrt{n+1}}{\sqrt{s} - \sqrt{n+1}} ||Ax^\ast - b||_2
$$

Suppose:

- $A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
- We are considering the least-squares problem $\min_x ||Ax - b||_2$

The ∆least-squares-sketch-and-solve algorithm approximates this problem by:

1. Generating a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
2. Computing $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min_x ||G(Ax - b)||_2$

@Prove that then if:

- $x^\ast = \text{argmin}_x ||Ax - b||_2$
- $G \in \mathbb R^{s \times m}$ is a Gaussian matrix with $s > cn$ and $c > 1$
- $\hat x = \text{argmin}_x||G(Ax - b)||_2$

Then, with high probability:
$$
||A\hat x - b||_2 \le \frac{\sqrt s + \sqrt{n+1}}{\sqrt{s} - \sqrt{n+1}} ||Ax^\ast - b||_2
$$

::

The trick: view every residual $Av - b$ as a single vector in a fixed $(n+1)$-dimensional subspace, then show the sketch $G$ is a near-isometry on it.

**Residual as one vector.** For any $v$,
$$
Av - b = [A,\, b] \begin{bmatrix} v \\ -1 \end{bmatrix}.
$$
Take the thin QR factorisation $[A, b] = QR$, with $Q \in \mathbb R^{m \times (n+1)}$ having orthonormal columns and $R \in \mathbb R^{(n+1) \times (n+1)}$ upper triangular, and set $w := R \begin{bmatrix} v \\ -1 \end{bmatrix} \in \mathbb R^{n+1}$. Then
$$
Av - b = QR \begin{bmatrix} v \\ -1 \end{bmatrix} = Qw, \qquad G(Av - b) = (GQ) w,
$$
and since $Q$ has orthonormal columns, $||Av - b||_2 = ||Qw||_2 = ||w||_2$.

**Near-isometry of the sketch.** By ∆orthogonal-invariance-of-gaussian-matrices, $GQ \in \mathbb R^{s \times (n+1)}$ is again Gaussian, so by ∆marchenko-pastur its singular values lie in $[\sqrt s - \sqrt{n+1}, \sqrt s + \sqrt{n+1}]$. Hence for every $w$,
$$
(\sqrt s - \sqrt{n+1}) ||w||_2 \le ||(GQ) w||_2 \le (\sqrt s + \sqrt{n+1}) ||w||_2,
$$
so the stretch is $\approx \sqrt s$, not $\approx 1$ (a Gaussian with $N(0,1)$ entries enlarges every direction by about $\sqrt s$). Translating back via $||w||_2 = ||Av - b||_2$ and $||(GQ) w||_2 = ||G(Av - b)||_2$ gives, for every $v$,
$$
(\sqrt s - \sqrt{n+1}) ||Av - b||_2 \le ||G(Av - b)||_2 \le (\sqrt s + \sqrt{n+1}) ||Av - b||_2. \qquad (\ast)
$$
The bound holds *uniformly* in $v$, which is essential: it must apply to the $G$-dependent minimiser $\hat x$, so a single-vector concentration bound would not do.

**Combine.**
$$
\begin{aligned}
||A\hat x - b||_2
&\le \frac{1}{\sqrt s - \sqrt{n+1}} ||G(A\hat x - b)||_2 && (\star 1) \\
&\le \frac{1}{\sqrt s - \sqrt{n+1}} ||G(Ax^\ast - b)||_2 && (\star 2) \\
&\le \frac{\sqrt s + \sqrt{n+1}}{\sqrt s - \sqrt{n+1}} ||Ax^\ast - b||_2 && (\star 3)
\end{aligned}
$$
as required. The absolute $\sqrt s$ never appears in the final bound: it is the ratio of $GQ$'s two extreme singular values, both $\approx \sqrt s$.

- $(\star 1)$ lower bound of $(\ast)$ at $v = \hat x$.
- $(\star 2)$ optimality: $\hat x$ minimises the sketched residual, so $||G(A\hat x - b)||_2 \le ||G(Ax^\ast - b)||_2$.
- $(\star 3)$ upper bound of $(\ast)$ at $v = x^\ast$.

### Blendenpik

### Bite-sized

For sketch-and-solve least-squares with $G \in \mathbb R^{s \times m}$ Gaussian, choosing $s = 4(n + 1)$ makes the residual-bound prefactor $\frac{\sqrt s + \sqrt{n+1}}{\sqrt s - \sqrt{n+1}}$ equal to $3$.

**Source**: Lecture 14, **Sketch and solve for $\min_x \|Ax - b\|_2$** slide and **§14.4** of the lecture notes (interpretation of Theorem 14.2).

@bite~

Why does randomly mixing the rows (sketch-and-solve) succeed where deterministic row subset selection can fail catastrophically?

::

In subset selection, if the chosen rows happen to be those with anomalous structure (e.g., the $\epsilon I_n$ block in the failure example), the small subproblem is misleading and $\hat x$ can be arbitrarily bad. A random Gaussian (or FFT-type) sketch spreads each row's signal across the sketched rows, so with high probability *no* submatrix is degenerate — Marchenko-Pastur then guarantees a near-isometric embedding.

**Source**: Lecture 14, **'Fast' (but fragile) alg for $\min_x \|Ax - b\|_2$** and **Sketch and solve for $\min_x \|Ax - b\|_2$** slides; **§14.3-14.4** of the lecture notes.

@bite~

Sketch-and-solve cost: forming the dense Gaussian sketch $GA$ already takes $O(smn)$ flops, which is no faster than the direct QR-based solver. The speedup comes from structured sketches (e.g., $G = SFD$ with FFT, "SRFT"), reducing the cost to $O(mn \log s + s n^2)$.

**Source**: Lecture 14, **Complexity and faster sketches** discussion in **§14.4** of the lecture notes.

@bite~

In one sentence, what does Blendenpik do?

::

Blendenpik uses a random sketch $G \in \mathbb R^{4n \times m}$ to build a preconditioner: form $GA = \hat Q \hat R$ (QR factorisation), then solve the original problem $\min_x \|Ax - b\|_2$ in the well-conditioned variables $y = \hat R x$, i.e. run CG on the normal equations of $B = A \hat R^{-1}$ (with $\kappa_2(B) = O(1)$) and recover $x = \hat R^{-1} y$. The change of variables is exact ($By = Ax$), so this returns the true minimiser to full accuracy, in $O(mn \log m + n^3)$ rather than $O(m n^2)$.

**Source**: Lecture 14, **Sketch-to-precondition: Blendenpik** slide and **§14.5** of the lecture notes.

@bite~

Blendenpik forms $GA = \hat Q \hat R$ with $G \in \mathbb R^{4n \times m}$ Gaussian. The preconditioned matrix $A \hat R^{-1}$ has condition number $O(1)$ with high probability, justified by applying Marchenko-Pastur to $GQ$ (where $A = QR$ is the thin QR of $A$).

**Source**: Lecture 14, **Explaining $\kappa_2(A \hat R^{-1}) = O(1)$ via Marchenko-Pastur** slide and **§14.5.1** of the lecture notes (Theorem 14.3).

@bite~

Inside Blendenpik, once $\hat R$ is computed, the inner solver is CG applied to the normal equation $B^\top B y = B^\top b$ with $B = A \hat R^{-1}$. Because $\kappa_2(B^\top B) = O(1)$, CG reaches a target tolerance $\tau$ on the relative residual in $O(\log(1/\tau))$ iterations, each costing $O(mn)$. ($\tau$ is the iteration tolerance — distinct from the course's standing $\epsilon = O(u)$ for unit roundoff.)

**Source**: Lecture 14, **Blendenpik: solving $\min_x \|Ax - b\|_2$ using $\hat R$** slide and **§14.5.2** of the lecture notes.

@bite~

Strategy for proving the sketch-and-solve error bound (∆sketch-and-solve-error-bound-proof).

::

- Reduce the problem to a property of $GQ$ where $Q$ is from the QR factorisation $[A, b] = QR$. Then $\|Av - b\|_2 = \|Qw\|_2$ and $\|G(Av - b)\|_2 = \|(GQ)w\|_2$.
- Argue $GQ$ is itself Gaussian by orthogonal invariance of Gaussian matrices, since $Q$ is independent of $G$.
- Apply Marchenko-Pastur to the rectangular $s \times (n+1)$ Gaussian $GQ$: all its singular values lie in $[\sqrt s - \sqrt{n+1}, \sqrt s + \sqrt{n+1}]$.
- Conclude a two-sided sandwich $\|G(Av - b)\|_2 \approx \|Av - b\|_2$ uniformly in $v$, then plug $v = \hat x$ and $v = x^\ast$ and use that $\hat x$ minimises the sketched residual.

**Source**: Lecture 14, Theorem 14.2 in **§14.4** of the lecture notes.

@bite~ @proofsupport~

The load-bearing two-sided inequality in the sketch-and-solve proof: for any vector $v$, $(\sqrt s - \sqrt{n+1}) \|Av - b\|_2 \le \|G(Av - b)\|_2 \le (\sqrt s + \sqrt{n+1}) \|Av - b\|_2$. The bound depends on $s, n$ but not on $v$ — it is a *uniform* near-isometry on $\mathrm{range}([A, b])$.

**Source**: Lecture 14, **§14.4** of the lecture notes (proof of Theorem 14.2).

@bite~ @proofsupport~

Where does the hypothesis "$G$ is Gaussian" enter the sketch-and-solve error proof (∆sketch-and-solve-error-bound-proof)?

::

In exactly two places, both via the rectangular random matrix $GQ$:

- **Orthogonal invariance of Gaussian matrices**: applied to argue that $GQ$ has the *same distribution* as a fresh Gaussian. This uses $Q$ orthogonal *and* independent of $G$.
- **Marchenko-Pastur conditioning**: applied to $GQ$ to bound its singular values in $[\sqrt s - \sqrt{n+1}, \sqrt s + \sqrt{n+1}]$. Strictly, MP itself only needs i.i.d. mean-$0$ variance-$1$ entries — but here orthogonal invariance is the step that gets us *to* a fresh random matrix with i.i.d. entries.

So if $G$ were merely a non-Gaussian sub-Gaussian sketch, MP-style conditioning of $GQ$ would still go through, but only via a more delicate non-invariance-based argument.

**Source**: Lecture 14, **§14.4** of the lecture notes (proof of Theorem 14.2 — orthogonal invariance + Marchenko-Pastur citation).

@bite~ @proofsupport~

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt