Notes - NLA MT25, Randomised least-squares

Course - Numerical Linear Algebra MT25^U

Flashcards

In the context of randomised least-squares, what kinds of least-squares problems are we considering?

randomised-least-squares-problem-setup

\[\min _ x \vert \vert Ax - b \vert \vert _ 2\]

where $A \in \mathbb R^{m \times n}$ is tall-skinny so that $m \gg n$, and $\text{rank}(A) = n$.

Row subset selection

Suppose:

$A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$

What’s the intuitive idea behind row subset selection?

row-subset-selection-intuition

Suppose that $Ax^\ast = b$, so that the system is consistent. Then $x^\ast$ can be found by solving the smaller problem

\[x^\ast = \text{argmin} _ x \vert \vert A _ 1 x - b _ 1 \vert \vert _ 2\]

where $A _ 1 \in \mathbb R^{s \times n}$ is any rank-$n$ submatrix of $A$. If this is the case, we can obtain the solution in $O(sn^2)$ operations.

Suppose:

$A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$

The intuitive idea behind row subset selection is that if $Ax^\ast = b$ for some $x^\ast$, then $x^\ast$ can be found by solving the smaller problem

\[x^\ast = \text{argmin} _ x \vert \vert A _ 1 x - b _ 1 \vert \vert _ 2\]

where $A _ 1 \in \mathbb R^{s \times n}$ is any rank-$n$ submatrix of $A$. If this is the case, we can obtain the solution in $O(sn^2)$ operations.

Give an @example to show that this need not be the case when $Ax _ \ast \ne b$.

row-subset-selection-failure-example

\[A = \begin{bmatrix} A _ 1 \\ A _ 2 \\ \vdots \\ A _ k \end{bmatrix}, \quad b = \begin{bmatrix} b _ 1 \\ b _ 2 \\ \vdots \\ b _ k \end{bmatrix}\]

where $A _ 1 = \epsilon I _ n$ and $\epsilon \ll 1$, and $A _ i = I _ n$ for $i \ge 2$, and $b _ i = b _ j$ if $i, j \ge 2$.

Then the least squares solution is approximately $x \approx b _ 2$. But if the top $s = n$ rows are chosen, solving $\min _ x \vert \vert A _ 1 x - b _ 1 \vert \vert _ 2$ gets $\hat x = \epsilon^{-1} b _ 1$, and this can be arbitrarily bad.

Sketch-and-solve

Suppose:

$A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$

@Describe the sketch-and-solve @algorithm in this context.

least-squares-sketch-and-solve

Generate a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
Computes $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min _ x \vert \vert G(Ax - b) \vert \vert _ 2$

Suppose:

$A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$

The ∆least-squares-sketch-and-solve algorithm approximates this problem by:

Generating a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
Computing $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min _ x \vert \vert G(Ax - b) \vert \vert _ 2$

@Justify its connection to the subset selection algorithm.

sketch-and-solve-connection-to-subset-selection

Consider the square Gaussian matrix

\[\hat G = \begin{bmatrix} G \\ G _ 2 \end{bmatrix}\]

and consider $\min _ x \vert \vert \hat G(Ax - b) \vert \vert _ 2$, and solve by selecting the first $s$ rows.

Suppose:

$A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$

The ∆least-squares-sketch-and-solve algorithm approximates this problem by:

Generating a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
Computing $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min _ x \vert \vert G(Ax - b) \vert \vert _ 2$

@State a theorem characterising the error in the approximate solutions.

sketch-and-solve-error-bound

Suppose:

$A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$
$x^\ast = \text{argmin} _ x \vert \vert Ax - b \vert \vert _ 2$
$G \in \mathbb R^{s \times m}$ is a Gaussian matrix with $s > cn$ and $c > 1$
$\hat x = \text{argmin} _ x \vert \vert G(Ax - b) \vert \vert _ 2$

Then, with high probability:

\[ \vert \vert A\hat x - b \vert \vert _ 2 \le \frac{\sqrt s + \sqrt{n+1}}{\sqrt{s} - \sqrt{n+1}} \vert \vert Ax^\ast - b \vert \vert _ 2\]

Suppose:

$A \in \mathbb R^{m \times n}$ is tall-skinny ($m \gg n$) and $\text{rank}(A) = n$
We are considering the least-squares problem $\min _ x \vert \vert Ax - b \vert \vert _ 2$

The ∆least-squares-sketch-and-solve algorithm approximates this problem by:

Generating a random sketch matrix $G \in \mathbb R^{s \times m}$, where $s = cn$ for $c > 1$, typically Gaussian for analysis
Computing $GA$ and $Gb$, and solves the $s \times n$ least-squares problem $\min _ x \vert \vert G(Ax - b) \vert \vert _ 2$

@Prove that then if:

$x^\ast = \text{argmin} _ x \vert \vert Ax - b \vert \vert _ 2$
$G \in \mathbb R^{s \times m}$ is a Gaussian matrix with $s > cn$ and $c > 1$
$\hat x = \text{argmin} _ x \vert \vert G(Ax - b) \vert \vert _ 2$

Then, with high probability:

\[ \vert \vert A\hat x - b \vert \vert _ 2 \le \frac{\sqrt s + \sqrt{n+1}}{\sqrt{s} - \sqrt{n+1}} \vert \vert Ax^\ast - b \vert \vert _ 2\]

sketch-and-solve-error-bound-proof

The trick: view every residual $Av - b$ as a single vector in a fixed $(n+1)$-dimensional subspace, then show the sketch $G$ is a near-isometry on it.

Residual as one vector. For any $v$,

\[Av - b = [A,\, b] \begin{bmatrix} v \\ -1 \end{bmatrix}.\]

Take the thin QR factorisation $[A, b] = QR$, with $Q \in \mathbb R^{m \times (n+1)}$ having orthonormal columns and $R \in \mathbb R^{(n+1) \times (n+1)}$ upper triangular, and set $w := R \begin{bmatrix} v \\ -1 \end{bmatrix} \in \mathbb R^{n+1}$. Then

\[Av - b = QR \begin{bmatrix} v \\ -1 \end{bmatrix} = Qw, \qquad G(Av - b) = (GQ) w,\]

and since $Q$ has orthonormal columns, $ \vert \vert Av - b \vert \vert _ 2 = \vert \vert Qw \vert \vert _ 2 = \vert \vert w \vert \vert _ 2$.

Near-isometry of the sketch. By ∆orthogonal-invariance-of-gaussian-matrices, $GQ \in \mathbb R^{s \times (n+1)}$ is again Gaussian, so by ∆marchenko-pastur its singular values lie in $[\sqrt s - \sqrt{n+1}, \sqrt s + \sqrt{n+1}]$. Hence for every $w$,

\[(\sqrt s - \sqrt{n+1}) \vert \vert w \vert \vert _ 2 \le \vert \vert (GQ) w \vert \vert _ 2 \le (\sqrt s + \sqrt{n+1}) \vert \vert w \vert \vert _ 2,\]

so the stretch is $\approx \sqrt s$, not $\approx 1$ (a Gaussian with $N(0,1)$ entries enlarges every direction by about $\sqrt s$). Translating back via $ \vert \vert w \vert \vert _ 2 = \vert \vert Av - b \vert \vert _ 2$ and $ \vert \vert (GQ) w \vert \vert _ 2 = \vert \vert G(Av - b) \vert \vert _ 2$ gives, for every $v$,

\[(\sqrt s - \sqrt{n+1}) \vert \vert Av - b \vert \vert _ 2 \le \vert \vert G(Av - b) \vert \vert _ 2 \le (\sqrt s + \sqrt{n+1}) \vert \vert Av - b \vert \vert _ 2. \qquad (\ast)\]

The bound holds uniformly in $v$, which is essential: it must apply to the $G$-dependent minimiser $\hat x$, so a single-vector concentration bound would not do.

Combine.

\[\begin{aligned} \vert \vert A\hat x - b \vert \vert _ 2 &\le \frac{1}{\sqrt s - \sqrt{n+1}} \vert \vert G(A\hat x - b) \vert \vert _ 2 && (\star 1) \\ &\le \frac{1}{\sqrt s - \sqrt{n+1}} \vert \vert G(Ax^\ast - b) \vert \vert _ 2 && (\star 2) \\ &\le \frac{\sqrt s + \sqrt{n+1}}{\sqrt s - \sqrt{n+1}} \vert \vert Ax^\ast - b \vert \vert _ 2 && (\star 3) \end{aligned}\]

as required. The absolute $\sqrt s$ never appears in the final bound: it is the ratio of $GQ$’s two extreme singular values, both $\approx \sqrt s$.

$(\star 1)$ lower bound of $(\ast)$ at $v = \hat x$.
$(\star 2)$ optimality: $\hat x$ minimises the sketched residual, so $ \vert \vert G(A\hat x - b) \vert \vert _ 2 \le \vert \vert G(Ax^\ast - b) \vert \vert _ 2$.
$(\star 3)$ upper bound of $(\ast)$ at $v = x^\ast$.

Blendenpik

Bite-sized

For sketch-and-solve least-squares with $G \in \mathbb R^{s \times m}$ Gaussian, choosing $s = 4(n + 1)$ makes the residual-bound prefactor $\frac{\sqrt s + \sqrt{n+1}}{\sqrt s - \sqrt{n+1}}$ equal to $3$.

Source: Lecture 14, Sketch and solve for $\min _ x \ \vert Ax - b\ \vert _ 2$ slide and §14.4 of the lecture notes (interpretation of Theorem 14.2).

@bite~

bite-sketch-and-solve-prefactor

Why does randomly mixing the rows (sketch-and-solve) succeed where deterministic row subset selection can fail catastrophically?

bite-why-sketching-beats-row-selection

In subset selection, if the chosen rows happen to be those with anomalous structure (e.g., the $\epsilon I _ n$ block in the failure example), the small subproblem is misleading and $\hat x$ can be arbitrarily bad. A random Gaussian (or FFT-type) sketch spreads each row’s signal across the sketched rows, so with high probability no submatrix is degenerate — Marchenko-Pastur then guarantees a near-isometric embedding.

Source: Lecture 14, ‘Fast’ (but fragile) alg for $\min _ x \|Ax - b\| _ 2$ and Sketch and solve for $\min _ x \|Ax - b\| _ 2$ slides; §14.3-14.4 of the lecture notes.

@bite~

Sketch-and-solve cost: forming the dense Gaussian sketch $GA$ already takes $O(smn)$ flops, which is no faster than the direct QR-based solver. The speedup comes from structured sketches (e.g., $G = SFD$ with FFT, “SRFT”), reducing the cost to $O(mn \log s + s n^2)$.

Source: Lecture 14, Complexity and faster sketches discussion in §14.4 of the lecture notes.

@bite~

bite-sketch-and-solve-cost

In one sentence, what does Blendenpik do?

bite-blendenpik-one-sentence

Blendenpik uses a random sketch $G \in \mathbb R^{4n \times m}$ to build a preconditioner: form $GA = \hat Q \hat R$ (QR factorisation), then solve the original problem $\min _ x \|Ax - b\| _ 2$ in the well-conditioned variables $y = \hat R x$, i.e. run CG on the normal equations of $B = A \hat R^{-1}$ (with $\kappa _ 2(B) = O(1)$) and recover $x = \hat R^{-1} y$. The change of variables is exact ($By = Ax$), so this returns the true minimiser to full accuracy, in $O(mn \log m + n^3)$ rather than $O(m n^2)$.

Source: Lecture 14, Sketch-to-precondition: Blendenpik slide and §14.5 of the lecture notes.

@bite~

Blendenpik forms $GA = \hat Q \hat R$ with $G \in \mathbb R^{4n \times m}$ Gaussian. The preconditioned matrix $A \hat R^{-1}$ has condition number $O(1)$ with high probability, justified by applying Marchenko-Pastur to $GQ$ (where $A = QR$ is the thin QR of $A$).

Source: Lecture 14, Explaining $\kappa _ 2(A \hat R^{-1}) = O(1)$ via Marchenko-Pastur slide and §14.5.1 of the lecture notes (Theorem 14.3).

@bite~

bite-blendenpik-well-conditioned-preconditioner

Inside Blendenpik, once $\hat R$ is computed, the inner solver is CG applied to the normal equation $B^\top B y = B^\top b$ with $B = A \hat R^{-1}$. Because $\kappa _ 2(B^\top B) = O(1)$, CG reaches a target tolerance $\tau$ on the relative residual in $O(\log(1/\tau))$ iterations, each costing $O(mn)$. ($\tau$ is the iteration tolerance — distinct from the course’s standing $\epsilon = O(u)$ for unit roundoff.)

Source: Lecture 14, Blendenpik: solving $\min _ x \ \vert Ax - b\ \vert _ 2$ using $\hat R$ slide and §14.5.2 of the lecture notes.

@bite~

bite-blendenpik-inner-solver-cost

Strategy for proving the sketch-and-solve error bound (∆sketch-and-solve-error-bound-proof).

bite-sketch-and-solve-strategy

Reduce the problem to a property of $GQ$ where $Q$ is from the QR factorisation $[A, b] = QR$. Then $\|Av - b\| _ 2 = \|Qw\| _ 2$ and $\|G(Av - b)\| _ 2 = \|(GQ)w\| _ 2$.
Argue $GQ$ is itself Gaussian by orthogonal invariance of Gaussian matrices, since $Q$ is independent of $G$.
Apply Marchenko-Pastur to the rectangular $s \times (n+1)$ Gaussian $GQ$: all its singular values lie in $[\sqrt s - \sqrt{n+1}, \sqrt s + \sqrt{n+1}]$.
Conclude a two-sided sandwich $\|G(Av - b)\| _ 2 \approx \|Av - b\| _ 2$ uniformly in $v$, then plug $v = \hat x$ and $v = x^\ast$ and use that $\hat x$ minimises the sketched residual.

Source: Lecture 14, Theorem 14.2 in §14.4 of the lecture notes.

@bite~ @proofsupport~

The load-bearing two-sided inequality in the sketch-and-solve proof: for any vector $v$, $(\sqrt s - \sqrt{n+1}) \ \vert Av - b\ \vert _ 2 \le <span class="cloze" tabindex="0">\ \vert G(Av - b)\ \vert _ 2</span> \le <span class="cloze" tabindex="0">(\sqrt s + \sqrt{n+1}) \ \vert Av - b\ \vert _ 2</span>$. The bound depends on $s, n$ but not on $v$ — it is a uniform near-isometry on $\mathrm{range}([A, b])$.

Source: Lecture 14, §14.4 of the lecture notes (proof of Theorem 14.2).

@bite~ @proofsupport~

bite-sketch-and-solve-key-inequality-mp-sandwich

Where does the hypothesis “$G$ is Gaussian” enter the sketch-and-solve error proof (∆sketch-and-solve-error-bound-proof)?

bite-sketch-and-solve-hypothesis-gaussian

In exactly two places, both via the rectangular random matrix $GQ$:

Orthogonal invariance of Gaussian matrices: applied to argue that $GQ$ has the same distribution as a fresh Gaussian. This uses $Q$ orthogonal and independent of $G$.
Marchenko-Pastur conditioning: applied to $GQ$ to bound its singular values in $[\sqrt s - \sqrt{n+1}, \sqrt s + \sqrt{n+1}]$. Strictly, MP itself only needs i.i.d. mean-$0$ variance-$1$ entries — but here orthogonal invariance is the step that gets us to a fresh random matrix with i.i.d. entries.

So if $G$ were merely a non-Gaussian sub-Gaussian sketch, MP-style conditioning of $GQ$ would still go through, but only via a more delicate non-invariance-based argument.

Source: Lecture 14, §14.4 of the lecture notes (proof of Theorem 14.2 — orthogonal invariance + Marchenko-Pastur citation).

@bite~ @proofsupport~

Flashcards

Row subset selection

Sketch-and-solve

Blendenpik

Bite-sized

Related posts