NLA MT25, Krylov subspace methods
Flashcards
Suppose we are solving linear systems $Ax = b$ or eigenvalue problems $Ax = \lambda x$. What is the general idea of Krylov subspace methods?
Look for approximate solutions $\hat x \in \mathcal K _ {r}(A, v)$, where
\[\mathcal K _ r(A, v) = \text{span}(v, Av, \ldots, A^{r-1} v)\]where $v \in \mathbb R^n$, or equivalently $\hat x = p _ {r-1}(A)v$, where $p \in \mathbb C _ {r-1}[t]$ is a degree $r-1$ polynomial.
Why is this useful?
- Only matrix-vector products are needed to form this basis,
- The polynomial in $A^i$ is chosen optimally for the problem (compare to the power method, which fixes $p(z) = z^{r-1}$).
- Often only $O(1)$ iterations suffice when the spectrum is well-behaved.
@Define the order-$r$ Krylov subspace $\mathcal K _ r(A, v)$.
Orthonormal bases of Krylov subspaces
This is worth considering because orthonormal bases of Krylov subspaces are very useful in iterative algorithms, such as the Notes - NLA MT25, GMRES algorithmU. A better approach than using a generic approach (e.g. QR decomposition on a matrix of columns) applied to $\{v, Av, \ldots, A^{r-1}v\}$ is given in Notes - NLA MT25, Arnoldi decomposition and iterationU.
Why is it useful to have an orthonormal basis of the Krylov subspace $\mathcal K _ r(A, v)$, rather than any basis (e.g. $[v, Av, \ldots, A^{r-1}v]$ itself)?
Many Krylov methods (e.g. ∆gmres-algorithm, ∆conjugate-gradient-algorithm, Arnoldi-eigenvalue) seek approximate solutions in the form $\hat x = Q _ r y \in \mathcal K _ r(A, v)$. For this to be efficient and stable, we want $Q _ r$ to be orthonormal.
Consider the order-$r$ Krylov subspace $\mathcal K _ r(A, v)$.
Orthonormal bases of these subspaces turn out to be very useful in iterative algorithms. What’s a naïve way of computing an orthonormal basis of $\mathcal K _ r(A, v)$, and why is this naïve?
Form the matrix
\[K = \Bigg[ v \, \Bigg \vert \, Av \, \Bigg \vert \, \cdots \, \Bigg \vert \, A^{r-1} v \Bigg]\]and orthogonalise, e.g. by finding the QR decomposition $K = QR$. This is naïve, because $K$ is typically very poorly conditioned (its columns increasingly align with dominant eigenvector of $A$), and so the computed $\hat Q$ loses orthogonality in floating-point arithmetic.
Bite-sized
The power method is a special case of a Krylov method with $\hat x = A^{k-1} v = p _ {k-1}(A) v$ where $p _ {k-1}(z) = $ $z^{k-1}$. Krylov methods improve on this by choosing the polynomial $p _ {k-1}$ optimally for the problem (residual minimisation, $A$-norm error minimisation, etc.), not just as a monomial.
Every vector in $\mathcal K _ k(A, v)$ can be written as $\hat x = $ $p _ {k-1}(A) v$ for some polynomial $p _ {k-1}$ of degree at most $k - 1$. This polynomial viewpoint is what makes the convergence analysis of Krylov methods tractable (everything reduces to polynomial approximation on the spectrum of $A$).
The Krylov subspace is shift-invariant: $\mathcal K _ k(A - sI, v) = $ $\mathcal K _ k(A, v)$. This is why shift-and-invert Arnoldi changes only the matrix being inverted, not the subspace structure.
When does a Krylov subspace method converge in $O(1)$ iterations independent of the matrix size $n$?
When the spectrum of $A$ is “well-behaved” for polynomial approximation: either (a) eigenvalues clustered away from 0, so a polynomial $p(z) = (1 - z/c)^k$ with $ \vert c \vert > r$ shrinks geometrically on $\lambda(A)$; or (b) only a small number $k \ll n$ of distinct eigenvalues, so $p(z) = \prod _ i (z - \lambda _ i)/(0 - \lambda _ i)$ vanishes at every eigenvalue and convergence terminates in $\le k$ steps. In both cases the rate doesn’t depend on $n$ — the practical appeal of Krylov methods on large sparse problems.