NLA MT25, Conjugate gradient method


Flashcards

What is the main idea of the conjugate gradient method for solving positive definite linear systems $Ax = b$?


Solve instead

\[Q _ k^\top (AQ _ k y - b) = 0\]

or equivalently

\[T _ k y - Q _ k^\top b = 0\]

where $x = Q _ k y$, and $Q _ k, T _ k$ come from the Lanczos decomposition

\[A Q _ k = Q _ k T _ k + q _ {k+1} [0, \ldots, 0, t _ {k+1, k}]\]

@Define Galerkin orthogonality in the context of the conjugate gradient method.


The requirement that

\[Q _ k^\top (AQ _ k y - b) = 0\]

i.e. that the residual $Ax - b$ is orthogonal to $Q _ k$.

In the conjugate gradient method, we solve positive definite linear systems $Ax = b$ by instead solving

\[Q _ k^\top (AQ _ k y - b) = 0\]

where

\[AQ _ k = Q _ k T _ k + q _ {k+1} [0, \ldots, 0, t _ {k+1, k}]\]

and $T _ k$ is symmetric tridiagonal. What is the motivation behind this?


The CG algorithm minimises the $A$-norm of the error in the Krylov subspace

\[\begin{aligned} x _ k &= \text{argmin} _ {x \in Q _ k} \vert \vert x - x _ \ast \vert \vert _ A \\ \end{aligned}\]

Writing $x _ k = Q _ k y$, we have

\[\begin{aligned} (x _ k - x _ \ast)^\top A (x _ k - x _ \ast) &= (Q _ k y - x _ \ast)^\top A (Q _ k y - x _ \ast) \\ &= y^\top (Q _ k^\top A Q _ k) y - 2b^\top Q _ k y + b^\top x _ \ast \end{aligned}\]

Differentiating with respect to $y$, we obtain the minimum as

\[y = (Q _ k^\top A Q _ k)^{-1} Q _ k^\top b\]

so $Q _ k^\top (AQ _ k y - b) = 0$.

Another way to see this is via the results in [[Notes - Numerical Analysis HT24, Best approximation in inner product spaces]]; the minimum in the $A$-norm will be attained when $\langle q _ i, x _ \ast - x\rangle _ A = 0$ for all $q _ i$.

@justify~

@State the CG @algorithm for solving positive definite linear systems $Ax = b$.


  • Initialisation
    • $x _ 0 = 0$
    • $r _ 0 = -b$
    • $p _ 0 = r _ 0$
  • For $k = 1, 2, 3, \ldots$
    • $\alpha _ k = \langle r _ k, r _ k \rangle / \langle p _ k, Ap _ k \rangle$
    • $x _ {k+1} = x _ k + \alpha _ k p _ k$
    • $r _ {k+1} = r _ k - \alpha _ k A p _ k$
    • $\beta _ k = \langle r _ {k+1}, r _ {k+1} \rangle / \langle r _ k, r _ k \rangle$
    • $p _ {k+1} = r _ {k+1} + \beta _ k p _ k$

where $r _ k = b - Ax _ k$ is the residual and $p _ k$ is the search direction.

In the conjugate gradient method, we solve positive definite linear systems $Ax = b$ by instead solving

\[Q _ k^\top (AQ _ k y - b) = 0\]

where

\[AQ _ k = Q _ k T _ k + q _ {k+1} [0, \ldots, 0, t _ {k+1, k}]\]

and $T _ k$ is symmetric tridiagonal.

@Prove that this may be implemented via the algorithm:

  • Initialisation
    • $x _ 0 = 0$
    • $r _ 0 = -b$
    • $p _ 0 = r _ 0$
  • For $k = 1, 2, 3, \ldots$
    • $\alpha _ k = \langle r _ k, r _ k \rangle / \langle p _ k, Ap _ k \rangle$
    • $x _ {k+1} = x _ k + \alpha _ k p _ k$
    • $r _ {k+1} = r _ k - \alpha _ k A p _ k$
    • $\beta _ k = \langle r _ {k+1}, r _ {k+1} \rangle / \langle r _ k, r _ k \rangle$
    • $p _ {k+1} = r _ {k+1} + \beta _ k p _ k$

where $r _ k = b - Ax _ k$ is the residual and $p _ k$ is the search direction.


@todo, once have done sheet 4.

Suppose:

  • $A \in \mathbb R^{n \times n}$, $A \succ 0$
  • $b \in \mathbb R^n$

@State a theorem describing the convergence of the CG algorithm applied to the linear system $Ax = b$.


Let $e _ k := x _ \ast - x _ k$ be the error after the $k$th CG iteration where $x _ \ast$ is the exact solution. Then

\[\frac{ \vert \vert e _ k \vert \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} \le 2\left(\frac{\sqrt{\kappa _ 2(A)} - 1}{\sqrt{\kappa _ 2(A)} + 1}\right)^k\]

Suppose:

  • $A \in \mathbb R^{n \times n}$, $A \succ 0$
  • $b \in \mathbb R^n$
  • $Ax _ \ast = b$
  • $e _ k := x _ \ast - x _ k$ be the error after the $k$th CG iteration

@Prove that then

\[\frac{ \vert \vert e _ k \vert \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} \le 2\left(\frac{\sqrt{\kappa _ 2(A)} - 1}{\sqrt{\kappa _ 2(A)} + 1}\right)^k\]

You may assume that

\[\min _ {p \in \mathcal P _ k, p(0) = 1} \max _ {x \in [\lambda _ \min(A), \lambda _ \max(A)]} \vert p(x) \vert \le 2\left(\frac{\sqrt{\kappa _ 2(A)} - 1}{\sqrt{\kappa _ 2(A)} + 1}\right)^k\]

(∆bound-polynomails-over-eigenvalue-interval)


We have $e _ 0 = x _ \ast$ since $x _ 0 = 0$. Then

\[\begin{aligned} \frac{ \vert \vert e _ k \vert \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} &= \min _ {x \in \mathcal K _ k(A, b)} \frac{ \vert \vert x _ k - x _ \ast \vert \vert _ A}{ \vert \vert x _ \ast \vert \vert _ A} \\ &= \min _ {p _ {k-1} \in \mathcal P _ {k-1}} \frac{ \vert \vert p _ {k-1}(A) b - A^{-1} b \vert \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} \\ &= \min _ {p _ {k-1} \in \mathcal P _ {k-1}} \frac{ \vert \vert (p _ {k-1}(A) A - I)e _ 0 \vert \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} \\ &= \min _ {p \in \mathcal P _ k, p(0) = 1} \frac{ \vert \vert p(A) e _ 0 \vert \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} \\ &= \min _ {p \in \mathcal P _ k, p(0) = 1} \frac{\left \vert \left \vert V\begin{bmatrix} p(\lambda _ 1) & & \\ & \ddots & \\ & & p(\lambda _ n) \end{bmatrix} V^\top e _ 0 \right \vert \right \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} \end{aligned}\]

Then

\[\begin{aligned} \left \vert \left \vert V\begin{bmatrix} p(\lambda _ 1) & & \\ & \ddots & \\ & & p(\lambda _ n) \end{bmatrix} V^\top e _ 0 \right \vert \right \vert _ A &= \sum _ i \lambda _ i p(\lambda _ i)^2 (V^\top e _ 0)^2 _ i \\ &\le \max _ j p(\lambda _ j)^2 \sum _ i \lambda _ i (V^\top e _ 0) _ i^2 \\ &= \max _ j p(\lambda _ j)^2 \vert \vert e _ 0 \vert \vert ^2 _ A \end{aligned}\]

Thus

\[\begin{aligned} \frac{ \vert \vert e _ k \vert \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} &\le \min _ {p \in \mathcal P _ k, p(0) = 1} \max _ j \vert p(\lambda _ j) \vert \\ &\le \min _ {p \in \mathcal P _ k, p(0) = 1} \max _ {x \in [\lambda _ \min(A), \lambda _ \max(A)]} \vert p(x) \vert \\ &\le 2\left(\frac{\sqrt{\kappa _ 2(A)} - 1}{\sqrt{\kappa _ 2(A)} + 1}\right)^k \end{aligned}\]

@Prove using results about Chebyshev polynomials that

\[\min _ {p \in \mathcal P _ k, p(0) = 1} \max _ {x \in [\lambda _ \min(A), \lambda _ \max(A)]} \vert p(x) \vert \le 2\left(\frac{\sqrt{\kappa _ 2(A)} - 1}{\sqrt{\kappa _ 2(A)} + 1}\right)^k\]

where $A$ is a matrix with minimum and maximum eigenvalue $\lambda _ \min$ and $\lambda _ \max$.


For ease of notation, denote $a = \lambda _ \min$ and $b = \lambda _ \max$. Let

\[p(x) = c _ k T _ k\left(\frac{2x - b - a}{b - a}\right)\]

where $c _ k = 1 / T _ k \left(\frac{-b-a}{b-a}\right)$. Then $p(0) = 1$, and by results about Chebyshev polynomials we have

\[\vert p(x) \vert \le 1/ \vert T _ k\left(\frac{b+a}{b-a}\right)\]

on $x \in [a, b]$.

Furthermore, we have

\[T _ k(z) = \frac 1 2 (z^k + z^{-k})\]

with $\frac 1 2 (z + z^{-1}) = \frac{b+a}{b-a}$, and hence

\[\begin{aligned} z &= \frac{\sqrt{b/a} + 1}{\sqrt{b / a}-1} \\ &= \frac{\sqrt{\kappa _ 2(A)} + 1}{\sqrt{\kappa _ 2(A)} - 1} \end{aligned}\]

Hence

\[\begin{aligned} \vert p(x) \vert &\le \frac{1}{T _ k\left(\frac{b+a}{b-a}\right)} \\ &\le 2 \left( \frac{\sqrt \kappa _ 2(A) - 1}{\sqrt \kappa _ 2(A) + 1} \right)^k \end{aligned}\]

We have the ∆convergence-of-conjugate-gradient result that if:

  • $A \in \mathbb R^{n \times n}$, $A \succ 0$
  • $b \in \mathbb R^n$
  • $e _ k := x _ \ast - x _ k$ be the error after the $k$th CG iteration where $x _ \ast$ is the exact solution

then:

\[\frac{ \vert \vert e _ k \vert \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} \le 2\left(\frac{\sqrt{\kappa _ 2(A)} - 1}{\sqrt{\kappa _ 2(A)} + 1}\right)^k\]

Intuitively, what does this mean in terms of when CG will be slow?


When the eigenvalues are not clustered away from $0$.

We have the ∆convergence-of-conjugate-gradient result that if:

  • $A \in \mathbb R^{n \times n}$, $A \succ 0$
  • $b \in \mathbb R^n$
  • $e _ k := x _ \ast - x _ k$ be the error after the $k$th CG iteration where $x _ \ast$ is the exact solution

then:

\[\frac{ \vert \vert e _ k \vert \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} \le 2\left(\frac{\sqrt{\kappa _ 2(A)} - 1}{\sqrt{\kappa _ 2(A)} + 1}\right)^k\]

This means that CG will be slow if the eigenvalues are not distributed in a favourable manner, like not being clustered away from $0$.

To solve this, we can use preconditioning. What does the preconditioning setup look like in this context?


Find some matrix $M$ such that

\[M^\top M \approx A^{-1}\]

and solve

\[M^\top A M y = M^\top b, \quad My = x\]

We have the ∆convergence-of-conjugate-gradient result that if:

  • $A \in \mathbb R^{n \times n}$, $A \succ 0$
  • $b \in \mathbb R^n$
  • $e _ k := x _ \ast - x _ k$ be the error after the $k$th CG iteration where $x _ \ast$ is the exact solution

then:

\[\frac{ \vert \vert e _ k \vert \vert _ A}{ \vert \vert e _ 0 \vert \vert _ A} \le 2\left(\frac{\sqrt{\kappa _ 2(A)} - 1}{\sqrt{\kappa _ 2(A)} + 1}\right)^k\]

This means that CG will be slow if the eigenvalues are not distributed in a favourable manner, like not being clustered away from $0$.

To solve this, we can use preconditioning where we find some matrix $M$ such that

\[M^\top M \approx A^{-1}\]

and solve

\[M^\top A M y = M^\top b, \quad My = x\]

What are the desiderata of $M$ in this context?


  • $M^\top A M$ is easy to multiply to a vector
  • $M^\top A M$ has eigenvalues has clustered eigenvalues away from $0$



Related posts