Continuous Optimisation HT26, Least-squares


Flashcards

Basic definitions

@State what is meant by a least-squares problem.

We have $r : \mathbb R^n \to \mathbb R^m$ with $m \ge n$, and want $r(x) = 0$ or $r(x) \approx 0$ via the objective

\[\min _ {x \in \mathbb R^n} f(x) := \frac 1 2 \sum^m _ {j=1} [r _ j(x)]^2 = \frac 1 2 \vert \vert r(x) \vert \vert ^2\]

Suppose we have the (potentially nonlinear) least-squares problem $r : \mathbb R^n \to \mathbb R^m$ with $m \ge n$, and want $r(x) = 0$ or $r(x) \approx 0$ via the objective

\[\min _ {x \in \mathbb R^n} f(x) := \frac 1 2 \sum^m _ {j=1} [r _ j(x)]^2 = \frac 1 2 \vert \vert r(x) \vert \vert ^2\]

What is $\nabla f(x)$ in this case? @Justify your answer.

\[\nabla f(x) = J(x)^\top r(x)\]

where $J(x)$ is the Jacobian of $r$ at $x$.


\[\frac{\partial f}{\partial x _ i}(x) = \sum^m _ {j = 1} r _ j(x) \frac{\partial r _ j}{\partial x _ i}(x) = r(x)^\top \begin{pmatrix} \frac{\partial r _ 1}{\partial x _ i}(x) \\ \vdots \\ \frac{\partial r _ m}{\partial x _ i}(x) \end{pmatrix}\]

Suppose we have the (potentially nonlinear) least-squares problem $r : \mathbb R^n \to \mathbb R^m$ with $m \ge n$, and want $r(x) = 0$ or $r(x) \approx 0$ via the objective

\[\min _ {x \in \mathbb R^n} f(x) := \frac 1 2 \sum^m _ {j=1} [r _ j(x)]^2 = \frac 1 2 \vert \vert r(x) \vert \vert ^2\]

What is $\nabla^2 f(x)$ in this case?

\[\nabla^2 f(x) = J(x)^\top J(x) + \sum^m _ {j=1} r _ j(x) \nabla^2 r _ j(x)\]

Gauss-Newton method

Suppose we have the (potentially nonlinear) least-squares problem $r : \mathbb R^n \to \mathbb R^m$ with $m \ge n$, and want $r(x) = 0$ or $r(x) \approx 0$ via the objective

\[\min _ {x \in \mathbb R^n} f(x) := \frac 1 2 \sum^m _ {j=1} [r _ j(x)]^2 = \frac 1 2 \vert \vert r(x) \vert \vert ^2\]

Recall that in this case:

  • $\nabla f(x) = J(x)^\top r(x)$
  • $\nabla^2 f(x) = J(x)^\top J(x) + \sum^m _ {j=1} r _ j(x) \nabla^2 r _ j(x)$

where $J(x)$ is the Jacobian of $r$ at $x$. @State and motivate the Gauss-Newton direction used as a (hopefully) descent direction in the Gauss-Newton method for nonlinear least-squares.

When $r _ j(x^\ast)\approx 0$ or $\nabla^2 r _ j(x^\ast)$ is small, then $r _ j(x) \nabla^2 r _ j(x)$ is small. Hence when $x$ is close to $x^\ast$, $\nabla^2 f(x) \approx J(x)^\top J(x)$. We define

\[\widetilde{\nabla^2}f(x) := J(x)^\top J(x)\]

and the Gauss-Newton direction $s^k$ as satisfying

\[\widetilde{\nabla^2}f(x)s^k = -\nabla f(x^k)\]

i.e.

\[J(x^k)^\top J(x^k) s^k = -J(x^k)^\top r(x^k)\]

and hence $s^k$ solves the linear least squares problem

\[\min _ {s \in \mathbb R^n} \frac 1 2 \vert \vert J(x^k)s + r(x^k) \vert \vert ^2\]

@exam~

Suppose we have the (potentially nonlinear) least-squares problem $r : \mathbb R^n \to \mathbb R^m$ with $m \ge n$, and want $r(x) = 0$ or $r(x) \approx 0$ via the objective

\[\min _ {x \in \mathbb R^n} f(x) := \frac 1 2 \sum^m _ {j=1} [r _ j(x)]^2 = \frac 1 2 \vert \vert r(x) \vert \vert ^2\]

Recall that in this case:

  • $\nabla f(x) = J(x)^\top r(x)$
  • $\nabla^2 f(x) = J(x)^\top J(x) + \sum^m _ {j=1} r _ j(x) \nabla^2 r _ j(x)$

where $J(x)$ is the Jacobian of $r$ at $x$. In this context, we the Gauss-Newton method picks the direction

\[s^k = -(J(x^k)^\top J(x^k))^{-1}J(x^k)^\top r(x^k)\]

Under what conditions is $s^k$ a descent direction? Why?

$s^k$ is a descent direction, provided $J(x^k)$ has full column rank (since if $J(x^k)$ has full column rank, $J(x^k)^\top J(x^k)$ is positive definite).

@exam~

Suppose we have the (potentially nonlinear) least-squares problem $r : \mathbb R^n \to \mathbb R^m$ with $m \ge n$, and want $r(x) = 0$ or $r(x) \approx 0$ via the objective

\[\min _ {x \in \mathbb R^n} f(x) := \frac 1 2 \sum^m _ {j=1} [r _ j(x)]^2 = \frac 1 2 \vert \vert r(x) \vert \vert ^2\]

In this context, describe the Gauss-Newton method for nonlinear least squares.

Suppose we have $\epsilon > 0$ and $x^0 \in \mathbb R^n$.

  • While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
    • Solve the linear system $\widetilde{\nabla^2} f(x^k) s^k = -\nabla f(x^k)$.
    • Set $x^{k+1} = x^k + \alpha^k s^k$ with $\alpha^k \in (0, 1]$ (found e.g. via backtracking Armijo)
    • $k = k+1$

where

\[\widetilde{\nabla^2}f(x) := J(x)^\top J(x)\]

motivated by the fact that when $r _ j(x^\ast)\approx 0$ or $\nabla^2 r _ j(x^\ast)$ is small, then $r _ j(x) \nabla^2 r _ j(x)$ is small. Hence when $x$ is close to $x^\ast$, $\nabla^2 f(x) \approx J(x)^\top J(x)$.

Suppose we have the (potentially nonlinear) least-squares problem $r : \mathbb R^n \to \mathbb R^m$ with $m \ge n$, and want $r(x) = 0$ or $r(x) \approx 0$ via the objective

\[\min _ {x \in \mathbb R^n} f(x) := \frac 1 2 \sum^m _ {j=1} [r _ j(x)]^2 = \frac 1 2 \vert \vert r(x) \vert \vert ^2\]

In this context, the Gauss-Newton method for nonlinear least-squares calculates

  • While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
    • Solve the linear system $\widetilde{\nabla^2} f(x^k) s^k = -\nabla f(x^k)$.
    • Set $x^{k+1} = x^k + \alpha^k s^k$ with $\alpha^k \in (0, 1]$ (found e.g. via backtracking Armijo)
    • $k = k+1$

@State a result about the global and local convergence of this method.

  • Global convergence: If $J(x^k)$ is uniformly full-rank for all $x^k$, and $\nabla f$ Lipschitz continuous, then $ \vert \vert \nabla f(x^k) \vert \vert = \vert \vert J(x^k)^\top r(x^k) \vert \vert \to 0$ as $k \to \infty$ (although $\nabla f(x) = 0$ does not imply $r(x) = 0$).
  • Local convergence: If $r(x^\ast) = 0$, $J(x^\ast)$ full-rank, $\alpha^k = 1$ for all $k$ and we have the conditions of ∆newtons-method-convergence (i.e. $f \in \mathcal C^2$, $\nabla f(x^\ast) = 0$, $\nabla^2 f(x^\ast)$ nonsingular, $\nabla^2 f$ locally Lipschitz at $x^\ast$, and $x^{k _ 0}$ sufficiently close to $x^\ast$), then $x^k \to x^\ast$ quadratically.

@exam~

Bite-sized

A least-squares problem is zero-residual if $r(x^\ast) = 0$ at the minimiser, and nonzero-residual otherwise. Gauss-Newton converges quadratically in the zero-residual case but only linearly in the nonzero-residual case.

Source: Lecture 7, Nonlinear Least-Squares (NLS) slide.

@bite~

For the linear least-squares problem $\min _ {x} \tfrac{1}{2} \ \vert Jx + r\ \vert ^2$ with $J \in \mathbb R^{m \times n}$, $m \ge n$, the global minimiser satisfies the normal equations $J^\top J x^\ast = -J^\top r$. Solvers in practice: Cholesky on $J^\top J$, or (more stably) QR or SVD of $J$.

Source: Lecture 7, Linear Least-Squares (LLS) slide.

@bite~

The Gauss-Newton step $s^k$ solves the linear least-squares subproblem $\min _ s \tfrac{1}{2} \ \vert J(x^k) s + r(x^k)\ \vert ^2$ — i.e. it linearises the residual $r$ at $x^k$ and minimises the resulting squared norm. This view makes it clear that Gauss-Newton reuses the LLS solver in the inner loop.

Source: Lecture 7, The Gauss-Newton Method slide.

@bite~

Geometrically, linear least-squares $\min _ x \tfrac{1}{2}\ \vert Ax - b\ \vert ^2$ finds the orthogonal projection of $b$ onto the column space of $A$. The normal equations $A^\top A x^\ast = A^\top b$ are exactly the orthogonality condition $A^\top(Ax^\ast - b) = 0$, i.e. the residual $Ax^\ast - b$ is orthogonal to every column of $A$.

Source: Lecture 7, Linear Least-Squares (LLS) geometric remark.

@bite~

Why does Gauss-Newton only converge linearly in the nonzero-residual case? Because the omitted term in $\nabla^2 f = J^\top J + \sum _ j r _ j \nabla^2 r _ j$ is no longer small at $x^\ast$ (since $r _ j(x^\ast) \ne 0$), so $\widetilde{\nabla^2}f = J^\top J$ is a bad approximation to the true Hessian, and Gauss-Newton loses Newton’s local quadratic rate. The asymptotic rate becomes $ \vert \lambda \vert $ where $\lambda$ is the largest eigenvalue of $(J^\top J)^{-1} \sum _ j r _ j \nabla^2 r _ j$ at $x^\ast$.

Source: Lecture 7, Gauss-Newton vs. Newton slide and numerical example.

@bite~