Continuous Optimisation HT26, Newton's method


Flashcards

Basic setup

Suppose $B^k$ is a symmetric, positive definite matrix. Suppose we choose directions $s^k$ where

\[B^k s^k = -\nabla f(x^k)\]

@Justify that $s^k$ is a descent direction.


Recall that a ∆descent-direction is such that $\nabla f(x^k)^\top s^k < 0$. Since

\[\nabla f(x^k)^\top s^k = -\nabla f(x^k)^\top (B^k)^{-1} \nabla f(x^k)\]

Then whenever $\nabla f(x^k) \ne 0$, by positive definiteness we have that $\nabla f(x^k)^\top s^k < 0$.

Suppose $B^k$ is a symmetric, positive definite matrix. Suppose we choose directions $s^k$ where

\[B^k s^k = -\nabla f(x^k)\]

@Justify why this is a sensible choice in terms of $s^k$ being the minimiser of some function.


$s^k$ uniquely solves

\[\min _ {s \in \mathbb R^n} m _ k(s) = f(x^k) + \nabla f(x^k)^\top s + \frac 1 2 s^\top B^k s\]

where $m _ k(s)$ is a convex quadratic function in $s$. Sensible choices might e.g. pick $s$ as the minimiser of the second-order Taylor approximation around $x^k$.

Newton’s method

Suppose we wish to minimise $f(x)$ subject to $x \in \mathbb R^n$.

@State the @algorithm corresponding to Newton’s method.


  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
    • Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
    • Set $k := k+1$

where we hope $\nabla^2 f(x^k)$ is positive definite so that $s^k$ is a descent direction.

Recall ∆newtons-method for optimisation:

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
    • Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
    • Set $k := k+1$

where we hope $\nabla^2 f(x^k)$ is positive definite so that $s^k$ is a descent direction. Suppose this is the case. @Justify the choice of $s^k$.


$s^k$ is minimises the second-order Taylor approximation of $f$ around $x^k$.

@Define the Newton direction.


\[s^k = (-\nabla^2 f(x^k))^{-1} \nabla f(x^k)\]

Recall ∆newtons-method for optimisation:

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
    • Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
    • Set $k := k+1$

@Define what is meant by pure Newton’s method.


$\alpha^k = 1$ for all $k$.

Recall ∆newtons-method for optimisation:

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
    • Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
    • Set $k := k+1$

@Justify its connection to Newton’s method for rootfinding.


This is equivalent to applying Newton’s method to $\nabla f(x)$.

Convergence

Recall ∆newtons-method for optimisation:

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
    • Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
    • Set $k := k+1$

@State a theorem about about the local convergence of ∆pure-newtons-method under some smoothness assumptions.


Suppose:

  • $f \in \mathcal C^2(\mathbb R^n)$
  • $\nabla f(x^\ast) = 0$
  • $\nabla^2 f(x^\ast)$ nonsingular
  • $\nabla^2 f$ is locally Lipschitz continuous at $x^\ast$
  • For some $k _ 0 \ge 0$, $x^{k _ 0}$ is sufficiently close to $x^\ast$

Then:

  • $x^k$ is well-defined for all $k \ge k^0$
  • $x^k \to x^\ast$ as $k \to \infty$ at a quadratic rate
  • $\nabla f(x^k) \to 0$ at a quadratic rate

@Prove ∆newtons-method-convergence, i.e. that if we use the following algorithm for optimisation

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
    • Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
    • Set $k := k+1$

where

  • $f \in \mathcal C^2(\mathbb R^n)$
  • $\nabla f(x^\ast) = 0$
  • $\nabla^2 f(x^\ast)$ nonsingular
  • $\nabla^2 f$ is locally Lipschitz continuous at $x^\ast$
  • For some $k _ 0 \ge 0$, $x^{k _ 0}$ is sufficiently close to $x^\ast$

then:

  • $x^k$ is well-defined for all $k \ge k^0$
  • $x^k \to x^\ast$ as $k \to \infty$ at a quadratic rate
  • $\nabla f(x^k) \to 0$ at a quadratic rate

Consider the Taylor expansion of $\nabla f$ around $x$:

\[\nabla f(x^\ast) = \nabla f(x) + \nabla^2 f(x) (x^\ast - x) + O( \vert \vert x^\ast - x \vert \vert ^2)\]

where $x$ is sufficiently close to $x^\ast$ and $O(\cdot)$ depends on the Lipschitz constant of $\nabla^2 f(x^\ast)$. Then by $\nabla f(x^\ast) = 0$ and $x = x^k$ whenever $x^k$ sufficiently close to $x^\ast$, we have

\[0 = \nabla f(x^k) + \nabla^2 f(x^k) (x^\ast - x^k) + O( \vert \vert x^\ast - x^k \vert \vert ^2)\]

Then as $\nabla^2 f(x^\ast)$ nonsingular, we have that $\nabla^2 f(x^k)$ is nonsingular whenever $x^k$ is sufficiently close to $x^\ast$. Hence the above implies that

\[x^k - x^\ast = (\nabla^2 f(x^k))^{-1} \nabla f(x^k) + O( \vert \vert x^\ast - x^k \vert \vert ^2)\]

Letting $s^k$ be the Newton direction and $x^{k+1} = x^k + s^k$, it follows that

\[x^k - x^\ast = x^k - x^{k+1} + O( \vert \vert x^\ast - x^k \vert \vert ^2)\]

Hence

\[x^{k+1} - x^\ast = O( \vert \vert x^k - x^\ast \vert \vert ^2)\]

Recall ∆newtons-method for optimisation:

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
    • Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
    • Set $k := k+1$

@State a theorem about about the local convergence of this algorithm under some smoothness assumptions when using ∆backtracking-armijo-linesearch.


Suppose:

  • $f \in \mathcal C^2(\mathbb R^n)$
  • $\nabla^2 f$ is Lipschitz continuous and positive definite at the iterates
  • $\alpha _ {(0)} = 1$ and $\beta \le 0.5$ in the ∆armijo-condition
  • $x^k \to x^\ast$ as $k \to \infty$
  • $\nabla f(x^\ast) = 0$
  • $\nabla^2 f(x^\ast) \succ 0$

Then:

  • $\alpha^k = 1$ for all $k$ sufficiently large
  • $x^k \to x^\ast$ converges quadratically

Invariance to linear transformations of the variables

@State (briefly) a theorem about the invariance of Newton’s method to linear transformations of variables.


Newton’s method is scale invariant with respect to linear transformations of variables.

Recall ∆newtons-method for optimisation:

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
    • Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
    • Set $k := k+1$

@Prove that this method (with or without linesearch) is scale invariant with respect to linear transformations of variables.


Let $\mathbf A \in \mathbb R^{n \times n}$ is a nonsingular matrix and let $y = \mathbf A x$. Write $\mathbf B = \mathbf A^{-1}$ and consider

\[\bar f(y) := f(x(y)) = f(\mathbf By)\]

We aim to show that minimising $\bar f$ with respect to $y$ is equivalent to minimising $f$ with respect to $x$.

From the definitions, we have that

\[\nabla \bar f(y) = \mathbf B^\top \nabla f(x), \quad \nabla^2 \bar f(y) = \mathbf B^\top \nabla^2 f(x) B\]

Hence the ∆newton-direction at $y$ is given by

\[\begin{aligned} s _ y &= -(\mathbf B^\top \nabla^2 f(x) \mathbf B)^{-1} \mathbf B^\top \nabla f(x) \\ &= -\mathbf B^{-1} (\nabla^2 f(x))^{-1} \mathbf B^{-\top} \mathbf B^\top \nabla f(x) \\ &= -\mathbf B^{-1}(\nabla^2 f(x))^{-1} \nabla f(x) \\ &= \mathbf As _ x \end{aligned}\]

Hence $y + \alpha s _ y = \mathbf A(x + \alpha s _ x)$.

@State three advantages and disadvantages of Newton’s method for optimisation.


  • Advantages
    • Quadratic local convergence: $ \vert \vert x^{k+1} - x^\ast \vert \vert = O( \vert \vert x^k - x^\ast \vert \vert ^2)$
    • Scale invariance: is invariant to linear transformations
    • Automatic step size: once close to the solution, the step size needs no adaptation or backtracking
  • Disadvantages
    • Newton direction $s^k$ is not well-defined if $\nabla^2 f(x^k)$ is singular
    • Iterates can get attracted to local maxima or saddle points
    • Iterates may fail to converge if $x^0$ is too far from solution

Give an @example of a scalar function on which ∆pure-newtons-method fails to converge globally and @visualise the failure mode.


\[f(x) = -\frac{x^6}{6} + \frac{x^4}{4} + 2x^2\]

Then $x^\ast = 0$ is a local minimiser and $x = \pm \sqrt{(1 + \sqrt{17})/2}$ is a global maximum. Then if Newton’s method is applied to $f$ with $x^0 = 1$, $x^{2k} = 1$ and $x^{2k+1} = -1$ for all $k$.

@State a theorem about the convergence of ∆newtons-method when using ∆backtracking-armijo-linesearch under some conditions on the eigenvalues of the Hessian.


Suppose:

  • $f \in \mathcal C^2(\mathbb R^n)$
  • $\nabla f$ is Lipschitz continuous
  • We minimise $f$ with Newton’s method and backtracking Armijo linesearch and $\epsilon := 0$
  • For all $x^k$, suppose the eigenvalues of $\nabla^2 f(x^k)$ at the iterates $x^k$ are positive and uniformly bounded below away from $0$ independently of $k$

Then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$.

We have the result that if:

  • $f \in \mathcal C^2(\mathbb R^n)$
  • $\nabla f$ is Lipschitz continuous
  • We minimise $f$ with backtracking Armijo linesearch and $\epsilon := 0$
  • For all $x^k$, suppose the eigenvalues of $\nabla^2 f(x^k)$ at the iterates $x^k$ are positive and uniformly bounded below away from $0$ independently of $k$

Then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$. What simpler conditions imply these ones?


  • $f \in \mathcal C^2$
  • $\nabla f$ is Lipschitz continuous
  • $f$ is strongly convex

@Prove that if

  • $f \in \mathcal C^2(\mathbb R^n)$
  • $\nabla f$ is Lipschitz continuous
  • We minimise $f$ with backtracking Armijo linesearch and $\epsilon := 0$
  • For all $x^k$, suppose the eigenvalues of $\nabla^2 f(x^k)$ at the iterates $x^k$ are positive and uniformly bounded below away from $0$ independently of $k$

then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$.


Note that these conditions satisfy the prerequisites for ∆global-convergence-of-backtracking-armijo. Hence there exists $l \ge 0$ such that $\nabla f(x^l) = 0$, or

\[E _ k := \min \left\{ \frac{ \vert \nabla f(x^k)^\top s^k \vert }{ \vert \vert s^k \vert \vert }, \vert \nabla f(x^k)^\top s^k \vert \right\} \to 0 \text{ as } k \to \infty \quad (\ast)\]

Assume now that $\nabla f(x^k) \ne 0$ for all $k \ge 0$. Then we are left showing that $(\ast)$ result implies $\nabla f(x^k) \to 0$ as $k \to \infty$. To do this, we will express the terms in $(\ast)$ in terms of $\nabla f(x^k)$.

Let $\nabla^2 f(x^k) := H _ k$. The eigenvalues of $\nabla^2 f(x^k) = H^k$ are positive and uniformly bounded below, away from zero for all $k$. This implies that $\lambda _ \min(H _ k) > 0$ and that there exists $\lambda _ \min$ independent of $k$ such that

\[\lambda _ \min(H _ k) \ge \lambda _ \min \text{ for all }k \ge 0\]

The ∆rayleigh-ritz-theorem says that for all $s \in \mathbb R^n \setminus {0}$ and $\mathbf M \in \mathbb R^{n \times n}$ symmetric we have

\[\lambda _ \min(\mathbf M) \le \frac{s^\top \mathbf M s}{ \vert \vert s \vert \vert ^2} \le \lambda _ \max(\mathbf M)\]

Furthermore by ∆lipschitz-continuous-iff-bounded-hessian we have that $\nabla f$ is Lipschitz continuous iff $\nabla^2 f$ is uniformly bounded above. Hence all the eigenvalues of $\nabla^2 f$ are bounded above, and hence there exists $\lambda _ \max > 0$ independent of $k$ such that

\[\lambda _ \max (H _ k) \le \lambda _ \max \text{ for all }k \ge 0\]

Returning to $(\ast)$, we see from the definition of $s^k$ and $\nabla f(x^k)$ that

\[\begin{aligned} \vert \nabla f(x^k)^\top s^k \vert &= \vert \nabla f(x^k)^\top H _ k^{-1} \nabla f(x^k) \vert \\ &\ge \lambda _ \min(H _ k^{-1}) \vert \vert \nabla f(x^k) \vert \vert ^2 \\ &= \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{\lambda _ \max(H _ k)} \\ &\ge \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{\lambda _ \max(H^k)} \\ &\ge \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{\lambda _ \max} \end{aligned}\]

Then $s^k = -H _ k^{-1}\nabla f(x^k)$ implies

\[\begin{aligned} \vert \vert s^k \vert \vert ^2 &= \nabla f(x^k)^\top H _ {k}^{-2} \nabla f(x^k) \\ &\le \lambda _ \max(H _ k^{-2}) \vert \vert \nabla f(x^k) \vert \vert ^2 \\ &= \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{\lambda _ \min(H _ k)^2} \\ &\le \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{\lambda^2 _ \min} \end{aligned}\]

and thus

\[\frac{1}{ \vert \vert s _ k \vert \vert } \ge \frac{\lambda _ \min}{ \vert \vert \nabla f(x^k) \vert \vert }\]

Furthermore, by the previous results we have that

\[E _ k \ge \min \left\{ \frac{\lambda _ \min}{\lambda _ \max} \vert \vert \nabla f(x^k) \vert \vert , \frac{1}{\lambda _ \max} \vert \vert \nabla f(x^k) \vert \vert ^2 \right\} > 0\]

for all $k$. This and $(\ast)$ then imply that $\nabla f(x^k) \to 0$ as $k \to \infty$.

@State a theorem about the convergence of ∆newtons-method when using ∆backtracking-armijo-linesearch under some conditions on the Lipschitz continuity of $\nabla f$ and choices of $B^k$ in $B^k s^k = -\nabla f(x^k)$.


Suppose:

  • $f \in \mathcal C^1 (\mathbb R^n)$
  • $f$ is bounded below on $\mathbb R^n$
  • $\nabla f$ is Lipschitz continuous
  • We minimise $f$ via backtracking Armijo linesearch with $s^k$ chosen according to $B^k s^k = -\nabla f(x^k)$
  • For all $k$, the eigenvalues of $B^k$ are uniformly bounded above and below away from $0$ independently of $k$

Then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$.

Modified damped Newton methods

Suppose we wish to use ∆newtons-method but $\nabla^2 f(x^k)$ is not positive definite. What is often done instead.


We instead solve

\[(\nabla^2 f(x^k) + M^k)s^k = -\nabla f(x^k)\]

where:

  • $M^k$ is chosen such that $\nabla^2 f(x^k) + M^k$ is sufficiently positive definite
  • $M^k := 0$ when $\nabla^2 f(x^k)$ is sufficiently positive definite

Suppose we wish to use ∆newtons-method but $\nabla^2 f(x^k)$ is not positive definite. In this case we often instead solve

\[(\nabla^2 f(x^k) + M^k)s^k = -\nabla f(x^k)\]

where:

  • $M^k$ is chosen such that $\nabla^2 f(x^k) + M^k$ is sufficiently positive definite
  • $M^k := 0$ when $\nabla^2 f(x^k)$ is sufficiently positive definite

@State one approach based on the spectral decomposition.


As $\nabla^2 f(x^k)$ is symmetric, we may factor

\[\nabla^2 f(x^k) = Q^k D^k (Q^k)^\top\]

where $Q^k$ is orthogonal and $D^k$ is diagonal. Then we may set

\[\nabla^2 f(x^k) + M^k := Q^k \max(\epsilon I, \vert D^k \vert )(Q^k)^\top\]

for some small $\epsilon > 0$.

(this is expensive for large problems)

Suppose we wish to use ∆newtons-method but $\nabla^2 f(x^k)$ is not positive definite. In this case we often instead solve

\[(\nabla^2 f(x^k) + M^k)s^k = -\nabla f(x^k)\]

where:

  • $M^k$ is chosen such that $\nabla^2 f(x^k) + M^k$ is sufficiently positive definite
  • $M^k := 0$ when $\nabla^2 f(x^k)$ is sufficiently positive definite

@State one approach based on estimating the smallest eigenvalue of $\nabla^2 f(x^k)$.


Estimate $\lambda _ \min(\nabla^2 f(x^k))$ and set

\[M^k := \max(0, \epsilon - \lambda _ \min(\nabla^2 f(x^k)))I\]

(this is biased to overemphasise large negative eigenvalues at the expense of small, positive ones).

Suppose we wish to use ∆newtons-method but $\nabla^2 f(x^k)$ is not positive definite. In this case we often instead solve

\[(\nabla^2 f(x^k) + M^k)s^k = -\nabla f(x^k)\]

where:

  • $M^k$ is chosen such that $\nabla^2 f(x^k) + M^k$ is sufficiently positive definite
  • $M^k := 0$ when $\nabla^2 f(x^k)$ is sufficiently positive definite

@State one approach based on the ∆cholesky-factorisation.


Compute the Cholesky factorisation

\[\nabla^2 f(x^k) = L^k(L^k)^\top\]

where $L^k$ is a lower triangular matrix, and we modify the generated $L^k$ if the factorisation is in danger of failing.




Related posts