Continuous Optimisation HT26, Newton's method
Flashcards
Basic setup
Suppose $B^k$ is a symmetric, positive definite matrix. Suppose we choose directions $s^k$ where
\[B^k s^k = -\nabla f(x^k)\]
@Justify that $s^k$ is a descent direction.
Recall that a ∆descent-direction is such that $\nabla f(x^k)^\top s^k < 0$. Since
\[\nabla f(x^k)^\top s^k = -\nabla f(x^k)^\top (B^k)^{-1} \nabla f(x^k)\]Then whenever $\nabla f(x^k) \ne 0$, by positive definiteness we have that $\nabla f(x^k)^\top s^k < 0$.
Suppose $B^k$ is a symmetric, positive definite matrix. Suppose we choose directions $s^k$ where
\[B^k s^k = -\nabla f(x^k)\]
@Justify why this is a sensible choice in terms of $s^k$ being the minimiser of some function.
$s^k$ uniquely solves
\[\min _ {s \in \mathbb R^n} m _ k(s) = f(x^k) + \nabla f(x^k)^\top s + \frac 1 2 s^\top B^k s\]where $m _ k(s)$ is a convex quadratic function in $s$. Sensible choices might e.g. pick $s$ as the minimiser of the second-order Taylor approximation around $x^k$.
Newton’s method
Suppose we wish to minimise $f(x)$ subject to $x \in \mathbb R^n$.
@State the @algorithm corresponding to Newton’s method.
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
where we hope $\nabla^2 f(x^k)$ is positive definite so that $s^k$ is a descent direction.
Recall ∆newtons-method for optimisation:
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
where we hope $\nabla^2 f(x^k)$ is positive definite so that $s^k$ is a descent direction. Suppose this is the case. @Justify the choice of $s^k$.
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
$s^k$ is minimises the second-order Taylor approximation of $f$ around $x^k$.
@Define the Newton direction.
Recall ∆newtons-method for optimisation:
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
@Define what is meant by pure Newton’s method.
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
$\alpha^k = 1$ for all $k$.
Recall ∆newtons-method for optimisation:
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
@Justify its connection to Newton’s method for rootfinding.
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
This is equivalent to applying Newton’s method to $\nabla f(x)$.
Convergence
Recall ∆newtons-method for optimisation:
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
@State a theorem about about the local convergence of ∆pure-newtons-method under some smoothness assumptions.
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
Suppose:
- $f \in \mathcal C^2(\mathbb R^n)$
- $\nabla f(x^\ast) = 0$
- $\nabla^2 f(x^\ast)$ nonsingular
- $\nabla^2 f$ is locally Lipschitz continuous at $x^\ast$
- For some $k _ 0 \ge 0$, $x^{k _ 0}$ is sufficiently close to $x^\ast$
Then:
- $x^k$ is well-defined for all $k \ge k^0$
- $x^k \to x^\ast$ as $k \to \infty$ at a quadratic rate
- $\nabla f(x^k) \to 0$ at a quadratic rate
@Prove ∆newtons-method-convergence, i.e. that if we use the following algorithm for optimisation
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
where
- $f \in \mathcal C^2(\mathbb R^n)$
- $\nabla f(x^\ast) = 0$
- $\nabla^2 f(x^\ast)$ nonsingular
- $\nabla^2 f$ is locally Lipschitz continuous at $x^\ast$
- For some $k _ 0 \ge 0$, $x^{k _ 0}$ is sufficiently close to $x^\ast$
then:
- $x^k$ is well-defined for all $k \ge k^0$
- $x^k \to x^\ast$ as $k \to \infty$ at a quadratic rate
- $\nabla f(x^k) \to 0$ at a quadratic rate
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
Consider the Taylor expansion of $\nabla f$ around $x$:
\[\nabla f(x^\ast) = \nabla f(x) + \nabla^2 f(x) (x^\ast - x) + O( \vert \vert x^\ast - x \vert \vert ^2)\]where $x$ is sufficiently close to $x^\ast$ and $O(\cdot)$ depends on the Lipschitz constant of $\nabla^2 f(x^\ast)$. Then by $\nabla f(x^\ast) = 0$ and $x = x^k$ whenever $x^k$ sufficiently close to $x^\ast$, we have
\[0 = \nabla f(x^k) + \nabla^2 f(x^k) (x^\ast - x^k) + O( \vert \vert x^\ast - x^k \vert \vert ^2)\]Then as $\nabla^2 f(x^\ast)$ nonsingular, we have that $\nabla^2 f(x^k)$ is nonsingular whenever $x^k$ is sufficiently close to $x^\ast$. Hence the above implies that
\[x^k - x^\ast = (\nabla^2 f(x^k))^{-1} \nabla f(x^k) + O( \vert \vert x^\ast - x^k \vert \vert ^2)\]Letting $s^k$ be the Newton direction and $x^{k+1} = x^k + s^k$, it follows that
\[x^k - x^\ast = x^k - x^{k+1} + O( \vert \vert x^\ast - x^k \vert \vert ^2)\]Hence
\[x^{k+1} - x^\ast = O( \vert \vert x^k - x^\ast \vert \vert ^2)\]Recall ∆newtons-method for optimisation:
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
@State a theorem about about the local convergence of this algorithm under some smoothness assumptions when using ∆backtracking-armijo-linesearch.
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
Suppose:
- $f \in \mathcal C^2(\mathbb R^n)$
- $\nabla^2 f$ is Lipschitz continuous and positive definite at the iterates
- $\alpha _ {(0)} = 1$ and $\beta \le 0.5$ in the ∆armijo-condition
- $x^k \to x^\ast$ as $k \to \infty$
- $\nabla f(x^\ast) = 0$
- $\nabla^2 f(x^\ast) \succ 0$
Then:
- $\alpha^k = 1$ for all $k$ sufficiently large
- $x^k \to x^\ast$ converges quadratically
Invariance to linear transformations of the variables
@State (briefly) a theorem about the invariance of Newton’s method to linear transformations of variables.
Newton’s method is scale invariant with respect to linear transformations of variables.
Recall ∆newtons-method for optimisation:
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- While $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$, repeat:
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
@Prove that this method (with or without linesearch) is scale invariant with respect to linear transformations of variables.
- Solve the linear system $\nabla^2 f(x^k) s^k = -\nabla f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$, where $\alpha^k \in (0, 1]$ (computed by e.g. exact linesearch)
- Set $k := k+1$
Let $\mathbf A \in \mathbb R^{n \times n}$ is a nonsingular matrix and let $y = \mathbf A x$. Write $\mathbf B = \mathbf A^{-1}$ and consider
\[\bar f(y) := f(x(y)) = f(\mathbf By)\]We aim to show that minimising $\bar f$ with respect to $y$ is equivalent to minimising $f$ with respect to $x$.
From the definitions, we have that
\[\nabla \bar f(y) = \mathbf B^\top \nabla f(x), \quad \nabla^2 \bar f(y) = \mathbf B^\top \nabla^2 f(x) B\]Hence the ∆newton-direction at $y$ is given by
\[\begin{aligned} s _ y &= -(\mathbf B^\top \nabla^2 f(x) \mathbf B)^{-1} \mathbf B^\top \nabla f(x) \\ &= -\mathbf B^{-1} (\nabla^2 f(x))^{-1} \mathbf B^{-\top} \mathbf B^\top \nabla f(x) \\ &= -\mathbf B^{-1}(\nabla^2 f(x))^{-1} \nabla f(x) \\ &= \mathbf As _ x \end{aligned}\]Hence $y + \alpha s _ y = \mathbf A(x + \alpha s _ x)$.
@State three advantages and disadvantages of Newton’s method for optimisation.
- Advantages
- Quadratic local convergence: $ \vert \vert x^{k+1} - x^\ast \vert \vert = O( \vert \vert x^k - x^\ast \vert \vert ^2)$
- Scale invariance: is invariant to linear transformations
- Automatic step size: once close to the solution, the step size needs no adaptation or backtracking
- Disadvantages
- Newton direction $s^k$ is not well-defined if $\nabla^2 f(x^k)$ is singular
- Iterates can get attracted to local maxima or saddle points
- Iterates may fail to converge if $x^0$ is too far from solution
Give an @example of a scalar function on which ∆pure-newtons-method fails to converge globally and @visualise the failure mode.
Then $x^\ast = 0$ is a local minimiser and $x = \pm \sqrt{(1 + \sqrt{17})/2}$ is a global maximum. Then if Newton’s method is applied to $f$ with $x^0 = 1$, $x^{2k} = 1$ and $x^{2k+1} = -1$ for all $k$.

@State a theorem about the convergence of ∆newtons-method when using ∆backtracking-armijo-linesearch under some conditions on the eigenvalues of the Hessian.
Suppose:
- $f \in \mathcal C^2(\mathbb R^n)$
- $\nabla f$ is Lipschitz continuous
- We minimise $f$ with Newton’s method and backtracking Armijo linesearch and $\epsilon := 0$
- For all $x^k$, suppose the eigenvalues of $\nabla^2 f(x^k)$ at the iterates $x^k$ are positive and uniformly bounded below away from $0$ independently of $k$
Then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$.
We have the result that if:
- $f \in \mathcal C^2(\mathbb R^n)$
- $\nabla f$ is Lipschitz continuous
- We minimise $f$ with backtracking Armijo linesearch and $\epsilon := 0$
- For all $x^k$, suppose the eigenvalues of $\nabla^2 f(x^k)$ at the iterates $x^k$ are positive and uniformly bounded below away from $0$ independently of $k$
Then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$. What simpler conditions imply these ones?
- $f \in \mathcal C^2$
- $\nabla f$ is Lipschitz continuous
- $f$ is strongly convex
@Prove that if
- $f \in \mathcal C^2(\mathbb R^n)$
- $\nabla f$ is Lipschitz continuous
- We minimise $f$ with backtracking Armijo linesearch and $\epsilon := 0$
- For all $x^k$, suppose the eigenvalues of $\nabla^2 f(x^k)$ at the iterates $x^k$ are positive and uniformly bounded below away from $0$ independently of $k$
then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$.
Note that these conditions satisfy the prerequisites for ∆global-convergence-of-backtracking-armijo. Hence there exists $l \ge 0$ such that $\nabla f(x^l) = 0$, or
\[E _ k := \min \left\{ \frac{ \vert \nabla f(x^k)^\top s^k \vert }{ \vert \vert s^k \vert \vert }, \vert \nabla f(x^k)^\top s^k \vert \right\} \to 0 \text{ as } k \to \infty \quad (\ast)\]Assume now that $\nabla f(x^k) \ne 0$ for all $k \ge 0$. Then we are left showing that $(\ast)$ result implies $\nabla f(x^k) \to 0$ as $k \to \infty$. To do this, we will express the terms in $(\ast)$ in terms of $\nabla f(x^k)$.
Let $\nabla^2 f(x^k) := H _ k$. The eigenvalues of $\nabla^2 f(x^k) = H^k$ are positive and uniformly bounded below, away from zero for all $k$. This implies that $\lambda _ \min(H _ k) > 0$ and that there exists $\lambda _ \min$ independent of $k$ such that
\[\lambda _ \min(H _ k) \ge \lambda _ \min \text{ for all }k \ge 0\]The ∆rayleigh-ritz-theorem says that for all $s \in \mathbb R^n \setminus {0}$ and $\mathbf M \in \mathbb R^{n \times n}$ symmetric we have
\[\lambda _ \min(\mathbf M) \le \frac{s^\top \mathbf M s}{ \vert \vert s \vert \vert ^2} \le \lambda _ \max(\mathbf M)\]Furthermore by ∆lipschitz-continuous-iff-bounded-hessian we have that $\nabla f$ is Lipschitz continuous iff $\nabla^2 f$ is uniformly bounded above. Hence all the eigenvalues of $\nabla^2 f$ are bounded above, and hence there exists $\lambda _ \max > 0$ independent of $k$ such that
\[\lambda _ \max (H _ k) \le \lambda _ \max \text{ for all }k \ge 0\]Returning to $(\ast)$, we see from the definition of $s^k$ and $\nabla f(x^k)$ that
\[\begin{aligned} \vert \nabla f(x^k)^\top s^k \vert &= \vert \nabla f(x^k)^\top H _ k^{-1} \nabla f(x^k) \vert \\ &\ge \lambda _ \min(H _ k^{-1}) \vert \vert \nabla f(x^k) \vert \vert ^2 \\ &= \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{\lambda _ \max(H _ k)} \\ &\ge \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{\lambda _ \max(H^k)} \\ &\ge \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{\lambda _ \max} \end{aligned}\]Then $s^k = -H _ k^{-1}\nabla f(x^k)$ implies
\[\begin{aligned} \vert \vert s^k \vert \vert ^2 &= \nabla f(x^k)^\top H _ {k}^{-2} \nabla f(x^k) \\ &\le \lambda _ \max(H _ k^{-2}) \vert \vert \nabla f(x^k) \vert \vert ^2 \\ &= \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{\lambda _ \min(H _ k)^2} \\ &\le \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{\lambda^2 _ \min} \end{aligned}\]and thus
\[\frac{1}{ \vert \vert s _ k \vert \vert } \ge \frac{\lambda _ \min}{ \vert \vert \nabla f(x^k) \vert \vert }\]Furthermore, by the previous results we have that
\[E _ k \ge \min \left\{ \frac{\lambda _ \min}{\lambda _ \max} \vert \vert \nabla f(x^k) \vert \vert , \frac{1}{\lambda _ \max} \vert \vert \nabla f(x^k) \vert \vert ^2 \right\} > 0\]for all $k$. This and $(\ast)$ then imply that $\nabla f(x^k) \to 0$ as $k \to \infty$.
@State a theorem about the convergence of ∆newtons-method when using ∆backtracking-armijo-linesearch under some conditions on the Lipschitz continuity of $\nabla f$ and choices of $B^k$ in $B^k s^k = -\nabla f(x^k)$.
Suppose:
- $f \in \mathcal C^1 (\mathbb R^n)$
- $f$ is bounded below on $\mathbb R^n$
- $\nabla f$ is Lipschitz continuous
- We minimise $f$ via backtracking Armijo linesearch with $s^k$ chosen according to $B^k s^k = -\nabla f(x^k)$
- For all $k$, the eigenvalues of $B^k$ are uniformly bounded above and below away from $0$ independently of $k$
Then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$.
Modified damped Newton methods
Suppose we wish to use ∆newtons-method but $\nabla^2 f(x^k)$ is not positive definite. What is often done instead.
We instead solve
\[(\nabla^2 f(x^k) + M^k)s^k = -\nabla f(x^k)\]where:
- $M^k$ is chosen such that $\nabla^2 f(x^k) + M^k$ is sufficiently positive definite
- $M^k := 0$ when $\nabla^2 f(x^k)$ is sufficiently positive definite
Suppose we wish to use ∆newtons-method but $\nabla^2 f(x^k)$ is not positive definite. In this case we often instead solve
\[(\nabla^2 f(x^k) + M^k)s^k = -\nabla f(x^k)\]
where:
- $M^k$ is chosen such that $\nabla^2 f(x^k) + M^k$ is sufficiently positive definite
- $M^k := 0$ when $\nabla^2 f(x^k)$ is sufficiently positive definite
@State one approach based on the spectral decomposition.
As $\nabla^2 f(x^k)$ is symmetric, we may factor
\[\nabla^2 f(x^k) = Q^k D^k (Q^k)^\top\]where $Q^k$ is orthogonal and $D^k$ is diagonal. Then we may set
\[\nabla^2 f(x^k) + M^k := Q^k \max(\epsilon I, \vert D^k \vert )(Q^k)^\top\]for some small $\epsilon > 0$.
(this is expensive for large problems)
Suppose we wish to use ∆newtons-method but $\nabla^2 f(x^k)$ is not positive definite. In this case we often instead solve
\[(\nabla^2 f(x^k) + M^k)s^k = -\nabla f(x^k)\]
where:
- $M^k$ is chosen such that $\nabla^2 f(x^k) + M^k$ is sufficiently positive definite
- $M^k := 0$ when $\nabla^2 f(x^k)$ is sufficiently positive definite
@State one approach based on estimating the smallest eigenvalue of $\nabla^2 f(x^k)$.
Estimate $\lambda _ \min(\nabla^2 f(x^k))$ and set
\[M^k := \max(0, \epsilon - \lambda _ \min(\nabla^2 f(x^k)))I\](this is biased to overemphasise large negative eigenvalues at the expense of small, positive ones).
Suppose we wish to use ∆newtons-method but $\nabla^2 f(x^k)$ is not positive definite. In this case we often instead solve
\[(\nabla^2 f(x^k) + M^k)s^k = -\nabla f(x^k)\]
where:
- $M^k$ is chosen such that $\nabla^2 f(x^k) + M^k$ is sufficiently positive definite
- $M^k := 0$ when $\nabla^2 f(x^k)$ is sufficiently positive definite
@State one approach based on the ∆cholesky-factorisation.
Compute the Cholesky factorisation
\[\nabla^2 f(x^k) = L^k(L^k)^\top\]where $L^k$ is a lower triangular matrix, and we modify the generated $L^k$ if the factorisation is in danger of failing.