Continuous Optimisation HT26, Steepest descent


Flashcards

Basic definitions

Recall the ∆generic-linesearch-method

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • For $k \ge 0$ and while $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$:
    • Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
    • Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$

In this context, @define what is meant by steepest descent.


Choose $s _ k = -\nabla f(x^k)$.

When is $s _ k = -\nabla f(x^k)$ a descent direction?


Whenever $\nabla f(x^k) \ne 0$.

Recall the ∆generic-linesearch-method

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • For $k \ge 0$ and while $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$:
    • Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
    • Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$

In this context, steepest descent chooses $s _ k = -\nabla f(x^k)$. Why is this called steepest descent?


It is the unique global solution of

\[\min _ {s \in \mathbb R^n, \vert \vert s \vert \vert = \vert \vert \nabla f(x^k) \vert \vert } f(x^k) + s^\top \nabla f(x^k)\]

Convergence

Recall the ∆generic-linesearch-method

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • For $k \ge 0$ and while $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$:
    • Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
    • Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$

In this context, steepest descent chooses $s _ k = -\nabla f(x^k)$.

@State a result about the convergence of this algorithm when used with backtracking and Armijo backtracking.


Suppose:

  • $f \in \mathcal C^1(\mathbb R^n)$
  • $f$ is bounded below on $\mathbb R^n$
  • $\nabla f$ is Lipschitz continuous

Then with both standard backtracking and Armijo backtracking, either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$, or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$.

Recall the ∆generic-linesearch-method

  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • For $k \ge 0$ and while $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$:
    • Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
    • Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$

In this context, steepest descent chooses $s _ k = -\nabla f(x^k)$. @Prove, by appealing to another result, that if

  • $f \in \mathcal C^1(\mathbb R^n)$
  • $f$ is bounded below on $\mathbb R^n$
  • $\nabla f$ is Lipschitz continuous

Then with Armijo backtracking either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$, or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$.


Let $s^k = -\nabla f(x^k)$ for all $k$ in ∆global-convergence-of-backtracking-armijo.

Can you give an @example of where steepest descent converges very slowly due to the objective function being poorly scaled?


\[f(x) = \frac 1 2 (a x _ 1^2 + x _ 2^2) = \frac 1 2 x^\top \begin{pmatrix}a & 0 \\ 0 & 1\end{pmatrix} x\]

Then steepest descent iterations yield

\[x^k = \left( \frac{a-1}{a+1} \right)^k \begin{pmatrix} (-1)^k \\ a \end{pmatrix}, \quad k \ge 0\]

Hence $x^k \to 0$ as $k \to \infty$. When $a \gg 1$, then steepest descent converges very slowly.

Steepest descent struggles with scale dependence. If we have the objective

\[f(x) = \frac 1 2 (a x _ 1^2 + x _ 2^2) = \frac 1 2 x^\top \begin{pmatrix}a & 0 \\ 0 & 1\end{pmatrix} x\]

then steepest descent iterations yield

\[x^k = \left( \frac{a-1}{a+1} \right)^k \begin{pmatrix} (-1)^k \\ a \end{pmatrix}, \quad k \ge 0\]

and when $a \gg 1$ then steepest descent converges very slowly. Give an @example of a linear transformation of variables that improves the scaling and @visualise why this works.


Let

\[y = \begin{pmatrix} a^{1/2} & 0 \\ 0 & 1 \end{pmatrix} x\]

and set $\bar f(y) = f(x(y))$. Then steepest descent from any $y^0 \in \mathbb R^2$ converges to the global minimiser in one step. This is because the global minimum goes from being very narrow to being a perfect bowl shape.

Asymptotically, steepest descent converges linearly to a solution since if $x^k \to x^\ast$ as $k \to \infty$, then

\[\vert \vert x^{k+1} - x^\ast \vert \vert \le \rho \vert \vert x^k - x^\ast \vert \vert\]

for all $k$ sufficiently large.

Why is steepest descent often slow in practice?


Because $\rho$ is often very close to $1$.

@State a theorem about the convergence of steepest descent to a local minimiser $x^\ast$ of $f \in \mathcal C^2(\mathbb R^n)$ in terms of the eigenvalues of the Hessian.


Suppose:

  • $f \in \mathcal C^2(\mathbb R^n)$
  • $x^\ast$ local minimiser of $f$
  • $\nabla^2 f(x^\ast)$ is positive definite with eigenvalues $\lambda^\ast _ \min, \lambda^\ast _ \max$
  • $x^k \to x^\ast$ via steepest descent iterations

Then $x^k$ converges linearly to $x^\ast$ with rate

\[\rho \le \frac{\kappa(x^\ast) - 1}{\kappa(x^\ast) + 1} := \rho _ \text{SD}\]

where $\kappa(x^\ast) = \lambda^\ast _ \max / \lambda^\ast _ \min$ (i.e. the condition number of $\nabla^2 f(x^\ast)$). In practice, $\rho \approx \rho _ \text{SD}$.




Related posts