Continuous Optimisation HT26, Steepest descent
Flashcards
Basic definitions
Recall the ∆generic-linesearch-method
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- For $k \ge 0$ and while $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$:
- Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
- Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$
In this context, @define what is meant by steepest descent.
- Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
- Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$
Choose $s _ k = -\nabla f(x^k)$.
When is $s _ k = -\nabla f(x^k)$ a descent direction?
Whenever $\nabla f(x^k) \ne 0$.
Recall the ∆generic-linesearch-method
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- For $k \ge 0$ and while $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$:
- Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
- Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$
In this context, steepest descent chooses $s _ k = -\nabla f(x^k)$. Why is this called steepest descent?
- Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
- Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$
It is the unique global solution of
\[\min _ {s \in \mathbb R^n, \vert \vert s \vert \vert = \vert \vert \nabla f(x^k) \vert \vert } f(x^k) + s^\top \nabla f(x^k)\]Convergence
Recall the ∆generic-linesearch-method
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- For $k \ge 0$ and while $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$:
- Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
- Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$
In this context, steepest descent chooses $s _ k = -\nabla f(x^k)$.
@State a result about the convergence of this algorithm when used with backtracking and Armijo backtracking.
- Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
- Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$
Suppose:
- $f \in \mathcal C^1(\mathbb R^n)$
- $f$ is bounded below on $\mathbb R^n$
- $\nabla f$ is Lipschitz continuous
Then with both standard backtracking and Armijo backtracking, either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$, or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$.
Recall the ∆generic-linesearch-method
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- For $k \ge 0$ and while $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$:
- Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
- Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$
In this context, steepest descent chooses $s _ k = -\nabla f(x^k)$. @Prove, by appealing to another result, that if
- $f \in \mathcal C^1(\mathbb R^n)$
- $f$ is bounded below on $\mathbb R^n$
- $\nabla f$ is Lipschitz continuous
Then with Armijo backtracking either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$, or $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$.
- Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
- Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$
Let $s^k = -\nabla f(x^k)$ for all $k$ in ∆global-convergence-of-backtracking-armijo.
Can you give an @example of where steepest descent converges very slowly due to the objective function being poorly scaled?
Then steepest descent iterations yield
\[x^k = \left( \frac{a-1}{a+1} \right)^k \begin{pmatrix} (-1)^k \\ a \end{pmatrix}, \quad k \ge 0\]Hence $x^k \to 0$ as $k \to \infty$. When $a \gg 1$, then steepest descent converges very slowly.
Steepest descent struggles with scale dependence. If we have the objective
\[f(x) = \frac 1 2 (a x _ 1^2 + x _ 2^2) = \frac 1 2 x^\top \begin{pmatrix}a & 0 \\ 0 & 1\end{pmatrix} x\]
then steepest descent iterations yield
\[x^k = \left( \frac{a-1}{a+1} \right)^k \begin{pmatrix} (-1)^k \\ a \end{pmatrix}, \quad k \ge 0\]
and when $a \gg 1$ then steepest descent converges very slowly. Give an @example of a linear transformation of variables that improves the scaling and @visualise why this works.
Let
\[y = \begin{pmatrix} a^{1/2} & 0 \\ 0 & 1 \end{pmatrix} x\]and set $\bar f(y) = f(x(y))$. Then steepest descent from any $y^0 \in \mathbb R^2$ converges to the global minimiser in one step. This is because the global minimum goes from being very narrow to being a perfect bowl shape.

Asymptotically, steepest descent converges linearly to a solution since if $x^k \to x^\ast$ as $k \to \infty$, then
\[\vert \vert x^{k+1} - x^\ast \vert \vert \le \rho \vert \vert x^k - x^\ast \vert \vert\]
for all $k$ sufficiently large.
Why is steepest descent often slow in practice?
Because $\rho$ is often very close to $1$.
@State a theorem about the convergence of steepest descent to a local minimiser $x^\ast$ of $f \in \mathcal C^2(\mathbb R^n)$ in terms of the eigenvalues of the Hessian.
Suppose:
- $f \in \mathcal C^2(\mathbb R^n)$
- $x^\ast$ local minimiser of $f$
- $\nabla^2 f(x^\ast)$ is positive definite with eigenvalues $\lambda^\ast _ \min, \lambda^\ast _ \max$
- $x^k \to x^\ast$ via steepest descent iterations
Then $x^k$ converges linearly to $x^\ast$ with rate
\[\rho \le \frac{\kappa(x^\ast) - 1}{\kappa(x^\ast) + 1} := \rho _ \text{SD}\]where $\kappa(x^\ast) = \lambda^\ast _ \max / \lambda^\ast _ \min$ (i.e. the condition number of $\nabla^2 f(x^\ast)$). In practice, $\rho \approx \rho _ \text{SD}$.