Continuous Optimisation HT26, Linesearch methods
Flashcards
Suppose we wish to minimise $f(x)$ subject to $x \in \mathbb R^n$, where $f \in \mathcal C^1$ or $\mathcal C^2$.
@State the @algorithm corresponding to generic linesearch methods.
- Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
- For $k \ge 0$ and while $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$:
- Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
- Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
- Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$
Suppose we are minimising $f$ at iteration $k$. @Define what is meant by a descent direction.
Exact linesearch
@Define what is meant by an exact linesearch method.
Where the step size $\alpha^k$ is chosen so that
\[\alpha^k := \text{argmin} _ {\alpha > 0} f(x^k + \alpha s^k)\]Suppose we have a quadratic function
\[q(x) = g^\top x + \frac 1 2 x^\top H x\]
where $x \in \mathbb R^n$ and aim to minimise this via a linesearch method along directions $s^k$.
@State the optimal stepsize $\alpha^\ast$ at step $k$ in this case.
Suppose we have a quadratic function
\[q(x) = g^\top x + \frac 1 2 x^\top H x\]
where $x \in \mathbb R^n$ and aim to minimise this via a linesearch method along directions $s^k$.
@Prove that the optimal stepsize $\alpha^\ast$ at step $k$ in this case is
\[\alpha^\ast = -\frac{(s^k)^\top \nabla q(x^k)}{(s^k)^\top H s^k}\]
We write
\[\phi _ k(\alpha) := q(x^k + \alpha s^k)\]Then
\[\begin{aligned} \phi'(\alpha) &= \frac{\text d}{\text d \alpha} \phi(\alpha) \\ &= \sum^n _ {i = 1} \frac{\text d x _ i}{\text d \alpha} \frac{\partial}{\partial x _ i} \phi(\alpha) \\ &= \sum^n _ {i = 1} s^k _ i \frac{\partial}{\partial x _ i} q(x^k + \alpha s^k) \\ &= (s^k)^\top \nabla q(x^k + \alpha s^k) \end{aligned}\]We also have
\[\begin{aligned} \nabla q(x) &= g + Hx \\ \nabla q(x^k + \alpha s^k) &= g + H(x^k + \alpha s^k) \end{aligned}\]and hence
\[\phi'(\alpha) = (s^k)^\top \nabla q(x^k) + \alpha (s^k)^\top H s^k\]thus $\alpha^\ast$ is the stationary point of $\phi(\alpha)$ iff $(s^k)^\top H s^k \ne 0$ and
\[\alpha^\ast = -\frac{(s^k)^\top \nabla q(x^k)}{(s^k)^\top H s^k}\]where this is the global minimiser of $\phi(\alpha)$ when $(s^k)^\top H s^k > 0$.
Backtracking
Suppose we are trying to find some $\alpha^k$ to use as our stepsize to minimise $f(x^k + \alpha^k s^k)$. Describe the basic backtracking @algorithm that can be used in this case.
- Choose $\alpha _ {(0)} > 0$ and $\tau \in (0, 1)$.
- While $f(x^k + \alpha _ {(i)} s^k) \ge f(x^k)$, repeat:
- Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
- $i = i + 1$
- Set $\alpha^k = \alpha _ {(i)}$.
@State the Armijo condition.
For some $\beta \in (0, 1)$, we aim to choose a stepsize such that
\[f(x^k + \alpha^k s^k) \le f(x^k) + \beta \alpha^k \nabla f(x^k)^\top s^k\]The ∆armijo-condition states that for some $\beta \in (0, 1)$, we aim to choose a stepsize such that
\[f(x^k + \alpha^k s^k) \le f(x^k) + \beta \alpha^k \nabla f(x^k)^\top s^k\]
What values of $\beta$ are typically used in practice?
Suppose we are trying to find some $\alpha^k$ to use as our stepsize to minimise $f(x^k + \alpha^k s^k)$. Describe the Armijo backtracking (bArmijo) @algorithm that can be used in this case.
- Choose $\alpha _ {(0)} > 0$ and $\tau \in (0, 1)$.
- While $f(x^k + \alpha _ {(i)} s^k) > f(x^k) + \beta \alpha _ {(i)} \nabla f(x^k)^\top s^k$, repeat:
- Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
- $i = i + 1$
- Set $\alpha^k = \alpha _ {(i)}$.
Suppose we are trying to find some $\alpha^k$ to use as our stepsize to minimise $f(x^k + \alpha^k s^k)$. The Armijo backtracking algorithm in this case is as follows:
- Choose $\alpha _ {(0)} > 0$ and $\tau \in (0, 1)$.
- While $f(x^k + \alpha _ {(i)} s^k) > f(x^k) + \beta \alpha _ {(i)} \nabla f(x^k)^\top s^k$, repeat:
- Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
- $i = i + 1$
- Set $\alpha^k = \alpha _ {(i)}$.
@State a result about the termination of this algorithm.
- Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
- $i = i + 1$
If $s^k$ is a descent direction so that $\nabla f(x^k)^\top s^k < 0$, then this algorithm terminates in a finite number of steps.
Convergence of GLM
Suppose:
- $f \in \mathcal C^1(\mathbb R^n)$
- $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
- $s^k$ is a descent direction at step $k$
@State a result about when the ∆armijo-condition
\[f(x^k + \alpha s^k) \le f(x^k) + \beta \alpha \nabla f(x^k)^\top s^k\]
is satisfied when a generic linesearch method is applied to minimising $f$.
The Armijo condition is satisfied for all $\alpha \in [0, \alpha^k _ \max]$ where
\[\alpha^k _ \max = \frac{(\beta - 1)\nabla f(x^k)^\top s^k}{L \vert \vert s^k \vert \vert ^2}\]Suppose:
- $f \in \mathcal C^1(\mathbb R^n)$
- $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
- $s^k$ is a descent direction at step $k$
@Prove that then the ∆armijo-condition
\[f(x^k + \alpha s^k) \le f(x^k) + \beta \alpha \nabla f(x^k)^\top s^k\]
is satisfied for all $\alpha \in [0, \alpha^k _ \max]$ where
\[\alpha^k _ \max = \frac{(\beta - 1) \nabla f(x^k)^\top s^k}{L \vert \vert s^k \vert \vert ^2}\]
Note that $\alpha^k _ \max > 0$ since $\beta \in (0, 1)$ and $s^k$ is a descent direction. Then for any $\alpha > 0$ and some $\tilde \alpha \in (0, \alpha)$, we have
\[\begin{aligned} f(x^k + \alpha s^k) &= f(x^k) + \alpha \nabla f(x^k + \tilde \alpha s^k)^\top s^k &&(\star1)\\ &= f(x^k) + \alpha \nabla f(x^k)^\top s^k + \alpha [\nabla f(x^k + \tilde \alpha s^k) - \nabla f(x^k)]^\top s^k \\ &\le f(x^k) + \alpha \nabla f(x^k)^\top s^k + \alpha \vert \vert \nabla f(x^k) + \tilde \alpha s^k - \nabla f(x^k) \vert \vert \, \vert \vert s^k \vert \vert &&(\star 2) \\ &\le f(x^k) + \alpha \nabla f(x^k)^\top s^k + \alpha L \vert \vert x^k + \tilde \alpha s^k - x^k \vert \vert \, \vert \vert s^k \vert \vert &&(\star 3) \\ &\le f(x^k) + \alpha \nabla f(x^k)^\top s^k + \alpha^2 L \vert \vert s^k \vert \vert ^2 &&(\star 4) \end{aligned}\]where:
- $(\star 1)$ is justified by the first-order Taylor expansion
- $(\star 2)$ is justified by the Cauchy-Schwarz inequality
- $(\star 3)$ is justified by the Lipschitz continuity of the gradient
- $(\star 4)$ is justified by the fact that $\tilde \alpha \le \alpha$
Thus the Armijo condition is satisfied for all $\alpha \ge 0$ such that
\[f(x^k) + \alpha \nabla f(x^k)^\top s^k + \alpha^2 L \vert \vert s^k \vert \vert ^2 \le f(x^k) + \beta \alpha \nabla f(x^k)^\top s^k\]which is equivalent to $\alpha \in [0, \alpha^k _ \max]$.
Suppose:
- $f \in \mathcal C^1(\mathbb R^n)$
- $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
- $s^k$ is a descent direction at step $k$
Recall that when we are trying to find some $\alpha^k$ to use as our stepsize to minimise $f(x^k + \alpha^k s^k)$, the Armijo backtracking algorithm finds this via the algorithm
- Choose $\alpha _ {(0)} > 0$ and $\tau \in (0, 1)$.
- While $f(x^k + \alpha _ {(i)} s^k) > f(x^k) + \beta \alpha _ {(i)} \nabla f(x^k)^\top s^k$, repeat:
- Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
- $i = i + 1$
- Set $\alpha^k = \alpha _ {(i)}$.
@State a result about the size of $\alpha^k$ found by this method.
- Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
- $i = i + 1$
for all $k \ge 0$ where $\alpha^k _ \max$ is defined as
\[\alpha^k _ \max = \frac{(\beta - 1)\nabla f(x^k)^\top s^k}{L \vert \vert s^k \vert \vert ^2}\]Suppose:
- $f \in \mathcal C^1(\mathbb R^n)$
- $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
- $s^k$ is a descent direction at step $k$
Recall that when we are trying to find some $\alpha^k$ to use as our stepsize to minimise $f(x^k + \alpha^k s^k)$, the Armijo backtracking algorithm finds this via the algorithm
- Choose $\alpha _ {(0)} > 0$ and $\tau \in (0, 1)$.
- While $f(x^k + \alpha _ {(i)} s^k) > f(x^k) + \beta \alpha _ {(i)} \nabla f(x^k)^\top s^k$, repeat:
- Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
- $i = i + 1$
- Set $\alpha^k = \alpha _ {(i)}$.
@Prove that we then have
\[\alpha^k \ge \min \{\alpha _ {(0)}, \tau \alpha^k _ \max\}\]
for all $k \ge 0$ where $\alpha^k _ \max$ is defined as
\[\alpha^k _ \max = \frac{(\beta - 1)\nabla f(x^k)^\top s^k}{L \vert \vert s^k \vert \vert ^2}\]
- Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
- $i = i + 1$
If $\alpha _ {(0)}$ satisfies the Armijo condition, then the backtracking algorithm terminates with $i = 0$ and $\alpha^k = \alpha _ {(0)}$. Otherwise, by ∆armijo-condition-satisifed-when-lipschitz, we have that the algorithm is guaranteed to terminate when
\[\alpha^k \le \alpha^k _ \max\]Let $(i-1)$ be the last iteration such that
\[\alpha _ {(i-1)} > \alpha^k _ \max \quad \text{and} \quad \alpha _ {(i)} \le \alpha _ \max^k\]Then it follows that
\[\alpha^k = \alpha _ {(i)} = \tau \alpha _ {(i-1)} > \tau \alpha^k _ \max\]Note that if $\alpha _ {(0)} > \alpha^k _ \max$, then $\alpha _ {(i)} = \tau \alpha _ {(0)} \le \alpha^k _ \max$ for any $i \ge \log(\alpha _ {0} / \alpha^k _ \max) / \vert \log \tau \vert $.
…@todo.
Suppose:
- $f \in \mathcal C^1(\mathbb R^n)$
- $f$ is bounded below on $\mathbb R^n$ by $f _ \text{low}$
- $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
- $s^k$ is a descent direction at step $k$
- We apply a generic backtracking Armijo linesearch method
@State a result about the convergence of $x^k$ in this case.
Either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or
\[\lim _ {k \to \infty} \min \left\{\frac{ \vert \nabla f(x^k) s^k \vert }{ \vert \vert s^k \vert \vert }, \vert \nabla f(x^k)^\top s^k \vert \right\} = 0\]@Prove that if
- $f \in \mathcal C^1(\mathbb R^n)$
- $f$ is bounded below on $\mathbb R^n$ by $f _ \text{low}$
- $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
- $s^k$ is a descent direction at step $k$
- We apply a generic backtracking Armijo linesearch method
then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or
\[\lim _ {k \to \infty} \min \left\{\frac{ \vert \nabla f(x^k) s^k \vert }{ \vert \vert s^k \vert \vert }, \vert \nabla f(x^k)^\top s^k \vert \right\} = 0\]
If there exists $l \ge 0$ such that $\nabla f(x^l) = 0$, we are done.
Assume now that $\nabla f(x^k) \ne 0$ for all $k \ge 0$. Then the ∆armijo-condition with $\alpha := \alpha^k$ and $x^{k+1} = x^k + \alpha^k s^k$ give
\[f(x^{k+1}) \le f(x^k) + \beta \alpha^k \nabla f(x^k)^\top s^k\]for all $k \ge 0$, or equivalently for all $k \ge 0$
\[f(x^k) - f(x^{k+1}) \ge -\beta \alpha^k \nabla f(x^k)^\top s^k\]Since $\nabla f(x^k)^\top s^k < 0$ (as $s^k$ is a descent direction), it follows $(-\nabla f(x^k))^\top s^k = \vert \nabla f(x^k))^\top s^k \vert $ and therefore
\[f(x^k) - f(x^{k+1}) \ge \beta \alpha^k \vert \nabla f(x^k))^\top s^k \vert\]Let $i \ge 0$. Summing up this inequality from $k = 0$ to $k = i$ gives a telescoping sum and hence
\[f(x^0) - f(x^{i+1}) \ge \beta \sum^i _ {k=0} \alpha^k \vert \nabla f(x^k))^\top s^k \vert\]As $f$ is bounded below by $f _ \text{low}$, $f(x^{i+1}) \ge f _ \text{low}$ for all $i \ge 0$. Thus as $i \to \infty$ in $(2)$, it follows that
\[\beta \sum^\infty _ {k = 0} \alpha^k \vert \nabla f(x^k)^\top s^k \vert \le f(x^0) - f _ \text{low} < \infty\]Since this series converges, the individual terms must tend towards 0. Hence
\[\lim _ {k \to \infty} \alpha^k \vert \nabla f(x^k)^\top s^k \vert = 0\]Recall from the ∆armijo-stepsize-bound that
\[\alpha^k \ge \min \{ \alpha _ {(0)}, \tau \alpha^k _ \max \}\]Define the following sets
\[\mathcal K _ 1 = \{k : \alpha _ {(0)} \ge \tau \alpha^k _ \max \}, \quad \mathcal K _ 2 = \{k : \alpha _ {0} < \tau \alpha^k _ \max\}\]and note that every iteration $k$ either belongs to $\mathcal K _ 1$ or $\mathcal K _ 2$.
For all $k \in \mathcal K _ 1$, we have from the ∆armijo-stepsize-bound and ∆armijo-condition-satisifed-when-lipschitz that
\[\alpha^k \vert \nabla f(x^k)^\top s^k \vert \ge \frac{(1-\beta)\tau}{L} \cdot \left( \frac{ \vert \nabla f(x^k)^\top s^k \vert }{ \vert \vert s^k \vert \vert } \right)^2 \ge 0\]and hence the limit above implies that
\[\lim _ {k \to \infty, k \in \mathcal K _ 1} \frac{ \vert \nabla f(x^k)^\top s^k \vert }{ \vert \vert s^k \vert \vert } = 0\]The ∆armijo-stepsize-bound also gives that $\alpha^k \ge \alpha _ {(0)}$ for all $k \in \mathcal K _ 2$, so furthermore
\[\lim _ {k \to \infty, k \in \mathcal K _ 2} \vert \nabla f(x^k)^\top s^k \vert = 0\]These two limits for the $\mathcal K _ 1$ and $\mathcal K _ 2$ subsequences together imply the required result, namely that
\[\lim _ {k \to \infty} \min \left\{\frac{ \vert \nabla f(x^k) s^k \vert }{ \vert \vert s^k \vert \vert }, \vert \nabla f(x^k)^\top s^k \vert \right\} = 0\]Recall ∆global-convergence-of-backtracking-armijo, i.e. that if
- $f \in \mathcal C^1(\mathbb R^n)$
- $f$ is bounded below on $\mathbb R^n$ by $f _ \text{low}$
- $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
- $s^k$ is a descent direction at step $k$
- We apply a generic backtracking Armijo linesearch method
then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or
\[\lim _ {k \to \infty} \min \left\{\frac{ \vert \nabla f(x^k) s^k \vert }{ \vert \vert s^k \vert \vert }, \vert \nabla f(x^k)^\top s^k \vert \right\} = 0\]
Give an intuitive explanation of this result.
Define
\[\cos \theta _ k := \frac{(-\nabla f(x^k))^\top s^k}{ \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert s^k \vert \vert } = \frac{ \vert \nabla f(x^k)^\top s^k \vert }{ \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert s^k \vert \vert }\]Then the result says that
\[\lim _ {k \to \infty} \vert \vert \nabla f(x^k) \vert \vert \cdot \cos \theta _ k \cdot \min\{1, \vert \vert s^k \vert \vert \} = 0\]Thus to ensure the global convergence of generic linesearch methods, we require $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$ and that $\cos \theta _ k \ge \delta > 0$ for all $k$, so that $s^k$ is prevented from becoming orthogonal to the steepest descent direction as $k$ increases.