Continuous Optimisation HT26, Linesearch methods


Flashcards

Suppose we wish to minimise $f(x)$ subject to $x \in \mathbb R^n$, where $f \in \mathcal C^1$ or $\mathcal C^2$.

@State the @algorithm corresponding to generic linesearch methods.


  • Choose $\epsilon > 0$ and $x^0 \in \mathbb R^n$.
  • For $k \ge 0$ and while $ \vert \vert \nabla f(x^k) \vert \vert > \epsilon$:
    • Compute descent direction $s^k \in \mathbb R^n$ (i.e. such that $\nabla f(x^k)^\top s^k < 0$).
    • Compute step size $\alpha^k > 0$ along $s^k$ such that $f(x^k + \alpha^k s^k) < f(x^k)$
    • Set $x^{k+1} = x^k + \alpha^k s^k$ and $k = k+1$

Suppose we are minimising $f$ at iteration $k$. @Define what is meant by a descent direction.


\[\nabla f(x^k)^\top s^k < 0\]

Exact linesearch

@Define what is meant by an exact linesearch method.


Where the step size $\alpha^k$ is chosen so that

\[\alpha^k := \text{argmin} _ {\alpha > 0} f(x^k + \alpha s^k)\]

Suppose we have a quadratic function

\[q(x) = g^\top x + \frac 1 2 x^\top H x\]

where $x \in \mathbb R^n$ and aim to minimise this via a linesearch method along directions $s^k$.

@State the optimal stepsize $\alpha^\ast$ at step $k$ in this case.


\[\alpha^\ast = -\frac{(s^k)^\top \nabla q(x^k)}{(s^k)^\top H s^k}\]

Suppose we have a quadratic function

\[q(x) = g^\top x + \frac 1 2 x^\top H x\]

where $x \in \mathbb R^n$ and aim to minimise this via a linesearch method along directions $s^k$.

@Prove that the optimal stepsize $\alpha^\ast$ at step $k$ in this case is

\[\alpha^\ast = -\frac{(s^k)^\top \nabla q(x^k)}{(s^k)^\top H s^k}\]

We write

\[\phi _ k(\alpha) := q(x^k + \alpha s^k)\]

Then

\[\begin{aligned} \phi'(\alpha) &= \frac{\text d}{\text d \alpha} \phi(\alpha) \\ &= \sum^n _ {i = 1} \frac{\text d x _ i}{\text d \alpha} \frac{\partial}{\partial x _ i} \phi(\alpha) \\ &= \sum^n _ {i = 1} s^k _ i \frac{\partial}{\partial x _ i} q(x^k + \alpha s^k) \\ &= (s^k)^\top \nabla q(x^k + \alpha s^k) \end{aligned}\]

We also have

\[\begin{aligned} \nabla q(x) &= g + Hx \\ \nabla q(x^k + \alpha s^k) &= g + H(x^k + \alpha s^k) \end{aligned}\]

and hence

\[\phi'(\alpha) = (s^k)^\top \nabla q(x^k) + \alpha (s^k)^\top H s^k\]

thus $\alpha^\ast$ is the stationary point of $\phi(\alpha)$ iff $(s^k)^\top H s^k \ne 0$ and

\[\alpha^\ast = -\frac{(s^k)^\top \nabla q(x^k)}{(s^k)^\top H s^k}\]

where this is the global minimiser of $\phi(\alpha)$ when $(s^k)^\top H s^k > 0$.

Backtracking

Suppose we are trying to find some $\alpha^k$ to use as our stepsize to minimise $f(x^k + \alpha^k s^k)$. Describe the basic backtracking @algorithm that can be used in this case.


  • Choose $\alpha _ {(0)} > 0$ and $\tau \in (0, 1)$.
  • While $f(x^k + \alpha _ {(i)} s^k) \ge f(x^k)$, repeat:
    • Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
    • $i = i + 1$
  • Set $\alpha^k = \alpha _ {(i)}$.

@State the Armijo condition.


For some $\beta \in (0, 1)$, we aim to choose a stepsize such that

\[f(x^k + \alpha^k s^k) \le f(x^k) + \beta \alpha^k \nabla f(x^k)^\top s^k\]

The ∆armijo-condition states that for some $\beta \in (0, 1)$, we aim to choose a stepsize such that

\[f(x^k + \alpha^k s^k) \le f(x^k) + \beta \alpha^k \nabla f(x^k)^\top s^k\]

What values of $\beta$ are typically used in practice?


\[\beta = 0.1, 0.01, 0.001\]

Suppose we are trying to find some $\alpha^k$ to use as our stepsize to minimise $f(x^k + \alpha^k s^k)$. Describe the Armijo backtracking (bArmijo) @algorithm that can be used in this case.


  • Choose $\alpha _ {(0)} > 0$ and $\tau \in (0, 1)$.
  • While $f(x^k + \alpha _ {(i)} s^k) > f(x^k) + \beta \alpha _ {(i)} \nabla f(x^k)^\top s^k$, repeat:
    • Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
    • $i = i + 1$
  • Set $\alpha^k = \alpha _ {(i)}$.

Suppose we are trying to find some $\alpha^k$ to use as our stepsize to minimise $f(x^k + \alpha^k s^k)$. The Armijo backtracking algorithm in this case is as follows:

  • Choose $\alpha _ {(0)} > 0$ and $\tau \in (0, 1)$.
  • While $f(x^k + \alpha _ {(i)} s^k) > f(x^k) + \beta \alpha _ {(i)} \nabla f(x^k)^\top s^k$, repeat:
    • Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
    • $i = i + 1$
  • Set $\alpha^k = \alpha _ {(i)}$.

@State a result about the termination of this algorithm.


If $s^k$ is a descent direction so that $\nabla f(x^k)^\top s^k < 0$, then this algorithm terminates in a finite number of steps.

Convergence of GLM

Suppose:

  • $f \in \mathcal C^1(\mathbb R^n)$
  • $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
  • $s^k$ is a descent direction at step $k$

@State a result about when the ∆armijo-condition

\[f(x^k + \alpha s^k) \le f(x^k) + \beta \alpha \nabla f(x^k)^\top s^k\]

is satisfied when a generic linesearch method is applied to minimising $f$.


The Armijo condition is satisfied for all $\alpha \in [0, \alpha^k _ \max]$ where

\[\alpha^k _ \max = \frac{(\beta - 1)\nabla f(x^k)^\top s^k}{L \vert \vert s^k \vert \vert ^2}\]

Suppose:

  • $f \in \mathcal C^1(\mathbb R^n)$
  • $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
  • $s^k$ is a descent direction at step $k$

@Prove that then the ∆armijo-condition

\[f(x^k + \alpha s^k) \le f(x^k) + \beta \alpha \nabla f(x^k)^\top s^k\]

is satisfied for all $\alpha \in [0, \alpha^k _ \max]$ where

\[\alpha^k _ \max = \frac{(\beta - 1) \nabla f(x^k)^\top s^k}{L \vert \vert s^k \vert \vert ^2}\]

Note that $\alpha^k _ \max > 0$ since $\beta \in (0, 1)$ and $s^k$ is a descent direction. Then for any $\alpha > 0$ and some $\tilde \alpha \in (0, \alpha)$, we have

\[\begin{aligned} f(x^k + \alpha s^k) &= f(x^k) + \alpha \nabla f(x^k + \tilde \alpha s^k)^\top s^k &&(\star1)\\ &= f(x^k) + \alpha \nabla f(x^k)^\top s^k + \alpha [\nabla f(x^k + \tilde \alpha s^k) - \nabla f(x^k)]^\top s^k \\ &\le f(x^k) + \alpha \nabla f(x^k)^\top s^k + \alpha \vert \vert \nabla f(x^k) + \tilde \alpha s^k - \nabla f(x^k) \vert \vert \, \vert \vert s^k \vert \vert &&(\star 2) \\ &\le f(x^k) + \alpha \nabla f(x^k)^\top s^k + \alpha L \vert \vert x^k + \tilde \alpha s^k - x^k \vert \vert \, \vert \vert s^k \vert \vert &&(\star 3) \\ &\le f(x^k) + \alpha \nabla f(x^k)^\top s^k + \alpha^2 L \vert \vert s^k \vert \vert ^2 &&(\star 4) \end{aligned}\]

where:

  • $(\star 1)$ is justified by the first-order Taylor expansion
  • $(\star 2)$ is justified by the Cauchy-Schwarz inequality
  • $(\star 3)$ is justified by the Lipschitz continuity of the gradient
  • $(\star 4)$ is justified by the fact that $\tilde \alpha \le \alpha$

Thus the Armijo condition is satisfied for all $\alpha \ge 0$ such that

\[f(x^k) + \alpha \nabla f(x^k)^\top s^k + \alpha^2 L \vert \vert s^k \vert \vert ^2 \le f(x^k) + \beta \alpha \nabla f(x^k)^\top s^k\]

which is equivalent to $\alpha \in [0, \alpha^k _ \max]$.

Suppose:

  • $f \in \mathcal C^1(\mathbb R^n)$
  • $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
  • $s^k$ is a descent direction at step $k$

Recall that when we are trying to find some $\alpha^k$ to use as our stepsize to minimise $f(x^k + \alpha^k s^k)$, the Armijo backtracking algorithm finds this via the algorithm

  • Choose $\alpha _ {(0)} > 0$ and $\tau \in (0, 1)$.
  • While $f(x^k + \alpha _ {(i)} s^k) > f(x^k) + \beta \alpha _ {(i)} \nabla f(x^k)^\top s^k$, repeat:
    • Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
    • $i = i + 1$
  • Set $\alpha^k = \alpha _ {(i)}$.

@State a result about the size of $\alpha^k$ found by this method.


\[\alpha^k \ge \min \{\alpha _ {(0)}, \tau \alpha^k _ \max\}\]

for all $k \ge 0$ where $\alpha^k _ \max$ is defined as

\[\alpha^k _ \max = \frac{(\beta - 1)\nabla f(x^k)^\top s^k}{L \vert \vert s^k \vert \vert ^2}\]

Suppose:

  • $f \in \mathcal C^1(\mathbb R^n)$
  • $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
  • $s^k$ is a descent direction at step $k$

Recall that when we are trying to find some $\alpha^k$ to use as our stepsize to minimise $f(x^k + \alpha^k s^k)$, the Armijo backtracking algorithm finds this via the algorithm

  • Choose $\alpha _ {(0)} > 0$ and $\tau \in (0, 1)$.
  • While $f(x^k + \alpha _ {(i)} s^k) > f(x^k) + \beta \alpha _ {(i)} \nabla f(x^k)^\top s^k$, repeat:
    • Set $\alpha _ {(i+1)} = \tau \alpha _ {(i)}$
    • $i = i + 1$
  • Set $\alpha^k = \alpha _ {(i)}$.

@Prove that we then have

\[\alpha^k \ge \min \{\alpha _ {(0)}, \tau \alpha^k _ \max\}\]

for all $k \ge 0$ where $\alpha^k _ \max$ is defined as

\[\alpha^k _ \max = \frac{(\beta - 1)\nabla f(x^k)^\top s^k}{L \vert \vert s^k \vert \vert ^2}\]

If $\alpha _ {(0)}$ satisfies the Armijo condition, then the backtracking algorithm terminates with $i = 0$ and $\alpha^k = \alpha _ {(0)}$. Otherwise, by ∆armijo-condition-satisifed-when-lipschitz, we have that the algorithm is guaranteed to terminate when

\[\alpha^k \le \alpha^k _ \max\]

Let $(i-1)$ be the last iteration such that

\[\alpha _ {(i-1)} > \alpha^k _ \max \quad \text{and} \quad \alpha _ {(i)} \le \alpha _ \max^k\]

Then it follows that

\[\alpha^k = \alpha _ {(i)} = \tau \alpha _ {(i-1)} > \tau \alpha^k _ \max\]

Note that if $\alpha _ {(0)} > \alpha^k _ \max$, then $\alpha _ {(i)} = \tau \alpha _ {(0)} \le \alpha^k _ \max$ for any $i \ge \log(\alpha _ {0} / \alpha^k _ \max) / \vert \log \tau \vert $.

…@todo.

Suppose:

  • $f \in \mathcal C^1(\mathbb R^n)$
  • $f$ is bounded below on $\mathbb R^n$ by $f _ \text{low}$
  • $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
  • $s^k$ is a descent direction at step $k$
  • We apply a generic backtracking Armijo linesearch method

@State a result about the convergence of $x^k$ in this case.


Either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or

\[\lim _ {k \to \infty} \min \left\{\frac{ \vert \nabla f(x^k) s^k \vert }{ \vert \vert s^k \vert \vert }, \vert \nabla f(x^k)^\top s^k \vert \right\} = 0\]

@Prove that if

  • $f \in \mathcal C^1(\mathbb R^n)$
  • $f$ is bounded below on $\mathbb R^n$ by $f _ \text{low}$
  • $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
  • $s^k$ is a descent direction at step $k$
  • We apply a generic backtracking Armijo linesearch method

then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or

\[\lim _ {k \to \infty} \min \left\{\frac{ \vert \nabla f(x^k) s^k \vert }{ \vert \vert s^k \vert \vert }, \vert \nabla f(x^k)^\top s^k \vert \right\} = 0\]

If there exists $l \ge 0$ such that $\nabla f(x^l) = 0$, we are done.

Assume now that $\nabla f(x^k) \ne 0$ for all $k \ge 0$. Then the ∆armijo-condition with $\alpha := \alpha^k$ and $x^{k+1} = x^k + \alpha^k s^k$ give

\[f(x^{k+1}) \le f(x^k) + \beta \alpha^k \nabla f(x^k)^\top s^k\]

for all $k \ge 0$, or equivalently for all $k \ge 0$

\[f(x^k) - f(x^{k+1}) \ge -\beta \alpha^k \nabla f(x^k)^\top s^k\]

Since $\nabla f(x^k)^\top s^k < 0$ (as $s^k$ is a descent direction), it follows $(-\nabla f(x^k))^\top s^k = \vert \nabla f(x^k))^\top s^k \vert $ and therefore

\[f(x^k) - f(x^{k+1}) \ge \beta \alpha^k \vert \nabla f(x^k))^\top s^k \vert\]

Let $i \ge 0$. Summing up this inequality from $k = 0$ to $k = i$ gives a telescoping sum and hence

\[f(x^0) - f(x^{i+1}) \ge \beta \sum^i _ {k=0} \alpha^k \vert \nabla f(x^k))^\top s^k \vert\]

As $f$ is bounded below by $f _ \text{low}$, $f(x^{i+1}) \ge f _ \text{low}$ for all $i \ge 0$. Thus as $i \to \infty$ in $(2)$, it follows that

\[\beta \sum^\infty _ {k = 0} \alpha^k \vert \nabla f(x^k)^\top s^k \vert \le f(x^0) - f _ \text{low} < \infty\]

Since this series converges, the individual terms must tend towards 0. Hence

\[\lim _ {k \to \infty} \alpha^k \vert \nabla f(x^k)^\top s^k \vert = 0\]

Recall from the ∆armijo-stepsize-bound that

\[\alpha^k \ge \min \{ \alpha _ {(0)}, \tau \alpha^k _ \max \}\]

Define the following sets

\[\mathcal K _ 1 = \{k : \alpha _ {(0)} \ge \tau \alpha^k _ \max \}, \quad \mathcal K _ 2 = \{k : \alpha _ {0} < \tau \alpha^k _ \max\}\]

and note that every iteration $k$ either belongs to $\mathcal K _ 1$ or $\mathcal K _ 2$.

For all $k \in \mathcal K _ 1$, we have from the ∆armijo-stepsize-bound and ∆armijo-condition-satisifed-when-lipschitz that

\[\alpha^k \vert \nabla f(x^k)^\top s^k \vert \ge \frac{(1-\beta)\tau}{L} \cdot \left( \frac{ \vert \nabla f(x^k)^\top s^k \vert }{ \vert \vert s^k \vert \vert } \right)^2 \ge 0\]

and hence the limit above implies that

\[\lim _ {k \to \infty, k \in \mathcal K _ 1} \frac{ \vert \nabla f(x^k)^\top s^k \vert }{ \vert \vert s^k \vert \vert } = 0\]

The ∆armijo-stepsize-bound also gives that $\alpha^k \ge \alpha _ {(0)}$ for all $k \in \mathcal K _ 2$, so furthermore

\[\lim _ {k \to \infty, k \in \mathcal K _ 2} \vert \nabla f(x^k)^\top s^k \vert = 0\]

These two limits for the $\mathcal K _ 1$ and $\mathcal K _ 2$ subsequences together imply the required result, namely that

\[\lim _ {k \to \infty} \min \left\{\frac{ \vert \nabla f(x^k) s^k \vert }{ \vert \vert s^k \vert \vert }, \vert \nabla f(x^k)^\top s^k \vert \right\} = 0\]

Recall ∆global-convergence-of-backtracking-armijo, i.e. that if

  • $f \in \mathcal C^1(\mathbb R^n)$
  • $f$ is bounded below on $\mathbb R^n$ by $f _ \text{low}$
  • $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$
  • $s^k$ is a descent direction at step $k$
  • We apply a generic backtracking Armijo linesearch method

then either there exists $l \ge 0$ such that $\nabla f(x^l) = 0$ or

\[\lim _ {k \to \infty} \min \left\{\frac{ \vert \nabla f(x^k) s^k \vert }{ \vert \vert s^k \vert \vert }, \vert \nabla f(x^k)^\top s^k \vert \right\} = 0\]

Give an intuitive explanation of this result.


Define

\[\cos \theta _ k := \frac{(-\nabla f(x^k))^\top s^k}{ \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert s^k \vert \vert } = \frac{ \vert \nabla f(x^k)^\top s^k \vert }{ \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert s^k \vert \vert }\]

Then the result says that

\[\lim _ {k \to \infty} \vert \vert \nabla f(x^k) \vert \vert \cdot \cos \theta _ k \cdot \min\{1, \vert \vert s^k \vert \vert \} = 0\]

Thus to ensure the global convergence of generic linesearch methods, we require $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to \infty$ and that $\cos \theta _ k \ge \delta > 0$ for all $k$, so that $s^k$ is prevented from becoming orthogonal to the steepest descent direction as $k$ increases.




Related posts