Optimisation for Data Science HT25, Line search methods



Flashcards

Assumptions

When solving unconstrained optimisation problems of the form

\[\min _ {x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ using updates of the form

\[x^{k+1} = x^k + \alpha _ k d^k\]

what assumptions do we make on $f$, $d^k$ and $\alpha _ k$?


  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex
  • $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
  • Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
    • There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
  • (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
    • There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
    • (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
    • (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
  • $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
  • Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
    • There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
  • (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
    • There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
    • (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
    • (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

Can you give intuitive justifications for why we have the three assumptions?


Assumption 1: We may rewrite this condition to say that

\[\cos(\angle(-\nabla f(x^k), d^k)) = \frac{(-\nabla f(x^k))^\top d^k}{ \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert } \ge \eta\]

This says that the directions don’t become nearly orthogonal to the steepest descent direction.

Assumption 2: This is just an “upgraded” version of the condition that $f(x^{k+1}) < f(x^k)$, with an additional term proportional to the step length and directional derivative to ensure that there is sufficient decrease.

Assumption 3: This prevents Assumption 2 being satisfied by just taking very small values of $\alpha _ k$.

$L$-smooth general case

Overestimation property

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
  • $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
  • Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
    • There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
  • (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
    • There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
    • (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
    • (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

In this context, @state an overestimation property on the iterates.


\[f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2\]

@important~ (overestimation property)

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
  • $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
  • Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
    • There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
  • (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
    • There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
    • (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
    • (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

@Prove that in this case we have the following overestimation property on the iterates:

\[f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2\]

Lower bound step size: By Assumption 3, we have that

\[\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k\]

By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:

\[\begin{aligned} -(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\ &\stackrel{\text{C.S.} } \le \vert \vert \nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert \\\\ &\stackrel{\text{L.S.} } \le L \cdot \alpha _ k \vert \vert d^k \vert \vert ^2 \end{aligned}\]

Solving for $\alpha _ k$ implies that

\[\alpha _ k \ge -\frac{1-c _ 2}{L \vert \vert d^k \vert \vert ^2} \cdot \nabla f(x^k) ^\top d^k\]

and hence by Assumption 1,

\[\alpha _ k \ge \frac{\eta(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert }\]

Use lower bound on step size to bound next iterate: Then

\[\begin{aligned} f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\ &\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\ &\le f(x^k) + c _ 1 \frac{\eta(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert } \nabla f(x^k)^\top d^k &&(3\star) \\\\ &\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2 && (4\star) \end{aligned}\]

where:

  • $(1\star)$ is by definition
  • $(2\star)$ is by Assumption 2
  • $(3\star)$ is by the previous result
    • This is a bit confusing since the bound on $\alpha _ k$ is a lower bound, but note that $\nabla f(x^k)^\top d^k$ is always negative by Assumption 1.
  • $(4\star)$ is by Assumption 1

@important~ (overestimation property)

Convergence results

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
  • $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
  • Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
    • There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
  • (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
    • There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
    • (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
    • (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

In this context, @state a result concerning the convergence of long-step steepest descent in the general case, and contrast it with results on the convergence of steepest descent where the step size is fixed at $\frac 1 L$.


For long-step methods:

\[\min _ {0 \le k \le T - 1} \vert \vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} }\]

For fixed-step methods:

\[\min _ {0 \le k \le T - 1} \vert \vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]

So only the constant changes.

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
  • $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
  • Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
    • There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
  • (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
    • There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
    • (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
    • (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

@Prove that then we have a reasonable bound on the convergence rate of steepest descent:

\[\min _ {0 \le k \le T - 1} \vert \vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} }\]

Bound step size: By Assumption 3, we have that

\[\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k\]

By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:

\[\begin{aligned} -(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\ &\stackrel{\text{C.S.} } \le \vert \vert \nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert \\\\ &\stackrel{\text{L.S.} } \le L \cdot \alpha _ k \vert \vert d^k \vert \vert ^2 \end{aligned}\]

Solving for $\alpha _ k$ implies that

\[\alpha _ k \ge -\frac{1-c _ 2}{L \vert \vert d^k \vert \vert ^2} \cdot \nabla f(x^k) ^\top d^k\]

and hence by Assumption 1,

\[\alpha _ k \ge \frac{\eta(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert }\]

Use bound on step size to bound derivative: Then

\[\begin{aligned} f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\ &\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\ &\le f(x^k) + c _ 1 \frac{\eta(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert } \nabla f(x^k)^\top d^k &&(3\star) \\\\ &\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2 && (4\star) \end{aligned}\]

where $(1\star)$ is by definition, $(2\star)$ is by Assumption 2, $(3\star)$ is by the previous result, and $(4\star)$ is by Assumption 1.

Now solving for $ \vert \vert \nabla f(x^k) \vert \vert ^2$ gives

\[\vert \vert \nabla f(x^k) \vert \vert ^2 \le \frac{L}{c _ 1(1 - c _ 2)\eta^2}(f(x^k) - f(x^{k+1}))\]

Telescoping sum: Summing over $k$, it follows:

\[\begin{aligned} \min _ {0 \le k\le T-1} \vert \vert \nabla f(x^k) \vert \vert ^2 &\le \frac 1 T\sum^{T-1} _ {k = 0} \vert \vert \nabla f(x^k) \vert \vert ^2 \\\\ &\le \frac{L}{c _ 1 (1 - c _ 2)\eta^2 T} \sum^{T-1} _ {k = 0}(f(x^k) - f(x^{k+1})) \\\\ &= \frac{L}{c _ 1 (1 - c _ 2)\eta^2 T} (f(x^0) - f(x^T)) \\\\ &\le \frac{L}{c _ 1(1 - c _ 2) \eta^2 T} (f(x^0) - C) \end{aligned}\]

and therefore:

\[\begin{aligned} \min _ {0 \le k \le T} \vert \vert \nabla f(x^k) \vert \vert &\le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} } \end{aligned}\]

Remarks: When using the fixed step size of $\alpha _ k = \frac 1 L$, we have the “foundational inequality”

\[f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} \vert \vert \nabla f(x^k) \vert \vert ^2\]

When we bound the derivative above, we prove that

\[f(x^k + \alpha _ k d^k)\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2\]

which is like the foundational inequality.

With weaker assumptions on the descent direction

There is a lot of overlap here with [[Notes - Optimisation for Data Science HT25, Coordinate descent]]U, since the analysis there applies these results with $d^k$ as the coordinate gradient directions.

Overestimation property

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
  • $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
  • Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
    • There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
  • (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
    • There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
    • (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
    • (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

In this context, we have the overestimation property that

\[f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2\]

However, sometimes it’s not possible to guarantee Assumption 1 such as when considering cyclic coordinate descent. @State another overestimation property that doesn’t require Assumption 1 but is less useful.


\[f(x^{k+1}) \le f(x^k) - \frac{c _ 1 (1 - c _ 2)}{L} \cos^2 \theta _ k \vert \vert \nabla f(x^k) \vert \vert ^2\]

where $\cos \theta _ k$ is the cosine of the angle between the chosen descent direction $d^k$ and the steepest descent direction $\nabla f(x^k)$:

\[\cos \theta _ k = \frac{-\nabla f(x^k)^\top d^k}{ \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert }\]

@important~ (overestimation property)

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
  • $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
  • Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
    • There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
  • (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
    • There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
    • (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
    • (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

In this context, we have the overestimation property that

\[f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2\]

However, sometimes it’s not possible to guarantee Assumption 1 such as when considering cyclic coordinate descent.

@Prove that without Assumption 1, we can still guarantee

\[f(x^{k+1}) \le f(x^k) - \frac{c _ 1 (1 - c _ 2)}{L} \cos^2 \theta _ k \vert \vert \nabla f(x^k) \vert \vert ^2\]

where $\cos \theta _ k$ is the cosine of the angle between the chosen descent direction $d^k$ and the steepest descent direction $\nabla f(x^k)$:

\[\cos \theta _ k = \frac{-\nabla f(x^k)^\top d^k}{ \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert }\]

@important~ (overestimation property)


Let $\eta _ k := \cos \theta _ k$.

Bound step size: By the curvature condition, we have that

\[\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k\]

By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:

\[\begin{aligned} -(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\ &\stackrel{\text{C.S.} } \le \vert \vert \nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert \\\\ &\stackrel{\text{L.S.} } \le L \cdot \alpha _ k \vert \vert d^k \vert \vert ^2 \end{aligned}\]

Solving for $\alpha _ k$ implies that

\[\alpha _ k \ge -\frac{1-c _ 2}{L \vert \vert d^k \vert \vert ^2} \cdot \nabla f(x^k) ^\top d^k\]

and hence by Assumption 1,

\[\alpha _ k \ge \frac{\eta _ k(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert }\]

Use bound on step size to bound derivative: Then

\[\begin{aligned} f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\ &\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\ &\le f(x^k) + c _ 1 \frac{\eta _ k(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert } \nabla f(x^k)^\top d^k &&(3\star) \\\\ &\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta _ k^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2 && (4\star) \end{aligned}\]

where $(1\star)$ is by definition, $(2\star)$ is by Assumption 2, $(3\star)$ is by the previous result, and $(4\star)$ is by Assumption 1.

@important~ (overestimation property)

Convergence results

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
  • $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
  • Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
    • There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
  • (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
    • There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
    • (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
    • (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

However, sometimes it’s not possible to guarantee Assumption 1 such as when considering cyclic coordinate descent. @State a result about what you can still guarantee about the iterates without Assumption 1.


The sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies:

\[\sum^\infty _ {k = 0} \big(\cos^2 \theta _ k \times \vert \vert \nabla f(x^k) \vert \vert ^2\big) < \infty\]

and hence either $\theta _ k \to \frac\pi 2$ or $\nabla f(x^k) \to 0$ where

\[\theta _ k = \angle(-\nabla f(x^k), d^k)\]

Not in the lectures or slides, but has come up in problem sheets and past papers.

@Define what is meant by exact line-search and @prove that if

\[f(x) = f(0) + \nabla f(x)^\top x + \frac 1 2 x^\top \nabla^2 f(x) x\]

is optimised with exact line-search, then the step length $\alpha^k$ chosen with search direction $s^k$ at each iteration is

\[\alpha^k = \frac{-\nabla f(x^k)^\top s^k}{(s^k)^\top \nabla^2 f(x) s^k}\]

Exact line search in direction $s^k$ picks

\[\alpha^k = \text{argmin} _ {\alpha > 0} f(x^k + \alpha s^k)\]

Note we may write

\[f(x^k + \alpha s^k) = f(x^k) + \alpha \nabla f(x^k)^\top s^k + \frac{\alpha^2}{2} (s^k)^\top \nabla^2 f(x) s^k\]

Differentiating and setting to zero, we obtain

\[\alpha = \frac{-\nabla f(x^k)^\top s^k}{(s^k)^\top H s^k}\]

@exam~

Suppose you’re considering a function which may be written exactly as

\[f(x) = f(0) + \nabla f(x)^\top x + \frac 1 2 x^\top \nabla^2 f(x) x\]

and are asked to find a change of variables $x = A^{-1} y$ so that steepest descent with exact line search applied to the new function $g(y) = f(x(y))$ converges in one iteration.

How should you go about this, and what is one candidate $A$ you could pick?


The idea is to make the Hessian of the new function equal to the identity. Then

\[y^{k+1} = y^k - \alpha \nabla g(y^k) = y^k - \alpha (c - y^k)\]

so that taking $\alpha = 1$ gives $y^{k+1} = -c$ and it converges in one step. One choice is hence

\[A = \sqrt{\nabla^2 f(x)}\]



Related posts