Optimisation for Data Science HT25, Line search methods
Flashcards
Assumptions
When solving unconstrained optimisation problems of the form
\[\min _ {x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ using updates of the form
\[x^{k+1} = x^k + \alpha _ k d^k\]
what assumptions do we make on $f$, $d^k$ and $\alpha _ k$?
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
Can you give intuitive justifications for why we have the three assumptions?
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
Assumption 1: We may rewrite this condition to say that
\[\cos(\angle(-\nabla f(x^k), d^k)) = \frac{(-\nabla f(x^k))^\top d^k}{ \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert } \ge \eta\]This says that the directions don’t become nearly orthogonal to the steepest descent direction.
Assumption 2: This is just an “upgraded” version of the condition that $f(x^{k+1}) < f(x^k)$, with an additional term proportional to the step length and directional derivative to ensure that there is sufficient decrease.
Assumption 3: This prevents Assumption 2 being satisfied by just taking very small values of $\alpha _ k$.
$L$-smooth general case
Overestimation property
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
In this context, @state an overestimation property on the iterates.
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
@important~ (overestimation property)
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
@Prove that in this case we have the following overestimation property on the iterates:
\[f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2\]
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
Lower bound step size: By Assumption 3, we have that
\[\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k\]By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:
\[\begin{aligned} -(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\ &\stackrel{\text{C.S.} } \le \vert \vert \nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert \\\\ &\stackrel{\text{L.S.} } \le L \cdot \alpha _ k \vert \vert d^k \vert \vert ^2 \end{aligned}\]Solving for $\alpha _ k$ implies that
\[\alpha _ k \ge -\frac{1-c _ 2}{L \vert \vert d^k \vert \vert ^2} \cdot \nabla f(x^k) ^\top d^k\]and hence by Assumption 1,
\[\alpha _ k \ge \frac{\eta(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert }\]Use lower bound on step size to bound next iterate: Then
\[\begin{aligned} f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\ &\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\ &\le f(x^k) + c _ 1 \frac{\eta(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert } \nabla f(x^k)^\top d^k &&(3\star) \\\\ &\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2 && (4\star) \end{aligned}\]where:
- $(1\star)$ is by definition
- $(2\star)$ is by Assumption 2
- $(3\star)$ is by the previous result
- This is a bit confusing since the bound on $\alpha _ k$ is a lower bound, but note that $\nabla f(x^k)^\top d^k$ is always negative by Assumption 1.
- $(4\star)$ is by Assumption 1
@important~ (overestimation property)
Convergence results
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
In this context, @state a result concerning the convergence of long-step steepest descent in the general case, and contrast it with results on the convergence of steepest descent where the step size is fixed at $\frac 1 L$.
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
For long-step methods:
\[\min _ {0 \le k \le T - 1} \vert \vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} }\]For fixed-step methods:
\[\min _ {0 \le k \le T - 1} \vert \vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]So only the constant changes.
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
@Prove that then we have a reasonable bound on the convergence rate of steepest descent:
\[\min _ {0 \le k \le T - 1} \vert \vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} }\]
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
Bound step size: By Assumption 3, we have that
\[\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k\]By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:
\[\begin{aligned} -(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\ &\stackrel{\text{C.S.} } \le \vert \vert \nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert \\\\ &\stackrel{\text{L.S.} } \le L \cdot \alpha _ k \vert \vert d^k \vert \vert ^2 \end{aligned}\]Solving for $\alpha _ k$ implies that
\[\alpha _ k \ge -\frac{1-c _ 2}{L \vert \vert d^k \vert \vert ^2} \cdot \nabla f(x^k) ^\top d^k\]and hence by Assumption 1,
\[\alpha _ k \ge \frac{\eta(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert }\]Use bound on step size to bound derivative: Then
\[\begin{aligned} f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\ &\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\ &\le f(x^k) + c _ 1 \frac{\eta(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert } \nabla f(x^k)^\top d^k &&(3\star) \\\\ &\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2 && (4\star) \end{aligned}\]where $(1\star)$ is by definition, $(2\star)$ is by Assumption 2, $(3\star)$ is by the previous result, and $(4\star)$ is by Assumption 1.
Now solving for $ \vert \vert \nabla f(x^k) \vert \vert ^2$ gives
\[\vert \vert \nabla f(x^k) \vert \vert ^2 \le \frac{L}{c _ 1(1 - c _ 2)\eta^2}(f(x^k) - f(x^{k+1}))\]Telescoping sum: Summing over $k$, it follows:
\[\begin{aligned} \min _ {0 \le k\le T-1} \vert \vert \nabla f(x^k) \vert \vert ^2 &\le \frac 1 T\sum^{T-1} _ {k = 0} \vert \vert \nabla f(x^k) \vert \vert ^2 \\\\ &\le \frac{L}{c _ 1 (1 - c _ 2)\eta^2 T} \sum^{T-1} _ {k = 0}(f(x^k) - f(x^{k+1})) \\\\ &= \frac{L}{c _ 1 (1 - c _ 2)\eta^2 T} (f(x^0) - f(x^T)) \\\\ &\le \frac{L}{c _ 1(1 - c _ 2) \eta^2 T} (f(x^0) - C) \end{aligned}\]and therefore:
\[\begin{aligned} \min _ {0 \le k \le T} \vert \vert \nabla f(x^k) \vert \vert &\le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} } \end{aligned}\]Remarks: When using the fixed step size of $\alpha _ k = \frac 1 L$, we have the “foundational inequality”
\[f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} \vert \vert \nabla f(x^k) \vert \vert ^2\]When we bound the derivative above, we prove that
\[f(x^k + \alpha _ k d^k)\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2\]which is like the foundational inequality.
With weaker assumptions on the descent direction
There is a lot of overlap here with [[Notes - Optimisation for Data Science HT25, Coordinate descent]]U, since the analysis there applies these results with $d^k$ as the coordinate gradient directions.
Overestimation property
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
In this context, we have the overestimation property that
\[f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2\]
However, sometimes it’s not possible to guarantee Assumption 1 such as when considering cyclic coordinate descent. @State another overestimation property that doesn’t require Assumption 1 but is less useful.
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
where $\cos \theta _ k$ is the cosine of the angle between the chosen descent direction $d^k$ and the steepest descent direction $\nabla f(x^k)$:
\[\cos \theta _ k = \frac{-\nabla f(x^k)^\top d^k}{ \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert }\]@important~ (overestimation property)
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
In this context, we have the overestimation property that
\[f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2\]
However, sometimes it’s not possible to guarantee Assumption 1 such as when considering cyclic coordinate descent.
@Prove that without Assumption 1, we can still guarantee
\[f(x^{k+1})
\le
f(x^k) - \frac{c _ 1 (1 - c _ 2)}{L} \cos^2 \theta _ k \vert \vert \nabla f(x^k) \vert \vert ^2\]
where $\cos \theta _ k$ is the cosine of the angle between the chosen descent direction $d^k$ and the steepest descent direction $\nabla f(x^k)$:
\[\cos \theta _ k = \frac{-\nabla f(x^k)^\top d^k}{ \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert }\]
@important~ (overestimation property)
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
Let $\eta _ k := \cos \theta _ k$.
Bound step size: By the curvature condition, we have that
\[\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k\]By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:
\[\begin{aligned} -(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\ &\stackrel{\text{C.S.} } \le \vert \vert \nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert \\\\ &\stackrel{\text{L.S.} } \le L \cdot \alpha _ k \vert \vert d^k \vert \vert ^2 \end{aligned}\]Solving for $\alpha _ k$ implies that
\[\alpha _ k \ge -\frac{1-c _ 2}{L \vert \vert d^k \vert \vert ^2} \cdot \nabla f(x^k) ^\top d^k\]and hence by Assumption 1,
\[\alpha _ k \ge \frac{\eta _ k(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert }\]Use bound on step size to bound derivative: Then
\[\begin{aligned} f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\ &\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\ &\le f(x^k) + c _ 1 \frac{\eta _ k(1 - c _ 2) \vert \vert \nabla f(x^k) \vert \vert }{L \vert \vert d^k \vert \vert } \nabla f(x^k)^\top d^k &&(3\star) \\\\ &\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta _ k^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2 && (4\star) \end{aligned}\]where $(1\star)$ is by definition, $(2\star)$ is by Assumption 2, $(3\star)$ is by the previous result, and $(4\star)$ is by Assumption 1.
@important~ (overestimation property)
Convergence results
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
However, sometimes it’s not possible to guarantee Assumption 1 such as when considering cyclic coordinate descent. @State a result about what you can still guarantee about the iterates without Assumption 1.
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
The sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies:
\[\sum^\infty _ {k = 0} \big(\cos^2 \theta _ k \times \vert \vert \nabla f(x^k) \vert \vert ^2\big) < \infty\]and hence either $\theta _ k \to \frac\pi 2$ or $\nabla f(x^k) \to 0$ where
\[\theta _ k = \angle(-\nabla f(x^k), d^k)\]Exact line search
Not in the lectures or slides, but has come up in problem sheets and past papers.
@Define what is meant by exact line-search and @prove that if
\[f(x) = f(0) + \nabla f(x)^\top x + \frac 1 2 x^\top \nabla^2 f(x) x\]
is optimised with exact line-search, then the step length $\alpha^k$ chosen with search direction $s^k$ at each iteration is
\[\alpha^k = \frac{-\nabla f(x^k)^\top s^k}{(s^k)^\top \nabla^2 f(x) s^k}\]
Exact line search in direction $s^k$ picks
\[\alpha^k = \text{argmin} _ {\alpha > 0} f(x^k + \alpha s^k)\]Note we may write
\[f(x^k + \alpha s^k) = f(x^k) + \alpha \nabla f(x^k)^\top s^k + \frac{\alpha^2}{2} (s^k)^\top \nabla^2 f(x) s^k\]Differentiating and setting to zero, we obtain
\[\alpha = \frac{-\nabla f(x^k)^\top s^k}{(s^k)^\top H s^k}\]@exam~
Suppose you’re considering a function which may be written exactly as
\[f(x) = f(0) + \nabla f(x)^\top x + \frac 1 2 x^\top \nabla^2 f(x) x\]
and are asked to find a change of variables $x = A^{-1} y$ so that steepest descent with exact line search applied to the new function $g(y) = f(x(y))$ converges in one iteration.
How should you go about this, and what is one candidate $A$ you could pick?
The idea is to make the Hessian of the new function equal to the identity. Then
\[y^{k+1} = y^k - \alpha \nabla g(y^k) = y^k - \alpha (c - y^k)\]so that taking $\alpha = 1$ gives $y^{k+1} = -c$ and it converges in one step. One choice is hence
\[A = \sqrt{\nabla^2 f(x)}\]