# Notes - Optimisation for Data Science HT25, Steepest descent with inexact line search

> Source: https://ollybritton.com/notes/uni/part-b/ht25/optimisation-for-data-science/notes/steepest-descent-with-inexact-line-search/ · Updated: 2026-01-26 · Tags: uni, notes

- [Course - Optimisation for Data Science HT25](https://ollybritton.com/notes/uni/part-b/ht25/optimisation-for-data-science/)
	- [Notes - Optimisation for Data Science HT25, Steepest descent](https://ollybritton.com/notes/uni/part-b/ht25/optimisation-for-data-science/notes/steepest-descent/)
	- [Notes - Optimisation for Data Science HT25, Coordinate descent](https://ollybritton.com/notes/uni/part-b/ht25/optimisation-for-data-science/notes/coordinate-descent/)

### Flashcards
#### Assumptions
When solving unconstrained optimisation problems of the form
$$
\min _ {x \in \mathbb R^n} f(x)
$$
where $f : \mathbb R^n \to \mathbb R$ using updates of the form
$$
x^{k+1} = x^k + \alpha _ k d^k
$$
what assumptions do we make on $f$, $d^k$ and $\alpha_k$?::

- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
	- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||$.
- (Wolfe conditions) $\alpha_k$ is neither too short nor too long:
	- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
	- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
	- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
	- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||$.
- (Wolfe conditions) $\alpha_k$ is neither too short nor too long:
	- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
	- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
	- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

Can you give intuitive justifications for why we have the three assumptions?::

**Assumption 1**: We may rewrite this condition to say that
$$
\cos(\angle(-\nabla f(x^k), d^k)) = \frac{(-\nabla f(x^k))^\top d^k}{||\nabla f(x^k)|| \cdot ||d^k||} \ge \eta
$$
This says that the directions don't become nearly orthogonal to the steepest descent direction.

**Assumption 2**: This is just an "upgraded" version of the condition that $f(x^{k+1}) < f(x^k)$, with an additional term proportional to the step length and directional derivative to ensure that there is sufficient decrease.

**Assumption 3**: This prevents Assumption 2 being satisfied by just taking very small values of $\alpha_k$.

#### $L$-smooth general case
##### Overestimation property
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
	- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||$.
- (Wolfe conditions) $\alpha_k$ is neither too short nor too long:
	- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
	- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
	- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

In this context, @state an overestimation property on the iterates.::

$$
f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L}  ||\nabla f(x^k)||^2
$$

@important~ (overestimation property)

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
	- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||$.
- (Wolfe conditions) $\alpha_k$ is neither too short nor too long:
	- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
	- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
	- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

@Prove that in this case we have the following overestimation property on the iterates:
$$
f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L}  ||\nabla f(x^k)||^2
$$
::

**Lower bound step size**: By Assumption 3, we have that
$$
\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k
$$
By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:
$$
\begin{aligned}
-(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\
&\stackrel{\text{C.S.} } \le  ||\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)|| \cdot ||d^k|| \\\\
&\stackrel{\text{L.S.} } \le L \cdot \alpha _ k ||d^k||^2
\end{aligned}
$$
Solving for $\alpha _ k$ implies that
$$
\alpha _ k \ge -\frac{1-c _ 2}{L||d^k||^2} \cdot \nabla f(x^k) ^\top d^k
$$
and hence by Assumption 1,
$$
\alpha _ k \ge \frac{\eta(1 - c _ 2)||\nabla f(x^k)||}{L||d^k||}
$$

**Use lower bound on step size to bound next iterate**: Then
$$
\begin{aligned}
f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\
&\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\
&\le f(x^k) + c _ 1 \frac{\eta(1 - c _ 2)||\nabla f(x^k)||}{L||d^k||} \nabla f(x^k)^\top d^k &&(3\star) \\\\
&\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} ||\nabla f(x^k)||^2 && (4\star)
\end{aligned}
$$
where:

- $(1\star)$ is by definition
- $(2\star)$ is by Assumption 2
- $(3\star)$ is by the previous result
	- This is a bit confusing since the bound on $\alpha_k$ is a lower bound, but note that $\nabla f(x^k)^\top d^k$ is always negative by Assumption 1.
- $(4\star)$ is by Assumption 1

@important~ (overestimation property)

##### Convergence results
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
	- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||$.
- (Wolfe conditions) $\alpha_k$ is neither too short nor too long:
	- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
	- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
	- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

In this context, @state a result concerning the convergence of long-step steepest descent in the general case, and contrast it with results on the convergence of steepest descent where the step size is fixed at $\frac 1 L$.::

For long-step methods:
$$
\min _ {0 \le k \le T - 1} ||\nabla f(x^k)|| \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} }
$$

For fixed-step methods:
$$
\min _ {0 \le k \le T - 1} || \nabla f(x^k)|| \le \sqrt{\frac{2L(f(x^0) - C)}{T} }
$$

So only the constant changes.

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
	- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||$.
- (Wolfe conditions) $\alpha_k$ is neither too short nor too long:
	- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
	- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
	- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

@Prove that then we have a reasonable bound on the convergence rate of steepest descent:
$$
\min _ {0 \le k \le T - 1} ||\nabla f(x^k)|| \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} }
$$
::

**Bound step size**: By Assumption 3, we have that
$$
\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k
$$
By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:
$$
\begin{aligned}
-(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\
&\stackrel{\text{C.S.} } \le  ||\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)|| \cdot ||d^k|| \\\\
&\stackrel{\text{L.S.} } \le L \cdot \alpha _ k ||d^k||^2
\end{aligned}
$$
Solving for $\alpha _ k$ implies that
$$
\alpha _ k \ge -\frac{1-c _ 2}{L||d^k||^2} \cdot \nabla f(x^k) ^\top d^k
$$
and hence by Assumption 1,
$$
\alpha _ k \ge \frac{\eta(1 - c _ 2)||\nabla f(x^k)||}{L||d^k||}
$$

**Use bound on step size to bound derivative**: Then
$$
\begin{aligned}
f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\
&\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\
&\le f(x^k) + c _ 1 \frac{\eta(1 - c _ 2)||\nabla f(x^k)||}{L||d^k||} \nabla f(x^k)^\top d^k &&(3\star) \\\\
&\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} ||\nabla f(x^k)||^2 && (4\star)
\end{aligned}
$$
where $(1\star)$ is by definition, $(2\star)$ is by Assumption 2, $(3\star)$ is by the previous result, and $(4\star)$ is by Assumption 1.

Now solving for $||\nabla f(x^k)||^2$ gives
$$
||\nabla f(x^k)||^2 \le \frac{L}{c _ 1(1 - c _ 2)\eta^2}(f(x^k) - f(x^{k+1}))
$$

**Telescoping sum:** Summing over $k$, it follows:
$$
\begin{aligned}
\min_{0 \le k\le T-1} ||\nabla f(x^k)||^2 &\le \frac 1 T\sum^{T-1} _ {k = 0} ||\nabla f(x^k)||^2  \\\\
&\le \frac{L}{c _ 1 (1 - c _ 2)\eta^2 T} \sum^{T-1} _ {k = 0}(f(x^k) - f(x^{k+1})) \\\\
&= \frac{L}{c _ 1 (1 - c _ 2)\eta^2 T} (f(x^0) - f(x^T)) \\\\
&\le \frac{L}{c _ 1(1 - c _ 2) \eta^2 T} (f(x^0) - C)
\end{aligned}
$$
and therefore:
$$
\begin{aligned}
\min _ {0 \le k \le T} ||\nabla f(x^k)||
&\le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} }
\end{aligned}
$$

---

**Remarks**: When using the fixed step size of $\alpha_k = \frac 1 L$, we have the "foundational inequality"
$$
f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2
$$
When we bound the derivative above, we prove that
$$
f(x^k + \alpha_k d^k)\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L}  ||\nabla f(x^k)||^2
$$
which is like the foundational inequality.

#### With weaker assumptions on the descent direction
There is a lot of overlap here with [Notes - Optimisation for Data Science HT25, Coordinate descent](https://ollybritton.com/notes/uni/part-b/ht25/optimisation-for-data-science/notes/coordinate-descent/), since the analysis there applies these results with $d^k$ as the coordinate gradient directions.

##### Overestimation property
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
	- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||$.
- (Wolfe conditions) $\alpha_k$ is neither too short nor too long:
	- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
	- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
	- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

In this context, we have the overestimation property that
$$
f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L}  ||\nabla f(x^k)||^2
$$
**However**, sometimes it's not possible to guarantee Assumption 1 such as when considering cyclic coordinate descent. @State another overestimation property that doesn't require Assumption 1 but is less useful.::

$$
f(x^{k+1})
\le
f(x^k) - \frac{c  _  1 (1 - c  _  2)}{L} \cos^2 \theta_k ||\nabla f(x^k)||^2
$$
where $\cos \theta_k$ is the cosine of the angle between the chosen descent direction $d^k$ and the steepest descent direction $\nabla f(x^k)$:
$$
\cos \theta_k = \frac{-\nabla f(x^k)^\top d^k}{||\nabla f(x^k)|| \cdot ||d^k||}
$$

@important~ (overestimation property)

Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
	- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||$.
- (Wolfe conditions) $\alpha_k$ is neither too short nor too long:
	- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
	- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
	- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

In this context, we have the overestimation property that
$$
f(x^{k+1})\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L}  ||\nabla f(x^k)||^2
$$
**However**, sometimes it's not possible to guarantee Assumption 1 such as when considering cyclic coordinate descent.

@Prove that without Assumption 1, we can still guarantee
$$
f(x^{k+1})
\le
f(x^k) - \frac{c  _  1 (1 - c  _  2)}{L} \cos^2 \theta_k ||\nabla f(x^k)||^2
$$
where $\cos \theta_k$ is the cosine of the angle between the chosen descent direction $d^k$ and the steepest descent direction $\nabla f(x^k)$:
$$
\cos \theta_k = \frac{-\nabla f(x^k)^\top d^k}{||\nabla f(x^k)|| \cdot ||d^k||}
$$

@important~ (overestimation property)

::

Let $\eta_k := \cos \theta_k$.

**Bound step size**: By the curvature condition, we have that
$$
\nabla f(x^k + \alpha  _  k d^k)^\top d^k \ge c  _  2 \nabla f(x^k)^\top d^k
$$
By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:
$$
\begin{aligned}
-(1-c  _  2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha  _  k d^k) - \nabla f(x^k)\right)^\top d^k \\\\
&\stackrel{\text{C.S.} } \le  ||\nabla f(x^k + \alpha  _  k d^k) - \nabla f(x^k)|| \cdot ||d^k|| \\\\
&\stackrel{\text{L.S.} } \le L \cdot \alpha  _  k ||d^k||^2
\end{aligned}
$$
Solving for $\alpha  _  k$ implies that
$$
\alpha  _  k \ge -\frac{1-c  _  2}{L||d^k||^2} \cdot \nabla f(x^k) ^\top d^k
$$
and hence by Assumption 1,
$$
\alpha  _  k \ge \frac{\eta _ k(1 - c  _  2)||\nabla f(x^k)||}{L||d^k||}
$$

**Use bound on step size to bound derivative**: Then
$$
\begin{aligned}
f(x^{k+1}) &= f(x^k + \alpha  _  k d^k) && (1\star) \\\\
&\le f(x^k) + c  _  1 \alpha  _  k \nabla f(x^k)^\top d^k && (2\star) \\\\
&\le f(x^k) + c  _  1 \frac{\eta _ k(1 - c  _  2)||\nabla f(x^k)||}{L||d^k||} \nabla f(x^k)^\top d^k &&(3\star) \\\\
&\le f(x^k) - \frac{c  _  1 (1 - c  _  2) \eta _ k^2}{L} ||\nabla f(x^k)||^2 && (4\star)
\end{aligned}
$$
where $(1\star)$ is by definition, $(2\star)$ is by Assumption 2, $(3\star)$ is by the previous result, and $(4\star)$ is by Assumption 1.

@important~ (overestimation property)

##### Convergence results
Suppose we wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using updates of the form $x^{k+1} = x^k + \alpha _ k d^k$, where:

- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth but possibly non-convex.
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- Assumption 1: The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
	- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||$.
- (Wolfe conditions) $\alpha_k$ is neither too short nor too long:
	- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
	- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
	- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

**However**, sometimes it's not possible to guarantee Assumption 1 such as when considering cyclic coordinate descent. @State a result about what you can still guarantee about the iterates without Assumption 1.::

The sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies:
$$
\sum^\infty _ {k = 0} \big(\cos^2 \theta _ k \times ||\nabla f(x^k)||^2\big) < \infty
$$
and hence either $\theta _ k \to \frac\pi 2$ or $\nabla f(x^k) \to 0$ where
$$
\theta _ k = \angle(-\nabla f(x^k), d^k)
$$

### Exact line search
Not in the lectures or slides, but has come up in problem sheets and past papers.

@Define what is meant by exact line-search and @prove that if
$$
f(x) = f(0) + \nabla f(x)^\top x + \frac 1 2 x^\top \nabla^2 f(x) x
$$
is optimised with exact line-search, then the step length $\alpha^k$ chosen with search direction $s^k$ at each iteration is
$$
\alpha^k = \frac{-\nabla f(x^k)^\top s^k}{(s^k)^\top \nabla^2 f(x) s^k}
$$
::

Exact line search in direction $s^k$ picks
$$
\alpha^k = \text{argmin}_{\alpha > 0} f(x^k + \alpha s^k)
$$

---

Note we may write
$$
f(x^k + \alpha s^k) = f(x^k) + \alpha \nabla f(x^k)^\top s^k +  \frac{\alpha^2}{2} (s^k)^\top \nabla^2 f(x) s^k 
$$
Differentiating and setting to zero, we obtain
$$
\alpha = \frac{-\nabla f(x^k)^\top s^k}{(s^k)^\top H s^k}
$$

@exam~

Suppose you're considering a function which may be written exactly as
$$
f(x) = f(0) + \nabla f(x)^\top x + \frac 1 2 x^\top \nabla^2 f(x) x
$$
and are asked to find a change of variables $x = A^{-1} y$ so that steepest descent with exact line search applied to the new function $g(y) = f(x(y))$ converges in one iteration.

How should you go about this, and what is one candidate $A$ you could pick?::

The idea is to make the Hessian of the new function equal to the identity. Then
$$
y^{k+1} = y^k - \alpha \nabla g(y^k) = y^k - \alpha (c - y^k)
$$
so that taking $\alpha = 1$ gives $y^{k+1} = -c$ and it converges in one step. One choice is hence
$$
A = \sqrt{\nabla^2 f(x)}
$$

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt