Notes - Optimisation for Data Science HT25, Steepest descent
Flashcards
Basic principles
What kind of optimisation problems do gradient methods solve, and how do they do it?
They solve unconstrained optimisation problems of the form
\[\min _ {x \in \mathbb R^n} f(x)\]where $f : \mathbb R^n \to \mathbb R$ is $L$-smooth. They iteratively find a solution via updates of the form
\[x^{k+1} = x^k + \alpha _ k d^k\]where $\alpha _ k > 0$ is called the step length and $d^k \in \mathbb R^n$ is called the search direction.
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
@Define the updates made in steepest descent.
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
@Prove that the steepest descent update
\[x^{k+1} = x^k - \frac 1 L \nabla f(x^k)\]
is motivated by the minimisation of a particular upper bound.
Note that since $f$ is $L$-smooth, we have the bound
\[f(y) - f(x) - \nabla f(x)^\top (y - x) \le \frac L 2 ||y - x||^2\]Letting $y = x + \alpha \nabla f(x)$ and rearranging, we obtain:
\[f(x + \alpha d) \le f(x) - \alpha ||\nabla f(x)||^2 + \frac{\alpha^2}{2} L ||\nabla f(x)||^2\]The right hand side is a convex function in $\alpha$ when $x$ is fixed, i.e.
\[\varphi(\alpha) := f(x) - \alpha ||\nabla f(x)||^2 + \frac{\alpha^2}{2} L ||\nabla f(x)||^2\]Therefore $\varphi$ achieves its global minimum when $\varphi’(\alpha) = 0$. Solving this gives that the optimum is
\[\alpha^\ast = \frac 1 L\]Foundational inequality
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent iterations $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$
In this context, @state and @prove the foundational inequality of steepest descent.
Starting iterations at some arbitrary point $x^0$ gives
\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]Proof: Since $f$ is $L$-smooth, we have the bound
\[f(y) - f(x) - \nabla f(x)^\top (y - x) \le \frac L 2 ||y - x||^2\]For every $x$ and $y$. Letting $y = x + \frac 1 L \nabla f(x)$, $\alpha = \frac 1 L$ and rearranging, we obtain:
\[f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$
@State the general results about the convergence of steepest descent given increasingly strict conditions on $f$.
- For general $f$, $(\nabla f(x^k))) _ {k \in \mathbb N} \to 0$ sublinearly with rate $O(1/\sqrt k)$.
- For convex $f$, $(f(x^k) - f(x^\ast)) _ {k \in \mathbb N} \to 0$ sublinearly with rate $O(1/k)$.
- For $\gamma$-strongly convex $f$, $(f(x^k) - f(x^\ast)) _ {k \in \mathbb N} \to 0$ $Q$-linearly with rate $(1 - \gamma/L)$.
Fixed-step convergence in general case
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$
@State a result about the convergence of steepest descent in the general case.
If $f$ is bounded below by $C$, then steepest descent satisfies
\[\min _ {0 \le k \le T - 1} || \nabla f(x^k)|| \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]for all $T \ge 1$.
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
- The step size $\alpha _ k$ is fixed at $1/L$.
- $f$ is bounded below by $C$
@Prove that then
\[\min _ {0 \le k \le T - 1} || \nabla f(x^k)|| \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]
for all $T \ge 1$.
The foundational inequality of steepest descent says that:
\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]for all $k \ge 1$. Therefore
\[\begin{aligned} \sum^{T-1} _ {k = 0} ||\nabla f(x^k)||^2 &\le 2L \sum^{T-1} _ {k = 0} [f(x^k) - f(x^{k+1})] \\\\ &= 2L(f(x^0) - f(x^T)) \end{aligned}\]as the sum telescopes. Thus
\[\begin{aligned} \min _ {0 \le k \le T - 1} || \nabla f(x^k)|| &\le \sqrt{\frac{1}{T} \sum^{T-1} _ {k = 0} ||\nabla f(x^k)||^2} \\\\ &\le \sqrt{\frac{2L(f(x^0) - f^T)}{T} } \\\\ &\le \sqrt{\frac{2L(f(x^0) - C)}{T} } \end{aligned}\]as required.
Fixed-step convergence in convex case
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
- The step size $\alpha _ k$ is fixed at $1/L$.
@State a result about the convergence of steepest descent in this $L$-smooth convex case.
Steepest descent satisfies
\[f(x^T) - f(x^\ast) \le \frac{L}{2T} ||x^0 - x^\ast||^2\]Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates of the form $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$
@Prove that then steepest descent satisfies
\[f(x^T) - f(x^\ast) \le \frac{L}{2T} ||x^0 - x^\ast||^2\]
Recall the foundational inequality of steepest gradient descent:
\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]By the convexity of $f$, $f(x^\ast) \ge f(x^k) + \nabla f(x^k) (x^\ast - x^k)$. Substituting the original inequality yields
\[\begin{aligned} f(x^{k+1}) &\le f(x^\ast) + \nabla f(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} ||\nabla f(x^k)||^2 \\\\ &= f(x^\ast) + \frac L 2 \left( ||x^k - x^\ast||^2 - \left|\left|x^k - x^\ast - \frac 1 L \nabla f(x^k)\right|\right|^2 \right) \\\\ &= f(x^\ast) + \frac L 2 \left( ||x^k - x^\ast||^2 - ||x^{k+1}-x^\ast||^2 \right) \end{aligned}\]Therefore
\[\begin{aligned} \sum^{T-1} _ {k = 0} \left( f(x^{k+1}) - f(x^\ast) \right) &\le \frac L 2 \sum^{T-1} _ {k = 0} \left( ||x^k - x^\ast||^2 - || x^{k+1} - x^\ast ||^2 \right) \\\\ &= \frac L 2 ||x^0 - x^\ast||^2 \end{aligned}\]and since $f(x^{k+1}) \ge f(x^T)$ by the foundational inequality, this implies
\[\begin{aligned} f(x^T) - f(x^\ast) &\le \frac{1}{T} \sum^{T-1} _ {k = 0} \left( f(x^{(k+1) }) - f(x^\ast) \right) \\\\ &\le \frac{L}{2T} ||x^0 - x^\ast||^2 \end{aligned}\]Fixed-step convergence in strongly convex case
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
- The step size $\alpha _ k$ is fixed at $1/L$.
- $f$ is $\gamma$-strongly convex
@State a result about the convergence of steepest descent in this $L$-smooth $\gamma$-strongly convex case.
There is a unique minimiser $x^\ast$ and
\[f(x^{k+1}) - f(x^\ast) \le \left(1 - \frac \gamma L\right)(f(x^k) - f(x^\ast))\]for all $k$.
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
- The step size $\alpha _ k$ is fixed at $1/L$.
- $f$ is $\gamma$-strongly convex
@Prove that then there is a unique minimiser $x^\ast$ and
\[f(x^{k+1}) - f(x^\ast) \le \left(1 - \frac \gamma L\right)(f(x^k) - f(x^\ast))\]
for all $k$.
The uniqueness of the minimiser follows from the fact $f$ is $\gamma$-strongly convex. Furthermore, $\gamma$-strong convexity also implies that for all $y \in \mathbb R^n$,
\[f(y) \ge f(x^k) + \nabla f(x^k)^\top (y - x^k) + \frac \gamma 2 ||y - x^k||^2\]Taking $\min$ of both sides,
\[\min _ y f(y) \ge \min _ y \left( f(x^k) + \nabla f(x^k)^\top(y-x^k) + \frac \gamma 2 ||y - x^k||^2 \right)\]The minimum on the right hand side occurs when
\[y^\ast = x^k - \frac 1 \gamma \nabla f(x^k)\]and the minimum on the left hand side occurs when $y = x^\ast$.
Substitution back gives
\[\begin{aligned} f(x^\ast) &\ge f(x^k) - \nabla f(x^k)^\top \left( \frac 1 \gamma \nabla f(x^k) \right) + \frac\gamma2 \left|\left| \frac 1 \gamma \nabla f(x^k) \right|\right|^2 \\\\ &= f(x^k) - \frac{1}{2\gamma} \left|\left| \nabla f(x^k) \right|\right|^2 \end{aligned}\]Therefore
\[\left|\left| \nabla f(x^k) \right|\right|^2 \ge 2\gamma (f(x^k) - f(x^\ast))\]Substituting this back into the foundational inequality,
\[f(x^{k+1}) - f(x^\ast) \le f(x^k) - f(x^\ast) - \frac\gamma L \left(f(x^k) - f(x^\ast)\right)\]Long-step methods
When solving unconstrained optimisation problems of the form
\[\min _ {x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ using steepest descent updates of the form
\[x^{k+1} = x^k + \alpha _ k d^k\]
what assumptions do we make on $f$, $d^k$ and $\alpha^k$?
- $f$ is $L$-smooth but possibly non convex
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- $\alpha^k$ is neither two short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates of the form $x^{k+1} = x^k + \alpha _ k d^k$.
- $f$ is $L$-smooth but possibly non convex
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
- The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- $\alpha^k$ is neither two short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
In this context, @state a result concerning the convergence of long step steepest descent in the general case, and contrast it with results on the convergence of long step steepest descent where the step size is fixed at $\frac 1 L$.
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
For long-step methods:
\[\min _ {0 \le k \le T - 1} ||\nabla f(x^k)|| \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} }\]For fixed-step methods:
\[\min _ {0 \le k \le T - 1} || \nabla f(x^k)|| \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]So only the constant changes.
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
- We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates of the form $x^{k+1} = x^k + \alpha _ k d^k$.
- $f$ is $L$-smooth but possibly non convex
- $f$ is bounded below by some $C$ for all $x \in \mathbb R^n$.
-
(Assumption 1): The search directions $d^k$ lie within a bounded angle of the steepest descent direction:
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- $\alpha^k$ is neither two short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Assumption 2): $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
-
(Assumption 3): $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
@Prove that then we have a reasonable bound on the convergence rate of steepest descent:
\[\min _ {0 \le k \le T - 1} ||\nabla f(x^k)|| \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} }\]
- There exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$, $\nabla f(x^k)^\top d^k \le -\eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Assumption 2): $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Assumption 3): $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
The last of the assumptions implies that
\[\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k\]implies by subtracting $\nabla f(x^k)^\top d^k$ from both sides that:
\[\begin{aligned} -(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\ &\le L \cdot \alpha _ k ||d^k||^2 \end{aligned}\]and where the second inequality follows from $L$-smoothness.
Solving for $\alpha _ k$ implies that
\[\alpha _ k \ge -\frac{1-c _ 2}{L||d^k||^2} \cdot \nabla f(x^k) ^\top d^k\]and hence by assumption 1,
\[\alpha _ k \ge \frac{\eta(1 - c _ 2)||\nabla f(x^k)||}{L||d^k||}\]Then
\[\begin{aligned} f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\ &\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\ &\le f(x^k) + c _ 1 \frac{\eta(1 - c _ 2)||\nabla f(x^k)||}{L||d^k||} \nabla f(x^k)^\top d^k &&(3\star) \\\\ &\le f(x^k) - \frac{c _ 1 (1 - c _ 2)}{L} \eta^2 ||\nabla f(x^k)||^2 && (4\star) \end{aligned}\]where $(1\star)$ is by definition, $(2\star)$ is by assumption 2, $(3\star)$ is by the previous result, and $(4\star)$ is by assumption 1.
Now solving for $ \vert \vert \nabla f(x^k) \vert \vert ^2$ gives
\[||\nabla f(x^k)||^2 \le \frac{L}{c _ 1(1 - c _ 2)\eta^2}(f(x^k) - f(x^{k+1}))\]This implies that $f(x^k) - f(x^{k+1})$, so we have descent in each iteration.
Summing over $k$, we find
\[\begin{aligned} \sum^{T-1} _ {k = 0} ||\nabla f(x^k)||^2 &\le \frac{L}{c _ 1 (1 - c _ 2)\eta^2} \sum^{T-1} _ {k = 0}(f(x^k) - f(x^{k+1})) \\\\ &= \frac{L}{c _ 1 (1 - c _ 2)\eta^2} (f(x^0) - f(x^\top)) \\\\ &\le \frac{L}{c _ 1(1 - c _ 2) \eta^2} (f(x^0) - C) \end{aligned}\]and therefore
\[\begin{aligned} \min _ {0 \le k \le T} ||\nabla f(x^k)|| &\le \sqrt{\frac 1 T \sum^{T-1} _ {k = 0} ||\nabla f(x^k)||^2} \\\\ &\le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1 (1 - c _ 2) T} } \end{aligned}\]