Notes - Optimisation for Data Science HT25, Steepest descent

Flashcards

[[Course - Optimisation for Data Science HT25]]^U

Flashcards

Basic principles

What kind of optimisation problems do gradient methods solve, and how do they do it?

They solve unconstrained optimisation problems of the form

\[\min _ {x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is $L$-smooth. They iteratively find a solution via updates of the form

\[x^{k+1} = x^k + \alpha _ k d^k\]

where $\alpha _ k > 0$ is called the step length and $d^k \in \mathbb R^n$ is called the search direction.

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$

@Define the updates made in steepest descent.

\[x^{k+1} = x^k - \frac 1 L \nabla f(x^k)\]

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$

@Prove that the steepest descent update

\[x^{k+1} = x^k - \frac 1 L \nabla f(x^k)\]

is motivated by the minimisation of a particular upper bound.

Note that since $f$ is $L$-smooth, we have the bound

\[f(y) - f(x) - \nabla f(x)^\top (y - x) \le \frac L 2 ||y - x||^2\]

Letting $y = x + \alpha \nabla f(x)$ and rearranging, we obtain:

\[f(x + \alpha d) \le f(x) - \alpha ||\nabla f(x)||^2 + \frac{\alpha^2}{2} L ||\nabla f(x)||^2\]

The right hand side is a convex function in $\alpha$ when $x$ is fixed, i.e.

\[\varphi(\alpha) := f(x) - \alpha ||\nabla f(x)||^2 + \frac{\alpha^2}{2} L ||\nabla f(x)||^2\]

Therefore $\varphi$ achieves its global minimum when $\varphi’(\alpha) = 0$. Solving this gives that the optimum is

\[\alpha^\ast = \frac 1 L\]

Foundational inequality

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent iterations $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$

@State the foundational inequality of steepest descent.

\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

@important~ (overestimation property)

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent iterations $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$

@Prove the foundational inequality of steepest descent, i.e. that starting iterations at some arbitrary point $x^0$ gives

\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

Since $f$ is $L$-smooth, we have the bound

\[f(y) - f(x) - \nabla f(x)^\top (y - x) \le \frac L 2 ||y - x||^2\]

For every $x$ and $y$. Letting $x = x^k$ and $y = x^k - \frac 1 L \nabla f(x)$ and rearranging, we obtain:

\[f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

@important~ (an overestimation property)

Summary of results

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$

@State the general results about the convergence of steepest descent given increasingly strict conditions on $f$.

For general $f$, $\big(\nabla f(x^k)\big) _ {k \in \mathbb N} \to 0$ sublinearly with rate $O(1/\sqrt k)$.
For convex $f$, $(f(x^k) - f(x^\ast)) _ {k \in \mathbb N} \to 0$ sublinearly with rate $O(1/k)$.
For $\gamma$-strongly convex $f$, $(f(x^k) - f(x^\ast)) _ {k \in \mathbb N} \to 0$ $Q$-linearly with rate $(1 - \gamma/L)$.

In particular, for general $f$,

\[\min _ {0 \le k \le T - 1} || \nabla f(x^k)|| \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]

where $C$ is a lower bound. For convex $f$,

\[f(x^T) - f(x^\ast) \le \frac{L}{2T} ||x^0 - x^\ast||^2\]

For $\gamma$-strongly convex $f$,

\[f(x^{k}) - f(x^\ast) \le \left(1 - \frac \gamma L\right)^k(f(x^0) - f(x^\ast))\]

Fixed-step convergence in general case

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$

@State a result about the convergence of steepest descent in the general case.

If $f$ is bounded below by $C$, then steepest descent satisfies

\[\min _ {0 \le k \le T - 1} || \nabla f(x^k)|| \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]

for all $T \ge 1$.

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
The step size $\alpha _ k$ is fixed at $1/L$.
$f$ is bounded below by $C$

@Prove that then

\[\min _ {0 \le k \le T - 1} || \nabla f(x^k)|| \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]

for all $T \ge 1$.

The foundational inequality of steepest descent says that:

\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

for all $k \ge 1$. Therefore

\[\begin{aligned} \sum^{T-1} _ {k = 0} ||\nabla f(x^k)||^2 &\le 2L \sum^{T-1} _ {k = 0} [f(x^k) - f(x^{k+1})] \\\\ &= 2L(f(x^0) - f(x^T)) \end{aligned}\]

as the sum telescopes. Thus

\[\begin{aligned} \min _ {0 \le k \le T - 1} || \nabla f(x^k)|| &\le \sqrt{\frac{1}{T} \sum^{T-1} _ {k = 0} ||\nabla f(x^k)||^2} \\\\ &\le \sqrt{\frac{2L(f(x^0) - f^T)}{T} } \\\\ &\le \sqrt{\frac{2L(f(x^0) - C)}{T} } \end{aligned}\]

as required.

Fixed-step convergence in convex case

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
The step size $\alpha _ k$ is fixed at $1/L$.

@State a result about the convergence of steepest descent in this $L$-smooth convex case.

Steepest descent satisfies

\[f(x^T) - f(x^\ast) \le \frac{L}{2T} ||x^0 - x^\ast||^2\]

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates of the form $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$

@Prove that then steepest descent satisfies

\[f(x^T) - f(x^\ast) \le \frac{L}{2T} ||x^0 - x^\ast||^2\]

Recall the foundational inequality of steepest gradient descent:

\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

By the convexity of $f$, $f(x^\ast) \ge f(x^k) + \nabla f(x^k) (x^\ast - x^k)$. Substituting the original inequality yields

\[\begin{aligned} f(x^{k+1}) &\le f(x^\ast) + \nabla f(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} ||\nabla f(x^k)||^2 \\\\ &= f(x^\ast) + \frac L 2 \left( ||x^k - x^\ast||^2 - \left|\left|x^k - x^\ast - \frac 1 L \nabla f(x^k)\right|\right|^2 \right) &&(\ast) \\\\ &= f(x^\ast) + \frac L 2 \left( ||x^k - x^\ast||^2 - ||x^{k+1}-x^\ast||^2 \right) \end{aligned}\]

where $(\ast)$ follows from expanding (notationally convenient to let $v = x^k - x^\ast$ and use inner product notation). Therefore

\[\begin{aligned} \sum^{T-1} _ {k = 0} \left( f(x^{k+1}) - f(x^\ast) \right) &\le \frac L 2 \sum^{T-1} _ {k = 0} \left( ||x^k - x^\ast||^2 - || x^{k+1} - x^\ast ||^2 \right) \\\\ &= \frac L 2 (||x^0 - x^\ast||^2 - ||x^T - x^\ast||^2) \\\\ &\le \frac L 2 ||x^0 - x^\ast||^2 \end{aligned}\]

and since $f(x^{k+1}) \ge f(x^T)$ by the descent guaranteed by the foundational inequality, this implies

\[\begin{aligned} f(x^T) - f(x^\ast) &= \frac{1}{T} \sum^{T-1} _ {k = 0} \left( f(x^{T}) - f(x^\ast) \right) \\\\ &\le \frac{1}{T} \sum^{T-1} _ {k = 0} \left( f(x^{k+1}) - f(x^\ast) \right) \\\\ &\le \frac{L}{2T} ||x^0 - x^\ast||^2 \end{aligned}\]

Equivalent “$\Delta$-flavoured” proof:

Let $\Delta _ k := f(x^k) - f(x^\ast)$. Recall that we have:

$f(x^{k+1}) \le f(x^k) - \frac{1}{2L} \vert \vert \nabla f(x^k) \vert \vert ^2$ (foundational inequality)
$\Delta _ k \le \nabla f(x^k)^\top (f(x^k) - f(x^\ast))$ (follows from convexity)

Then:

\[\begin{aligned} \Delta_{k+1} &\le \Delta_k - \frac{1}{2L} ||\nabla f(x^k)||^2 &&(\text{foundational inequality}) \\\\ &\le \nabla f(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} ||\nabla f(x^k)||^2 &&(\text{convexity}) \\\\ &= \frac{L}{2}\big( ||x^k - x^\ast||^2 - ||x^k - \frac 1 L \nabla f(x^k) - x^\ast||^2 \big) &&(\text{algebra}) \\\\ &= \frac{L}{2}\big(||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2\big) &&(\text{def.}) \end{aligned}\]

Hence

\[\begin{aligned} f(x^T) - f(x^\ast) &= \Delta_T \\\\ &= \frac{1}{T} \sum^{T-1}_{k = 0} \Delta_T \\\\ &\le \frac{1}{T} \sum^{T-1}_{k = 0} \Delta_{k+1} &&(\text{decreasing}) \\\\ &\le \frac{L}{2T} \sum^{T-1}_{k=0} \big(||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2\big) &&(\text{def.}) \\\\ &= \frac{L}{2T} \big(||x^0 - x^\ast||^2 -||x^T - x^\ast||^2\big) &&(\text{telescope}) \\\\ &\le \frac{L}{2T} ||x^0 - x^\ast||^2 \end{aligned}\]

Fixed-step convergence in strongly convex case

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
The step size $\alpha _ k$ is fixed at $1/L$.
$f$ is $\gamma$-strongly convex

@State a result about the convergence of steepest descent in this $L$-smooth $\gamma$-strongly convex case.

\[f(x^{k}) - f(x^\ast) \le \left(1 - \frac \gamma L\right)^k(f(x^0) - f(x^\ast))\]

for all $k$.

Suppose:

$f : \mathbb R^n \to \mathbb R$ is $L$-smooth
We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
The step size $\alpha _ k$ is fixed at $1/L$.
$f$ is $\gamma$-strongly convex with unique minimiser $x^\ast$

@Prove that then:

\[f(x^{k}) - f(x^\ast) \le \left(1 - \frac \gamma L\right)^k(f(x^0) - f(x^\ast))\]

for all $k$.

By $\gamma$-strong convexity, we have that for all $y \in \mathbb R^n$,

\[f(y) \ge f(x^k) + \nabla f(x^k)^\top (y - x^k) + \frac \gamma 2 ||y - x^k||^2\]

Taking $\min$ of both sides,

\[\min _ y f(y) \ge \min _ y \left( f(x^k) + \nabla f(x^k)^\top(y-x^k) + \frac \gamma 2 ||y - x^k||^2 \right)\]

The minimum on the right hand side is achieved by $y = x^\ast$, and the minimum on the right hand side occurs when $y = x^k - \frac 1 \gamma \nabla f(x^k)$ (check by differentiating).

Substituting these back in yields

\[\begin{aligned} f(x^\ast) &\ge f(x^k) - \nabla f(x^k)^\top \left( \frac 1 \gamma \nabla f(x^k) \right) + \frac\gamma2 \left|\left| \frac 1 \gamma \nabla f(x^k) \right|\right|^2 \\\\ &= f(x^k) - \frac{1}{2\gamma} \left|\left| \nabla f(x^k) \right|\right|^2 \end{aligned}\]

Hence

\[\left|\left| \nabla f(x^k) \right|\right|^2 \ge 2\gamma (f(x^k) - f(x^\ast))\]

Substituting this back into the foundational inequality, we obtain

\[f(x^k) - f(x^{k+1}) \le \frac{\gamma}{L}(f(x^k) - f(x^\ast))\]

Subtracting $f(x^\ast)$ from both sides and rearranging, we obtain

\[f(x^{k+1}) - f(x^\ast) \le \left(1 - \frac\gamma L\right)(f(x^k) - f(x^\ast))\]

and hence the claim that

\[f(x^{k}) - f(x^\ast) \le \left(1 - \frac \gamma L\right)^k(f(x^0) - f(x^\ast))\]

for all $k \ge 1$ follows by induction.

Equivalent “$\Delta$-flavoured” proof:

Let $\Delta _ k := f(x^k) - f(x^\ast)$. Recall that we have:

$f(x^{k+1}) \le f(x^k) - \frac{1}{2L} \vert \vert \nabla f(x^k) \vert \vert ^2$ (foundational inequality)
$\Delta _ k \le \frac{1}{2\gamma} \vert \vert \nabla f(x^k) \vert \vert ^2$ (follows from $\gamma$-strong convexity)

Then:

\[\begin{aligned} \Delta_{k+1} &\le \Delta_k - \frac{1}{2L} ||\nabla f(x^k)||^2 && (\text{foundational inequality}) \\\\ &\le \Delta_k - \frac{\gamma}{L} \Delta_k &&(\gamma\text{-strong convexity}) \\\\ &\le \left(1 - \frac{\gamma}{L}\right) \Delta_k \end{aligned}\]

and hence by induction

\[\begin{aligned} f(x^T) - f(x^\ast) &= \Delta_T \\\\ &\le \left(1 - \frac{\gamma}{L}\right)^k (f(x^0) - f(x^\ast)) \end{aligned}\]

Flashcards

Basic principles

Foundational inequality

Summary of results

Fixed-step convergence in general case

Fixed-step convergence in convex case

Fixed-step convergence in strongly convex case

Related posts