Optimisation for Data Science HT25, Steepest descent



Flashcards

Basic principles

What kind of optimisation problems do gradient methods solve, and how do they do it?


They solve unconstrained optimisation problems of the form

\[\min _ {x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is $L$-smooth. They iteratively find a solution via updates of the form

\[x^{k+1} = x^k + \alpha _ k d^k\]

where $\alpha _ k > 0$ is called the step length and $d^k \in \mathbb R^n$ is called the search direction.

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$

@Define the updates made in steepest descent.


\[x^{k+1} = x^k - \frac 1 L \nabla f(x^k)\]

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$

@Prove that the steepest descent update

\[x^{k+1} = x^k - \frac 1 L \nabla f(x^k)\]

is motivated by the minimisation of a particular upper bound.


Note that since $f$ is $L$-smooth, we have the bound

\[f(y) - f(x) - \nabla f(x)^\top (y - x) \le \frac L 2 ||y - x||^2\]

Letting $y = x + \alpha \nabla f(x)$ and rearranging, we obtain:

\[f(x + \alpha d) \le f(x) - \alpha ||\nabla f(x)||^2 + \frac{\alpha^2}{2} L ||\nabla f(x)||^2\]

The right hand side is a convex function in $\alpha$ when $x$ is fixed, i.e.

\[\varphi(\alpha) := f(x) - \alpha ||\nabla f(x)||^2 + \frac{\alpha^2}{2} L ||\nabla f(x)||^2\]

Therefore $\varphi$ achieves its global minimum when $\varphi’(\alpha) = 0$. Solving this gives that the optimum is

\[\alpha^\ast = \frac 1 L\]

Foundational inequality

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent iterations $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$

@State the foundational inequality of steepest descent.


\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

@important~ (overestimation property)

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent iterations $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$

@Prove the foundational inequality of steepest descent, i.e. that starting iterations at some arbitrary point $x^0$ gives

\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

Since $f$ is $L$-smooth, we have the bound

\[f(y) - f(x) - \nabla f(x)^\top (y - x) \le \frac L 2 ||y - x||^2\]

For every $x$ and $y$. Letting $x = x^k$ and $y = x^k - \frac 1 L \nabla f(x)$ and rearranging, we obtain:

\[f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

@important~ (an overestimation property)

Summary of results

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$

@State the general results about the convergence of steepest descent given increasingly strict conditions on $f$.


  • For general $f$, $\big(\nabla f(x^k)\big) _ {k \in \mathbb N} \to 0$ sublinearly with rate $O(1/\sqrt k)$.
  • For convex $f$, $(f(x^k) - f(x^\ast)) _ {k \in \mathbb N} \to 0$ sublinearly with rate $O(1/k)$.
  • For $\gamma$-strongly convex $f$, $(f(x^k) - f(x^\ast)) _ {k \in \mathbb N} \to 0$ $Q$-linearly with rate $(1 - \gamma/L)$.

In particular, for general $f$,

\[\min _ {0 \le k \le T - 1} || \nabla f(x^k)|| \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]

where $C$ is a lower bound. For convex $f$,

\[f(x^T) - f(x^\ast) \le \frac{L}{2T} ||x^0 - x^\ast||^2\]

For $\gamma$-strongly convex $f$,

\[f(x^{k}) - f(x^\ast) \le \left(1 - \frac \gamma L\right)^k(f(x^0) - f(x^\ast))\]

Fixed-step convergence in general case

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$

@State a result about the convergence of steepest descent in the general case.


If $f$ is bounded below by $C$, then steepest descent satisfies

\[\min _ {0 \le k \le T - 1} || \nabla f(x^k)|| \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]

for all $T \ge 1$.

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
  • The step size $\alpha _ k$ is fixed at $1/L$.
  • $f$ is bounded below by $C$

@Prove that then

\[\min _ {0 \le k \le T - 1} || \nabla f(x^k)|| \le \sqrt{\frac{2L(f(x^0) - C)}{T} }\]

for all $T \ge 1$.


The foundational inequality of steepest descent says that:

\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

for all $k \ge 1$. Therefore

\[\begin{aligned} \sum^{T-1} _ {k = 0} ||\nabla f(x^k)||^2 &\le 2L \sum^{T-1} _ {k = 0} [f(x^k) - f(x^{k+1})] \\\\ &= 2L(f(x^0) - f(x^T)) \end{aligned}\]

as the sum telescopes. Thus

\[\begin{aligned} \min _ {0 \le k \le T - 1} || \nabla f(x^k)|| &\le \sqrt{\frac{1}{T} \sum^{T-1} _ {k = 0} ||\nabla f(x^k)||^2} \\\\ &\le \sqrt{\frac{2L(f(x^0) - f^T)}{T} } \\\\ &\le \sqrt{\frac{2L(f(x^0) - C)}{T} } \end{aligned}\]

as required.

Fixed-step convergence in convex case

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
  • The step size $\alpha _ k$ is fixed at $1/L$.

@State a result about the convergence of steepest descent in this $L$-smooth convex case.


Steepest descent satisfies

\[f(x^T) - f(x^\ast) \le \frac{L}{2T} ||x^0 - x^\ast||^2\]

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ using steepest descent updates of the form $x^{k+1} = x^k - \frac 1 L \nabla f(x^k)$

@Prove that then steepest descent satisfies

\[f(x^T) - f(x^\ast) \le \frac{L}{2T} ||x^0 - x^\ast||^2\]

Recall the foundational inequality of steepest gradient descent:

\[f(x^{k+1}) = f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

By the convexity of $f$, $f(x^\ast) \ge f(x^k) + \nabla f(x^k) (x^\ast - x^k)$. Substituting the original inequality yields

\[\begin{aligned} f(x^{k+1}) &\le f(x^\ast) + \nabla f(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} ||\nabla f(x^k)||^2 \\\\ &= f(x^\ast) + \frac L 2 \left( ||x^k - x^\ast||^2 - \left|\left|x^k - x^\ast - \frac 1 L \nabla f(x^k)\right|\right|^2 \right) &&(\ast) \\\\ &= f(x^\ast) + \frac L 2 \left( ||x^k - x^\ast||^2 - ||x^{k+1}-x^\ast||^2 \right) \end{aligned}\]

where $(\ast)$ follows from expanding (notationally convenient to let $v = x^k - x^\ast$ and use inner product notation). Therefore

\[\begin{aligned} \sum^{T-1} _ {k = 0} \left( f(x^{k+1}) - f(x^\ast) \right) &\le \frac L 2 \sum^{T-1} _ {k = 0} \left( ||x^k - x^\ast||^2 - || x^{k+1} - x^\ast ||^2 \right) \\\\ &= \frac L 2 (||x^0 - x^\ast||^2 - ||x^T - x^\ast||^2) \\\\ &\le \frac L 2 ||x^0 - x^\ast||^2 \end{aligned}\]

and since $f(x^{k+1}) \ge f(x^T)$ by the descent guaranteed by the foundational inequality, this implies

\[\begin{aligned} f(x^T) - f(x^\ast) &= \frac{1}{T} \sum^{T-1} _ {k = 0} \left( f(x^{T}) - f(x^\ast) \right) \\\\ &\le \frac{1}{T} \sum^{T-1} _ {k = 0} \left( f(x^{k+1}) - f(x^\ast) \right) \\\\ &\le \frac{L}{2T} ||x^0 - x^\ast||^2 \end{aligned}\]

Equivalent “$\Delta$-flavoured” proof:

Let $\Delta _ k := f(x^k) - f(x^\ast)$. Recall that we have:

  • $f(x^{k+1}) \le f(x^k) - \frac{1}{2L} \vert \vert \nabla f(x^k) \vert \vert ^2$ (foundational inequality)
  • $\Delta _ k \le \nabla f(x^k)^\top (f(x^k) - f(x^\ast))$ (follows from convexity)

Then:

\[\begin{aligned} \Delta_{k+1} &\le \Delta_k - \frac{1}{2L} ||\nabla f(x^k)||^2 &&(\text{foundational inequality}) \\\\ &\le \nabla f(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} ||\nabla f(x^k)||^2 &&(\text{convexity}) \\\\ &= \frac{L}{2}\big( ||x^k - x^\ast||^2 - ||x^k - \frac 1 L \nabla f(x^k) - x^\ast||^2 \big) &&(\text{algebra}) \\\\ &= \frac{L}{2}\big(||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2\big) &&(\text{def.}) \end{aligned}\]

Hence

\[\begin{aligned} f(x^T) - f(x^\ast) &= \Delta_T \\\\ &= \frac{1}{T} \sum^{T-1}_{k = 0} \Delta_T \\\\ &\le \frac{1}{T} \sum^{T-1}_{k = 0} \Delta_{k+1} &&(\text{decreasing}) \\\\ &\le \frac{L}{2T} \sum^{T-1}_{k=0} \big(||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2\big) &&(\text{def.}) \\\\ &= \frac{L}{2T} \big(||x^0 - x^\ast||^2 -||x^T - x^\ast||^2\big) &&(\text{telescope}) \\\\ &\le \frac{L}{2T} ||x^0 - x^\ast||^2 \end{aligned}\]

Fixed-step convergence in strongly convex case

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
  • The step size $\alpha _ k$ is fixed at $1/L$.
  • $f$ is $\gamma$-strongly convex

@State a result about the convergence of steepest descent in this $L$-smooth $\gamma$-strongly convex case.


\[f(x^{k}) - f(x^\ast) \le \left(1 - \frac \gamma L\right)^k(f(x^0) - f(x^\ast))\]

for all $k$.

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth
  • We wish to solve the optimisation problem $\min _ {x \in \mathbb R^n} f(x)$
  • The step size $\alpha _ k$ is fixed at $1/L$.
  • $f$ is $\gamma$-strongly convex with unique minimiser $x^\ast$

@Prove that then:

\[f(x^{k}) - f(x^\ast) \le \left(1 - \frac \gamma L\right)^k(f(x^0) - f(x^\ast))\]

for all $k$.


By $\gamma$-strong convexity, we have that for all $y \in \mathbb R^n$,

\[f(y) \ge f(x^k) + \nabla f(x^k)^\top (y - x^k) + \frac \gamma 2 ||y - x^k||^2\]

Taking $\min$ of both sides,

\[\min _ y f(y) \ge \min _ y \left( f(x^k) + \nabla f(x^k)^\top(y-x^k) + \frac \gamma 2 ||y - x^k||^2 \right)\]

The minimum on the right hand side is achieved by $y = x^\ast$, and the minimum on the right hand side occurs when $y = x^k - \frac 1 \gamma \nabla f(x^k)$ (check by differentiating).

Substituting these back in yields

\[\begin{aligned} f(x^\ast) &\ge f(x^k) - \nabla f(x^k)^\top \left( \frac 1 \gamma \nabla f(x^k) \right) + \frac\gamma2 \left|\left| \frac 1 \gamma \nabla f(x^k) \right|\right|^2 \\\\ &= f(x^k) - \frac{1}{2\gamma} \left|\left| \nabla f(x^k) \right|\right|^2 \end{aligned}\]

Hence

\[\left|\left| \nabla f(x^k) \right|\right|^2 \ge 2\gamma (f(x^k) - f(x^\ast))\]

Substituting this back into the foundational inequality, we obtain

\[f(x^k) - f(x^{k+1}) \le \frac{\gamma}{L}(f(x^k) - f(x^\ast))\]

Subtracting $f(x^\ast)$ from both sides and rearranging, we obtain

\[f(x^{k+1}) - f(x^\ast) \le \left(1 - \frac\gamma L\right)(f(x^k) - f(x^\ast))\]

and hence the claim that

\[f(x^{k}) - f(x^\ast) \le \left(1 - \frac \gamma L\right)^k(f(x^0) - f(x^\ast))\]

for all $k \ge 1$ follows by induction.


Equivalent “$\Delta$-flavoured” proof:

Let $\Delta _ k := f(x^k) - f(x^\ast)$. Recall that we have:

  • $f(x^{k+1}) \le f(x^k) - \frac{1}{2L} \vert \vert \nabla f(x^k) \vert \vert ^2$ (foundational inequality)
  • $\Delta _ k \le \frac{1}{2\gamma} \vert \vert \nabla f(x^k) \vert \vert ^2$ (follows from $\gamma$-strong convexity)

Then:

\[\begin{aligned} \Delta_{k+1} &\le \Delta_k - \frac{1}{2L} ||\nabla f(x^k)||^2 && (\text{foundational inequality}) \\\\ &\le \Delta_k - \frac{\gamma}{L} \Delta_k &&(\gamma\text{-strong convexity}) \\\\ &\le \left(1 - \frac{\gamma}{L}\right) \Delta_k \end{aligned}\]

and hence by induction

\[\begin{aligned} f(x^T) - f(x^\ast) &= \Delta_T \\\\ &\le \left(1 - \frac{\gamma}{L}\right)^k (f(x^0) - f(x^\ast)) \end{aligned}\]



Related posts