Optimisation for Data Science HT25, Overview of convergence results



Table of results

The bulk of this course consists of proving a lot of different results about iterative algorithms for solving optimisation problems

\[\min_{x \in \mathcal D \subseteq \mathbb R^d} f(x)\]

under different assumptions on the objective function $f$. The main algorithms considered are:

  • Steepest descent: The classic gradient descent algorithm
  • Line search: Not an algorithm per say, but a method for determining directions and step lengths
  • Proximal method: A specialised algorithm for solving regularised problems
  • Heavy ball method: A momentum-based algorithm
  • Nesterov’s accelerated gradient method: A more general momentum-based algorithm
  • Stochastic gradient descent: Gradient descent with randomised gradients
  • SVRG: “Stochastic variance-reduced gradient” descent, picks random gradients cleverly to improve convergence
  • Coordinate descent: Gradient descent along a subset of the coordinate directions at each iteration

and the main assumptions considered explicitly here are:

  • $L$-smoothness: Does the function have $L$-Lipschitz continuous gradients?
  • Convexity: Is the function well shaped for optimisation?
  • $\gamma$-strong convexity: Is the function really well shaped for optimisation?

There are also a few assumptions that are taken implicitly throughout these results. For example, it makes sense to assume that the functions are bounded below, otherwise there is no hope of converging to a minimiser (although this is not necessary in the $\gamma$-strongly convex case).

Method Result Assumptions  
Steepest descent

fixed step, general case
(proof)
$\min _ {0 \le k \le T-1} \vert\vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{T}}$
  • $L$-smooth
  • Step size $1/L$
 
Steepest descent

fixed step, convex
(proof)
$f(x^k) - f(x^\ast) \le \frac{L\vert\vert x^0 - x^\ast \vert\vert^2}{2k}$
  • $L$-smooth
  • Step size $1/L$
 
Steepest descent

fixed step, $\gamma$-strongly convex
(proof)
$f(x^k) - f(x^\ast) \le \left(1-\frac{\gamma}{L}\right)^k (f(x^0) - f(x^\ast))$
  • $L$-smooth
  • Step size $1/L$
 
Gradient descent

line search, general case
(proof)
$\min _ {0 \le k \le T-1} \vert\vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1(1-c _ 2)T}}$
  • $L$-smooth
  • Okay descent directions
  • Wolfe conditions
 
Gradient descent

line search, $\gamma$-strongly convex
(sheets only)
$f(x^k) - f(x^\ast) \le \left(1 - \frac{2\eta^2 \gamma c _ 1 (1 - c _ 2)}{L}\right)(f(x^k) - f(x^\ast))$
  • $L$-smooth
  • Okay descent directions
  • Wolfe conditions
 
Proximal method for $\varphi(x)$ as the sum $f(x) + \lambda \psi(x)$

fixed step, $f$ convex and smooth + $\psi$ convex
(proof)
$\varphi(x^k) - \varphi(x^\ast) \le \frac{L \vert\vert x^0 - x^\ast \vert \vert^2}{2k}$
  • $f$ $L$-smooth and convex
  • $\psi$ convex
 
Heavy ball method for $\frac 1 2 x^\top A x - b^\top x$

fixed step, quadratic function
(stated, but not proved)
$\vert\vert x^k - x^\ast\vert \vert \le C \beta^k$
  • $\alpha _ k = \frac{4}{(\sqrt L + \sqrt \gamma)^2}$
  • $\beta _ k = \frac{\sqrt L - \sqrt \gamma}{\sqrt L + \sqrt \gamma}$
 
Nesterov’s accelerated gradient method

dynamic step, convex case
(proof)
$f(x^k) - f(x^\ast) \le \frac{4L\vert\vert x^0 - x^\ast \vert\vert^2}{(k+2)^2}$
  • Finite minimiser and finite minimum
 
Nesterov’s accelerated gradient method

fixed step, $\gamma$-strongly convex case
(stated, but not proved)
$f(x^k) - f(x^\ast) \le \frac{L + \gamma}{2} \left( 1 - \sqrt{\frac \gamma L} \right)^k \vert\vert x^0 - x^\ast\vert\vert^2$
  • $L$-smooth
 
Stochastic gradient descent

fixed step, general case
(proof)
$\min _ {0 \le i \le k - 1} \mathbb E[\vert\vert \nabla f(X^i) \vert\vert^2]\le\eta M + \frac{2L(f(x^0) - f _ \text{low})}{k \eta}$
  • Unbiased gradient
  • Bounded below
  • $L$-smooth summands
  • Bounded variance
  • Step size $\eta/L$
 
Stochastic gradient descent

fixed step, $\gamma$-strongly convex
(proof)
$\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left(1 - \frac{\eta \gamma}{L}\right)^k \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]$
  • Unbiased gradient
  • Bounded below
  • $L$-smooth summands
  • Bounded variance
  • Step size $\eta/L$
 
Stochastic gradient descent

fixed step, $\gamma$-strongly convex,
exponentially decreasing variance

(current unproven)
$\mathbb E[f(X^k)] - f(x^\ast) \le \nu \rho^k$, $\nu := \max \left( \frac{\alpha L M}{\gamma}, f(x^0) - f(x^\ast) \right)$ and $\rho := \max \left( 1 - \frac{\alpha \gamma}{2}, \frac{1}{\xi} \right) < 1$
  • Unbiased gradient
  • $L$-smooth summands
  • Exponentially decreasing variance
  • Step size $\eta / L$
 
Stochastic gradient descent

dynamic step, $\gamma$-strongly convex
(proof)
$\mathbb E[f(X^k)] - f(x^\ast) \le \frac{\frac{2L}{\gamma}\max\left(\frac M \gamma, f(x^0) - f(x^\ast)\right)}{\frac{2L}{\gamma} + k}$
  • Unbiased gradient
  • $L$-smooth summands
  • Bounded variance
  • Step size $\frac{2}{2L + k\gamma}$
 
Stochastic gradient descent

fixed step, $\gamma$-strongly convex,
increased batch size

(current unproven)
$\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma p} \le \left( 1 - \frac{\eta \gamma}{L} \right)^2 \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma p} \right]$
  • Unbiased gradient
  • $L$-smooth summands
  • Batch size $p$
  • Step size $\eta / L$
 
SVRG

$\gamma$-strongly convex
(currently unproven)
$\mathbb E[f(X^k)] - f(x^\ast) \le \rho^k (f(x^0) - f(x^\ast))$ where $\rho := \frac{1}{\gamma \alpha (1 - 2L\alpha) p} + \frac{2L}{1 - 2L\alpha} < 1$
  • $L$-smooth and convex summands
  • Step size $\alpha < \frac{1}{4L}$ and $p > \frac{1}{\gamma \alpha (1 - 4L\alpha)}$
 
Cyclic coordinate descent

fixed step, general case
(proof)
$\sum^\infty _ {k=0}\left(\cos^2 \theta _ k \times \vert\vert \nabla f(x^k) \vert\vert^2 \right) < \infty$
  • $L$-smooth
  • Bounded below
  • Wolfe conditions
 
Cyclic coordinate descent

fixed step, unique coordinatewise minima
(stated, in lecture notes only)
$\text{Acc}\big((x^{j}) _ {j \in n\mathbb N}\big) \subseteq \{x \in \mathbb R^n : \nabla f(x) = 0\}$
  • Continuously differentiable
  • Bounded below
  • Unique coordinatewise minima
  • Exact minimiser
 
Cyclic coordinate descent

fixed step, convex
(stated, but not proved)
$f(x^k) - f(x^\ast) \le \frac{4nD^2 L _ \max (1 + n\frac{L^2}{L^2 _ \max})}{k+8}$
  • CL-smooth
  • $D$-bounded initial sublevel set
  • Step size $1/L _ \text{max}$
 
Cyclic coordinate descent

fixed step, $\gamma$-strongly convex
(stated, but not proved)
$f(x^k) - f(x^\ast) \le \left(1 - \frac{\gamma}{2L _ \max \left(1 + n\frac{L^2}{L _ \max^2}\right)}\right)^{k/n} (f(x^0) - f(x^\ast))$
  • CL-smooth
  • Step size $1/L _ \text{max}$
  • $k$ multiple of $n$
 
Gauss-Southwell coordinate descent

fixed step, convex
(sheets only)
$f(x^k) - f(x^\ast) \le \frac{2nL _ \text{max} D^2}{k}$
  • CL-smooth
  • Step size $1/L _ \text{max}$
 
Randomised coordinate descent

fixed step, general case
(proof)
$\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}$
  • CL-smooth
  • Bounded below
  • Step size $1/L _ \text{max}$
 
Randomised coordinate descent

fixed step, convex
(proof)
$\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \text{max}D^2}{k}$
  • CL-smooth
  • $D$-bounded initial sublevel set
  • Step size $1/L _ \text{max}$
 
Randomised coordinate descent

fixed step, $\gamma$-strongly convex
(proof, in slides only)
$\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))$
  • CL-smooth
  • Step size $1/L _ \text{max}$
 
Randomised coordinate descent

line search, general case
(currently unproven)
$\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL(f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$
  • L-smooth
  • Bounded below
  • Line search
  • Wolfe conditions
 
Randomised coordinate descent

componentwise line search, general case
(currently unproven, in slides only)
$\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL _ \text{max} (f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$
  • CL-smooth
  • Bounded below
  • Line search
  • Componentwise Wolfe conditions
 
Randomised coordinate descent

componentwise line search, convex
(currently unproven, in slides only)
$\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}$
  • CL-smooth
  • $D$-bounded initial sublevel set
  • Line search
  • Componentwise Wolfe conditions
 
Randomised coordinate descent

componentwise line search, $\gamma$-strongly convex
(currently unproven, in slides only)
$\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{2\gamma c _ 1(1-c _ 2)}{nL _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$
  • CL-smooth
  • Line search
  • Componentwise Wolfe conditions
 
Randomised block coordinate descent

fixed step, $\gamma$-strongly convex
(currently unproven, in slides only)
$\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{\gamma}{\ell L _ \text{bmax} }\right)^k(f(x^0) - f(x^\ast))$
  • BL-smooth
  • Step size $1/L _ \text{bmax}$
  • Known set of blocks
 
Randomised block coordinate descent

fixed step, $\gamma$-strongly convex
(currently unproven, in slides only)
$\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{n _ 0 \gamma}{n L _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$
  • BL-smooth
  • Step size $1/L _ \text{bmax}$
  • All blocks of a fixed size
 

Proving these results

Most of the results above follow from proving some “overestimation” or “local descent” property, which is a result that places an upper bound on the next iterate given the current iterate and other local information like the gradient. Global results on the convergence of each of the methods are then obtained by applying these results repeatedly to analyse the sequence of iterates as a whole.

Method Property Assumptions
Steepest descent

fixed step
(proof)
$f(x^{k+1}) \le f(x^k) - \frac{1}{2L} \vert\vert\nabla f(x^k)\vert\vert^2$
  • $L$-smooth
Gradient descent

line search
(proof)
$f(x^{k+1}) \le f(x^k) - \frac{c _ 1 (1 - c _ 2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2$
  • $L$-smooth
  • Wolfe conditions
  • Okay descent directions
Proximal method

smooth add convex
(proof)
$\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert\vert G _ \alpha(x)\vert \vert^2$
  • $f$ is $L$-smooth
  • $\psi$ is convex
Stochastic gradient descent

fixed step
(proof)
$\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha _ k\left( 1-\frac{L\alpha _ k}{2} \right) \vert\vert\nabla f(X^k)\vert\vert^2 + \frac{ML\alpha _ k^2}{2}$
  • Unbiased gradient
  • $L$-smooth summands
  • Batch size $p$
  • Step size $\eta / L$
Gradient descent

line search, weaker assumptions on descent direction
(proof)
$f(x^{k+1}) \le f(x^k) - \frac{c _ 1(1-c _ 2)}{L} \cos^2 \theta _ k \vert\vert\nabla f(x^k)\vert\vert^2$
  • $L$-smooth
  • Wolfe conditions
Randomised coordinate descent

line search, L-smooth
(proof)
$\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{c _ 1 (1 - c _ 2)}{n L} \mathbb E[\vert \vert \nabla f(X^k)\vert \vert^2]$
  • L-smooth
Coordinate descent

CL-smooth
(proof)
$f(x^{k+1}) \le f(x^k) - \alpha _ k \left( 1 - \frac{L _ \text{max} \alpha _ k}{2} \right) \left( \frac{\partial f}{\partial x _ i}(x) \right)^2$
  • CL-smooth
Blockwise coordinate descent

BL-smooth
(currently unproven)
$f(x - \alpha \nabla _ \mathcal B f(x)) \le f(x) - \alpha \left(1 - \frac{L _ \text{bmax} \alpha }{2}\right) \vert\vert\nabla _ {\mathcal B} f(x)\vert\vert^2$
  • BL-smooth

In stochastic methods (such as stochastic gradient descent or randomised coordinate descent), it’s not always possible to guarantee local descent given the standard assumptions; you might just get unlucky and keep on picking directions or gradients which don’t actually make any progress. In these cases, you can instead show that you get local descent in expectation, rather than in every case.

Likewise, when using cyclic coordinate descent you might also get unlucky and pick gradients which don’t make any progress towards minima (and this can be deliberately engineered, as it is with Powell’s objective; see [[Notes - Optimisation for Data Science HT25, Coordinate descent]]U). In this case, the overestimation property might depend on other local information other than the magnitude of the gradient. For example, in the table you will notice that the overestimation property for gradient descent with line search contains an additional factor of $\cos \theta _ k$ where $\theta _ k$ is the angle between the chosen descent direction $d^k$ and the actual steepest descent direction $-\nabla f(x^k)$; this represents the fact that you’ll only get local descent if your chosen descent direction is pointing in the right direction. If you upgrade the assumptions on $f$ to be CL-smooth, you can obtain descent in every iteration.

When $f$ is nonconvex

When $f$ is nonconvex, $f$ might have lots of local minima. Since these methods are all local, all you can hope for is saying something about the rate that $ \vert \vert \nabla f(x^k) \vert \vert $ tends towards zero. For example, when using line search methods we have the overestimation property that

\[f(x^{k+1}) \le f(x^k) - \frac{c_1 (1 - c_2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2\]

Rearranging, you obtain

\[||\nabla f(x^k)||^2 \le \frac{L}{c_1 (1 - c_2) \eta^2} \big( f(x^k) - f(x^{k+1}) \big)\]

And then by considering telescoping sums it follows that:

\[\min_{0 \le i \le k-1} ||\nabla f(x^k)|| \le \sqrt{\frac{L (f(x^0) - f(x^\ast))}{c_1(1-c_2)\eta^2 k} }\]

When $f$ is convex

When $f$ is convex and has some minimiser $x^\ast$, that minimiser is a global minimiser. So it makes sense to talk about the rate at which $f(x^k)$ tends towards $f(x^\ast)$. It’s often $\Delta _ k := f(x^k) - f(x^\ast)$ be the gap between the value at the current iterate and the value at the global minimiser.

In this context, you have two inequalities like:

  • $\Delta _ {k+1} \le \Delta _ k - B(x^k)$ (the overestimation property you have proved)
  • $\Delta _ k \le \nabla f(x^k)^\top (x^k - x^\ast)$ (rearranging the first-order characterisation of convexity)

There’s two main ways the proofs proceed from here.

Finding a telescoping sum $\sum \Delta _ k \le \sum g(x^k) - g(x^{k+1})$
Finding a telescoping sum $\sum \frac{1}{\Delta _ k} - \frac{1}{\Delta _ {k+1} } \ge \cdots$

When $f$ is $\gamma$-strongly convex

When $f$ is $\gamma$-strongly convex you can upgrade the inequality $\Delta _ k \le \nabla f(x^k)^\top (x^k - x^\ast)$ to the much stronger $\Delta _ k \le \frac{1}{2\gamma} \vert \vert \nabla f(x^k) \vert \vert ^2$. If your overestimation property is something like

\[f(x^{k+1}) \le f(x^k) - \frac{c_1 (1 - c_2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2\]

Then you obtain

\[\begin{aligned} \Delta_{k+1} &\le \Delta_k - \frac{c_1 (1 - c_2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2 && (\text{rewriting}) \\\\ &= \left(1 - \frac{2c_1 (1-c_2)}{L} \eta^2\right) \Delta_k \end{aligned}\]

and then an inductive proof establishes that

\[f(x^{k}) - f(x^\ast) \le \left( 1 - \frac{2c_1 (1 - c_2)}{L} \eta^2 \right) (f(x^0) - f(x^\ast))\]

Some of the assumptions

Okay descent directions

The search directions $d^k$ lie within a bounded angle of the steepest descent direction: there exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$,

\[\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||\]

(relevant to line search methods)

Wolfe conditions

$\alpha _ k$ is neither too short nor too long: there exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$, we have:

  • Armijo condition: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
  • Curvature condition: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

(relevant to line search methods)

Unbiased gradient

$G^k$ (the estimate of the gradient) conditioned on the current batch is an unbiased estimator of the true gradient:

\[\mathbb E_{\mathcal S_k} [G^k] = \nabla f(X^k)\]

(relevant to stochastic gradient descent)

$L$-smooth summands

For all $j$, $\nabla f _ j$ is $L$-smooth.

(relevant to stochastic gradient descent)

Exponentially decreasing variance

\[\text{Var}(G^k \mid \mathcal S_k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S_k] \le \frac {M} {\xi^k}\]

Unique coordinatewise minima

For each $i$, the minimum $\min _ {t \in \mathbb R} f(x _ 1, x _ 2, \ldots, x _ {i-1}, t, x _ {i+1}, \ldots, x _ n)$ is uniquely attained for all $x _ 1, \ldots, x _ {i-1}, x _ {i+1}, \ldots, x _ n$.

(relevant to coordinate descent)

Step size exact minimiser

$\alpha _ k$ is chosen to be a minimiser of $f(x^k + \alpha _ k d^k)$.

(relevant to coordinate descent)

CL-smooth (componentwise Lipschitz-smooth)

Each partial derivative of $f$ is “coordinatewise $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

(relevant to coordinate descent)

$D$-bounded initial sublevel set

$f$ has a $D$-bounded initial sublevel set if there exists a finite minimiser $x^\ast$ with $ \vert \vert x - x^\ast \vert \vert ^2 \le D$ for all $x$ with $f(x) \le f(x^0)$ (this terminology is not standard).

(relevant to coordinate descent)

Componentwise Wolfe conditions

There exist constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$ and $1 \le i \le n$,

  • $f\left(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i\right) \le f(x^k) - c _ 1 \alpha _ k \left( \frac{\partial f(x^k)}{\partial x _ i} \right)^2$
  • $\left(\frac{\partial f(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i)}{\partial x _ i}\right) \left(-\frac{\partial f(x^k)}{\partial x _ i} \right) \ge -c _ 2 \left(\frac{\partial f(x^k)}{\partial x _ i}\right)^2$

(these are just the Wolfe conditions with $d^k$ as $-\frac{\partial f(x^k)}{x _ i} e _ i$ and all the $\nabla f(x^k)$ replaced by $\frac{\partial f(x^k)}{\partial x _ i}$).

BL-smooth (blockwise Lipschitz-smooth)

A function with a natural block structure

\[f(x) = f(x _ {\mathcal B _ 1}, \ldots, x _ {\mathcal B _ \ell})\]

is block Lipschitz smooth (BL-smooth) with blockwise Lipschitz constant $L _ \text{bmax}$ if for each block $\mathcal B$, there exists a Lipschitz constant $L _ \mathcal B$ such that

\[||\nabla _ {\mathcal B} f(x - \alpha \nabla _ {\mathcal B} f(x)) - \nabla _ {\mathcal B} f(x)|| \le L _ {\mathcal B} \times ||\alpha \nabla _ {\mathcal B} f(x)||\]

for all $x \in \mathbb R^n$ and $\alpha \in \mathbb R$, and

\[L _ \text{bmax} := \max _ \mathcal B L _ {\mathcal B}\]

Other results not covered in the course

Here are some other results which are not covered in the course. These mainly come from the “Optimization for Data Science” course at ETH Zurich.

Method Result Assumptions and notes
Fixed-step steepest descent $\sum^{k-1} _ {i = 0} (f(x^i) - f(x^\ast) \le \frac \alpha 2 \sum^{k-1} _ {i = 0}\vert\vert\nabla f(x^i)\vert\vert^2 + \frac 1 {2\alpha} \vert\vert x^0 - x^\ast\vert\vert^2$
  • $f$ is convex
  • $f$ is not necessarily $L$-smooth
Fixed-step steepest descent $\frac 1 k \sum^{k-1} _ {i = 0} (f(x^i) - f(x^\ast)) \le \frac{\ell \vert\vert x^0 - x^\ast\vert\vert}{\sqrt k}$
  • $f$ is convex
  • $f$ is $\ell$-Lipschitz



Related posts