Optimisation for Data Science HT25, Overview of convergence results
- Table of results
- Proving these results
-
Some of the assumptions
- Okay descent directions
- Wolfe conditions
- Unbiased gradient
- $L$-smooth summands
- Exponentially decreasing variance
- Unique coordinatewise minima
- Step size exact minimiser
- CL-smooth (componentwise Lipschitz-smooth)
- $D$-bounded initial sublevel set
- Componentwise Wolfe conditions
- BL-smooth (blockwise Lipschitz-smooth)
- Other results not covered in the course
Table of results
The bulk of this course consists of proving a lot of different results about iterative algorithms for solving optimisation problems
\[\min_{x \in \mathcal D \subseteq \mathbb R^d} f(x)\]under different assumptions on the objective function $f$. The main algorithms considered are:
- Steepest descent: The classic gradient descent algorithm
- Line search: Not an algorithm per say, but a method for determining directions and step lengths
- Proximal method: A specialised algorithm for solving regularised problems
- Heavy ball method: A momentum-based algorithm
- Nesterov’s accelerated gradient method: A more general momentum-based algorithm
- Stochastic gradient descent: Gradient descent with randomised gradients
- SVRG: “Stochastic variance-reduced gradient” descent, picks random gradients cleverly to improve convergence
- Coordinate descent: Gradient descent along a subset of the coordinate directions at each iteration
and the main assumptions considered explicitly here are:
- $L$-smoothness: Does the function have $L$-Lipschitz continuous gradients?
- Convexity: Is the function well shaped for optimisation?
- $\gamma$-strong convexity: Is the function really well shaped for optimisation?
There are also a few assumptions that are taken implicitly throughout these results. For example, it makes sense to assume that the functions are bounded below, otherwise there is no hope of converging to a minimiser (although this is not necessary in the $\gamma$-strongly convex case).
Method | Result | Assumptions | |
---|---|---|---|
Steepest descent fixed step, general case (proof) |
$\min _ {0 \le k \le T-1} \vert\vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{T}}$ |
|
|
Steepest descent fixed step, convex (proof) |
$f(x^k) - f(x^\ast) \le \frac{L\vert\vert x^0 - x^\ast \vert\vert^2}{2k}$ |
|
|
Steepest descent fixed step, $\gamma$-strongly convex (proof) |
$f(x^k) - f(x^\ast) \le \left(1-\frac{\gamma}{L}\right)^k (f(x^0) - f(x^\ast))$ |
|
|
Gradient descent line search, general case (proof) |
$\min _ {0 \le k \le T-1} \vert\vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1(1-c _ 2)T}}$ |
|
|
Gradient descent line search, $\gamma$-strongly convex (sheets only) |
$f(x^k) - f(x^\ast) \le \left(1 - \frac{2\eta^2 \gamma c _ 1 (1 - c _ 2)}{L}\right)(f(x^k) - f(x^\ast))$ |
|
|
Proximal method for $\varphi(x)$ as the sum $f(x) + \lambda \psi(x)$ fixed step, $f$ convex and smooth + $\psi$ convex (proof) |
$\varphi(x^k) - \varphi(x^\ast) \le \frac{L \vert\vert x^0 - x^\ast \vert \vert^2}{2k}$ |
|
|
Heavy ball method for $\frac 1 2 x^\top A x - b^\top x$ fixed step, quadratic function (stated, but not proved) |
$\vert\vert x^k - x^\ast\vert \vert \le C \beta^k$ |
|
|
Nesterov’s accelerated gradient method dynamic step, convex case (proof) |
$f(x^k) - f(x^\ast) \le \frac{4L\vert\vert x^0 - x^\ast \vert\vert^2}{(k+2)^2}$ |
|
|
Nesterov’s accelerated gradient method fixed step, $\gamma$-strongly convex case (stated, but not proved) |
$f(x^k) - f(x^\ast) \le \frac{L + \gamma}{2} \left( 1 - \sqrt{\frac \gamma L} \right)^k \vert\vert x^0 - x^\ast\vert\vert^2$ |
|
|
Stochastic gradient descent fixed step, general case (proof) |
$\min _ {0 \le i \le k - 1} \mathbb E[\vert\vert \nabla f(X^i) \vert\vert^2]\le\eta M + \frac{2L(f(x^0) - f _ \text{low})}{k \eta}$ |
|
|
Stochastic gradient descent fixed step, $\gamma$-strongly convex (proof) |
$\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left(1 - \frac{\eta \gamma}{L}\right)^k \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]$ |
|
|
Stochastic gradient descent fixed step, $\gamma$-strongly convex, exponentially decreasing variance (current unproven) |
$\mathbb E[f(X^k)] - f(x^\ast) \le \nu \rho^k$, $\nu := \max \left( \frac{\alpha L M}{\gamma}, f(x^0) - f(x^\ast) \right)$ and $\rho := \max \left( 1 - \frac{\alpha \gamma}{2}, \frac{1}{\xi} \right) < 1$ |
|
|
Stochastic gradient descent dynamic step, $\gamma$-strongly convex (proof) |
$\mathbb E[f(X^k)] - f(x^\ast) \le \frac{\frac{2L}{\gamma}\max\left(\frac M \gamma, f(x^0) - f(x^\ast)\right)}{\frac{2L}{\gamma} + k}$ |
|
|
Stochastic gradient descent fixed step, $\gamma$-strongly convex, increased batch size (current unproven) |
$\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma p} \le \left( 1 - \frac{\eta \gamma}{L} \right)^2 \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma p} \right]$ |
|
|
SVRG $\gamma$-strongly convex (currently unproven) |
$\mathbb E[f(X^k)] - f(x^\ast) \le \rho^k (f(x^0) - f(x^\ast))$ where $\rho := \frac{1}{\gamma \alpha (1 - 2L\alpha) p} + \frac{2L}{1 - 2L\alpha} < 1$ |
|
|
Cyclic coordinate descent fixed step, general case (proof) |
$\sum^\infty _ {k=0}\left(\cos^2 \theta _ k \times \vert\vert \nabla f(x^k) \vert\vert^2 \right) < \infty$ |
|
|
Cyclic coordinate descent fixed step, unique coordinatewise minima (stated, in lecture notes only) |
$\text{Acc}\big((x^{j}) _ {j \in n\mathbb N}\big) \subseteq \{x \in \mathbb R^n : \nabla f(x) = 0\}$ |
|
|
Cyclic coordinate descent fixed step, convex (stated, but not proved) |
$f(x^k) - f(x^\ast) \le \frac{4nD^2 L _ \max (1 + n\frac{L^2}{L^2 _ \max})}{k+8}$ |
|
|
Cyclic coordinate descent fixed step, $\gamma$-strongly convex (stated, but not proved) |
$f(x^k) - f(x^\ast) \le \left(1 - \frac{\gamma}{2L _ \max \left(1 + n\frac{L^2}{L _ \max^2}\right)}\right)^{k/n} (f(x^0) - f(x^\ast))$ |
|
|
Gauss-Southwell coordinate descent fixed step, convex (sheets only) |
$f(x^k) - f(x^\ast) \le \frac{2nL _ \text{max} D^2}{k}$ |
|
|
Randomised coordinate descent fixed step, general case (proof) |
$\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}$ |
|
|
Randomised coordinate descent fixed step, convex (proof) |
$\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \text{max}D^2}{k}$ |
|
|
Randomised coordinate descent fixed step, $\gamma$-strongly convex (proof, in slides only) |
$\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))$ |
|
|
Randomised coordinate descent line search, general case (currently unproven) |
$\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL(f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$ |
|
|
Randomised coordinate descent componentwise line search, general case (currently unproven, in slides only) |
$\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL _ \text{max} (f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$ |
|
|
Randomised coordinate descent componentwise line search, convex (currently unproven, in slides only) |
$\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}$ |
|
|
Randomised coordinate descent componentwise line search, $\gamma$-strongly convex (currently unproven, in slides only) |
$\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{2\gamma c _ 1(1-c _ 2)}{nL _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$ |
|
|
Randomised block coordinate descent fixed step, $\gamma$-strongly convex (currently unproven, in slides only) |
$\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{\gamma}{\ell L _ \text{bmax} }\right)^k(f(x^0) - f(x^\ast))$ |
|
|
Randomised block coordinate descent fixed step, $\gamma$-strongly convex (currently unproven, in slides only) |
$\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{n _ 0 \gamma}{n L _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$ |
|
Proving these results
Most of the results above follow from proving some “overestimation” or “local descent” property, which is a result that places an upper bound on the next iterate given the current iterate and other local information like the gradient. Global results on the convergence of each of the methods are then obtained by applying these results repeatedly to analyse the sequence of iterates as a whole.
Method | Property | Assumptions |
---|---|---|
Steepest descent fixed step (proof) |
$f(x^{k+1}) \le f(x^k) - \frac{1}{2L} \vert\vert\nabla f(x^k)\vert\vert^2$ |
|
Gradient descent line search (proof) |
$f(x^{k+1}) \le f(x^k) - \frac{c _ 1 (1 - c _ 2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2$ |
|
Proximal method smooth add convex (proof) |
$\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert\vert G _ \alpha(x)\vert \vert^2$ |
|
Stochastic gradient descent fixed step (proof) |
$\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha _ k\left( 1-\frac{L\alpha _ k}{2} \right) \vert\vert\nabla f(X^k)\vert\vert^2 + \frac{ML\alpha _ k^2}{2}$ |
|
Gradient descent line search, weaker assumptions on descent direction (proof) |
$f(x^{k+1}) \le f(x^k) - \frac{c _ 1(1-c _ 2)}{L} \cos^2 \theta _ k \vert\vert\nabla f(x^k)\vert\vert^2$ |
|
Randomised coordinate descent line search, L-smooth (proof) |
$\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{c _ 1 (1 - c _ 2)}{n L} \mathbb E[\vert \vert \nabla f(X^k)\vert \vert^2]$ |
|
Coordinate descent CL-smooth (proof) |
$f(x^{k+1}) \le f(x^k) - \alpha _ k \left( 1 - \frac{L _ \text{max} \alpha _ k}{2} \right) \left( \frac{\partial f}{\partial x _ i}(x) \right)^2$ |
|
Blockwise coordinate descent BL-smooth (currently unproven) |
$f(x - \alpha \nabla _ \mathcal B f(x)) \le f(x) - \alpha \left(1 - \frac{L _ \text{bmax} \alpha }{2}\right) \vert\vert\nabla _ {\mathcal B} f(x)\vert\vert^2$ |
|
In stochastic methods (such as stochastic gradient descent or randomised coordinate descent), it’s not always possible to guarantee local descent given the standard assumptions; you might just get unlucky and keep on picking directions or gradients which don’t actually make any progress. In these cases, you can instead show that you get local descent in expectation, rather than in every case.
Likewise, when using cyclic coordinate descent you might also get unlucky and pick gradients which don’t make any progress towards minima (and this can be deliberately engineered, as it is with Powell’s objective; see [[Notes - Optimisation for Data Science HT25, Coordinate descent]]U). In this case, the overestimation property might depend on other local information other than the magnitude of the gradient. For example, in the table you will notice that the overestimation property for gradient descent with line search contains an additional factor of $\cos \theta _ k$ where $\theta _ k$ is the angle between the chosen descent direction $d^k$ and the actual steepest descent direction $-\nabla f(x^k)$; this represents the fact that you’ll only get local descent if your chosen descent direction is pointing in the right direction. If you upgrade the assumptions on $f$ to be CL-smooth, you can obtain descent in every iteration.
When $f$ is nonconvex
When $f$ is nonconvex, $f$ might have lots of local minima. Since these methods are all local, all you can hope for is saying something about the rate that $ \vert \vert \nabla f(x^k) \vert \vert $ tends towards zero. For example, when using line search methods we have the overestimation property that
\[f(x^{k+1}) \le f(x^k) - \frac{c_1 (1 - c_2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2\]Rearranging, you obtain
\[||\nabla f(x^k)||^2 \le \frac{L}{c_1 (1 - c_2) \eta^2} \big( f(x^k) - f(x^{k+1}) \big)\]And then by considering telescoping sums it follows that:
\[\min_{0 \le i \le k-1} ||\nabla f(x^k)|| \le \sqrt{\frac{L (f(x^0) - f(x^\ast))}{c_1(1-c_2)\eta^2 k} }\]When $f$ is convex
When $f$ is convex and has some minimiser $x^\ast$, that minimiser is a global minimiser. So it makes sense to talk about the rate at which $f(x^k)$ tends towards $f(x^\ast)$. It’s often $\Delta _ k := f(x^k) - f(x^\ast)$ be the gap between the value at the current iterate and the value at the global minimiser.
In this context, you have two inequalities like:
- $\Delta _ {k+1} \le \Delta _ k - B(x^k)$ (the overestimation property you have proved)
- $\Delta _ k \le \nabla f(x^k)^\top (x^k - x^\ast)$ (rearranging the first-order characterisation of convexity)
There’s two main ways the proofs proceed from here.
Finding a telescoping sum $\sum \Delta _ k \le \sum g(x^k) - g(x^{k+1})$
- [[Notes - Optimisation for Data Science HT25, Steepest descent]]U uses the foundational inequality and convexity to establish $f(x^{k+1}) \le f(x^\ast) + \frac L 2 \left( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1}-x^\ast \vert \vert ^2 \right)$ and then telescopes
- [[Notes - Optimisation for Data Science HT25, Proximal method]]U first proves the “foundational inequality of the prox method” $\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2$ ; when $z = x^k$, this gives $\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k}{2} \vert \vert G _ {\alpha _ k} (x^k) \vert \vert ^2$. Then it is established that $\phi(x^{k+1}) \le \phi(x^\ast) \le \frac{1}{2\alpha _ k}( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2)$ and uses a telescoping sum
- [[Notes - Optimisation for Data Science HT25, Heavy ball method]]U
Finding a telescoping sum $\sum \frac{1}{\Delta _ k} - \frac{1}{\Delta _ {k+1} } \ge \cdots$
- [[Notes - Optimisation for Data Science HT25, Coordinate descent]]U uses this technique in the convex CL-smooth case (I still don’t fully understand why since I think there is a proof that )
When $f$ is $\gamma$-strongly convex
When $f$ is $\gamma$-strongly convex you can upgrade the inequality $\Delta _ k \le \nabla f(x^k)^\top (x^k - x^\ast)$ to the much stronger $\Delta _ k \le \frac{1}{2\gamma} \vert \vert \nabla f(x^k) \vert \vert ^2$. If your overestimation property is something like
\[f(x^{k+1}) \le f(x^k) - \frac{c_1 (1 - c_2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2\]Then you obtain
\[\begin{aligned} \Delta_{k+1} &\le \Delta_k - \frac{c_1 (1 - c_2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2 && (\text{rewriting}) \\\\ &= \left(1 - \frac{2c_1 (1-c_2)}{L} \eta^2\right) \Delta_k \end{aligned}\]and then an inductive proof establishes that
\[f(x^{k}) - f(x^\ast) \le \left( 1 - \frac{2c_1 (1 - c_2)}{L} \eta^2 \right) (f(x^0) - f(x^\ast))\]Some of the assumptions
Okay descent directions
The search directions $d^k$ lie within a bounded angle of the steepest descent direction: there exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$,
\[\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||\](relevant to line search methods)
Wolfe conditions
$\alpha _ k$ is neither too short nor too long: there exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$, we have:
- Armijo condition: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- Curvature condition: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
(relevant to line search methods)
Unbiased gradient
$G^k$ (the estimate of the gradient) conditioned on the current batch is an unbiased estimator of the true gradient:
\[\mathbb E_{\mathcal S_k} [G^k] = \nabla f(X^k)\](relevant to stochastic gradient descent)
$L$-smooth summands
For all $j$, $\nabla f _ j$ is $L$-smooth.
(relevant to stochastic gradient descent)
Exponentially decreasing variance
\[\text{Var}(G^k \mid \mathcal S_k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S_k] \le \frac {M} {\xi^k}\]Unique coordinatewise minima
For each $i$, the minimum $\min _ {t \in \mathbb R} f(x _ 1, x _ 2, \ldots, x _ {i-1}, t, x _ {i+1}, \ldots, x _ n)$ is uniquely attained for all $x _ 1, \ldots, x _ {i-1}, x _ {i+1}, \ldots, x _ n$.
(relevant to coordinate descent)
Step size exact minimiser
$\alpha _ k$ is chosen to be a minimiser of $f(x^k + \alpha _ k d^k)$.
(relevant to coordinate descent)
CL-smooth (componentwise Lipschitz-smooth)
Each partial derivative of $f$ is “coordinatewise $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\](relevant to coordinate descent)
$D$-bounded initial sublevel set
$f$ has a $D$-bounded initial sublevel set if there exists a finite minimiser $x^\ast$ with $ \vert \vert x - x^\ast \vert \vert ^2 \le D$ for all $x$ with $f(x) \le f(x^0)$ (this terminology is not standard).
(relevant to coordinate descent)
Componentwise Wolfe conditions
There exist constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$ and $1 \le i \le n$,
- $f\left(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i\right) \le f(x^k) - c _ 1 \alpha _ k \left( \frac{\partial f(x^k)}{\partial x _ i} \right)^2$
- $\left(\frac{\partial f(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i)}{\partial x _ i}\right) \left(-\frac{\partial f(x^k)}{\partial x _ i} \right) \ge -c _ 2 \left(\frac{\partial f(x^k)}{\partial x _ i}\right)^2$
(these are just the Wolfe conditions with $d^k$ as $-\frac{\partial f(x^k)}{x _ i} e _ i$ and all the $\nabla f(x^k)$ replaced by $\frac{\partial f(x^k)}{\partial x _ i}$).
BL-smooth (blockwise Lipschitz-smooth)
A function with a natural block structure
\[f(x) = f(x _ {\mathcal B _ 1}, \ldots, x _ {\mathcal B _ \ell})\]is block Lipschitz smooth (BL-smooth) with blockwise Lipschitz constant $L _ \text{bmax}$ if for each block $\mathcal B$, there exists a Lipschitz constant $L _ \mathcal B$ such that
\[||\nabla _ {\mathcal B} f(x - \alpha \nabla _ {\mathcal B} f(x)) - \nabla _ {\mathcal B} f(x)|| \le L _ {\mathcal B} \times ||\alpha \nabla _ {\mathcal B} f(x)||\]for all $x \in \mathbb R^n$ and $\alpha \in \mathbb R$, and
\[L _ \text{bmax} := \max _ \mathcal B L _ {\mathcal B}\]Other results not covered in the course
Here are some other results which are not covered in the course. These mainly come from the “Optimization for Data Science” course at ETH Zurich.
Method | Result | Assumptions and notes |
---|---|---|
Fixed-step steepest descent | $\sum^{k-1} _ {i = 0} (f(x^i) - f(x^\ast) \le \frac \alpha 2 \sum^{k-1} _ {i = 0}\vert\vert\nabla f(x^i)\vert\vert^2 + \frac 1 {2\alpha} \vert\vert x^0 - x^\ast\vert\vert^2$ |
|
Fixed-step steepest descent | $\frac 1 k \sum^{k-1} _ {i = 0} (f(x^i) - f(x^\ast)) \le \frac{\ell \vert\vert x^0 - x^\ast\vert\vert}{\sqrt k}$ |
|