Notes - Optimisation for Data Science HT25, Overview of convergence results

[[Course - Optimisation for Data Science HT25]]^U

Table of results

The bulk of this course consists of proving a lot of different results about iterative algorithms for solving optimisation problems

\[\min_{x \in \mathcal D \subseteq \mathbb R^d} f(x)\]

under different assumptions on the objective function $f$. The main algorithms considered are:

Steepest descent: The classic gradient descent algorithm
Line search: Not an algorithm per say, but a method for determining directions and step lengths
Proximal method: A specialised algorithm for solving regularised problems
Heavy ball method: A momentum-based algorithm
Nesterov’s accelerated gradient method: A more general momentum-based algorithm
Stochastic gradient descent: Gradient descent with randomised gradients
SVRG: “Stochastic variance-reduced gradient” descent, picks random gradients cleverly to improve convergence
Coordinate descent: Gradient descent along a subset of the coordinate directions at each iteration

and the main assumptions considered explicitly here are:

$L$-smoothness: Does the function have $L$-Lipschitz continuous gradients?
Convexity: Is the function well shaped for optimisation?
$\gamma$-strong convexity: Is the function really well shaped for optimisation?

There are also a few assumptions that are taken implicitly throughout these results. For example, it makes sense to assume that the functions are bounded below, otherwise there is no hope of converging to a minimiser (although this is not necessary in the $\gamma$-strongly convex case).

Method	Result	Assumptions
Steepest descent fixed step, general case ^(proof)	$\min _ {0 \le k \le T-1} \vert\vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{2L(f(x^0) - C)}{T}}$	$L$-smooth Step size $1/L$
Steepest descent fixed step, convex ^(proof)	$f(x^k) - f(x^\ast) \le \frac{L\vert\vert x^0 - x^\ast \vert\vert^2}{2k}$	$L$-smooth Step size $1/L$
Steepest descent fixed step, $\gamma$-strongly convex ^(proof)	$f(x^k) - f(x^\ast) \le \left(1-\frac{\gamma}{L}\right)^k (f(x^0) - f(x^\ast))$	$L$-smooth Step size $1/L$
Gradient descent line search, general case ^(proof)	$\min _ {0 \le k \le T-1} \vert\vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1(1-c _ 2)T}}$	$L$-smooth Okay descent directions Wolfe conditions
Gradient descent line search, $\gamma$-strongly convex ^{(sheets only)}	$f(x^k) - f(x^\ast) \le \left(1 - \frac{2\eta^2 \gamma c _ 1 (1 - c _ 2)}{L}\right)(f(x^k) - f(x^\ast))$	$L$-smooth Okay descent directions Wolfe conditions
Proximal method for $\varphi(x)$ as the sum $f(x) + \lambda \psi(x)$ fixed step, $f$ convex and smooth + $\psi$ convex ^(proof)	$\varphi(x^k) - \varphi(x^\ast) \le \frac{L \vert\vert x^0 - x^\ast \vert \vert^2}{2k}$	$f$ $L$-smooth and convex $\psi$ convex
Heavy ball method for $\frac 1 2 x^\top A x - b^\top x$ fixed step, quadratic function ^{(stated, but not proved)}	$\vert\vert x^k - x^\ast\vert \vert \le C \beta^k$	$\alpha _ k = \frac{4}{(\sqrt L + \sqrt \gamma)^2}$ $\beta _ k = \frac{\sqrt L - \sqrt \gamma}{\sqrt L + \sqrt \gamma}$
Nesterov’s accelerated gradient method dynamic step, convex case ^(proof)	$f(x^k) - f(x^\ast) \le \frac{4L\vert\vert x^0 - x^\ast \vert\vert^2}{(k+2)^2}$	Finite minimiser and finite minimum
Nesterov’s accelerated gradient method fixed step, $\gamma$-strongly convex case ^{(stated, but not proved)}	$f(x^k) - f(x^\ast) \le \frac{L + \gamma}{2} \left( 1 - \sqrt{\frac \gamma L} \right)^k \vert\vert x^0 - x^\ast\vert\vert^2$	$L$-smooth
Stochastic gradient descent fixed step, general case ^(proof)	$\min _ {0 \le i \le k - 1} \mathbb E[\vert\vert \nabla f(X^i) \vert\vert^2]\le\eta M + \frac{2L(f(x^0) - f _ \text{low})}{k \eta}$	Unbiased gradient Bounded below $L$-smooth summands Bounded variance Step size $\eta/L$
Stochastic gradient descent fixed step, $\gamma$-strongly convex ^(proof)	$\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left(1 - \frac{\eta \gamma}{L}\right)^k \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]$	Unbiased gradient Bounded below $L$-smooth summands Bounded variance Step size $\eta/L$
Stochastic gradient descent fixed step, $\gamma$-strongly convex, exponentially decreasing variance ^{(current unproven)}	$\mathbb E[f(X^k)] - f(x^\ast) \le \nu \rho^k$, $\nu := \max \left( \frac{\alpha L M}{\gamma}, f(x^0) - f(x^\ast) \right)$ and $\rho := \max \left( 1 - \frac{\alpha \gamma}{2}, \frac{1}{\xi} \right) < 1$	Unbiased gradient $L$-smooth summands Exponentially decreasing variance Step size $\eta / L$
Stochastic gradient descent dynamic step, $\gamma$-strongly convex ^(proof)	$\mathbb E[f(X^k)] - f(x^\ast) \le \frac{\frac{2L}{\gamma}\max\left(\frac M \gamma, f(x^0) - f(x^\ast)\right)}{\frac{2L}{\gamma} + k}$	Unbiased gradient $L$-smooth summands Bounded variance Step size $\frac{2}{2L + k\gamma}$
Stochastic gradient descent fixed step, $\gamma$-strongly convex, increased batch size ^{(current unproven)}	$\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma p} \le \left( 1 - \frac{\eta \gamma}{L} \right)^2 \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma p} \right]$	Unbiased gradient $L$-smooth summands Batch size $p$ Step size $\eta / L$
SVRG $\gamma$-strongly convex ^{(currently unproven)}	$\mathbb E[f(X^k)] - f(x^\ast) \le \rho^k (f(x^0) - f(x^\ast))$ where $\rho := \frac{1}{\gamma \alpha (1 - 2L\alpha) p} + \frac{2L}{1 - 2L\alpha} < 1$	$L$-smooth and convex summands Step size $\alpha < \frac{1}{4L}$ and $p > \frac{1}{\gamma \alpha (1 - 4L\alpha)}$
Cyclic coordinate descent fixed step, general case ^(proof)	$\sum^\infty _ {k=0}\left(\cos^2 \theta _ k \times \vert\vert \nabla f(x^k) \vert\vert^2 \right) < \infty$	$L$-smooth Bounded below Wolfe conditions
Cyclic coordinate descent fixed step, unique coordinatewise minima ^{(stated, in lecture notes only)}	$\text{Acc}\big((x^{j}) _ {j \in n\mathbb N}\big) \subseteq \{x \in \mathbb R^n : \nabla f(x) = 0\}$	Continuously differentiable Bounded below Unique coordinatewise minima Exact minimiser
Cyclic coordinate descent fixed step, convex ^{(stated, but not proved)}	$f(x^k) - f(x^\ast) \le \frac{4nD^2 L _ \max (1 + n\frac{L^2}{L^2 _ \max})}{k+8}$	CL-smooth $D$-bounded initial sublevel set Step size $1/L _ \text{max}$
Cyclic coordinate descent fixed step, $\gamma$-strongly convex ^{(stated, but not proved)}	$f(x^k) - f(x^\ast) \le \left(1 - \frac{\gamma}{2L _ \max \left(1 + n\frac{L^2}{L _ \max^2}\right)}\right)^{k/n} (f(x^0) - f(x^\ast))$	CL-smooth Step size $1/L _ \text{max}$ $k$ multiple of $n$
Gauss-Southwell coordinate descent fixed step, convex ^{(sheets only)}	$f(x^k) - f(x^\ast) \le \frac{2nL _ \text{max} D^2}{k}$	CL-smooth Step size $1/L _ \text{max}$
Randomised coordinate descent fixed step, general case ^(proof)	$\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}$	CL-smooth Bounded below Step size $1/L _ \text{max}$
Randomised coordinate descent fixed step, convex ^(proof)	$\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \text{max}D^2}{k}$	CL-smooth $D$-bounded initial sublevel set Step size $1/L _ \text{max}$
Randomised coordinate descent fixed step, $\gamma$-strongly convex ^{(proof, in slides only)}	$\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))$	CL-smooth Step size $1/L _ \text{max}$
Randomised coordinate descent line search, general case ^{(currently unproven)}	$\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL(f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$	L-smooth Bounded below Line search Wolfe conditions
Randomised coordinate descent componentwise line search, general case ^{(currently unproven, in slides only)}	$\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL _ \text{max} (f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$	CL-smooth Bounded below Line search Componentwise Wolfe conditions
Randomised coordinate descent componentwise line search, convex ^{(currently unproven, in slides only)}	$\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}$	CL-smooth $D$-bounded initial sublevel set Line search Componentwise Wolfe conditions
Randomised coordinate descent componentwise line search, $\gamma$-strongly convex ^{(currently unproven, in slides only)}	$\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{2\gamma c _ 1(1-c _ 2)}{nL _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$	CL-smooth Line search Componentwise Wolfe conditions
Randomised block coordinate descent fixed step, $\gamma$-strongly convex ^{(currently unproven, in slides only)}	$\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{\gamma}{\ell L _ \text{bmax} }\right)^k(f(x^0) - f(x^\ast))$	BL-smooth Step size $1/L _ \text{bmax}$ Known set of blocks
Randomised block coordinate descent fixed step, $\gamma$-strongly convex ^{(currently unproven, in slides only)}	$\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{n _ 0 \gamma}{n L _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$	BL-smooth Step size $1/L _ \text{bmax}$ All blocks of a fixed size

Proving these results

Most of the results above follow from proving some “overestimation” or “local descent” property, which is a result that places an upper bound on the next iterate given the current iterate and other local information like the gradient. Global results on the convergence of each of the methods are then obtained by applying these results repeatedly to analyse the sequence of iterates as a whole.

Method	Property	Assumptions
Steepest descent fixed step ^(proof)	$f(x^{k+1}) \le f(x^k) - \frac{1}{2L} \vert\vert\nabla f(x^k)\vert\vert^2$	$L$-smooth
Gradient descent line search ^(proof)	$f(x^{k+1}) \le f(x^k) - \frac{c _ 1 (1 - c _ 2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2$	$L$-smooth Wolfe conditions Okay descent directions
Proximal method smooth add convex ^(proof)	$\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert\vert G _ \alpha(x)\vert \vert^2$	$f$ is $L$-smooth $\psi$ is convex
Stochastic gradient descent fixed step ^(proof)	$\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha _ k\left( 1-\frac{L\alpha _ k}{2} \right) \vert\vert\nabla f(X^k)\vert\vert^2 + \frac{ML\alpha _ k^2}{2}$	Unbiased gradient $L$-smooth summands Batch size $p$ Step size $\eta / L$
Gradient descent line search, weaker assumptions on descent direction ^(proof)	$f(x^{k+1}) \le f(x^k) - \frac{c _ 1(1-c _ 2)}{L} \cos^2 \theta _ k \vert\vert\nabla f(x^k)\vert\vert^2$	$L$-smooth Wolfe conditions
Randomised coordinate descent line search, L-smooth ^(proof)	$\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{c _ 1 (1 - c _ 2)}{n L} \mathbb E[\vert \vert \nabla f(X^k)\vert \vert^2]$	L-smooth
Coordinate descent CL-smooth ^(proof)	$f(x^{k+1}) \le f(x^k) - \alpha _ k \left( 1 - \frac{L _ \text{max} \alpha _ k}{2} \right) \left( \frac{\partial f}{\partial x _ i}(x) \right)^2$	CL-smooth
Blockwise coordinate descent BL-smooth ^{(currently unproven)}	$f(x - \alpha \nabla _ \mathcal B f(x)) \le f(x) - \alpha \left(1 - \frac{L _ \text{bmax} \alpha }{2}\right) \vert\vert\nabla _ {\mathcal B} f(x)\vert\vert^2$	BL-smooth

In stochastic methods (such as stochastic gradient descent or randomised coordinate descent), it’s not always possible to guarantee local descent given the standard assumptions; you might just get unlucky and keep on picking directions or gradients which don’t actually make any progress. In these cases, you can instead show that you get local descent in expectation, rather than in every case.

Likewise, when using cyclic coordinate descent you might also get unlucky and pick gradients which don’t make any progress towards minima (and this can be deliberately engineered, as it is with Powell’s objective; see [[Notes - Optimisation for Data Science HT25, Coordinate descent]]^U). In this case, the overestimation property might depend on other local information other than the magnitude of the gradient. For example, in the table you will notice that the overestimation property for gradient descent with line search contains an additional factor of $\cos \theta _ k$ where $\theta _ k$ is the angle between the chosen descent direction $d^k$ and the actual steepest descent direction $-\nabla f(x^k)$; this represents the fact that you’ll only get local descent if your chosen descent direction is pointing in the right direction. If you upgrade the assumptions on $f$ to be CL-smooth, you can obtain descent in every iteration.

When $f$ is nonconvex

When $f$ is nonconvex, $f$ might have lots of local minima. Since these methods are all local, all you can hope for is saying something about the rate that $ \vert \vert \nabla f(x^k) \vert \vert $ tends towards zero. For example, when using line search methods we have the overestimation property that

\[f(x^{k+1}) \le f(x^k) - \frac{c_1 (1 - c_2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2\]

Rearranging, you obtain

\[||\nabla f(x^k)||^2 \le \frac{L}{c_1 (1 - c_2) \eta^2} \big( f(x^k) - f(x^{k+1}) \big)\]

And then by considering telescoping sums it follows that:

\[\min_{0 \le i \le k-1} ||\nabla f(x^k)|| \le \sqrt{\frac{L (f(x^0) - f(x^\ast))}{c_1(1-c_2)\eta^2 k} }\]

When $f$ is convex

When $f$ is convex and has some minimiser $x^\ast$, that minimiser is a global minimiser. So it makes sense to talk about the rate at which $f(x^k)$ tends towards $f(x^\ast)$. It’s often $\Delta _ k := f(x^k) - f(x^\ast)$ be the gap between the value at the current iterate and the value at the global minimiser.

In this context, you have two inequalities like:

$\Delta _ {k+1} \le \Delta _ k - B(x^k)$ (the overestimation property you have proved)
$\Delta _ k \le \nabla f(x^k)^\top (x^k - x^\ast)$ (rearranging the first-order characterisation of convexity)

There’s two main ways the proofs proceed from here.

Finding a telescoping sum $\sum \Delta _ k \le \sum g(x^k) - g(x^{k+1})$

[[Notes - Optimisation for Data Science HT25, Steepest descent]]^U uses the foundational inequality and convexity to establish $f(x^{k+1}) \le f(x^\ast) + \frac L 2 \left( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1}-x^\ast \vert \vert ^2 \right)$ and then telescopes
[[Notes - Optimisation for Data Science HT25, Proximal method]]^U first proves the “foundational inequality of the prox method” $\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2$ ; when $z = x^k$, this gives $\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k}{2} \vert \vert G _ {\alpha _ k} (x^k) \vert \vert ^2$. Then it is established that $\phi(x^{k+1}) \le \phi(x^\ast) \le \frac{1}{2\alpha _ k}( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2)$ and uses a telescoping sum
[[Notes - Optimisation for Data Science HT25, Heavy ball method]]^U

Finding a telescoping sum $\sum \frac{1}{\Delta _ k} - \frac{1}{\Delta _ {k+1} } \ge \cdots$

[[Notes - Optimisation for Data Science HT25, Coordinate descent]]^U uses this technique in the convex CL-smooth case (I still don’t fully understand why since I think there is a proof that )

When $f$ is $\gamma$-strongly convex

When $f$ is $\gamma$-strongly convex you can upgrade the inequality $\Delta _ k \le \nabla f(x^k)^\top (x^k - x^\ast)$ to the much stronger $\Delta _ k \le \frac{1}{2\gamma} \vert \vert \nabla f(x^k) \vert \vert ^2$. If your overestimation property is something like

\[f(x^{k+1}) \le f(x^k) - \frac{c_1 (1 - c_2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2\]

Then you obtain

\[\begin{aligned} \Delta_{k+1} &\le \Delta_k - \frac{c_1 (1 - c_2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2 && (\text{rewriting}) \\\\ &= \left(1 - \frac{2c_1 (1-c_2)}{L} \eta^2\right) \Delta_k \end{aligned}\]

and then an inductive proof establishes that

\[f(x^{k}) - f(x^\ast) \le \left( 1 - \frac{2c_1 (1 - c_2)}{L} \eta^2 \right) (f(x^0) - f(x^\ast))\]

Some of the assumptions

Okay descent directions

The search directions $d^k$ lie within a bounded angle of the steepest descent direction: there exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$,

\[\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||\]

(relevant to line search methods)

Wolfe conditions

$\alpha _ k$ is neither too short nor too long: there exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$, we have:

Armijo condition: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
Curvature condition: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

(relevant to line search methods)

Unbiased gradient

$G^k$ (the estimate of the gradient) conditioned on the current batch is an unbiased estimator of the true gradient:

\[\mathbb E_{\mathcal S_k} [G^k] = \nabla f(X^k)\]

(relevant to stochastic gradient descent)

$L$-smooth summands

For all $j$, $\nabla f _ j$ is $L$-smooth.

(relevant to stochastic gradient descent)

Exponentially decreasing variance

\[\text{Var}(G^k \mid \mathcal S_k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S_k] \le \frac {M} {\xi^k}\]

Unique coordinatewise minima

For each $i$, the minimum $\min _ {t \in \mathbb R} f(x _ 1, x _ 2, \ldots, x _ {i-1}, t, x _ {i+1}, \ldots, x _ n)$ is uniquely attained for all $x _ 1, \ldots, x _ {i-1}, x _ {i+1}, \ldots, x _ n$.

(relevant to coordinate descent)

Step size exact minimiser

$\alpha _ k$ is chosen to be a minimiser of $f(x^k + \alpha _ k d^k)$.

(relevant to coordinate descent)

CL-smooth (componentwise Lipschitz-smooth)

Each partial derivative of $f$ is “coordinatewise $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

(relevant to coordinate descent)

$D$-bounded initial sublevel set

$f$ has a $D$-bounded initial sublevel set if there exists a finite minimiser $x^\ast$ with $ \vert \vert x - x^\ast \vert \vert ^2 \le D$ for all $x$ with $f(x) \le f(x^0)$ (this terminology is not standard).

(relevant to coordinate descent)

Componentwise Wolfe conditions

There exist constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$ and $1 \le i \le n$,

$f\left(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i\right) \le f(x^k) - c _ 1 \alpha _ k \left( \frac{\partial f(x^k)}{\partial x _ i} \right)^2$
$\left(\frac{\partial f(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i)}{\partial x _ i}\right) \left(-\frac{\partial f(x^k)}{\partial x _ i} \right) \ge -c _ 2 \left(\frac{\partial f(x^k)}{\partial x _ i}\right)^2$

(these are just the Wolfe conditions with $d^k$ as $-\frac{\partial f(x^k)}{x _ i} e _ i$ and all the $\nabla f(x^k)$ replaced by $\frac{\partial f(x^k)}{\partial x _ i}$).

BL-smooth (blockwise Lipschitz-smooth)

A function with a natural block structure

\[f(x) = f(x _ {\mathcal B _ 1}, \ldots, x _ {\mathcal B _ \ell})\]

is block Lipschitz smooth (BL-smooth) with blockwise Lipschitz constant $L _ \text{bmax}$ if for each block $\mathcal B$, there exists a Lipschitz constant $L _ \mathcal B$ such that

\[||\nabla _ {\mathcal B} f(x - \alpha \nabla _ {\mathcal B} f(x)) - \nabla _ {\mathcal B} f(x)|| \le L _ {\mathcal B} \times ||\alpha \nabla _ {\mathcal B} f(x)||\]

for all $x \in \mathbb R^n$ and $\alpha \in \mathbb R$, and

\[L _ \text{bmax} := \max _ \mathcal B L _ {\mathcal B}\]

Other results not covered in the course

Here are some other results which are not covered in the course. These mainly come from the “Optimization for Data Science” course at ETH Zurich.

Method	Result	Assumptions and notes
Fixed-step steepest descent	$\sum^{k-1} _ {i = 0} (f(x^i) - f(x^\ast) \le \frac \alpha 2 \sum^{k-1} _ {i = 0}\vert\vert\nabla f(x^i)\vert\vert^2 + \frac 1 {2\alpha} \vert\vert x^0 - x^\ast\vert\vert^2$	$f$ is convex $f$ is not necessarily $L$-smooth
Fixed-step steepest descent	$\frac 1 k \sum^{k-1} _ {i = 0} (f(x^i) - f(x^\ast)) \le \frac{\ell \vert\vert x^0 - x^\ast\vert\vert}{\sqrt k}$	$f$ is convex $f$ is $\ell$-Lipschitz

Table of results

Proving these results

When $f$ is nonconvex

When $f$ is convex

Finding a telescoping sum $\sum \Delta _ k \le \sum g(x^k) - g(x^{k+1})$

Finding a telescoping sum $\sum \frac{1}{\Delta _ k} - \frac{1}{\Delta _ {k+1} } \ge \cdots$

When $f$ is $\gamma$-strongly convex

Some of the assumptions

Okay descent directions

Wolfe conditions

Unbiased gradient

$L$-smooth summands

Exponentially decreasing variance

Unique coordinatewise minima

Step size exact minimiser

CL-smooth (componentwise Lipschitz-smooth)

$D$-bounded initial sublevel set

Componentwise Wolfe conditions

BL-smooth (blockwise Lipschitz-smooth)

Other results not covered in the course

Related posts