Optimisation for Data Science HT25, Overview of convergence results
Table of results
Method | Result | Assumptions |
---|---|---|
Fixed-step steepest descent, general case | $\min _ {0 \le k \le T-1} \vert\vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{T}}$ | $L$-smooth, bounded below, step size $1/L$ |
Fixed-step steepest descent, convex | $f(x^k) - f(x^\ast) \le \frac{L\vert\vert x^0 - x^\ast \vert\vert^2}{2k}$ | $L$-smooth, step size $1/L$ |
Fixed-step steepest descent, $\gamma$-strongly convex | $f(x^k) - f(x^\ast) \le \left(1-\frac{\gamma}{L}\right)^k (f(x^0) - f(x^\ast))$ | $L$-smooth, step size $1/L$, unique minimiser |
Line search, general case | $\min _ {0 \le k \le T-1} \vert\vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1(1-c^2)T}}$ | $L$-smooth, bounded below, okay descent directions, Wolfe conditions |
Proximal method for $\varphi(x)$ as the sum $f(x) + \lambda \psi(x)$ | $\varphi(x^k) - \varphi(x^\ast) \le \frac{L \vert\vert x^0 - x^\ast \vert \vert^2}{2k}$ | $f$ $L$-smooth and convex, $\psi$ convex |
Nesterov’s accelerated gradient method, convex case, dynamic step lengths | $f(x^k) - f(x^\ast) \le \frac{4L\vert\vert x^0 - x^\ast \vert\vert^2}{(k+2)^2}$ | Finite minimiser and finite minimum |
Nesterov’s accelerated gradient method, $\gamma$-strongly convex case, fixed step lengthsᴾ | $f(x^k) - f(x^\ast) \le \frac{L + \gamma}{2} \left( 1 - \sqrt{\frac \gamma L} \right)^k \vert\vert x^0 - x^\ast\vert\vert^2$ | $L$-smooth |
Stochastic gradient descent, general case | $\min _ {0 \le i \le k - 1} \mathbb E[\vert\vert \nabla f(X^i) \vert\vert^2]\le\eta M + \frac{2L(f(x^0) - f _ \text{low})}{k \eta}$ | Unbiased gradient, bounded below, $L$-smooth summands, bounded variance, step size $\eta/L$ |
Stochastic gradient descent, $\gamma$-strongly convex | $\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left(1 - \frac{\eta \gamma}{L}\right)^k \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]$ | Unbiased gradient, bounded below, $L$-smooth summands, bounded variance, step size $\eta/L$ |
Stochastic gradient descent, $\gamma$-strongly convex, variable step size | $\mathbb E[f(X^k)] - f(x^\ast) \le \frac{\frac{2L}{\gamma}\max\left(\frac M \gamma, f(x^0) - f(x^\ast)\right)}{\frac{2L}{\gamma} + k}$ | Unbiased gradient, $L$-smooth summands, bounded variance, step size $\frac{2}{2L + k\gamma}$ |
Stochastic gradient descent, $\gamma$-strongly convex, exponentially decreasing variance | $\mathbb E[f(X^k)] - f(x^\ast) \le \nu \rho^k$, $\nu := \max \left( \frac{\alpha L M}{\gamma}, f(x^0) - f(x^\ast) \right)$ and $\rho := \max \left( 1 - \frac{\alpha \gamma}{2}, \frac{1}{\xi} \right) < 1$ | Unbiased gradient, $L$-smooth summands, exponentially decreasing variance, step size $\eta / L$ |
SVRG, $\gamma$-strongly convex | $\mathbb E[f(X^k)] - f(x^\ast) \le \rho^k (f(x^0) - f(x^\ast))$ where $\rho := \frac{1}{\gamma \alpha (1 - 2L\alpha) p} + \frac{2L}{1 - 2L\alpha} < 1$ | $L$-smooth and convex summands, step size $\alpha < \frac{1}{4L}$ and $p > \frac{1}{\gamma \alpha (1 - 4L\alpha)}$ |
Cyclic coordinate descent, general case | $\sum^\infty _ {k=0}\left(\cos^2 \theta _ k \times \vert\vert \nabla f(x^k) \vert\vert^2 \right) < \infty$ | $L$-smooth, bounded below, Wolfe conditions |
Cyclic coordinate descent, unique coordinatewise minima | $\text{Acc}\big((x^{j}) _ {j \in n\mathbb N}\big) \subseteq \{x \in \mathbb R^n : \nabla f(x) = 0\}$ | Continuously differentiable, bounded below, unique coordinatewise minima, step size exact minimiser |
Cyclic coordinate descent, convexᴾ | $f(x^k) - f(x^\ast) \le \frac{4nD^2 L _ \max (1 + n\frac{L^2}{L^2 _ \max})}{k+8}$ | CL-smooth, $D$-bounded initial sublevel set, step size $1/L _ \text{max}$ |
Cyclic coordinate descent, $\gamma$-strongly convexᴾ | $f(x^k) - f(x^\ast) \le \left(1 - \frac{\gamma}{2L _ \max \left(1 + n\frac{L^2}{L _ \max^2}\right)}\right)^{k/n} (f(x^0) - f(x^\ast))$ | CL-smooth, step size $1/L _ \text{max}$, $k$ multiple of $n$ |
Gauss-Southwell coordinate descent, convexᴴ | $f(x^k) - f(x^\ast) \le \frac{2nL _ \text{max} D^2}{k}$ | CL-smooth, step size $1/L _ \text{max}$ |
Randomised coordinate descent, general case | $\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL(f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$ | L-smooth, bounded below, line search, Wolfe conditions |
Randomised coordinate descent, general case | $\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}$ | CL-smooth, bounded below, step size $1/L _ \text{max}$ |
Randomised coordinate descent, general caseᵀ | $\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL _ \text{max} (f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$ | CL-smooth, bounded below, line search, componentwise Wolfe conditions |
Randomised coordinate descent, convex | $\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \text{max}D^2}{k}$ | CL-smooth, $D$-bounded initial sublevel set, step size $1/L _ \text{max}$ |
Randomised coordinate descent, convexᵀ | $\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}$ | CL-smooth, $D$-bounded initial sublevel set, line search, componentwise Wolfe conditions |
Randomised coordinate descent, $\gamma$-strongly convexᵀ | $\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))$ | CL-smooth, step size $1/L _ \text{max}$. |
Randomised coordinate descent, $\gamma$-strongly convexᵀ | $\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{2\gamma c _ 1(1-c _ 2)}{nL _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$ | CL-smooth, line search, componentwise Wolfe conditions |
Randomised block coordinate descent, $\gamma$-strongly convexᵀ | $\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{\gamma}{\ell L _ \text{bmax} }\right)^k(f(x^0) - f(x^\ast))$ | BL-smooth, step size $1/L _ \text{bmax}$, known set of blocks |
Randomised block coordinate descent, $\gamma$-strongly convexᵀ | $\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{n _ 0 \gamma}{n L _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$ | BL-smooth, step size $1/L _ \text{bmax}$, all blocks of a certain size |
where:
- (ᴾ) denotes that the result was not proved.
- (ᴸ) denotes that the result is in the lecture notes, but not in the slides.
- (ᵀ) denotes that the result is in the slides, but not in the lecture notes.
- (ᴴ) denotes that the result is only in the problem sheets.
Overestimation properties
Most of the results above follow from proving some “overestimation property”:
Method | Property | Assumptions |
---|---|---|
Fixed-step gradient descent | $f(x^{k+1}) \le f(x^k) - \frac{1}{2L} \vert\vert\nabla f(x^k)\vert\vert^2$ | $L$-smooth |
Line-search gradient descent | $f(x^{k+1}) \le f(x^k) - \frac{c _ 1 (1 - c _ 2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2$ | $L$-smooth, Wolfe conditions |
Proximal method | $\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k}{2}\vert \vert G _ {\alpha _ k}(x^k)\vert \vert^2$ | $f$ is $L$-smooth, $\psi$ is convex |
Nesterov acceleration | $f(x^k) \le f(y^k) - \frac{\alpha _ k}{2} \vert \vert \nabla f(y^k) \vert \vert^2$ | $L$-smooth |
Stochastic gradient descent | $\mathbb E[f(X^{k+1})\mid X^k] \le f(X^k) + \alpha _ k \left( \frac{L\alpha _ k}{2} - 1 \right) \vert \vert \nabla f(X^k)\vert \vert^2 + \frac{ML\alpha _ k^2}{2}$ | Unbiased gradient, $L$-smooth summands, bounded variance |
Coordinate descent | $f(x^{k+1}) \le f(x^k) - \alpha _ k \left( 1 - \frac{L _ \text{max} }{2} \alpha \right) \left( \frac{\partial f}{\partial x _ i}(x) \right)^2$ | CL-smooth |
Proof techniques
Proofs in the $L$-smooth, possibly non-convex case
- Find a “foundational inequality” giving the per-iteration decrease of the objective in terms of the gradient
- Use a telescoping sum
- Examples:
- [[Notes - Optimisation for Data Science HT25, Steepest descent]]U uses $L$-smoothness to establish $f\left(x^{k+1}\right) \le f(x^k) - \frac 1 {2L} \vert \vert \nabla f(x^k) \vert \vert ^2$ and then telescopes
- [[Notes - Optimisation for Data Science HT25, Line search methods]]U uses $L$-smoothness and the Wolfe conditions to establish $f(x^{k+1}) \le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta^2}{L} \vert \vert \nabla f(x^k) \vert \vert ^2$ and then telescopes
- [[Notes - Optimisation for Data Science HT25, Stochastic gradient descent]]U proves that $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \frac \alpha 2 \vert \vert \nabla f(X^k) \vert \vert ^2 + \frac{M L \alpha^2}{2}$ and then uses a telescoping argument
- [[Notes - Optimisation for Data Science HT25, Coordinate descent]]U
Proofs in the convex case
- Find a link between the decrease of the objective $f(x^{k+1}) - f(x^\ast)$ to the per-iteration decrease of the error in the iterates $ \vert \vert x^{k} - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2$
- Use a telescoping sum
- Examples:
- [[Notes - Optimisation for Data Science HT25, Steepest descent]]U uses the foundational inequality and convexity to establish $f(x^{k+1}) \le f(x^\ast) + \frac L 2 \left( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1}-x^\ast \vert \vert ^2 \right)$ and then telescopes
- [[Notes - Optimisation for Data Science HT25, Proximal methods]]U first proves the “foundational inequality of the prox method” $\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2$ ; when $z = x^k$, this gives $\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k}{2} \vert \vert G _ {\alpha _ k} (x^k) \vert \vert ^2$. Then it is established that $\phi(x^{k+1}) \le \phi(x^\ast) \le \frac{1}{2\alpha _ k}( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2)$ and uses a telescoping sum
- [[Notes - Optimisation for Data Science HT25, Accelerated methods]]U
- [[Notes - Optimisation for Data Science HT25, Coordinate descent]]U uses a different technique involving the sequence $\nabla _ k := f(x^k) - f(x^\ast)$ and using that convexity implies $f(x^k) - f(x^\ast) \le D \vert \vert \nabla f(x^k) \vert \vert $
Proofs in the $\gamma$-strongly convex case
- Conclude something like $ \vert \vert \nabla f(x^k) \vert \vert ^2 \ge 2\gamma(f(x^k) - f(x^\ast))$, and then substitute back into the foundational inequality to obtain $f(x^{k+1}) - f(x^k) \le \frac{\gamma}{L} (f(x^k) - f(x^\ast))$
- Examples:
Assumptions
Okay descent directions
The search directions $d^k$ lie within a bounded angle of the steepest descent direction: there exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$,
\[\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||\](relevant to line search methods)
Wolfe conditions
$\alpha _ k$ is neither too short nor too long: there exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$, we have:
- Armijo condition: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- Curvature condition: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
(relevant to line search methods)
Unbiased gradient
$G^k$ (the estimate of the gradient) conditioned on the current batch is an unbiased estimator of the true gradient:
\[\mathbb E_{\mathcal S_k} [G^k] = \nabla f(X^k)\](relevant to stochastic gradient descent)
$L$-smooth summands
For all $j$, $\nabla f _ j$ is $L$-smooth.
(relevant to stochastic gradient descent)
Exponentially decreasing variance
\[\text{Var}(G^k \mid \mathcal S_k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S_k] \le \frac {M} {\xi^k}\]Unique coordinatewise minima
For each $i$, the minimum $\min _ {t \in \mathbb R} f(x _ 1, x _ 2, \ldots, x _ {i-1}, t, x _ {i+1}, \ldots, x _ n)$ is uniquely attained for all $x _ 1, \ldots, x _ {i-1}, x _ {i+1}, \ldots, x _ n$.
(relevant to coordinate descent)
Step size exact minimiser
$\alpha _ k$ is chosen to be a minimiser of $f(x^k + \alpha _ k d^k)$.
(relevant to coordinate descent)
CL-smooth (componentwise Lipschitz-smooth)
Each partial derivative of $f$ is “coordinatewise $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\](relevant to coordinate descent)
$D$-bounded initial sublevel set
$f$ has a $D$-bounded initial sublevel set if there exists a finite minimiser $x^\ast$ with $ \vert \vert x - x^\ast \vert \vert ^2 \le D$ for all $x$ with $f(x) \le f(x^0)$ (this terminology is not standard).
(relevant to coordinate descent)
Componentwise Wolfe conditions
TODO
Checklist
- Fixed step steepest-descent
- $L$-smooth
- Convex
- $\gamma$-strongly convex
- Line search methods
- $L$-smooth (only case covered)
- Proximal methods
- Sum of $L$-smooth convex and convex (only case covered)
- Accelerated methods
- Heavy ball method
- Convex quadratic functions
- Nesterov’s accelerated gradient method
- Convex
- $\gamma$-strongly convex (no proof)
- Heavy ball method
- Stochastic gradient descent
- $L$-smooth
- Convex
- $\gamma$-strongly convex
- Variance reduction
- Exponentially decreasing variance
- SVRG