Optimisation for Data Science HT25, Overview of convergence results



Table of results

Method Result Assumptions
Fixed-step steepest descent, general case $\min _ {0 \le k \le T-1} \vert\vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{T}}$ $L$-smooth, bounded below, step size $1/L$
Fixed-step steepest descent, convex $f(x^k) - f(x^\ast) \le \frac{L\vert\vert x^0 - x^\ast \vert\vert^2}{2k}$ $L$-smooth, step size $1/L$
Fixed-step steepest descent, $\gamma$-strongly convex $f(x^k) - f(x^\ast) \le \left(1-\frac{\gamma}{L}\right)^k (f(x^0) - f(x^\ast))$ $L$-smooth, step size $1/L$, unique minimiser
Line search, general case $\min _ {0 \le k \le T-1} \vert\vert \nabla f(x^k) \vert \vert \le \sqrt{\frac{L(f(x^0) - C)}{\eta^2 c _ 1(1-c^2)T}}$ $L$-smooth, bounded below, okay descent directions, Wolfe conditions
Proximal method for $\varphi(x)$ as the sum $f(x) + \lambda \psi(x)$ $\varphi(x^k) - \varphi(x^\ast) \le \frac{L \vert\vert x^0 - x^\ast \vert \vert^2}{2k}$ $f$ $L$-smooth and convex, $\psi$ convex
Nesterov’s accelerated gradient method, convex case, dynamic step lengths $f(x^k) - f(x^\ast) \le \frac{4L\vert\vert x^0 - x^\ast \vert\vert^2}{(k+2)^2}$ Finite minimiser and finite minimum
Nesterov’s accelerated gradient method, $\gamma$-strongly convex case, fixed step lengthsᴾ $f(x^k) - f(x^\ast) \le \frac{L + \gamma}{2} \left( 1 - \sqrt{\frac \gamma L} \right)^k \vert\vert x^0 - x^\ast\vert\vert^2$ $L$-smooth
Stochastic gradient descent, general case $\min _ {0 \le i \le k - 1} \mathbb E[\vert\vert \nabla f(X^i) \vert\vert^2]\le\eta M + \frac{2L(f(x^0) - f _ \text{low})}{k \eta}$ Unbiased gradient, bounded below, $L$-smooth summands, bounded variance, step size $\eta/L$
Stochastic gradient descent, $\gamma$-strongly convex $\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left(1 - \frac{\eta \gamma}{L}\right)^k \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]$ Unbiased gradient, bounded below, $L$-smooth summands, bounded variance, step size $\eta/L$
Stochastic gradient descent, $\gamma$-strongly convex, variable step size $\mathbb E[f(X^k)] - f(x^\ast) \le \frac{\frac{2L}{\gamma}\max\left(\frac M \gamma, f(x^0) - f(x^\ast)\right)}{\frac{2L}{\gamma} + k}$ Unbiased gradient, $L$-smooth summands, bounded variance, step size $\frac{2}{2L + k\gamma}$
Stochastic gradient descent, $\gamma$-strongly convex, exponentially decreasing variance $\mathbb E[f(X^k)] - f(x^\ast) \le \nu \rho^k$, $\nu := \max \left( \frac{\alpha L M}{\gamma}, f(x^0) - f(x^\ast) \right)$ and $\rho := \max \left( 1 - \frac{\alpha \gamma}{2}, \frac{1}{\xi} \right) < 1$ Unbiased gradient, $L$-smooth summands, exponentially decreasing variance, step size $\eta / L$
SVRG, $\gamma$-strongly convex $\mathbb E[f(X^k)] - f(x^\ast) \le \rho^k (f(x^0) - f(x^\ast))$ where $\rho := \frac{1}{\gamma \alpha (1 - 2L\alpha) p} + \frac{2L}{1 - 2L\alpha} < 1$ $L$-smooth and convex summands, step size $\alpha < \frac{1}{4L}$ and $p > \frac{1}{\gamma \alpha (1 - 4L\alpha)}$
Cyclic coordinate descent, general case $\sum^\infty _ {k=0}\left(\cos^2 \theta _ k \times \vert\vert \nabla f(x^k) \vert\vert^2 \right) < \infty$ $L$-smooth, bounded below, Wolfe conditions
Cyclic coordinate descent, unique coordinatewise minima $\text{Acc}\big((x^{j}) _ {j \in n\mathbb N}\big) \subseteq \{x \in \mathbb R^n : \nabla f(x) = 0\}$ Continuously differentiable, bounded below, unique coordinatewise minima, step size exact minimiser
Cyclic coordinate descent, convexᴾ $f(x^k) - f(x^\ast) \le \frac{4nD^2 L _ \max (1 + n\frac{L^2}{L^2 _ \max})}{k+8}$ CL-smooth, $D$-bounded initial sublevel set, step size $1/L _ \text{max}$
Cyclic coordinate descent, $\gamma$-strongly convexᴾ $f(x^k) - f(x^\ast) \le \left(1 - \frac{\gamma}{2L _ \max \left(1 + n\frac{L^2}{L _ \max^2}\right)}\right)^{k/n} (f(x^0) - f(x^\ast))$ CL-smooth, step size $1/L _ \text{max}$, $k$ multiple of $n$
Gauss-Southwell coordinate descent, convexᴴ $f(x^k) - f(x^\ast) \le \frac{2nL _ \text{max} D^2}{k}$ CL-smooth, step size $1/L _ \text{max}$
Randomised coordinate descent, general case $\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL(f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$ L-smooth, bounded below, line search, Wolfe conditions
Randomised coordinate descent, general case $\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}$ CL-smooth, bounded below, step size $1/L _ \text{max}$
Randomised coordinate descent, general caseᵀ $\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL _ \text{max} (f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$ CL-smooth, bounded below, line search, componentwise Wolfe conditions
Randomised coordinate descent, convex $\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \text{max}D^2}{k}$ CL-smooth, $D$-bounded initial sublevel set, step size $1/L _ \text{max}$
Randomised coordinate descent, convexᵀ $\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}$ CL-smooth, $D$-bounded initial sublevel set, line search, componentwise Wolfe conditions
Randomised coordinate descent, $\gamma$-strongly convexᵀ $\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))$ CL-smooth, step size $1/L _ \text{max}$.
Randomised coordinate descent, $\gamma$-strongly convexᵀ $\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{2\gamma c _ 1(1-c _ 2)}{nL _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$ CL-smooth, line search, componentwise Wolfe conditions
Randomised block coordinate descent, $\gamma$-strongly convexᵀ $\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{\gamma}{\ell L _ \text{bmax} }\right)^k(f(x^0) - f(x^\ast))$ BL-smooth, step size $1/L _ \text{bmax}$, known set of blocks
Randomised block coordinate descent, $\gamma$-strongly convexᵀ $\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{n _ 0 \gamma}{n L _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$ BL-smooth, step size $1/L _ \text{bmax}$, all blocks of a certain size

where:

  • (ᴾ) denotes that the result was not proved.
  • (ᴸ) denotes that the result is in the lecture notes, but not in the slides.
  • (ᵀ) denotes that the result is in the slides, but not in the lecture notes.
  • (ᴴ) denotes that the result is only in the problem sheets.

Overestimation properties

Most of the results above follow from proving some “overestimation property”:

Method Property Assumptions
Fixed-step gradient descent $f(x^{k+1}) \le f(x^k) - \frac{1}{2L} \vert\vert\nabla f(x^k)\vert\vert^2$ $L$-smooth
Line-search gradient descent $f(x^{k+1}) \le f(x^k) - \frac{c _ 1 (1 - c _ 2)}{L} \eta^2 \vert \vert \nabla f(x^k) \vert \vert^2$ $L$-smooth, Wolfe conditions
Proximal method $\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k}{2}\vert \vert G _ {\alpha _ k}(x^k)\vert \vert^2$ $f$ is $L$-smooth, $\psi$ is convex
Nesterov acceleration $f(x^k) \le f(y^k) - \frac{\alpha _ k}{2} \vert \vert \nabla f(y^k) \vert \vert^2$ $L$-smooth
Stochastic gradient descent $\mathbb E[f(X^{k+1})\mid X^k] \le f(X^k) + \alpha _ k \left( \frac{L\alpha _ k}{2} - 1 \right) \vert \vert \nabla f(X^k)\vert \vert^2 + \frac{ML\alpha _ k^2}{2}$ Unbiased gradient, $L$-smooth summands, bounded variance
Coordinate descent $f(x^{k+1}) \le f(x^k) - \alpha _ k \left( 1 - \frac{L _ \text{max} }{2} \alpha \right) \left( \frac{\partial f}{\partial x _ i}(x) \right)^2$ CL-smooth

Proof techniques

Proofs in the $L$-smooth, possibly non-convex case

Proofs in the convex case

  • Find a link between the decrease of the objective $f(x^{k+1}) - f(x^\ast)$ to the per-iteration decrease of the error in the iterates $ \vert \vert x^{k} - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2$
  • Use a telescoping sum
  • Examples:
  • [[Notes - Optimisation for Data Science HT25, Coordinate descent]]U uses a different technique involving the sequence $\nabla _ k := f(x^k) - f(x^\ast)$ and using that convexity implies $f(x^k) - f(x^\ast) \le D \vert \vert \nabla f(x^k) \vert \vert $

Proofs in the $\gamma$-strongly convex case

Assumptions

Okay descent directions

The search directions $d^k$ lie within a bounded angle of the steepest descent direction: there exists a constant $0 < \eta \le 1$ such that for all $k \in \mathbb N$,

\[\nabla f(x^k)^\top d^k \le -\eta||\nabla f(x^k)|| \cdot ||d^k||\]

(relevant to line search methods)

Wolfe conditions

$\alpha _ k$ is neither too short nor too long: there exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$, we have:

  • Armijo condition: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
  • Curvature condition: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

(relevant to line search methods)

Unbiased gradient

$G^k$ (the estimate of the gradient) conditioned on the current batch is an unbiased estimator of the true gradient:

\[\mathbb E_{\mathcal S_k} [G^k] = \nabla f(X^k)\]

(relevant to stochastic gradient descent)

$L$-smooth summands

For all $j$, $\nabla f _ j$ is $L$-smooth.

(relevant to stochastic gradient descent)

Exponentially decreasing variance

\[\text{Var}(G^k \mid \mathcal S_k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S_k] \le \frac {M} {\xi^k}\]

Unique coordinatewise minima

For each $i$, the minimum $\min _ {t \in \mathbb R} f(x _ 1, x _ 2, \ldots, x _ {i-1}, t, x _ {i+1}, \ldots, x _ n)$ is uniquely attained for all $x _ 1, \ldots, x _ {i-1}, x _ {i+1}, \ldots, x _ n$.

(relevant to coordinate descent)

Step size exact minimiser

$\alpha _ k$ is chosen to be a minimiser of $f(x^k + \alpha _ k d^k)$.

(relevant to coordinate descent)

CL-smooth (componentwise Lipschitz-smooth)

Each partial derivative of $f$ is “coordinatewise $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

(relevant to coordinate descent)

$D$-bounded initial sublevel set

$f$ has a $D$-bounded initial sublevel set if there exists a finite minimiser $x^\ast$ with $ \vert \vert x - x^\ast \vert \vert ^2 \le D$ for all $x$ with $f(x) \le f(x^0)$ (this terminology is not standard).

(relevant to coordinate descent)

Componentwise Wolfe conditions

TODO

Checklist

  • Fixed step steepest-descent
    • $L$-smooth
    • Convex
    • $\gamma$-strongly convex
  • Line search methods
    • $L$-smooth (only case covered)
  • Proximal methods
    • Sum of $L$-smooth convex and convex (only case covered)
  • Accelerated methods
    • Heavy ball method
      • Convex quadratic functions
    • Nesterov’s accelerated gradient method
      • Convex
      • $\gamma$-strongly convex (no proof)
  • Stochastic gradient descent
    • $L$-smooth
    • Convex
    • $\gamma$-strongly convex
    • Variance reduction
      • Exponentially decreasing variance
      • SVRG



Related posts