Optimisation for Data Science HT25, Variance reduction methods



Flashcards

Convergence when variance is decaying exponentially

Recall the stochastic gradient descent algorithm:

To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

Under standard assumptions, in the $\gamma$-strongly convex case SGD only converges to a noise floor. Consider this stronger set of assumptions:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le \frac {M} {\xi^k}$ for all $k$. (note the $\xi^k$ here)

@State a result about the convergence of SGD in the $\gamma$-strongly convex case with fixed step-size in this case.


Suppose:

  • All the assumptions above are satisfied
  • $f$ is $\gamma$-strongly convex
  • $\alpha _ k = \alpha = \frac \eta L$ where $\eta \in (0, 1]$

Then for all $k \ge 0$:

\[\mathbb E[f(X^k)] - f(x^\ast) \le \nu \rho^k\]

where:

  • $\nu := \max \left( \frac{\alpha L M}{\gamma}, f(x^0) - f(x^\ast) \right)$
  • $\rho := \max \left( 1 - \frac{\alpha \gamma}{2}, \frac{1}{\xi} \right) < 1$

SVRG, reducing variance by gradient aggregation

Recall the stochastic gradient descent algorithm:

To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

In this context, @state the SVRG @algorithm and explain its operation intuitively.


  • Given $x^0 \in \mathbb R^n$
  • For $k = 0, 1, 2, \ldots$ do:
    • Calculate $\nabla f(x^k)$ (expensive!)
    • Let $y^0 := x^k$
    • For $j \in \{ 0, \ldots, p-1\}$, do:
      • Select $I \sim U(\{1, \ldots, m\})$
      • $g^j := \nabla f _ I (y^j) - \nabla f _ I (x^k) + \nabla f(x^k)$
      • $y^{j+1} = y^j - \alpha _ {k, j} g^j$
    • New iterate: $x^{k+1} \in \{y^0, y^1, \ldots, y^{p-1}\}$ chosen uniformly at random

SGD is a good algorithm but it the variance in the stochastic gradients introduces a “noise floor” and you don’t get actual convergence. Instead we modify the stochastic gradient to be

\[g^j = \nabla f_I(y^j) - \nabla f_I (x^k) + \nabla f(x^k)\]

This is still an unbiased estimator of the gradient, since

\[\mathbb E_I [G^j] = \mathbb E [\nabla f_I(Y^j)] - \nabla f_I(X^k)] + \mathbb E_I [\nabla f(X^k)]\]

But

\[\text{Var}(G) = \text{Var}(\nabla f_I(Y^j)) + \text{Var}(\nabla f_I(X^k)) - 2\text{Cov}(\nabla f_I(Y^j), \nabla f_I(X^k))\]

And since $y^j$ and $x^k$ are close, the gradients should be positively correlated and the variance smaller than if we had just taken the usual stochastic gradient.

Recall the stochastic gradient descent algorithm:

To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

In this context, the SVRG algorithm modifies this by doing the following:

Given $x^0 \in \mathbb R^n$ for $k = 0, 1, 2, \ldots$:

  • Calculate $\nabla f(x^k)$ and let $\overline x^{k, 0} := x^k$:
    • For $j \in \{ 0, \ldots, p-1\}$, do:
      • Select $\mathcal S _ {k, j}$ i.i.d. $\sim U(\{1, \ldots, m\})$ with replacement, $ \vert \mathcal S _ {k, j} \vert = 1$
      • $g^{k, j} := \nabla f _ {\mathcal S _ {k, j} } (\overline{x}^k, j) - \nabla f _ {\mathcal S _ {k, j} } (x^k) + \nabla f(x^k)$
      • $\overline x^{k, j+1} = \overline x^{k, j} - \alpha^{k, j} g^{k, j}$
    • New iterate: $x^{k+1} \in \{\overline x^{k, j} : j \in \{0, \ldots, p-1\}\}$

@State a result relating to its convergence in the $\gamma$-strongly convex case.


Suppose:

  • $f _ j$ for each $j \in \{1, \ldots, m\}$ be convex and $L$-smooth
  • $f$ is $\gamma$-strongly convex with minimiser $x^\ast$
  • The SVRG algorithm is applied with constant step size $\alpha$ and $p$ such that $0 < \alpha < \frac 1 {4L}$ and $p > \frac{1}{\gamma \alpha(1 - 4L\alpha)}$

Then:

\[\mathbb E[f(X^k)] - f(x^\ast) \le \rho^k (f(x^0) - f(x^\ast))\]

for all $k \ge 0$ and where

\[\rho := \frac{1}{\gamma \alpha (1 - 2L\alpha) p} + \frac{2L}{1 - 2L\alpha} < 1\]



Related posts