Notes - Optimisation for Data Science HT25, Stochastic variance reduction methods



Flashcards

Reducing variance by increasing batch size

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, the stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

We also have the following three assumptions:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le \frac {M} {\xi^k}$ for all $k$. (note the $\xi^k$ here)

@State a result about the convergence of SGD in the $\gamma$-strongly convex case with fixed step-size.


Suppose:

  • All the assumptions above are satisfied
  • $f$ is $\gamma$-strongly convex
  • $\alpha^k = \alpha = \frac \eta L$ where $\eta \in (0, 1]$

Then SGD satisfies the linear rate

\[\mathbb E[f(X^k)] - f(x^\ast) \le \nu_{vr} \rho^k\]

for all $k \ge 0$ where

\[\nu_{vr} := \max \\{\frac{\alpha L M}{\gamma}, f(x^0) - f(x^\ast)\\}\]

and

\[\rho := \max \\{ 1 - \alpha \gamma / 2, 1/\xi\\} < 1\]

and thus

\[\lim_{k \to \infty} \mathbb E[f(X^k)] = f(x^\ast)\]

linearly.

Reducing variable by gradient aggregation

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, the stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

In this context, @state the SVRG @algorithm and explain its operation intuitively.


Given $x^0 \in \mathbb R^n$ for $k = 0, 1, 2, \ldots$:

  • Calculate $\nabla f(x^k)$ and let $\overline x^{k, 0} := x^k$:
    • For $j \in \{ 0, \ldots, p-1\}$, do:
      • Select $\mathcal S _ {k, j}$ i.i.d. $\sim U(\{1, \ldots, m\})$ with replacement, $ \vert \mathcal S _ {k, j} \vert = 1$
      • $g^{k, j} := \nabla f _ {\mathcal S _ {k, j} } (\overline{x}^k, j) - \nabla f _ {\mathcal S _ {k, j}} (x^k) + \nabla f(x^k)$
      • $\overline x^{k, j+1} = \overline x^{k, j} - \alpha^{k, j} g^{k, j}$
    • New iterate: $x^{k+1} \in \{\overline x^{k, j} : j \in \{0, \ldots, p-1\}\}$

@todo, intuitive explanation

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, the stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

In this context, the SVRG algorithm modifies this by doing the following


Given $x^0 \in \mathbb R^n$ for $k = 0, 1, 2, \ldots$:

  • Calculate $\nabla f(x^k)$ and let $\overline x^{k, 0} := x^k$:
    • For $j \in \{ 0, \ldots, p-1\}$, do:
      • Select $\mathcal S _ {k, j}$ i.i.d. $\sim U(\{1, \ldots, m\})$ with replacement, $ \vert \mathcal S _ {k, j} \vert = 1$
      • $g^{k, j} := \nabla f _ {\mathcal S _ {k, j} } (\overline{x}^k, j) - \nabla f _ {\mathcal S _ {k, j}} (x^k) + \nabla f(x^k)$
      • $\overline x^{k, j+1} = \overline x^{k, j} - \alpha^{k, j} g^{k, j}$
    • New iterate: $x^{k+1} \in \{\overline x^{k, j} : j \in \{0, \ldots, p-1\}\}$

@State a result relating to its convergence.::

Suppose:

  • $f _ j$ for each $j \in \{1, \ldots, m\}$ be convex and $L$-smooth
  • $f$ is $\gamma$-strongly convex with minimiser $x^\ast$
  • The SVRG algorithm is applied with constant step size $\alpha$ and $p$ such that $0 < \alpha < \frac 1 {4L}$ and $p > \frac{1}{\gamma \alpha(1 - 4L\alpha)}$

Then:

\[\mathbb E[f(x^k)] - f(x^\ast) \le \rho^k (f(x^0) - f(x^\ast))\]

for all $k \ge 0$ and where

\[\rho := \frac{1}{\gamma \alpha (1 - 2L\alpha) p} + \frac{2L}{1 - 2L\alpha} < 1\]



Related posts