Optimisation for Data Science HT25, Variance reduction methods
Flashcards
Convergence when variance is decaying exponentially
Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
Under standard assumptions, in the $\gamma$-strongly convex case SGD only converges to a noise floor. Consider this stronger set of assumptions:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le \frac {M} {\xi^k}$ for all $k$. (note the $\xi^k$ here)
@State a result about the convergence of SGD in the $\gamma$-strongly convex case with fixed step-size in this case.
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le \frac {M} {\xi^k}$ for all $k$. (note the $\xi^k$ here)
Suppose:
- All the assumptions above are satisfied
- $f$ is $\gamma$-strongly convex
- $\alpha _ k = \alpha = \frac \eta L$ where $\eta \in (0, 1]$
Then for all $k \ge 0$:
\[\mathbb E[f(X^k)] - f(x^\ast) \le \nu \rho^k\]where:
- $\nu := \max \left( \frac{\alpha L M}{\gamma}, f(x^0) - f(x^\ast) \right)$
- $\rho := \max \left( 1 - \frac{\alpha \gamma}{2}, \frac{1}{\xi} \right) < 1$
SVRG, reducing variance by gradient aggregation
Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
In this context, @state the SVRG @algorithm and explain its operation intuitively.
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- Given $x^0 \in \mathbb R^n$
- For $k = 0, 1, 2, \ldots$ do:
- Calculate $\nabla f(x^k)$ (expensive!)
- Let $y^0 := x^k$
- For $j \in \{ 0, \ldots, p-1\}$, do:
- Select $I \sim U(\{1, \ldots, m\})$
- $g^j := \nabla f _ I (y^j) - \nabla f _ I (x^k) + \nabla f(x^k)$
- $y^{j+1} = y^j - \alpha _ {k, j} g^j$
- New iterate: $x^{k+1} \in \{y^0, y^1, \ldots, y^{p-1}\}$ chosen uniformly at random
SGD is a good algorithm but it the variance in the stochastic gradients introduces a “noise floor” and you don’t get actual convergence. Instead we modify the stochastic gradient to be
\[g^j = \nabla f_I(y^j) - \nabla f_I (x^k) + \nabla f(x^k)\]This is still an unbiased estimator of the gradient, since
\[\mathbb E_I [G^j] = \mathbb E [\nabla f_I(Y^j)] - \nabla f_I(X^k)] + \mathbb E_I [\nabla f(X^k)]\]But
\[\text{Var}(G) = \text{Var}(\nabla f_I(Y^j)) + \text{Var}(\nabla f_I(X^k)) - 2\text{Cov}(\nabla f_I(Y^j), \nabla f_I(X^k))\]And since $y^j$ and $x^k$ are close, the gradients should be positively correlated and the variance smaller than if we had just taken the usual stochastic gradient.
Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
In this context, the SVRG algorithm modifies this by doing the following:
Given $x^0 \in \mathbb R^n$ for $k = 0, 1, 2, \ldots$:
- Calculate $\nabla f(x^k)$ and let $\overline x^{k, 0} := x^k$:
- For $j \in \{ 0, \ldots, p-1\}$, do:
- Select $\mathcal S _ {k, j}$ i.i.d. $\sim U(\{1, \ldots, m\})$ with replacement, $ \vert \mathcal S _ {k, j} \vert = 1$
- $g^{k, j} := \nabla f _ {\mathcal S _ {k, j} } (\overline{x}^k, j) - \nabla f _ {\mathcal S _ {k, j} } (x^k) + \nabla f(x^k)$
- $\overline x^{k, j+1} = \overline x^{k, j} - \alpha^{k, j} g^{k, j}$
- New iterate: $x^{k+1} \in \{\overline x^{k, j} : j \in \{0, \ldots, p-1\}\}$
@State a result relating to its convergence in the $\gamma$-strongly convex case.
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
Given $x^0 \in \mathbb R^n$ for $k = 0, 1, 2, \ldots$:
- Calculate $\nabla f(x^k)$ and let $\overline x^{k, 0} := x^k$:
- For $j \in \{ 0, \ldots, p-1\}$, do:
- Select $\mathcal S _ {k, j}$ i.i.d. $\sim U(\{1, \ldots, m\})$ with replacement, $ \vert \mathcal S _ {k, j} \vert = 1$
- $g^{k, j} := \nabla f _ {\mathcal S _ {k, j} } (\overline{x}^k, j) - \nabla f _ {\mathcal S _ {k, j} } (x^k) + \nabla f(x^k)$
- $\overline x^{k, j+1} = \overline x^{k, j} - \alpha^{k, j} g^{k, j}$
- New iterate: $x^{k+1} \in \{\overline x^{k, j} : j \in \{0, \ldots, p-1\}\}$
- For $j \in \{ 0, \ldots, p-1\}$, do:
Suppose:
- $f _ j$ for each $j \in \{1, \ldots, m\}$ be convex and $L$-smooth
- $f$ is $\gamma$-strongly convex with minimiser $x^\ast$
- The SVRG algorithm is applied with constant step size $\alpha$ and $p$ such that $0 < \alpha < \frac 1 {4L}$ and $p > \frac{1}{\gamma \alpha(1 - 4L\alpha)}$
Then:
\[\mathbb E[f(X^k)] - f(x^\ast) \le \rho^k (f(x^0) - f(x^\ast))\]for all $k \ge 0$ and where
\[\rho := \frac{1}{\gamma \alpha (1 - 2L\alpha) p} + \frac{2L}{1 - 2L\alpha} < 1\]