Notes - Optimisation for Data Science HT25, Stochastic variance reduction methods
Flashcards
Reducing variance by increasing batch size
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, the stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le \frac {M} {\xi^k}$ for all $k$. (note the $\xi^k$ here)
@State a result about the convergence of SGD in the $\gamma$-strongly convex case with fixed step-size.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le \frac {M} {\xi^k}$ for all $k$. (note the $\xi^k$ here)
Suppose:
- All the assumptions above are satisfied
- $f$ is $\gamma$-strongly convex
- $\alpha^k = \alpha = \frac \eta L$ where $\eta \in (0, 1]$
Then SGD satisfies the linear rate
\[\mathbb E[f(X^k)] - f(x^\ast) \le \nu_{vr} \rho^k\]for all $k \ge 0$ where
\[\nu_{vr} := \max \\{\frac{\alpha L M}{\gamma}, f(x^0) - f(x^\ast)\\}\]and
\[\rho := \max \\{ 1 - \alpha \gamma / 2, 1/\xi\\} < 1\]and thus
\[\lim_{k \to \infty} \mathbb E[f(X^k)] = f(x^\ast)\]linearly.
Reducing variable by gradient aggregation
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, the stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
In this context, @state the SVRG @algorithm and explain its operation intuitively.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
Given $x^0 \in \mathbb R^n$ for $k = 0, 1, 2, \ldots$:
- Calculate $\nabla f(x^k)$ and let $\overline x^{k, 0} := x^k$:
- For $j \in \{ 0, \ldots, p-1\}$, do:
- Select $\mathcal S _ {k, j}$ i.i.d. $\sim U(\{1, \ldots, m\})$ with replacement, $ \vert \mathcal S _ {k, j} \vert = 1$
- $g^{k, j} := \nabla f _ {\mathcal S _ {k, j} } (\overline{x}^k, j) - \nabla f _ {\mathcal S _ {k, j}} (x^k) + \nabla f(x^k)$
- $\overline x^{k, j+1} = \overline x^{k, j} - \alpha^{k, j} g^{k, j}$
- New iterate: $x^{k+1} \in \{\overline x^{k, j} : j \in \{0, \ldots, p-1\}\}$
- For $j \in \{ 0, \ldots, p-1\}$, do:
@todo, intuitive explanation
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, the stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
In this context, the SVRG algorithm modifies this by doing the following
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
Given $x^0 \in \mathbb R^n$ for $k = 0, 1, 2, \ldots$:
- Calculate $\nabla f(x^k)$ and let $\overline x^{k, 0} := x^k$:
- For $j \in \{ 0, \ldots, p-1\}$, do:
- Select $\mathcal S _ {k, j}$ i.i.d. $\sim U(\{1, \ldots, m\})$ with replacement, $ \vert \mathcal S _ {k, j} \vert = 1$
- $g^{k, j} := \nabla f _ {\mathcal S _ {k, j} } (\overline{x}^k, j) - \nabla f _ {\mathcal S _ {k, j}} (x^k) + \nabla f(x^k)$
- $\overline x^{k, j+1} = \overline x^{k, j} - \alpha^{k, j} g^{k, j}$
- New iterate: $x^{k+1} \in \{\overline x^{k, j} : j \in \{0, \ldots, p-1\}\}$
@State a result relating to its convergence.::
Suppose:
- $f _ j$ for each $j \in \{1, \ldots, m\}$ be convex and $L$-smooth
- $f$ is $\gamma$-strongly convex with minimiser $x^\ast$
- The SVRG algorithm is applied with constant step size $\alpha$ and $p$ such that $0 < \alpha < \frac 1 {4L}$ and $p > \frac{1}{\gamma \alpha(1 - 4L\alpha)}$
Then:
\[\mathbb E[f(x^k)] - f(x^\ast) \le \rho^k (f(x^0) - f(x^\ast))\]for all $k \ge 0$ and where
\[\rho := \frac{1}{\gamma \alpha (1 - 2L\alpha) p} + \frac{2L}{1 - 2L\alpha} < 1\]