Notes - Optimisation for Data Science HT25, Stochastic gradient descent
Flashcards
Basic definitions and assumptions
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$.
In this context, can you @state the stochastic gradient descent @algorithm?
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned} g^k &= \nabla f_{\mathcal S_k} (x^k) \\\\ &= \frac 1 {m_k} \sum_{j \in \mathcal S_k} \nabla f_j(x^k) \end{aligned}\](3): Update
\[x^{k+1} = x^k - \alpha_k g^k\]Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
@State the three assumptions that we use in order to prove that SGD converges in the general case.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
Suppose that $g^k$ is calculated as above, and where $ \vert \mathcal S _ k \vert = 1$. @Justify that then the expected value of the gradient with respect to the selected data point is an unbiased estimator of the true gradient.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
where we have used $\mathbb E[G^k \mid S _ k] = \nabla f _ j (X^k)$, which is true due to the i.i.d. choice of $\mathcal S _ k$ and $G^k$.
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Can you @state two useful results that follow from the first and second assumptions respectively?
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Assumption 1 implies that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E_{\mathcal S_k} [G^k] + \frac{L\alpha^2}{2} \mathbb E_{\mathcal S_k} [||G^k||^2]\]Assumption 1 and Assumption 2 together imply that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha^k\left( \frac{L\alpha^k}{2} - 1 \right) ||\nabla f(X^k)||^2 + \frac{ML(\alpha^k)^2}{2}\]Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
@Prove that Assumption 2 implies
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E_{\mathcal S_k} [G^k] + \frac{L\alpha^2}{2} \mathbb E_{\mathcal S_k} [||G^k||^2] \quad (1\star)\]
and that all the assumptions together imply that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha^k\left( \frac{L\alpha^k}{2} - 1 \right) ||\nabla f(X^k)||^2 + \frac{ML(\alpha^k)^2}{2} \quad (2\star)\]
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
For (1), note that Assumption 1 implies that $f$ is $L$-smooth and so we have the upper bound:
\[f(x + \alpha d) \le f(x) + \alpha \nabla f(x)^\top d + \frac {L\alpha^2} 2||d||^2\]for all $x, d \in \mathbb R^n$ and $\alpha \in \mathbb R$. Letting $x = X^k$, $d = G^k$ and $\alpha = \alpha^k$ and using $X^{k+1} = X^k + \alpha^k G^k$, we have that
\[f(X^{k+1}) \le f(X^k) - \alpha^k \nabla f(X^k)^\top G^k + \frac{L(\alpha_k)^2} 2 ||G^k||^2\]Applying expectation on both sides with respect to $\mathcal S _ k$, we see that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha^k \nabla f(X^k)^\top \mathbb E_{\mathcal S_k}[G^k] + \frac {L(\alpha^k)^2} 2 \mathbb E_{\mathcal S_k} [||G^k||^2]\]where we have used that $f(X^k)$ and $\nabla f(X^k)$ do not depend on $\mathcal S _ k$.
For (2), note that Assumption 1 implies that $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$. Therefore
\[\begin{aligned} \text{Var}(G^k \mid \mathcal S_k) &= \mathbb E_{\mathcal S_k}[||G^k||^2] - 2\nabla f(X^k)^\top \mathbb E_{\mathcal S_k} [G^k] + ||\nabla f(X^k)||^2 \\\\ &= \mathbb E_{\mathcal S_k} [||G^k||^2] - ||\nabla f(X^k)||^2 \end{aligned}\]which together with Assumption 3 implies that
\[\mathbb E_{\mathcal S_k} [||G^k||^2] \le M + ||\nabla f(X^k)||^2\]Applying this to the previous result, we see that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha^k\left( \frac{L\alpha^k}{2} - 1 \right) ||\nabla f(X^k)||^2 + \frac{ML(\alpha^k)^2}{2} \quad (2\star)\]Convergence in the general case
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
@State a theorem about the convergence of SGD in this general case.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Suppose:
- All the above assumptions hold
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
Then, for any $k \ge 1$:
\[\min_{0 \le i \le k - 1} \mathbb E[|| \nabla f(X^i) ||^2] \le \eta M + \frac{2L(f(x^0) - f_\text{low})}{k \eta}\]and so SGD takes at most
\[k \le \frac{2L(f(x^0) - f_\text{low})}{k\varepsilon}\]to generate
\[\mathbb E[||\nabla f(X^{k-1})||^2] \le \varepsilon + \eta M\]Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
@Prove that then if
- All the above assumptions hold
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
Then, for any $k \ge 1$:
\[\min_{0 \le i \le k - 1} \mathbb E[|| \nabla f(X^i) ||^2]
\le
\eta M + \frac{2L(f(x^0) - f_\text{low})}{k \eta}\]
and so SGD takes at most
\[k \le \frac{2L(f(x^0) - f_\text{low})}{k\varepsilon}\]
to generate
\[\mathbb E[||\nabla f(X^{k-1})||^2] \le \varepsilon + \eta M\]
You may use the following two useful results, which follow from the above assumptions:
- $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E _ {\mathcal S _ k} [G^k] + \frac{L\alpha^2}{2} \mathbb E _ {\mathcal S _ k} [ \vert \vert G^k \vert \vert ^2] \quad (1\star)$
- $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha^k\left( \frac{L\alpha^k}{2} - 1 \right) \vert \vert \nabla f(X^k) \vert \vert ^2 + \frac{ML(\alpha^k)^2}{2} \quad (2\star)$
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
- $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E _ {\mathcal S _ k} [G^k] + \frac{L\alpha^2}{2} \mathbb E _ {\mathcal S _ k} [ \vert \vert G^k \vert \vert ^2] \quad (1\star)$
- $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha^k\left( \frac{L\alpha^k}{2} - 1 \right) \vert \vert \nabla f(X^k) \vert \vert ^2 + \frac{ML(\alpha^k)^2}{2} \quad (2\star)$
The last of these two results and the fact that $\frac{L \alpha}{2} - 1 = \frac \eta 2 - 1 < -\frac 1 2$ give that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \frac \alpha 2 ||\nabla f(X^k)||^2 + \frac{M L \alpha^2}{2}\]Taking expectation with respect to the past, $\mathcal S _ 0, \ldots, S _ {k-1}$ on both sides of the above, by the memoryless property (@todo), the current iterate depends only on the previous sample size, so
\[\mathbb E = \mathbb E_k := \mathbb [\cdot \mid \mathcal S_0, \ldots, \mathcal S_k] = \mathbb E_{\mathcal S_k}\]and hence
\[\mathbb E_k[f(X^{k+1})] \le \mathbb E_{k-1}[f(X^k)] - \frac \alpha 2 \mathbb E_{k-1} [||\nabla f(X^k)||^2] + \frac{ML\alpha^2}{2}\]Now we have that for all $k \ge 0$, the expected decrease satisfies
\[\mathbb E_{k-1} [f(X^k)] - \mathbb E_k [f(X^{k+1})] \ge \frac \alpha 2 \mathbb E_{k-1} [||\nabla f(X^k)||^2] - \frac{ML\alpha^2}{2}\]Then summing up from $i = 0$ to $k$ and using that $f(X^{k+1}) \ge f _ \text{low}$, we deduce
\[\begin{aligned} f(x^0) - f_\text{low} &\ge f(x^0) - \mathbb E_k[f(X^{k+1})] \\\\ &\ge \frac \alpha 2 \sum^k_{i = 0} \mathbb E_{k-1} [||\nabla f(X^i)||^2] - (k+1)\frac{ML\alpha^2}{2} \\\\ &\ge \frac \alpha 2 (k+1) \left[\min_{0 \le i \le k} \mathbb E[||\nabla f(X^i)||^2] - M L \alpha\right] \end{aligned}\]Convergence in the strongly-convex case
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
@State a theorem about the convergence of SGD in this $\gamma$-strongly convex case.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Suppose:
- All the above assumptions hold
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
Then
\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left( 1 - \frac{\eta \gamma}{L} \right)^2 \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
@Prove that if:
- All the above assumptions hold
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
then:
\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma}
\le
\left( 1 - \frac{\eta \gamma}{L} \right)^2
\cdot
\left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
The last of these two results and the fact that $\frac{L \alpha}{2} - 1 = \frac \eta 2 - 1 < -\frac 1 2$ give that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \frac \alpha 2 ||\nabla f(X^k)||^2 + \frac{M L \alpha^2}{2}\]Taking expectation with respect to the past, $\mathcal S _ 0, \ldots, S _ {k-1}$ on both sides of the above, by the memoryless property (@todo), the current iterate depends only on the previous sample size, so
\[\mathbb E = \mathbb E_k := \mathbb [\cdot \mid \mathcal S_0, \ldots, \mathcal S_k] = \mathbb E_{\mathcal S_k}\]and hence
\[\mathbb E_k [f(X^{k+1})] - f(x^\ast) \le \mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac \alpha 2 \mathbb E_{k-1}[||\nabla f(X^k)||^2] + \frac{ML\alpha^2}{2}\]As $f$ is strongly convex, the global minimiser $x^\ast$ is unique and
\[f(X^k) - f(x^\ast) \le \frac{1}{2\gamma}||\nabla f(X^k)||^2\]and thus
\[2\gamma \mathbb E_{k-1} (f[(X^k)] - f(x^\ast)) \le \mathbb E_{k-1}[||\nabla f(X^k)||^2]\]And therefore we deduce
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) \le (1 - \gamma \alpha)(\mathbb E_{k-1}[f(X^k)] - f(x^\ast)) + \frac{ML\alpha^2}{2}\]or equivalently,
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{\alpha M L}{2\gamma} \le (1-\gamma \alpha) \left(\mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac{\alpha M L}{2\gamma}\right)\]And then since $\alpha = \eta / L \le \frac 1 L \le \frac 1 \gamma$, we have
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{M\eta}{2\gamma} \le \left(1 - \frac{\eta \gamma}{L} \right) \left( \mathbb E_{k-1} [f(X^k)] - f(x^\ast) - \frac{M\eta}{2\gamma} \right)\]and then the claim follows by induction.
Decreasing the noise floor
Dynamic step-size
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Then if
- All the above assumptions hold
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
We have the result that
\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma}
\le
\left( 1 - \frac{\eta \gamma}{L} \right)^2
\cdot
\left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]
The only problem with this is that it only guarantees convergence up the “noise floor”, $\frac{\eta M}{2\eta}$. @State a result about how using a variable step size $\alpha^k$ can guarantee actual convergence in expectation at the cost of sublinear convergence.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Let everything be as above but set
\[\alpha^k = \frac{2}{2L + k \gamma}\]for $k \ge 0$. Then SGD satisfies, for all $k \ge 0$,
\[0 \le \mathbb E[f(X^k)] - f(x^\ast) \le \frac{v}{\frac {2L} \gamma + k}\]where
\[v := \frac{2L}{\gamma} \max \\{\frac M \gamma, f(x^0) - f(x^\ast)\\}\]Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Then if
- All the above assumptions hold
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
We have the result that
\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma}
\le
\left( 1 - \frac{\eta \gamma}{L} \right)^2
\cdot
\left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]
The only problem with this is that it only guarantees convergence up the “noise floor”, $\frac{\eta M}{2\eta}$. @Prove that if you let
\[\alpha^k = \frac{2}{2L + k \gamma}\]
for $k \ge 0$, then SGD satisfies, for all $k \ge 0$
\[0 \le \mathbb E[f(X^k)] - f(x^\ast) \le \frac{v}{\frac {2L} \gamma + k}\]
where
\[v := \frac{2L}{\gamma} \max \\{\frac M \gamma, f(x^0) - f(x^\ast)\\}\]
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
The proof is identical to the proof in the ordinary $\gamma$-strongly convex case to the point where we establish that
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{\alpha M L}{2\gamma} \le (1-\gamma \alpha) \left(\mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac{\alpha M L}{2\gamma}\right)\]Then, since $\alpha^k \le \frac 1 L \le \frac 1 \gamma$, for all $k \ge 0$, we have
\[\mathbb E_k [f(X^{k+1})] - f(x^\ast) - \frac{\alpha^k M L}{2\gamma} \le (1 - \gamma \alpha^k) \left( \mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac{\alpha^k ML}{2\gamma} \right)\]Now proceed by induction. At $k = 0$, the result holds. Now assume that it holds at $k > 0$, and substitute into the above equation to obtain
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{\alpha^k M L}{2\gamma} \le (1-\gamma \alpha^k)\left( \frac{v}{\frac{2L}{\gamma} + k} - \frac{\alpha^k ML}{2\gamma} \right)\]Then substituting the expression for $\alpha^k$ into the above and simplifying the result establishes the result with $k$ replaced by $k+1$.
Increasing batch size
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
If
- All the above assumptions hold
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
then
\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma}
\le
\left( 1 - \frac{\eta \gamma}{L} \right)^2
\cdot
\left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]
How do things change if the mini-batch size $ \vert \mathcal S _ k \vert $ is $p$ rather than $1$?
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.