Optimisation for Data Science HT25, Stochastic gradient descent
Flashcards
Basic definitions and assumptions
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]
where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$.
In this context, can you @state the stochastic gradient descent @algorithm?
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned} g^k &= \nabla f_{\mathcal S_k} (x^k) \\\\ &= \frac 1 {m_k} \sum_{j \in \mathcal S_k} \nabla f_j(x^k) \end{aligned}\](3): Update
\[x^{k+1} = x^k - \alpha_k g^k\]Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
@State the three assumptions that we use in order to prove that SGD converges in the general case.
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
Suppose that $g^k$ is calculated as above, and where $ \vert \mathcal S _ k \vert = 1$. @Justify that then the expected value of the gradient with respect to the selected data point is an unbiased estimator of the true gradient.
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
where we have used $\mathbb E[G^k \mid S _ k] = \nabla f _ j (X^k)$, which is true due to the i.i.d. choice of $\mathcal S _ k$ and $G^k$.
Overestimation properties
Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Can you @state two useful results that follow from these assumptions?
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Assumption 1 and Assumption 2 imply that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E_{\mathcal S_k} [G^k] + \frac{L\alpha^2}{2} \mathbb E_{\mathcal S_k} [||G^k||^2]\]And all the assumptions imply that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha_k\left( 1 - \frac{L\alpha_k}{2} \right) ||\nabla f(X^k)||^2 + \frac{ML\alpha_k^2}{2}\]@important~ (overestimation property)
Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
@Prove that Assumption 2 implies
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E_{\mathcal S_k} [G^k] + \frac{L\alpha^2}{2} \mathbb E_{\mathcal S_k} [||G^k||^2] \quad (1\star)\]
and that all the assumptions together imply that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha_k\left( 1-\frac{L\alpha_k}{2} \right) ||\nabla f(X^k)||^2 + \frac{ML\alpha_k^2}{2} \quad (2\star)\]
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
For (1), note that Assumption 1 implies that $f$ is $L$-smooth and so we have the upper bound:
\[f(x + \alpha d) \le f(x) + \alpha \nabla f(x)^\top d + \frac {L\alpha^2} 2||d||^2\]for all $x, d \in \mathbb R^n$ and $\alpha \in \mathbb R$. Letting $x = X^k$, $d = G^k$ and $\alpha = -\alpha _ k$ and using $X^{k+1} = X^k - \alpha _ k G^k$, we have that
\[f(X^{k+1}) \le f(X^k) - \alpha_k \nabla f(X^k)^\top G^k + \frac{L(\alpha_k)^2} 2 ||G^k||^2\](note that this slightly conflicts with the standards used elsewhere in the course, since we define $x^{k+1} = x^k - \alpha d$ rather than $x^{k+1} = x^k + \alpha d$ rather than adding it. This is because it’s notationally convenient to have $G^k$ be an estimator of the gradient, rather than estimator of the negative gradient).
Applying expectation on both sides with respect to $\mathcal S _ k$, we see that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha_k \nabla f(X^k)^\top \mathbb E_{\mathcal S_k}[G^k] + \frac {L\alpha_k^2} 2 \mathbb E_{\mathcal S_k} [||G^k||^2]\]where we have used that $f(X^k)$ and $\nabla f(X^k)$ do not depend on $\mathcal S _ k$.
For (2), note that Assumption 1 implies that $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$. Therefore
\[\begin{aligned} \text{Var}(G^k \mid \mathcal S_k) &= \mathbb E_{\mathcal S_k}[||G^k||^2] - 2\nabla f(X^k)^\top \mathbb E_{\mathcal S_k} [G^k] + ||\nabla f(X^k)||^2 \\\\ &= \mathbb E_{\mathcal S_k} [||G^k||^2] - ||\nabla f(X^k)||^2 \end{aligned}\]which together with Assumption 3 implies that
\[\mathbb E_{\mathcal S_k} [||G^k||^2] \le M + ||\nabla f(X^k)||^2\]Applying this to the previous result, we see that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) + \alpha_k\left( 1 - \frac{L\alpha_k}{2} \right) ||\nabla f(X^k)||^2 + \frac{ML\alpha_k^2}{2} \quad (2\star)\]@important~ (overestimation property)
Analysing stochastic algorithms
- Expectation with respect to one of the sources of randomness, versus expectation given another random variable
- Memoryless / Markov property
- Tower law of expectation
Convergence in the general case
Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
@State a theorem about the convergence of SGD in this general case.
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Suppose:
- All the above assumptions hold
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
Then, for any $k \ge 1$:
\[\min_{0 \le i \le k - 1} \mathbb E[|| \nabla f(X^i) ||^2] \le \eta M + \frac{2L(f(x^0) - f_\text{low})}{k \eta}\]Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
@Prove that then if
- All the above assumptions hold
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
Then, for any $k \ge 1$:
\[\min_{0 \le i \le k - 1} \mathbb E[|| \nabla f(X^i) ||^2]
\le
\eta M + \frac{2L(f(x^0) - f_\text{low})}{k \eta}\]
You may use the following two useful results, which follow from the above assumptions:
- $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E _ {\mathcal S _ k} [G^k] + \frac{L\alpha^2}{2} \mathbb E _ {\mathcal S _ k} [ \vert \vert G^k \vert \vert ^2] \quad (1\star)$
- $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha _ k\left(1 - \frac{L\alpha _ k}{2} \right) \vert \vert \nabla f(X^k) \vert \vert ^2 + \frac{ML\alpha _ k^2}{2} \quad (2\star)$
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
- $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E _ {\mathcal S _ k} [G^k] + \frac{L\alpha^2}{2} \mathbb E _ {\mathcal S _ k} [ \vert \vert G^k \vert \vert ^2] \quad (1\star)$
- $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha _ k\left(1 - \frac{L\alpha _ k}{2} \right) \vert \vert \nabla f(X^k) \vert \vert ^2 + \frac{ML\alpha _ k^2}{2} \quad (2\star)$
The last of these two results and the fact that $\frac{L \alpha}{2} - 1 = \frac \eta 2 - 1 < -\frac 1 2$ give that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \frac \alpha 2 ||\nabla f(X^k)||^2 + \frac{M L \alpha^2}{2}\]Taking expectation with respect to the past, $\mathcal S _ 0, \ldots, S _ {k-1}$ on both sides of the above, by the memoryless property (@todo), the current iterate depends only on the previous sample size, so
\[\mathbb E = \mathbb E_k := \mathbb E[\cdot \mid \mathcal S_0, \ldots, \mathcal S_k] = \mathbb E_{\mathcal S_k}\]and hence
\[\mathbb E_k[f(X^{k+1})] \le \mathbb E_{k-1}[f(X^k)] - \frac \alpha 2 \mathbb E_{k-1} [||\nabla f(X^k)||^2] + \frac{ML\alpha^2}{2}\]Rearranging,
\[\mathbb E_{k-1} [f(X^k)] - \mathbb E_k [f(X^{k+1})] \ge \frac \alpha 2 \mathbb E_{k-1} [||\nabla f(X^k)||^2] - \frac{ML\alpha^2}{2}\]Summing both sides from $0$ to $k$, we obtain
\[\sum^k_{i = 0}\Big(\mathbb E_{i-1} [f(X^k)] - \mathbb E_i [f(X^{k+1})] \Big) \ge \sum^k_{i = 0} \Big( \frac \alpha 2 \mathbb E_{k-1} [||\nabla f(X^k)||^2] - \frac{ML\alpha^2}{2} \Big)\]Since the sum telescopes and $f(X^{k+1}) \ge f _ \text{low}$, we obtain
\[\begin{aligned} f(x^0) - f_\text{low} &\ge \frac \alpha 2 \sum^k_{i = 0} \Big(\mathbb E_{k-1} [||\nabla f(X^i)||^2]\Big) - (k+1)\frac{ML\alpha^2}{2} \\\\ &\ge \frac \alpha 2 (k+1) \left[\min_{0 \le i \le k} \mathbb E[||\nabla f(X^i)||^2] - M L \alpha\right] \end{aligned}\]and hence
\[\min_{0 \le i \le k - 1} \mathbb E[|| \nabla f(X^i) ||^2] \le \eta M + \frac{2L(f(x^0) - f_\text{low})}{k \eta}\]as required.
Convergence in the strongly-convex case
Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
@State a theorem about the convergence of SGD in this $\gamma$-strongly convex case.
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Suppose:
- All the above assumptions hold
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
Then
\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left( 1 - \frac{\eta \gamma}{L} \right)^k \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
@Prove that if:
- All the above assumptions hold
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
then:
\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma}
\le
\left( 1 - \frac{\eta \gamma}{L} \right)^k
\cdot
\left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]
You may assume that Assumption 1 and Assumption 2 together imply that:
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha_k\left( 1- \frac{L\alpha_k}{2} \right) ||\nabla f(X^k)||^2 + \frac{ML\alpha_k^2}{2}\]
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
The given result and the fact that $\frac{L \alpha}{2} - 1 = \frac \eta 2 - 1 < -\frac 1 2$ imply that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \frac \alpha 2 ||\nabla f(X^k)||^2 + \frac{M L \alpha^2}{2}\]Taking expectation with respect to the past, $\mathcal S _ 0, \ldots, S _ {k-1}$ on both sides of the above, by the memoryless property (@todo), the current iterate depends only on the previous sample size, so
\[\mathbb E = \mathbb E_k := \mathbb E[\cdot \mid \mathcal S_0, \ldots, \mathcal S_k] = \mathbb E_{\mathcal S_k}\]and hence
\[\mathbb E_k [f(X^{k+1})] - f(x^\ast) \le \mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac \alpha 2 \mathbb E_{k-1}[||\nabla f(X^k)||^2] + \frac{ML\alpha^2}{2}\]As $f$ is strongly convex, the global minimiser $x^\ast$ is unique and
\[f(X^k) - f(x^\ast) \le \frac{1}{2\gamma}||\nabla f(X^k)||^2\]and thus
\[2\gamma (\mathbb E_{k-1} [f(X^k)] - f(x^\ast)) \le \mathbb E_{k-1}[||\nabla f(X^k)||^2]\]And therefore we deduce
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) \le (1 - \gamma \alpha)(\mathbb E_{k-1}[f(X^k)] - f(x^\ast)) + \frac{ML\alpha^2}{2}\]or equivalently,
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{\alpha M L}{2\gamma} \le (1-\gamma \alpha) \left(\mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac{\alpha M L}{2\gamma}\right)\]And then since $\alpha = \eta / L \le \frac 1 L \le \frac 1 \gamma$, we have
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{M\eta}{2\gamma} \le \left(1 - \frac{\eta \gamma}{L} \right) \left( \mathbb E_{k-1} [f(X^k)] - f(x^\ast) - \frac{M\eta}{2\gamma} \right)\]and then the claim follows by induction.
Decreasing the noise floor
Dynamic step-size
Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Then if
- All the above assumptions hold
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
We have the result that
\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma}
\le
\left( 1 - \frac{\eta \gamma}{L} \right)^2
\cdot
\left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]
The only problem with this is that it only guarantees convergence up the “noise floor”, $\frac{\eta M}{2\eta}$. @State a result about how using a variable step size $\alpha _ k$ can guarantee actual convergence in expectation at the cost of sublinear convergence.
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Let everything be as above but set
\[\alpha_k = \frac{2}{2L + k \gamma}\]for $k \ge 0$. Then SGD satisfies, for all $k \ge 0$,
\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{\frac{2L}{\gamma} \max \left(\frac M \gamma, f(x^0) - f(x^\ast)\right)}{\frac {2L} \gamma + k}\]Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Then if
- All the above assumptions hold
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
We have the result that
\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma}
\le
\left( 1 - \frac{\eta \gamma}{L} \right)^2
\cdot
\left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]
The only problem with this is that it only guarantees convergence up the “noise floor”, $\frac{\eta M}{2\eta}$.
@Prove that if you let
\[\alpha_k = \frac{2}{2L + k \gamma}\]
for $k \ge 0$, then SGD satisfies, for all $k \ge 0$
\[E[f(X^k)] - f(x^\ast) \le \frac{v}{\frac {2L} \gamma + k}\]
where
\[v := \frac{2L}{\gamma} \max \\{\frac M \gamma, f(x^0) - f(x^\ast)\\}\]
You may assume that Assumption 1 and Assumption 2 together imply that:
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha_k\left( 1-\frac{L\alpha_k}{2} \right) ||\nabla f(X^k)||^2 + \frac{ML\alpha_k^2}{2}\]
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
The given result and the fact that $\frac{L \alpha}{2} - 1 = \frac \eta 2 - 1 < -\frac 1 2$ imply that
\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \frac \alpha 2 ||\nabla f(X^k)||^2 + \frac{M L \alpha^2}{2}\]Taking expectation with respect to the past, $\mathcal S _ 0, \ldots, S _ {k-1}$ on both sides of the above, by the memoryless property (@todo), the current iterate depends only on the previous sample size, so
\[\mathbb E = \mathbb E_k := \mathbb E[\cdot \mid \mathcal S_0, \ldots, \mathcal S_k] = \mathbb E_{\mathcal S_k}\]and hence
\[\mathbb E_k [f(X^{k+1})] - f(x^\ast) \le \mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac \alpha 2 \mathbb E_{k-1}[||\nabla f(X^k)||^2] + \frac{ML\alpha^2}{2}\]As $f$ is strongly convex, the global minimiser $x^\ast$ is unique and
\[f(X^k) - f(x^\ast) \le \frac{1}{2\gamma}||\nabla f(X^k)||^2\]and thus
\[2\gamma (\mathbb E_{k-1} [f(X^k)] - f(x^\ast)) \le \mathbb E_{k-1}[||\nabla f(X^k)||^2]\]And therefore we deduce
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) \le (1 - \gamma \alpha)(\mathbb E_{k-1}[f(X^k)] - f(x^\ast)) + \frac{ML\alpha^2}{2}\]or equivalently,
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{\alpha M L}{2\gamma} \le (1-\gamma \alpha) \left(\mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac{\alpha M L}{2\gamma}\right)\]Then, since $\alpha _ k \le \frac 1 L \le \frac 1 \gamma$, for all $k \ge 0$, we have
\[\mathbb E_k [f(X^{k+1})] - f(x^\ast) - \frac{\alpha_k M L}{2\gamma} \le (1 - \gamma \alpha_k) \left( \mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac{\alpha_k ML}{2\gamma} \right)\](up until this point, the proof is identical to the fixed-step case).
Now proceed by induction. At $k = 0$, the result holds. Now assume that it holds at $k > 0$, and substitute into the above equation to obtain
\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{\alpha_k M L}{2\gamma} \le (1-\gamma \alpha_k)\left( \frac{v}{\frac{2L}{\gamma} + k} - \frac{\alpha_k ML}{2\gamma} \right)\]Then substituting the expression for $\alpha _ k$ into the above and simplifying the result establishes the result with $k$ replaced by $k+1$.
Increased batch size
Recall the stochastic gradient descent algorithm:
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]
(3): Update
\[x^{k+1} = x^k - \alpha _ k g^k\]
We also have the following three assumptions used in our analysis of SGD in the general case:
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
If
- All the above assumptions hold
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $ \vert \mathcal S _ k \vert = 1$
- The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$
then
\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma}
\le
\left( 1 - \frac{\eta \gamma}{L} \right)^2
\cdot
\left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]
How do things change if the mini-batch size $ \vert \mathcal S _ k \vert $ is $p$ rather than $1$?
To solve $\min _ {x \in \mathbb R^n} f(x) = \frac 1 m \sum^m _ {j = 1} f _ j(x)$ where each $f _ j$ continuously differentiable, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$
(2): Calculate
(3): Update
- $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
- For all $j \in \{1, \ldots, m\}$, $f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
- There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb E[(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.
Variance happens to be decreasing exponentially
See [[Notes - Optimisation for Data Science HT25, Variance reduction methods]]U.