Notes - Optimisation for Data Science HT25, Stochastic gradient descent



Flashcards

Basic definitions and assumptions

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$.

In this context, can you @state the stochastic gradient descent @algorithm?


Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned} g^k &= \nabla f_{\mathcal S_k} (x^k) \\\\ &= \frac 1 {m_k} \sum_{j \in \mathcal S_k} \nabla f_j(x^k) \end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha_k g^k\]

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

@State the three assumptions that we use in order to prove that SGD converges in the general case.


  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

Suppose that $g^k$ is calculated as above, and where $ \vert \mathcal S _ k \vert = 1$. @Justify that then the expected value of the gradient with respect to the selected data point is an unbiased estimator of the true gradient.


\[\begin{aligned} \mathbb E_{\mathcal S_k} [G_k] &= \mathbb E[G^k \mid \mathcal S_k] \\\\ &= \sum^m_{j =1} \mathbb E[G^k \mid \mathcal S_k = j] \cdot \mathbb P[\mathcal S_k = j] \\\\ &= \sum^m_{j=1} \nabla f_j(X^k) \cdot \frac 1 m \\\\ &= \nabla f(X^k) \end{aligned}\]

where we have used $\mathbb E[G^k \mid S _ k] = \nabla f _ j (X^k)$, which is true due to the i.i.d. choice of $\mathcal S _ k$ and $G^k$.

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

We also have the following three assumptions used in our analysis of SGD in the general case:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.

Can you @state two useful results that follow from the first and second assumptions respectively?


Assumption 1 implies that

\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E_{\mathcal S_k} [G^k] + \frac{L\alpha^2}{2} \mathbb E_{\mathcal S_k} [||G^k||^2]\]

Assumption 1 and Assumption 2 together imply that

\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha^k\left( \frac{L\alpha^k}{2} - 1 \right) ||\nabla f(X^k)||^2 + \frac{ML(\alpha^k)^2}{2}\]

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

We also have the following three assumptions used in our analysis of SGD in the general case:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.

@Prove that Assumption 2 implies

\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E_{\mathcal S_k} [G^k] + \frac{L\alpha^2}{2} \mathbb E_{\mathcal S_k} [||G^k||^2] \quad (1\star)\]

and that all the assumptions together imply that

\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha^k\left( \frac{L\alpha^k}{2} - 1 \right) ||\nabla f(X^k)||^2 + \frac{ML(\alpha^k)^2}{2} \quad (2\star)\]

For (1), note that Assumption 1 implies that $f$ is $L$-smooth and so we have the upper bound:

\[f(x + \alpha d) \le f(x) + \alpha \nabla f(x)^\top d + \frac {L\alpha^2} 2||d||^2\]

for all $x, d \in \mathbb R^n$ and $\alpha \in \mathbb R$. Letting $x = X^k$, $d = G^k$ and $\alpha = \alpha^k$ and using $X^{k+1} = X^k + \alpha^k G^k$, we have that

\[f(X^{k+1}) \le f(X^k) - \alpha^k \nabla f(X^k)^\top G^k + \frac{L(\alpha_k)^2} 2 ||G^k||^2\]

Applying expectation on both sides with respect to $\mathcal S _ k$, we see that

\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha^k \nabla f(X^k)^\top \mathbb E_{\mathcal S_k}[G^k] + \frac {L(\alpha^k)^2} 2 \mathbb E_{\mathcal S_k} [||G^k||^2]\]

where we have used that $f(X^k)$ and $\nabla f(X^k)$ do not depend on $\mathcal S _ k$.


For (2), note that Assumption 1 implies that $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$. Therefore

\[\begin{aligned} \text{Var}(G^k \mid \mathcal S_k) &= \mathbb E_{\mathcal S_k}[||G^k||^2] - 2\nabla f(X^k)^\top \mathbb E_{\mathcal S_k} [G^k] + ||\nabla f(X^k)||^2 \\\\ &= \mathbb E_{\mathcal S_k} [||G^k||^2] - ||\nabla f(X^k)||^2 \end{aligned}\]

which together with Assumption 3 implies that

\[\mathbb E_{\mathcal S_k} [||G^k||^2] \le M + ||\nabla f(X^k)||^2\]

Applying this to the previous result, we see that

\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \alpha^k\left( \frac{L\alpha^k}{2} - 1 \right) ||\nabla f(X^k)||^2 + \frac{ML(\alpha^k)^2}{2} \quad (2\star)\]

Convergence in the general case

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

We also have the following three assumptions used in our analysis of SGD in the general case:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.

@State a theorem about the convergence of SGD in this general case.


Suppose:

  • All the above assumptions hold
  • $f$ is bounded below by $f _ \text{low}$
  • $ \vert \mathcal S _ k \vert = 1$
  • The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$

Then, for any $k \ge 1$:

\[\min_{0 \le i \le k - 1} \mathbb E[|| \nabla f(X^i) ||^2] \le \eta M + \frac{2L(f(x^0) - f_\text{low})}{k \eta}\]

and so SGD takes at most

\[k \le \frac{2L(f(x^0) - f_\text{low})}{k\varepsilon}\]

to generate

\[\mathbb E[||\nabla f(X^{k-1})||^2] \le \varepsilon + \eta M\]

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

We also have the following three assumptions used in our analysis of SGD in the general case:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.

@Prove that then if

  • All the above assumptions hold
  • $f$ is bounded below by $f _ \text{low}$
  • $ \vert \mathcal S _ k \vert = 1$
  • The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$

Then, for any $k \ge 1$:

\[\min_{0 \le i \le k - 1} \mathbb E[|| \nabla f(X^i) ||^2] \le \eta M + \frac{2L(f(x^0) - f_\text{low})}{k \eta}\]

and so SGD takes at most

\[k \le \frac{2L(f(x^0) - f_\text{low})}{k\varepsilon}\]

to generate

\[\mathbb E[||\nabla f(X^{k-1})||^2] \le \varepsilon + \eta M\]

You may use the following two useful results, which follow from the above assumptions:

  • $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha \nabla f(X^k)^\top \mathbb E _ {\mathcal S _ k} [G^k] + \frac{L\alpha^2}{2} \mathbb E _ {\mathcal S _ k} [ \vert \vert G^k \vert \vert ^2] \quad (1\star)$
  • $\mathbb E _ {\mathcal S _ k} [f(X^{k+1})] \le f(X^k) - \alpha^k\left( \frac{L\alpha^k}{2} - 1 \right) \vert \vert \nabla f(X^k) \vert \vert ^2 + \frac{ML(\alpha^k)^2}{2} \quad (2\star)$

The last of these two results and the fact that $\frac{L \alpha}{2} - 1 = \frac \eta 2 - 1 < -\frac 1 2$ give that

\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \frac \alpha 2 ||\nabla f(X^k)||^2 + \frac{M L \alpha^2}{2}\]

Taking expectation with respect to the past, $\mathcal S _ 0, \ldots, S _ {k-1}$ on both sides of the above, by the memoryless property (@todo), the current iterate depends only on the previous sample size, so

\[\mathbb E = \mathbb E_k := \mathbb [\cdot \mid \mathcal S_0, \ldots, \mathcal S_k] = \mathbb E_{\mathcal S_k}\]

and hence

\[\mathbb E_k[f(X^{k+1})] \le \mathbb E_{k-1}[f(X^k)] - \frac \alpha 2 \mathbb E_{k-1} [||\nabla f(X^k)||^2] + \frac{ML\alpha^2}{2}\]

Now we have that for all $k \ge 0$, the expected decrease satisfies

\[\mathbb E_{k-1} [f(X^k)] - \mathbb E_k [f(X^{k+1})] \ge \frac \alpha 2 \mathbb E_{k-1} [||\nabla f(X^k)||^2] - \frac{ML\alpha^2}{2}\]

Then summing up from $i = 0$ to $k$ and using that $f(X^{k+1}) \ge f _ \text{low}$, we deduce

\[\begin{aligned} f(x^0) - f_\text{low} &\ge f(x^0) - \mathbb E_k[f(X^{k+1})] \\\\ &\ge \frac \alpha 2 \sum^k_{i = 0} \mathbb E_{k-1} [||\nabla f(X^i)||^2] - (k+1)\frac{ML\alpha^2}{2} \\\\ &\ge \frac \alpha 2 (k+1) \left[\min_{0 \le i \le k} \mathbb E[||\nabla f(X^i)||^2] - M L \alpha\right] \end{aligned}\]

Convergence in the strongly-convex case

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

We also have the following three assumptions used in our analysis of SGD in the general case:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.

@State a theorem about the convergence of SGD in this $\gamma$-strongly convex case.


Suppose:

  • All the above assumptions hold
  • $f$ is $\gamma$-strongly convex
  • $f$ is bounded below by $f _ \text{low}$
  • $ \vert \mathcal S _ k \vert = 1$
  • The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$

Then

\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left( 1 - \frac{\eta \gamma}{L} \right)^2 \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

We also have the following three assumptions used in our analysis of SGD in the general case:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.

@Prove that if:

  • All the above assumptions hold
  • $f$ is $\gamma$-strongly convex
  • $f$ is bounded below by $f _ \text{low}$
  • $ \vert \mathcal S _ k \vert = 1$
  • The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$

then:

\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left( 1 - \frac{\eta \gamma}{L} \right)^2 \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]

The last of these two results and the fact that $\frac{L \alpha}{2} - 1 = \frac \eta 2 - 1 < -\frac 1 2$ give that

\[\mathbb E_{\mathcal S_k} [f(X^{k+1})] \le f(X^k) - \frac \alpha 2 ||\nabla f(X^k)||^2 + \frac{M L \alpha^2}{2}\]

Taking expectation with respect to the past, $\mathcal S _ 0, \ldots, S _ {k-1}$ on both sides of the above, by the memoryless property (@todo), the current iterate depends only on the previous sample size, so

\[\mathbb E = \mathbb E_k := \mathbb [\cdot \mid \mathcal S_0, \ldots, \mathcal S_k] = \mathbb E_{\mathcal S_k}\]

and hence

\[\mathbb E_k [f(X^{k+1})] - f(x^\ast) \le \mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac \alpha 2 \mathbb E_{k-1}[||\nabla f(X^k)||^2] + \frac{ML\alpha^2}{2}\]

As $f$ is strongly convex, the global minimiser $x^\ast$ is unique and

\[f(X^k) - f(x^\ast) \le \frac{1}{2\gamma}||\nabla f(X^k)||^2\]

and thus

\[2\gamma \mathbb E_{k-1} (f[(X^k)] - f(x^\ast)) \le \mathbb E_{k-1}[||\nabla f(X^k)||^2]\]

And therefore we deduce

\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) \le (1 - \gamma \alpha)(\mathbb E_{k-1}[f(X^k)] - f(x^\ast)) + \frac{ML\alpha^2}{2}\]

or equivalently,

\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{\alpha M L}{2\gamma} \le (1-\gamma \alpha) \left(\mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac{\alpha M L}{2\gamma}\right)\]

And then since $\alpha = \eta / L \le \frac 1 L \le \frac 1 \gamma$, we have

\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{M\eta}{2\gamma} \le \left(1 - \frac{\eta \gamma}{L} \right) \left( \mathbb E_{k-1} [f(X^k)] - f(x^\ast) - \frac{M\eta}{2\gamma} \right)\]

and then the claim follows by induction.

Decreasing the noise floor

Dynamic step-size

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

We also have the following three assumptions used in our analysis of SGD in the general case:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.

Then if

  • All the above assumptions hold
  • $f$ is $\gamma$-strongly convex
  • $f$ is bounded below by $f _ \text{low}$
  • $ \vert \mathcal S _ k \vert = 1$
  • The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$

We have the result that

\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left( 1 - \frac{\eta \gamma}{L} \right)^2 \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]

The only problem with this is that it only guarantees convergence up the “noise floor”, $\frac{\eta M}{2\eta}$. @State a result about how using a variable step size $\alpha^k$ can guarantee actual convergence in expectation at the cost of sublinear convergence.


Let everything be as above but set

\[\alpha^k = \frac{2}{2L + k \gamma}\]

for $k \ge 0$. Then SGD satisfies, for all $k \ge 0$,

\[0 \le \mathbb E[f(X^k)] - f(x^\ast) \le \frac{v}{\frac {2L} \gamma + k}\]

where

\[v := \frac{2L}{\gamma} \max \\{\frac M \gamma, f(x^0) - f(x^\ast)\\}\]

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

We also have the following three assumptions used in our analysis of SGD in the general case:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.

Then if

  • All the above assumptions hold
  • $f$ is $\gamma$-strongly convex
  • $f$ is bounded below by $f _ \text{low}$
  • $ \vert \mathcal S _ k \vert = 1$
  • The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$

We have the result that

\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left( 1 - \frac{\eta \gamma}{L} \right)^2 \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]

The only problem with this is that it only guarantees convergence up the “noise floor”, $\frac{\eta M}{2\eta}$. @Prove that if you let

\[\alpha^k = \frac{2}{2L + k \gamma}\]

for $k \ge 0$, then SGD satisfies, for all $k \ge 0$

\[0 \le \mathbb E[f(X^k)] - f(x^\ast) \le \frac{v}{\frac {2L} \gamma + k}\]

where

\[v := \frac{2L}{\gamma} \max \\{\frac M \gamma, f(x^0) - f(x^\ast)\\}\]

The proof is identical to the proof in the ordinary $\gamma$-strongly convex case to the point where we establish that

\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{\alpha M L}{2\gamma} \le (1-\gamma \alpha) \left(\mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac{\alpha M L}{2\gamma}\right)\]

Then, since $\alpha^k \le \frac 1 L \le \frac 1 \gamma$, for all $k \ge 0$, we have

\[\mathbb E_k [f(X^{k+1})] - f(x^\ast) - \frac{\alpha^k M L}{2\gamma} \le (1 - \gamma \alpha^k) \left( \mathbb E_{k-1}[f(X^k)] - f(x^\ast) - \frac{\alpha^k ML}{2\gamma} \right)\]

Now proceed by induction. At $k = 0$, the result holds. Now assume that it holds at $k > 0$, and substitute into the above equation to obtain

\[\mathbb E_k[f(X^{k+1})] - f(x^\ast) - \frac{\alpha^k M L}{2\gamma} \le (1-\gamma \alpha^k)\left( \frac{v}{\frac{2L}{\gamma} + k} - \frac{\alpha^k ML}{2\gamma} \right)\]

Then substituting the expression for $\alpha^k$ into the above and simplifying the result establishes the result with $k$ replaced by $k+1$.

Increasing batch size

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x) = \frac 1 m \sum^m_{j = 1} f_j(x)\]

where each $f _ j : \mathbb R^n \to \mathbb R$ is continuously differentiable for $j \in \{1, \ldots, m\}$. In this context, stochastic gradient descent proceeds as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Sample $\mathcal S _ k$ i.i.d. from $U(\{1, \ldots, m\})$

(2): Calculate

\[\begin{aligned}g^k &= \nabla f _ {\mathcal S _ k} (x^k) \\\\&= \frac 1 {m _ k} \sum _ {j \in \mathcal S _ k} \nabla f _ j(x^k)\end{aligned}\]

(3): Update

\[x^{k+1} = x^k - \alpha _ k g^k\]

We also have the following three assumptions used in our analysis of SGD in the general case:

  1. $G^k$ conditioned on the current batch is an unbiased estimator of the true gradient: $\mathbb E _ {\mathcal S _ k} [G^k] = \nabla f(X^k)$.
  2. For all $j \in \{1, \ldots, m\}$, $\nabla f _ j$ is $L$-smooth (and hence $f$ is $L$-smooth by the triangle inequality)
  3. There exists $M > 0$ such that $\text{Var}(G^k \mid \mathcal S _ k) := \mathbb [(G^k - \nabla f(X^k))^\top (G^k - \nabla f(X^K)) \mid \mathcal S _ k] \le M$ for all $k$.

If

  • All the above assumptions hold
  • $f$ is $\gamma$-strongly convex
  • $f$ is bounded below by $f _ \text{low}$
  • $ \vert \mathcal S _ k \vert = 1$
  • The step size is fixed at $\alpha = \eta / L$ where $\eta \in (0, 1]$

then

\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma} \le \left( 1 - \frac{\eta \gamma}{L} \right)^2 \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma} \right]\]

How do things change if the mini-batch size $ \vert \mathcal S _ k \vert $ is $p$ rather than $1$?


\[\mathbb E[f(X^k)] - f(x^\ast) - \frac{\eta M}{2\gamma p} \le \left( 1 - \frac{\eta \gamma}{L} \right)^2 \cdot \left[ f(x^0) - f(x^\ast) - \frac{\eta M}{2\gamma p} \right]\]



Related posts