Notes - Optimisation for Data Science HT25, Accelerated methods



Flashcards

Heavy ball method

Suppose we wish to minimise the convex quadratic function

\[f(x) = \frac 1 2 x^\top A x - b^\top x\]

where $A$ is a symmetric matrix with eigenvalues in $[\gamma, L]$.

In this context, @define the general template for iterative updates used in the heavy ball method.


\[x^{k+1} = x^k - \alpha_k \nabla f(x^k) + \beta_k (x^k - x^{k-1})\]

where $\beta _ k > 0$ is some constant (and initially, a steepest descent update is performed to avoid the need for two starting points).

Suppose we wish to minimise the convex quadratic function

\[f(x) = \frac 1 2 x^\top A x - b^\top x\]

where $A$ is a symmetric matrix with eigenvalues in $[\gamma, L]$. In the heavy ball method, we perform updates of the form

\[x^{k+1} = x^k - \alpha_k \nabla f(x^k) + \beta_k (x^k - x^{k-1})\]

@State a theorem about the convergence of the heavy ball method.


There exists a constant $C > 0$ such that when the heavy ball method is applied with constant step lengths

\[\begin{aligned} \alpha_k &= \frac{4}{(\sqrt L + \sqrt \gamma)^2} \\\\ \beta_k &= \frac{\sqrt L - \sqrt \gamma}{\sqrt L + \sqrt \gamma} \end{aligned}\]

satisfies

\[||x^k - x^\ast|| \le C\beta^k\]

for all $k$.

Suppose we wish to minimise the convex quadratic function

\[f(x) = \frac 1 2 x^\top A x - b^\top x\]

where $A$ is a symmetric matrix with eigenvalues in $[\gamma, L]$. In the heavy ball method, we perform updates of the form

\[x^{k+1} = x^k - \alpha_k \nabla f(x^k) + \beta_k (x^k - x^{k-1})\]

There is a result that says there is some constant $C > 0$ such that when the heavy ball method is applied with constant step lengths

\[\begin{aligned} \alpha_k &= \frac{4}{(\sqrt L + \sqrt \gamma)^2} \\\\ \beta_k &= \frac{\sqrt L - \sqrt \gamma}{\sqrt L + \sqrt \gamma} \end{aligned}\]

satisfies

\[||x^k - x^\ast|| \le C\beta^k\]

for all $k$. This gives a result about the convergence of $x^k$ to $x^\ast$. Can you @State and @prove a theorem about the convergence of the objective value, possibly under some additional assumptions?


\[\begin{aligned} f(x^k) - f(x^\ast) &\le \nabla f(x^\ast) (x^k - x^\ast) + \frac L 2 ||x^k - x^\ast||^2 &&(1\star) \\\\ &\le \frac{LC^2}{2} \beta^{2k} &&(2\star) \\\\ &\approx \frac{LC^2}{2} \left( 1 - 2 \sqrt{\frac \gamma L} \right)^{2k} &&(3\star) \end{aligned}\]

where:

  • $(1\star)$ follows from the fact $f$ is $L$-smooth
  • $(2\star)$ follows from the fact that $\nabla f(x^\ast) = 0$ and by the above result
  • $(3\star)$ follows from an approximation when $L \gg \gamma$

Nesterov acceleration for $L$-smooth $\gamma$-strongly convex functions

Suppose we wish to minimise a strongly convex objective $f(x)$

\[\min_{x \in \mathbb R^n} f(x)\]

In this context, @define the general template of the iterative updates used in Nesterov’s accelerated gradient method.


\[x^{k+1} = x^k - \alpha_k \nabla f(x^k + \beta_k (x^k - x^{k-1})) + \beta_k(x^k - x^{k-1})\]

Suppose we wish to minimise a $\gamma$-strongly convex and $L$-smooth objective $f(x)$

\[\min_{x \in \mathbb R^n} f(x)\]

In this context, Nesterov’s accelerated gradient method uses updates of the form

\[x^{k+1} = x^k - \alpha_k \nabla f(x^k + \beta_k (x^k - x^{k-1})) + \beta_k(x^k - x^{k-1})\]

In this context, @state a result about the convergence of $f(x^k)$.


When Nesterov’s accelerated gradient method is used with constant step lengths

\[\begin{aligned} \alpha_k &= \frac 1 L \\\\ \beta_k &= \frac{\sqrt L - \sqrt \gamma}{\sqrt L + \sqrt \gamma} \end{aligned}\]

satisfies

\[f(x^k) - f(x^\ast) \le \frac{L + \gamma}{2} ||x^0 - x^\ast||^2 \times \left( 1 - \sqrt{\frac \gamma L} \right)^k\]

for all $k \ge 1$.

Nesterov acceleration for $L$-smooth convex functions

@State the @algorithm corresponding to Nesterov acceleration for the minimisation of an $L$-smooth convex function $f$ at start point $y^0$.


  • Initialisation:
    • Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
    • $k := 0$
    • $\lambda _ 0 := 1$
    • $x^{-1} := y^0$
  • Main body: While $ \vert \vert x^k - x^{k-1} \vert \vert > \text{tol}$:
    • Find the minimal $i \in \mathbb N$ such that
    • $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
    • $\alpha _ k := 2^{-i} \alpha _ {k-1}$
    • $x^k := y^k - \alpha _ k \nabla f(y^k)$
    • $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
    • $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$

Intuitively, this algorithm produces two sequences of iterates, $x^k$ and $y^k$. The points $x^k$ are obtained as the minimisers of a quadratic upper bound function (@todo, which function specifically?), and the $y^k$ are approximate second order corrections to the $x^k$.

Nesterov’s acceleration method for the minimisation of an $L$-smooth convex function $f$ at start point $y^0$ proceeds as follows:

  • Initialisation:
    • Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
    • $k := 0$
    • $\lambda _ 0 := 1$
    • $x^{-1} := y^0$
  • Main body: While $ \vert \vert x^k - x^{k-1} \vert \vert > \text{tol}$:
    • Find the minimal $i \in \mathbb N$ such that
    • $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
    • $\alpha _ k := 2^{-i} \alpha _ {k-1}$
    • $x^k := y^k - \alpha _ k \nabla f(y^k)$
    • $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
    • $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$

@Prove that $\alpha _ k$ is decreasing and that it stops to decrease once $\alpha _ k \le L^{-1}$, and then motivate this process.


Clearly it’s decreasing because of the step $\alpha _ k := 2^{-i} \alpha _ {k-1}$. Once $\alpha _ k \le L^{-1}$, we have

\[\begin{aligned} f(y^k - \alpha_{k-1} \nabla f(y^k)) &\le f(y^k) + \langle \nabla f(y^k), -\alpha_{k-1} \nabla f(y^k)\rangle + \frac L 2 ||\alpha_{k-1} \nabla f(y^k)||^2 \\\\ &\le f(y^k) - \frac{\alpha_{k-1} }{2} ||\nabla f(y^k)||^2 \end{aligned}\]

Therefore the condition that

\[f(y^k - 2^{-i} \alpha_{k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha_{k-1} ||\nabla f(y^k)||^2\]

is satisfied with $i = 0$ and hence $\alpha _ k$ remains the same.

If $L$ was known in advance, $\alpha _ k$ could be set to the optimal value $\alpha = \frac 1 L$ in advance, but this algorithm learns the value of $L$ over time. $\alpha \le L^{-1}$ is the condition needed to ensure that

\[m_{k, \alpha}^u (y) = f(y^k) + \langle \nabla f(y^k), y - y^k \rangle + \frac{\alpha^{-1} }{2} ||y - y^k||^2\]

is an upper bound on $f(y)$.

Nesterov’s acceleration method for the minimisation of an $L$-smooth convex function $f$ at start point $y^0$ proceeds as follows:

  • Initialisation:
    • Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
    • $k := 0$
    • $\lambda _ 0 := 1$
    • $x^{-1} := y^0$
  • Main body: While $ \vert \vert x^k - x^{k-1} \vert \vert > \text{tol}$:
    • Find the minimal $i \in \mathbb N$ such that
    • $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
    • $\alpha _ k := 2^{-i} \alpha _ {k-1}$
    • $x^k := y^k - \alpha _ k \nabla f(y^k)$
    • $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
    • $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$

Can you @justify the update for $x^k$:

  • $x^k := y^k - \alpha _ k \nabla f(y^k)$

$x^k$ is the global minimiser of the function

\[m_{k, \alpha}^u (y) = f(y^k) + \langle \nabla f(y^k), y - y^k \rangle + \frac{\alpha^{-1} }{2} ||y - y^k||^2\]

which is an upper bound on $f$ given that $\alpha _ k \le L^{-1}$.

Nesterov’s acceleration method for the minimisation of an $L$-smooth convex function $f$ at start point $y^0$ proceeds as follows:

  • Initialisation:
    • Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
    • $k := 0$
    • $\lambda _ 0 := 1$
    • $x^{-1} := y^0$
  • Main body: While $ \vert \vert x^k - x^{k-1} \vert \vert > \text{tol}$:
    • Find the minimal $i \in \mathbb N$ such that
    • $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
    • $\alpha _ k := 2^{-i} \alpha _ {k-1}$
    • $x^k := y^k - \alpha _ k \nabla f(y^k)$
    • $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
    • $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$

@State a result related to its convergence.


Suppose:

  • $f : \mathbb R^n \to \mathbb R$ be a convex $L$-smooth function
  • $f$ has at least one finite minimiser $x^\ast \in \text{argmin } f(x)$
  • $f$ has a finite minimum $f^\ast = \min f(x)$

Then:

  • The sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies $f(x^k) \le f^\ast + \frac{4L \vert \vert x^0 - x^\ast \vert \vert ^2}{(k+2)^2}$ for all $k \in \mathbb N$, so
  • $x^k$ is $\varepsilon$-optimal for all $k \ge N(\varepsilon) = \left\lfloor \frac{2\sqrt L \times \vert \vert x^0 - x^\ast \vert \vert }{\sqrt \varepsilon} \right\rfloor - 1$

Nesterov’s acceleration method for the minimisation of an $L$-smooth convex function $f$ at start point $y^0$ proceeds as follows:

  • Initialisation:
    • Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
    • $k := 0$
    • $\lambda _ 0 := 1$
    • $x^{-1} := y^0$
  • Main body: While $ \vert \vert x^k - x^{k-1} \vert \vert > \text{tol}$:
    • Find the minimal $i \in \mathbb N$ such that
    • $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
    • $\alpha _ k := 2^{-i} \alpha _ {k-1}$
    • $x^k := y^k - \alpha _ k \nabla f(y^k)$
    • $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
    • $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$

@Prove that if

  • $f : \mathbb R^n \to \mathbb R$ be a convex $L$-smooth function
  • $f$ has at least one finite minimiser $x^\ast \in \text{argmin } f(x)$
  • $f$ has a finite minimum $f^\ast = \min f(x)$

then:

  • The sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies $f(x^k) \le f^\ast + \frac{4L \vert \vert x^0 - x^\ast \vert \vert ^2}{(k+2)^2}$ for all $k \in \mathbb N$, so
  • $x^k$ is $\varepsilon$-optimal for all $k \ge N(\varepsilon) = \left\lfloor \frac{2\sqrt L \times \vert \vert x^0 - x^\ast \vert \vert }{\sqrt \varepsilon} \right\rfloor - 1$

It is easiest to analyse the algorithm in terms of

\[p^k = (\lambda_k - 1)(x^{k-1} - x^k)\]

We have:

\[y^{k+1} = x^k + \frac{\lambda_k - 1}{\lambda_{k+1} }(x^k - x^{k-1})\]

and

\[\begin{aligned} x^{k+1} &= y^{k+1} - \alpha_{k+1} \nabla f(y^{k+1}) \\\\ &= x^k + \frac{\lambda_k - 1}{\lambda_{k+1} } (x^k - x^{k-1}) - \alpha_{k+1} \nabla f(y^{k+1}) \end{aligned}\]

Hence by considering $p^k$, we have

\[\begin{aligned} p^{k+1} - x^{k+1} &= (\lambda_{k+1} - 1)(x^k - x^{k+1}) - x^{k+1} \\\\ &= \lambda_{k+1} (x^k - x^{k+1}) - x^k \\\\ &= (1 - \lambda_k) (x^k - x^{k-1}) - x^k + \lambda_{k+1} \alpha_{k+1} \nabla f(y^{k+1}) &&(1\star) \\\\ &= p^k - x^k + \lambda_{k+1} \alpha_{k+1} \nabla f(y^{k+1}) \end{aligned}\]

where $(1\star)$ follows from the above equality. Hence

\[\begin{aligned} &||p^{k+1} - x^{k+1} + x^\ast||^2\\\\ =&||p^k - x^k + x^\ast||^2 + \lambda_{k+1}^2 \alpha_{k+1}^2 ||\nabla f(y^{k+1})||^2 + 2\lambda_{k+1} \alpha_{k+1} \langle \nabla f(y^{k+1}), p^k - x^k + x^\ast \rangle \\\\ =&||p^k - x^k + x^\ast||^2 + \lambda^2_{k+1} ||\nabla f(y^{k+1})||^2 + 2(\lambda_{k+1} - 1) \alpha_{k+1} \langle \nabla f(y^{k+1}, p^k)\rangle + 2\lambda_{k+1} \alpha_{k+1} \langle \nabla f(y^k_{k+1}), x^\ast - x^k + \lambda^{-1}_{k+1} p^k \rangle \\\\ =&||p^k - x^k + x^\ast||^2 + \lambda_{k+1}^2 \alpha_{k+1}^2 ||\nabla f(y^{k+1})||^2 + 2(\lambda_{k+1} - 1)\alpha_{k+1} \langle \nabla f(y^{k+1}), p^k \rangle + 2 \lambda_{k+1} \alpha_{k+1} \langle \nabla f(y^{k+1}), x^\ast - y^k \rangle \end{aligned}\]

where the last equality follows from the identity

\[\begin{aligned} y^{k+1} &= x^k + \frac{\lambda_k - 1}{\lambda_{k+1} } (x^k - x^{k-1} ) \\\\ &= x^k - \frac{1}{\lambda_{k+1} } p^k \end{aligned}\]

Now we wish to derive bounds on the last two terms on the right hand side of the big equation.

To bound $2\lambda _ {k+1} \alpha _ {k+1} \langle \nabla f(y^{k+1}), x^\ast - y^k \rangle$ (the second of the terms), note that by the convexity of $f$, we have

\[f(y^{k+1}) - f^\ast \le \langle \nabla f(y^{k+1}), x^\ast - y^{k+1}\rangle\]

The sufficient decrease condition yields that

\[f(y^{k+1}) - f(x^{k+1}) \ge \frac{\alpha_{k+1} }{2} ||\nabla f(y^{k+1})||^2\]

and so substituting this into the convexity equation, we have that

\[\langle \nabla f(y^{k+1}), x^\ast - y^{k+1}\rangle \ge f(x^{k+1}) - f^\ast + \frac{\alpha_{k+1} }{2} ||\nabla f(y^{k+1})||^2\]

To bound $2(\lambda _ {k+1} - 1) \alpha _ {k+1} \langle \nabla f(y^{k+1}), p^k\rangle$ (the first of the terms), note that by the definition of $y^{k+1}$, we have that

\[\begin{aligned} y^{k+1} &= x^k + \frac{\lambda_k - 1}{\alpha_{k+1}} (x^k - x^{k+1}) \\\\ &= x^k - \alpha^{-1}_{k+1} p^k \end{aligned}\]

And then convexity implies that

\[f(y^{k+1}) + \alpha^{-1} \alpha_{k+1} \langle \nabla f(y^{k+1}), p^k\rangle \le f(x^k)\]

so by $f(y^{k+1}) - f(x^{k+1}) \ge \frac{\alpha _ {k+1} }{2} \vert \vert \nabla f(y^{k+1}) \vert \vert ^2$ above, we see that

\[\frac{\alpha_{k+1} }{2} ||\nabla f(y^{k+1})||^2 \le f(x^k) - f(x^{k+1}) - \alpha^{-1}_{k+1} \langle \nabla f(y^{k+1}), p^k\rangle\]

Combining everything so far, we have that

\[\begin{aligned} &||p^{k+1} - x^{k+1} + x^\ast||^2 - ||p^k - x^k + x^\ast||^2 \\\\ \le&2(\lambda_{k+1} - 1) \alpha_{k+1} \langle \nabla f(y^{k+1}, p^k) \rangle \\\\ & - 2\lambda_{k+1} \alpha_{k+1} (f(x^{k+1}) - f^\ast) \\\\ & + (\lambda_{k+1}^2 - \lambda_{k+1}) \alpha_{k+1}^2 ||\nabla f(y^{k+1})||^2 \\\\ \le& {-2}\lambda_{k+1} \alpha_{k+1} (f(x^{k+1}) - f^\ast) + 2(\lambda_{k+1}^2 - \lambda_{k+1}) \alpha_{k+1} (f(x^k) - f(x^{k+1})) \\\\ =&2\alpha_{k+1} \lambda^2_k (f(x^{k+1}) - f^\ast) - 2\alpha_{k+1} \lambda^2_{k+1} (f(x^{k+1}) - f^\ast)) \\\\ \le&2\alpha_k \lambda_k^2 (f(x^k) - f^\ast) - 2\alpha_{k+1} \lambda_{k+1}^2 (f(x^{k+1}) - f^\ast) \end{aligned}\]

(@todo, explain this monster).

Applying this inequality iteratively, we find

\[\begin{aligned} 2 \alpha_{k+1} \lambda^2_{k+1} (f(x^{k+1}) - f^\ast) &\le 2\alpha_{k+1} \lambda_{k+1}^2 (f(x^{k+1}) - f^\ast) + ||p^{k+1} - x^{k+1} + x^\ast||^2 \\\\ &\le 2\alpha_k \lambda_k^2 (f(x^k) - f^\ast) + ||p_k - x^k + x^\ast||^2 \\\\ &\le \cdots \\\\ &\le 2\alpha_0 \lambda_0^2 (f(x^0) - f^\ast) + ||p_0 - x^0 + x^\ast||^2 \\\\ &\le ||y^0 - x^\ast||^2 \end{aligned}\]

where the last inequality follows from $\lambda _ 0 = 1$, $p^0 = (\lambda _ 0 - 1)(x^{-1} - x^0) = 0$, and

\[\begin{aligned} ||p^0 - x^0 + x^\ast||^2 &= ||x^0 - x^\ast||^2 \\\\ &= ||y^0 - \alpha_0 \nabla f(y^0) - x^\ast|| \\\\ &= ||y^0 - x^\ast||^2 + \alpha_0^2 ||\nabla f(y^0)||^2 - 2 \alpha_0 \langle \nabla f(y^0), y^0 - x^\ast \alpha \rangle \\\\ &\le ||y^0 - x^\ast||^2 + \alpha_0^2 ||\nabla f(y^0)||^2 - 2\alpha_0(f(x^0) - f^\ast + \frac{\alpha_0} 2 ||\nabla f(y^0)||^2) \\\\ &= ||y^0 - x^\ast||^2 - 2\alpha_0 \lambda_0^2(f(x^0) - f^\ast) \end{aligned}\]

Hence this implies

\[2 \alpha_{k+1} (1 + \frac{k+1}{2})^2 (f(x^{k+1}) - f^\ast) \le ||y^0 - x^\ast||^2\]

and so

\[f(x^{k+1}) - f^\ast \le \frac{4L||y^0 - x^\ast||^2}{(k+3)^2}\]

as required.




Related posts