Notes - Optimisation for Data Science HT25, Accelerated methods

Flashcards

[[Course - Optimisation for Data Science HT25]]^U
- [[Notes - Optimisation for Data Science HT25, Steepest descent]]^U

Flashcards

Heavy ball method

Suppose we wish to minimise the convex quadratic function

f (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x

where $A$ is a symmetric matrix with eigenvalues in $[γ, L]$ .

In this context, @define the general template for iterative updates used in the heavy ball method.

x^{k + 1} = x^{k} - α_{k} \nabla f (x^{k}) + β_{k} (x^{k} - x^{k - 1})

where $β_{k} > 0$ is some constant (and initially, a steepest descent update is performed to avoid the need for two starting points).

Suppose we wish to minimise the convex quadratic function

f (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x

where $A$ is a symmetric matrix with eigenvalues in $[γ, L]$ . In the heavy ball method, we perform updates of the form

x^{k + 1} = x^{k} - α_{k} \nabla f (x^{k}) + β_{k} (x^{k} - x^{k - 1})

@State a theorem about the convergence of the heavy ball method.

There exists a constant $C > 0$ such that when the heavy ball method is applied with constant step lengths

\begin{aligned} α_{k} & = \frac{4}{(\sqrt{L} + \sqrt{γ})^{2}} \\ β_{k} & = \frac{\sqrt{L} - \sqrt{γ}}{\sqrt{L} + \sqrt{γ}} \end{aligned}

satisfies

| | x^{k} - x^{*} | | \leq C β^{k}

for all $k$ .

Suppose we wish to minimise the convex quadratic function

f (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x

where $A$ is a symmetric matrix with eigenvalues in $[γ, L]$ . In the heavy ball method, we perform updates of the form

x^{k + 1} = x^{k} - α_{k} \nabla f (x^{k}) + β_{k} (x^{k} - x^{k - 1})

There is a result that says there is some constant $C > 0$ such that when the heavy ball method is applied with constant step lengths

\begin{aligned} α_{k} & = \frac{4}{(\sqrt{L} + \sqrt{γ})^{2}} \\ β_{k} & = \frac{\sqrt{L} - \sqrt{γ}}{\sqrt{L} + \sqrt{γ}} \end{aligned}

satisfies

| | x^{k} - x^{*} | | \leq C β^{k}

for all $k$ . This gives a result about the convergence of $x^{k}$ to $x^{*}$ . Can you @State and @prove a theorem about the convergence of the objective value, possibly under some additional assumptions?

\begin{aligned} f (x^{k}) - f (x^{*}) & \leq \nabla f (x^{*}) (x^{k} - x^{*}) + \frac{L}{2} | | x^{k} - x^{*} | |^{2} & (1 ⋆) \\ \leq \frac{L C^{2}}{2} β^{2 k} & (2 ⋆) \\ \approx \frac{L C^{2}}{2} {(1 - 2 \sqrt{\frac{γ}{L}})}^{2 k} & (3 ⋆) \end{aligned}

where:

$(1 ⋆)$ follows from the fact $f$ is $L$ -smooth
$(2 ⋆)$ follows from the fact that $\nabla f (x^{*}) = 0$ and by the above result
$(3 ⋆)$ follows from an approximation when $L ≫ γ$

Nesterov acceleration for $L$ -smooth $γ$ -strongly convex functions

Suppose we wish to minimise a strongly convex objective $f (x)$

min_{x \in R^{n}} f (x)

In this context, @define the general template of the iterative updates used in Nesterov’s accelerated gradient method.

x^{k + 1} = x^{k} - α_{k} \nabla f (x^{k} + β_{k} (x^{k} - x^{k - 1})) + β_{k} (x^{k} - x^{k - 1})

Suppose we wish to minimise a $γ$ -strongly convex and $L$ -smooth objective $f (x)$

min_{x \in R^{n}} f (x)

In this context, Nesterov’s accelerated gradient method uses updates of the form

x^{k + 1} = x^{k} - α_{k} \nabla f (x^{k} + β_{k} (x^{k} - x^{k - 1})) + β_{k} (x^{k} - x^{k - 1})

@State candidate constant step lengths and give a result about the convergence of $f (x^{k})$ in this case.

α_{k} = \frac{1}{L}, β_{k} = \frac{\sqrt{L} - \sqrt{γ}}{\sqrt{L} + \sqrt{γ}}

gives

f (x^{k}) - f (x^{*}) \leq \frac{L + γ}{2} {(1 - \sqrt{\frac{γ}{L}})}^{k} | | x^{0} - x^{*} | |^{2}

for all $k \geq 1$ .

Nesterov acceleration for $L$ -smooth convex functions

@State the @algorithm corresponding to Nesterov acceleration for the minimisation of an $L$ -smooth convex function $f$ at start point $y^{0}$ .

Initialisation:
- Find $z \neq y^{0} \in R^{n}$ and set $α_{- 1} := \frac{| | y^{0} - z | |}{| | \nabla f (y^{0}) - \nabla f (x) | |}$
- $k := 0$
- $λ_{0} := 1$
- $x^{- 1} := y^{0}$
Main body: While $| | x^{k} - x^{k - 1} | | > tol$ :
- Find the minimal $i \in N$ such that
- $f (y^{k} - 2^{- i} α_{k - 1} \nabla f (y^{k})) \leq f (y^{k}) - 2^{- (i + 1)} α_{k - 1} | | \nabla f (y^{k}) | |^{2}$
- $α_{k} := 2^{- i} α_{k - 1}$
- $x^{k} := y^{k} - α_{k} \nabla f (y^{k})$
- $λ_{k + 1} := \frac{1}{2} (1 + \sqrt{4 λ_{k}^{2} + 1})$
- $y^{k + 1} := x^{k} + \frac{λ_{k} - 1}{λ_{k + 1}} (x^{k} - x^{k - 1})$

Intuitively, this algorithm produces two sequences of iterates, $x^{k}$ and $y^{k}$ . The points $x^{k}$ are obtained as the minimisers of a quadratic upper bound function (@todo, which function specifically?), and the $y^{k}$ are approximate second order corrections to the $x^{k}$ .

Nesterov’s acceleration method for the minimisation of an $L$ -smooth convex function $f$ at start point $y^{0}$ proceeds as follows:

Initialisation:
- Find $z \neq y^{0} \in R^{n}$ and set $α_{- 1} := \frac{| | y^{0} - z | |}{| | \nabla f (y^{0}) - \nabla f (x) | |}$
- $k := 0$
- $λ_{0} := 1$
- $x^{- 1} := y^{0}$
Main body: While $| | x^{k} - x^{k - 1} | | > tol$ :
- Find the minimal $i \in N$ such that
- $f (y^{k} - 2^{- i} α_{k - 1} \nabla f (y^{k})) \leq f (y^{k}) - 2^{- (i + 1)} α_{k - 1} | | \nabla f (y^{k}) | |^{2}$
- $α_{k} := 2^{- i} α_{k - 1}$
- $x^{k} := y^{k} - α_{k} \nabla f (y^{k})$
- $λ_{k + 1} := \frac{1}{2} (1 + \sqrt{4 λ_{k}^{2} + 1})$
- $y^{k + 1} := x^{k} + \frac{λ_{k} - 1}{λ_{k + 1}} (x^{k} - x^{k - 1})$

@Prove that $α_{k}$ is decreasing and that it stops to decrease once $α_{k} \leq L^{- 1}$ , and then motivate this process.

Clearly it’s decreasing because of the step $α_{k} := 2^{- i} α_{k - 1}$ . Once $α_{k} \leq L^{- 1}$ , we have

\begin{aligned} f (y^{k} - α_{k - 1} \nabla f (y^{k})) & \leq f (y^{k}) + ⟨ \nabla f (y^{k}), - α_{k - 1} \nabla f (y^{k}) ⟩ + \frac{L}{2} | | α_{k - 1} \nabla f (y^{k}) | |^{2} \\ \leq f (y^{k}) - \frac{α_{k - 1}}{2} | | \nabla f (y^{k}) | |^{2} \end{aligned}

Therefore the condition that

f (y^{k} - 2^{- i} α_{k - 1} \nabla f (y^{k})) \leq f (y^{k}) - 2^{- (i + 1)} α_{k - 1} | | \nabla f (y^{k}) | |^{2}

is satisfied with $i = 0$ and hence $α_{k}$ remains the same.

If $L$ was known in advance, $α_{k}$ could be set to the optimal value $α = \frac{1}{L}$ in advance, but this algorithm learns the value of $L$ over time. $α \leq L^{- 1}$ is the condition needed to ensure that

m_{k, α}^{u} (y) = f (y^{k}) + ⟨ \nabla f (y^{k}), y - y^{k} ⟩ + \frac{α^{- 1}}{2} | | y - y^{k} | |^{2}

is an upper bound on $f (y)$ .

Nesterov’s acceleration method for the minimisation of an $L$ -smooth convex function $f$ at start point $y^{0}$ proceeds as follows:

Initialisation:
- Find $z \neq y^{0} \in R^{n}$ and set $α_{- 1} := \frac{| | y^{0} - z | |}{| | \nabla f (y^{0}) - \nabla f (x) | |}$
- $k := 0$
- $λ_{0} := 1$
- $x^{- 1} := y^{0}$
Main body: While $| | x^{k} - x^{k - 1} | | > tol$ :
- Find the minimal $i \in N$ such that
- $f (y^{k} - 2^{- i} α_{k - 1} \nabla f (y^{k})) \leq f (y^{k}) - 2^{- (i + 1)} α_{k - 1} | | \nabla f (y^{k}) | |^{2}$
- $α_{k} := 2^{- i} α_{k - 1}$
- $x^{k} := y^{k} - α_{k} \nabla f (y^{k})$
- $λ_{k + 1} := \frac{1}{2} (1 + \sqrt{4 λ_{k}^{2} + 1})$
- $y^{k + 1} := x^{k} + \frac{λ_{k} - 1}{λ_{k + 1}} (x^{k} - x^{k - 1})$

Can you @justify the update for $x^{k}$ :

$x^{k} := y^{k} - α_{k} \nabla f (y^{k})$

$x^{k}$ is the global minimiser of the function

\tilde{f} (y) = f (y^{k}) + \nabla f (y^{k})^{⊤} (y - y^{k}) + \frac{1}{2 α_{k}} | | y - y^{k} | |^{2}

which is an upper bound on $f$ given that $α_{k} \leq L^{- 1}$ (straightforwardly checked by differentiating).

Nesterov’s acceleration method for the minimisation of an $L$ -smooth convex function $f$ at start point $y^{0}$ proceeds as follows:

Initialisation:
- Find $z \neq y^{0} \in R^{n}$ and set $α_{- 1} := \frac{| | y^{0} - z | |}{| | \nabla f (y^{0}) - \nabla f (x) | |}$
- $k := 0$
- $λ_{0} := 1$
- $x^{- 1} := y^{0}$
Main body: While $| | x^{k} - x^{k - 1} | | > tol$ :
- Find the minimal $i \in N$ such that
- $f (y^{k} - 2^{- i} α_{k - 1} \nabla f (y^{k})) \leq f (y^{k}) - 2^{- (i + 1)} α_{k - 1} | | \nabla f (y^{k}) | |^{2}$
- $α_{k} := 2^{- i} α_{k - 1}$
- $x^{k} := y^{k} - α_{k} \nabla f (y^{k})$
- $λ_{k + 1} := \frac{1}{2} (1 + \sqrt{4 λ_{k}^{2} + 1})$
- $y^{k + 1} := x^{k} + \frac{λ_{k} - 1}{λ_{k + 1}} (x^{k} - x^{k - 1})$

@State a result related to its convergence.

Suppose:

$f : R^{n} \to R$ be a convex $L$ -smooth function
$f$ has at least one finite minimiser $x^{*} \in argmin f (x)$
$f$ has a finite minimum $f^{*} = min f (x)$

Then for all $k \geq 1$ :

f (x^{k}) - f (x^{*}) \leq \frac{4 L | | x^{0} - x^{*} | |^{2}}{(k + 2)^{2}}

Nesterov’s acceleration method for the minimisation of an $L$ -smooth convex function $f$ at start point $y^{0}$ proceeds as follows:

Initialisation:
- Find $z \neq y^{0} \in R^{n}$ and set $α_{- 1} := \frac{| | y^{0} - z | |}{| | \nabla f (y^{0}) - \nabla f (x) | |}$
- $k := 0$
- $λ_{0} := 1$
- $x^{- 1} := y^{0}$
Main body: While $| | x^{k} - x^{k - 1} | | > tol$ :
- Find the minimal $i \in N$ such that
- $f (y^{k} - 2^{- i} α_{k - 1} \nabla f (y^{k})) \leq f (y^{k}) - 2^{- (i + 1)} α_{k - 1} | | \nabla f (y^{k}) | |^{2}$
- $α_{k} := 2^{- i} α_{k - 1}$
- $x^{k} := y^{k} - α_{k} \nabla f (y^{k})$
- $λ_{k + 1} := \frac{1}{2} (1 + \sqrt{4 λ_{k}^{2} + 1})$
- $y^{k + 1} := x^{k} + \frac{λ_{k} - 1}{λ_{k + 1}} (x^{k} - x^{k - 1})$

@Prove that if

$f : R^{n} \to R$ be a convex $L$ -smooth function
$f$ has at least one finite minimiser $x^{*} \in argmin f (x)$
$f$ has a finite minimum $f^{*} = min f (x)$

then:

The sequence of iterates $(x^{k})_{k \in N}$ satisfies $f (x^{k}) \leq f^{*} + \frac{4 L | | x^{0} - x^{*} | |^{2}}{(k + 2)^{2}}$ for all $k \in N$ , so
$x^{k}$ is $ε$ -optimal for all $k \geq N (ε) = ⌊ \frac{2 \sqrt{L} \times | | x^{0} - x^{*} | |}{\sqrt{ε}} ⌋ - 1$

It is easiest to analyse the algorithm in terms of

p^{k} = (λ_{k} - 1) (x^{k - 1} - x^{k})

We have:

y^{k + 1} = x^{k} + \frac{λ_{k} - 1}{λ_{k + 1}} (x^{k} - x^{k - 1})

and

\begin{aligned} x^{k + 1} & = y^{k + 1} - α_{k + 1} \nabla f (y^{k + 1}) \\ = x^{k} + \frac{λ_{k} - 1}{λ_{k + 1}} (x^{k} - x^{k - 1}) - α_{k + 1} \nabla f (y^{k + 1}) \end{aligned}

Hence by considering $p^{k}$ , we have

\begin{aligned} p^{k + 1} - x^{k + 1} & = (λ_{k + 1} - 1) (x^{k} - x^{k + 1}) - x^{k + 1} \\ = λ_{k + 1} (x^{k} - x^{k + 1}) - x^{k} \\ = (1 - λ_{k}) (x^{k} - x^{k - 1}) - x^{k} + λ_{k + 1} α_{k + 1} \nabla f (y^{k + 1}) & (1 ⋆) \\ = p^{k} - x^{k} + λ_{k + 1} α_{k + 1} \nabla f (y^{k + 1}) \end{aligned}

where $(1 ⋆)$ follows from the above equality. Hence

\begin{aligned} | | p^{k + 1} - x^{k + 1} + x^{*} | |^{2} \\ = & | | p^{k} - x^{k} + x^{*} | |^{2} + λ_{k + 1}^{2} α_{k + 1}^{2} | | \nabla f (y^{k + 1}) | |^{2} + 2 λ_{k + 1} α_{k + 1} ⟨ \nabla f (y^{k + 1}), p^{k} - x^{k} + x^{*} ⟩ \\ = & | | p^{k} - x^{k} + x^{*} | |^{2} + λ_{k + 1}^{2} | | \nabla f (y^{k + 1}) | |^{2} + 2 (λ_{k + 1} - 1) α_{k + 1} ⟨ \nabla f (y^{k + 1}, p^{k}) ⟩ + 2 λ_{k + 1} α_{k + 1} ⟨ \nabla f (y_{k + 1}^{k}), x^{*} - x^{k} + λ_{k + 1}^{- 1} p^{k} ⟩ \\ = & | | p^{k} - x^{k} + x^{*} | |^{2} + λ_{k + 1}^{2} α_{k + 1}^{2} | | \nabla f (y^{k + 1}) | |^{2} + 2 (λ_{k + 1} - 1) α_{k + 1} ⟨ \nabla f (y^{k + 1}), p^{k} ⟩ + 2 λ_{k + 1} α_{k + 1} ⟨ \nabla f (y^{k + 1}), x^{*} - y^{k} ⟩ \end{aligned}

where the last equality follows from the identity

\begin{aligned} y^{k + 1} & = x^{k} + \frac{λ_{k} - 1}{λ_{k + 1}} (x^{k} - x^{k - 1}) \\ = x^{k} - \frac{1}{λ_{k + 1}} p^{k} \end{aligned}

Now we wish to derive bounds on the last two terms on the right hand side of the big equation.

To bound $2 λ_{k + 1} α_{k + 1} ⟨ \nabla f (y^{k + 1}), x^{*} - y^{k} ⟩$ (the second of the terms), note that by the convexity of $f$ , we have

f (y^{k + 1}) - f^{*} \leq ⟨ \nabla f (y^{k + 1}), x^{*} - y^{k + 1} ⟩

The sufficient decrease condition yields that

f (y^{k + 1}) - f (x^{k + 1}) \geq \frac{α_{k + 1}}{2} | | \nabla f (y^{k + 1}) | |^{2}

and so substituting this into the convexity equation, we have that

⟨ \nabla f (y^{k + 1}), x^{*} - y^{k + 1} ⟩ \geq f (x^{k + 1}) - f^{*} + \frac{α_{k + 1}}{2} | | \nabla f (y^{k + 1}) | |^{2}

To bound $2 (λ_{k + 1} - 1) α_{k + 1} ⟨ \nabla f (y^{k + 1}), p^{k} ⟩$ (the first of the terms), note that by the definition of $y^{k + 1}$ , we have that

\begin{aligned} y^{k + 1} & = x^{k} + \frac{λ_{k} - 1}{α_{k + 1}} (x^{k} - x^{k + 1}) \\ = x^{k} - α_{k + 1}^{- 1} p^{k} \end{aligned}

And then convexity implies that

f (y^{k + 1}) + α^{- 1} α_{k + 1} ⟨ \nabla f (y^{k + 1}), p^{k} ⟩ \leq f (x^{k})

so by $f (y^{k + 1}) - f (x^{k + 1}) \geq \frac{α_{k + 1}}{2} | | \nabla f (y^{k + 1}) | |^{2}$ above, we see that

\frac{α_{k + 1}}{2} | | \nabla f (y^{k + 1}) | |^{2} \leq f (x^{k}) - f (x^{k + 1}) - α_{k + 1}^{- 1} ⟨ \nabla f (y^{k + 1}), p^{k} ⟩

Combining everything so far, we have that

\begin{aligned} | | p^{k + 1} - x^{k + 1} + x^{*} | |^{2} - | | p^{k} - x^{k} + x^{*} | |^{2} \\ \leq & 2 (λ_{k + 1} - 1) α_{k + 1} ⟨ \nabla f (y^{k + 1}, p^{k}) ⟩ \\ - 2 λ_{k + 1} α_{k + 1} (f (x^{k + 1}) - f^{*}) \\ + (λ_{k + 1}^{2} - λ_{k + 1}) α_{k + 1}^{2} | | \nabla f (y^{k + 1}) | |^{2} \\ \leq & - 2 λ_{k + 1} α_{k + 1} (f (x^{k + 1}) - f^{*}) + 2 (λ_{k + 1}^{2} - λ_{k + 1}) α_{k + 1} (f (x^{k}) - f (x^{k + 1})) \\ = & 2 α_{k + 1} λ_{k}^{2} (f (x^{k + 1}) - f^{*}) - 2 α_{k + 1} λ_{k + 1}^{2} (f (x^{k + 1}) - f^{*})) \\ \leq & 2 α_{k} λ_{k}^{2} (f (x^{k}) - f^{*}) - 2 α_{k + 1} λ_{k + 1}^{2} (f (x^{k + 1}) - f^{*}) \end{aligned}

(@todo, explain this monster).

Applying this inequality iteratively, we find

\begin{aligned} 2 α_{k + 1} λ_{k + 1}^{2} (f (x^{k + 1}) - f^{*}) & \leq 2 α_{k + 1} λ_{k + 1}^{2} (f (x^{k + 1}) - f^{*}) + | | p^{k + 1} - x^{k + 1} + x^{*} | |^{2} \\ \leq 2 α_{k} λ_{k}^{2} (f (x^{k}) - f^{*}) + | | p_{k} - x^{k} + x^{*} | |^{2} \\ \leq \dots \\ \leq 2 α_{0} λ_{0}^{2} (f (x^{0}) - f^{*}) + | | p_{0} - x^{0} + x^{*} | |^{2} \\ \leq | | y^{0} - x^{*} | |^{2} \end{aligned}

where the last inequality follows from $λ_{0} = 1$ , $p^{0} = (λ_{0} - 1) (x^{- 1} - x^{0}) = 0$ , and

\begin{aligned} | | p^{0} - x^{0} + x^{*} | |^{2} & = | | x^{0} - x^{*} | |^{2} \\ = | | y^{0} - α_{0} \nabla f (y^{0}) - x^{*} | | \\ = | | y^{0} - x^{*} | |^{2} + α_{0}^{2} | | \nabla f (y^{0}) | |^{2} - 2 α_{0} ⟨ \nabla f (y^{0}), y^{0} - x^{*} α ⟩ \\ \leq | | y^{0} - x^{*} | |^{2} + α_{0}^{2} | | \nabla f (y^{0}) | |^{2} - 2 α_{0} (f (x^{0}) - f^{*} + \frac{α_{0}}{2} | | \nabla f (y^{0}) | |^{2}) \\ = | | y^{0} - x^{*} | |^{2} - 2 α_{0} λ_{0}^{2} (f (x^{0}) - f^{*}) \end{aligned}

Hence this implies

2 α_{k + 1} (1 + \frac{k + 1}{2})^{2} (f (x^{k + 1}) - f^{*}) \leq | | y^{0} - x^{*} | |^{2}

and so

f (x^{k + 1}) - f^{*} \leq \frac{4 L | | y^{0} - x^{*} | |^{2}}{(k + 3)^{2}}

as required.

Flashcards

Heavy ball method

Nesterov acceleration for L-smooth γ-strongly convex functions

Nesterov acceleration for L-smooth convex functions

Related posts

Nesterov acceleration for $L$ -smooth $γ$ -strongly convex functions

Nesterov acceleration for $L$ -smooth convex functions