Notes - Optimisation for Data Science HT25, Accelerated methods
Flashcards
Heavy ball method
Suppose we wish to minimise the convex quadratic function
\[f(x) = \frac 1 2 x^\top A x - b^\top x\]
where $A$ is a symmetric matrix with eigenvalues in $[\gamma, L]$.
In this context, @define the general template for iterative updates used in the heavy ball method.
where $\beta _ k > 0$ is some constant (and initially, a steepest descent update is performed to avoid the need for two starting points).
Suppose we wish to minimise the convex quadratic function
\[f(x) = \frac 1 2 x^\top A x - b^\top x\]
where $A$ is a symmetric matrix with eigenvalues in $[\gamma, L]$. In the heavy ball method, we perform updates of the form
\[x^{k+1} = x^k - \alpha_k \nabla f(x^k) + \beta_k (x^k - x^{k-1})\]
@State a theorem about the convergence of the heavy ball method.
There exists a constant $C > 0$ such that when the heavy ball method is applied with constant step lengths
\[\begin{aligned} \alpha_k &= \frac{4}{(\sqrt L + \sqrt \gamma)^2} \\\\ \beta_k &= \frac{\sqrt L - \sqrt \gamma}{\sqrt L + \sqrt \gamma} \end{aligned}\]satisfies
\[||x^k - x^\ast|| \le C\beta^k\]for all $k$.
Suppose we wish to minimise the convex quadratic function
\[f(x) = \frac 1 2 x^\top A x - b^\top x\]
where $A$ is a symmetric matrix with eigenvalues in $[\gamma, L]$. In the heavy ball method, we perform updates of the form
\[x^{k+1} = x^k - \alpha_k \nabla f(x^k) + \beta_k (x^k - x^{k-1})\]
There is a result that says there is some constant $C > 0$ such that when the heavy ball method is applied with constant step lengths
\[\begin{aligned}
\alpha_k &= \frac{4}{(\sqrt L + \sqrt \gamma)^2} \\\\
\beta_k &= \frac{\sqrt L - \sqrt \gamma}{\sqrt L + \sqrt \gamma}
\end{aligned}\]
satisfies
\[||x^k - x^\ast|| \le C\beta^k\]
for all $k$. This gives a result about the convergence of $x^k$ to $x^\ast$. Can you @State and @prove a theorem about the convergence of the objective value, possibly under some additional assumptions?
where:
- $(1\star)$ follows from the fact $f$ is $L$-smooth
- $(2\star)$ follows from the fact that $\nabla f(x^\ast) = 0$ and by the above result
- $(3\star)$ follows from an approximation when $L \gg \gamma$
Nesterov acceleration for $L$-smooth $\gamma$-strongly convex functions
Suppose we wish to minimise a strongly convex objective $f(x)$
\[\min_{x \in \mathbb R^n} f(x)\]
In this context, @define the general template of the iterative updates used in Nesterov’s accelerated gradient method.
Suppose we wish to minimise a $\gamma$-strongly convex and $L$-smooth objective $f(x)$
\[\min_{x \in \mathbb R^n} f(x)\]
In this context, Nesterov’s accelerated gradient method uses updates of the form
\[x^{k+1} = x^k - \alpha_k \nabla f(x^k + \beta_k (x^k - x^{k-1})) + \beta_k(x^k - x^{k-1})\]
In this context, @state a result about the convergence of $f(x^k)$.
When Nesterov’s accelerated gradient method is used with constant step lengths
\[\begin{aligned} \alpha_k &= \frac 1 L \\\\ \beta_k &= \frac{\sqrt L - \sqrt \gamma}{\sqrt L + \sqrt \gamma} \end{aligned}\]satisfies
\[f(x^k) - f(x^\ast) \le \frac{L + \gamma}{2} ||x^0 - x^\ast||^2 \times \left( 1 - \sqrt{\frac \gamma L} \right)^k\]for all $k \ge 1$.
Nesterov acceleration for $L$-smooth convex functions
@State the @algorithm corresponding to Nesterov acceleration for the minimisation of an $L$-smooth convex function $f$ at start point $y^0$.
-
Initialisation:
- Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
- $k := 0$
- $\lambda _ 0 := 1$
- $x^{-1} := y^0$
-
Main body: While $ \vert \vert x^k - x^{k-1} \vert \vert > \text{tol}$:
- Find the minimal $i \in \mathbb N$ such that
- $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
- $\alpha _ k := 2^{-i} \alpha _ {k-1}$
- $x^k := y^k - \alpha _ k \nabla f(y^k)$
- $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
- $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$
Intuitively, this algorithm produces two sequences of iterates, $x^k$ and $y^k$. The points $x^k$ are obtained as the minimisers of a quadratic upper bound function (@todo, which function specifically?), and the $y^k$ are approximate second order corrections to the $x^k$.
Nesterov’s acceleration method for the minimisation of an $L$-smooth convex function $f$ at start point $y^0$ proceeds as follows:
-
Initialisation:
- Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
- $k := 0$
- $\lambda _ 0 := 1$
- $x^{-1} := y^0$
-
Main body: While $ \vert \vert x^k - x^{k-1} \vert \vert > \text{tol}$:
- Find the minimal $i \in \mathbb N$ such that
- $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
- $\alpha _ k := 2^{-i} \alpha _ {k-1}$
- $x^k := y^k - \alpha _ k \nabla f(y^k)$
- $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
- $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$
@Prove that $\alpha _ k$ is decreasing and that it stops to decrease once $\alpha _ k \le L^{-1}$, and then motivate this process.
- Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
- $k := 0$
- $\lambda _ 0 := 1$
- $x^{-1} := y^0$
- Find the minimal $i \in \mathbb N$ such that
- $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
- $\alpha _ k := 2^{-i} \alpha _ {k-1}$
- $x^k := y^k - \alpha _ k \nabla f(y^k)$
- $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
- $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$
Clearly it’s decreasing because of the step $\alpha _ k := 2^{-i} \alpha _ {k-1}$. Once $\alpha _ k \le L^{-1}$, we have
\[\begin{aligned} f(y^k - \alpha_{k-1} \nabla f(y^k)) &\le f(y^k) + \langle \nabla f(y^k), -\alpha_{k-1} \nabla f(y^k)\rangle + \frac L 2 ||\alpha_{k-1} \nabla f(y^k)||^2 \\\\ &\le f(y^k) - \frac{\alpha_{k-1} }{2} ||\nabla f(y^k)||^2 \end{aligned}\]Therefore the condition that
\[f(y^k - 2^{-i} \alpha_{k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha_{k-1} ||\nabla f(y^k)||^2\]is satisfied with $i = 0$ and hence $\alpha _ k$ remains the same.
If $L$ was known in advance, $\alpha _ k$ could be set to the optimal value $\alpha = \frac 1 L$ in advance, but this algorithm learns the value of $L$ over time. $\alpha \le L^{-1}$ is the condition needed to ensure that
\[m_{k, \alpha}^u (y) = f(y^k) + \langle \nabla f(y^k), y - y^k \rangle + \frac{\alpha^{-1} }{2} ||y - y^k||^2\]is an upper bound on $f(y)$.
Nesterov’s acceleration method for the minimisation of an $L$-smooth convex function $f$ at start point $y^0$ proceeds as follows:
-
Initialisation:
- Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
- $k := 0$
- $\lambda _ 0 := 1$
- $x^{-1} := y^0$
-
Main body: While $ \vert \vert x^k - x^{k-1} \vert \vert > \text{tol}$:
- Find the minimal $i \in \mathbb N$ such that
- $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
- $\alpha _ k := 2^{-i} \alpha _ {k-1}$
- $x^k := y^k - \alpha _ k \nabla f(y^k)$
- $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
- $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$
Can you @justify the update for $x^k$:
- $x^k := y^k - \alpha _ k \nabla f(y^k)$
- Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
- $k := 0$
- $\lambda _ 0 := 1$
- $x^{-1} := y^0$
- Find the minimal $i \in \mathbb N$ such that
- $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
- $\alpha _ k := 2^{-i} \alpha _ {k-1}$
- $x^k := y^k - \alpha _ k \nabla f(y^k)$
- $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
- $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$
$x^k$ is the global minimiser of the function
\[m_{k, \alpha}^u (y) = f(y^k) + \langle \nabla f(y^k), y - y^k \rangle + \frac{\alpha^{-1} }{2} ||y - y^k||^2\]which is an upper bound on $f$ given that $\alpha _ k \le L^{-1}$.
Nesterov’s acceleration method for the minimisation of an $L$-smooth convex function $f$ at start point $y^0$ proceeds as follows:
-
Initialisation:
- Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
- $k := 0$
- $\lambda _ 0 := 1$
- $x^{-1} := y^0$
-
Main body: While $ \vert \vert x^k - x^{k-1} \vert \vert > \text{tol}$:
- Find the minimal $i \in \mathbb N$ such that
- $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
- $\alpha _ k := 2^{-i} \alpha _ {k-1}$
- $x^k := y^k - \alpha _ k \nabla f(y^k)$
- $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
- $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$
@State a result related to its convergence.
- Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
- $k := 0$
- $\lambda _ 0 := 1$
- $x^{-1} := y^0$
- Find the minimal $i \in \mathbb N$ such that
- $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
- $\alpha _ k := 2^{-i} \alpha _ {k-1}$
- $x^k := y^k - \alpha _ k \nabla f(y^k)$
- $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
- $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$
Suppose:
- $f : \mathbb R^n \to \mathbb R$ be a convex $L$-smooth function
- $f$ has at least one finite minimiser $x^\ast \in \text{argmin } f(x)$
- $f$ has a finite minimum $f^\ast = \min f(x)$
Then:
- The sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies $f(x^k) \le f^\ast + \frac{4L \vert \vert x^0 - x^\ast \vert \vert ^2}{(k+2)^2}$ for all $k \in \mathbb N$, so
- $x^k$ is $\varepsilon$-optimal for all $k \ge N(\varepsilon) = \left\lfloor \frac{2\sqrt L \times \vert \vert x^0 - x^\ast \vert \vert }{\sqrt \varepsilon} \right\rfloor - 1$
Nesterov’s acceleration method for the minimisation of an $L$-smooth convex function $f$ at start point $y^0$ proceeds as follows:
-
Initialisation:
- Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
- $k := 0$
- $\lambda _ 0 := 1$
- $x^{-1} := y^0$
-
Main body: While $ \vert \vert x^k - x^{k-1} \vert \vert > \text{tol}$:
- Find the minimal $i \in \mathbb N$ such that
- $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
- $\alpha _ k := 2^{-i} \alpha _ {k-1}$
- $x^k := y^k - \alpha _ k \nabla f(y^k)$
- $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
- $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$
@Prove that if
- $f : \mathbb R^n \to \mathbb R$ be a convex $L$-smooth function
- $f$ has at least one finite minimiser $x^\ast \in \text{argmin } f(x)$
- $f$ has a finite minimum $f^\ast = \min f(x)$
then:
- The sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies $f(x^k) \le f^\ast + \frac{4L \vert \vert x^0 - x^\ast \vert \vert ^2}{(k+2)^2}$ for all $k \in \mathbb N$, so
- $x^k$ is $\varepsilon$-optimal for all $k \ge N(\varepsilon) = \left\lfloor \frac{2\sqrt L \times \vert \vert x^0 - x^\ast \vert \vert }{\sqrt \varepsilon} \right\rfloor - 1$
- Find $z \ne y^0 \in \mathbb R^n$ and set $\alpha _ {-1} := \frac{ \vert \vert y^0 - z \vert \vert }{ \vert \vert \nabla f(y^0) - \nabla f(x) \vert \vert }$
- $k := 0$
- $\lambda _ 0 := 1$
- $x^{-1} := y^0$
- Find the minimal $i \in \mathbb N$ such that
- $f(y^k - 2^{-i} \alpha _ {k-1} \nabla f(y^k)) \le f(y^k) - 2^{-(i+1)} \alpha _ {k-1} \vert \vert \nabla f(y^k) \vert \vert ^2$
- $\alpha _ k := 2^{-i} \alpha _ {k-1}$
- $x^k := y^k - \alpha _ k \nabla f(y^k)$
- $\lambda _ {k+1} := \frac{1}{2} \left( 1 + \sqrt{4 \lambda _ k^2 + 1} \right)$
- $y^{k+1} := x^k + \frac{\lambda _ k - 1}{\lambda _ {k+1} }(x^k - x^{k-1})$
It is easiest to analyse the algorithm in terms of
\[p^k = (\lambda_k - 1)(x^{k-1} - x^k)\]We have:
\[y^{k+1} = x^k + \frac{\lambda_k - 1}{\lambda_{k+1} }(x^k - x^{k-1})\]and
\[\begin{aligned} x^{k+1} &= y^{k+1} - \alpha_{k+1} \nabla f(y^{k+1}) \\\\ &= x^k + \frac{\lambda_k - 1}{\lambda_{k+1} } (x^k - x^{k-1}) - \alpha_{k+1} \nabla f(y^{k+1}) \end{aligned}\]Hence by considering $p^k$, we have
\[\begin{aligned} p^{k+1} - x^{k+1} &= (\lambda_{k+1} - 1)(x^k - x^{k+1}) - x^{k+1} \\\\ &= \lambda_{k+1} (x^k - x^{k+1}) - x^k \\\\ &= (1 - \lambda_k) (x^k - x^{k-1}) - x^k + \lambda_{k+1} \alpha_{k+1} \nabla f(y^{k+1}) &&(1\star) \\\\ &= p^k - x^k + \lambda_{k+1} \alpha_{k+1} \nabla f(y^{k+1}) \end{aligned}\]where $(1\star)$ follows from the above equality. Hence
\[\begin{aligned} &||p^{k+1} - x^{k+1} + x^\ast||^2\\\\ =&||p^k - x^k + x^\ast||^2 + \lambda_{k+1}^2 \alpha_{k+1}^2 ||\nabla f(y^{k+1})||^2 + 2\lambda_{k+1} \alpha_{k+1} \langle \nabla f(y^{k+1}), p^k - x^k + x^\ast \rangle \\\\ =&||p^k - x^k + x^\ast||^2 + \lambda^2_{k+1} ||\nabla f(y^{k+1})||^2 + 2(\lambda_{k+1} - 1) \alpha_{k+1} \langle \nabla f(y^{k+1}, p^k)\rangle + 2\lambda_{k+1} \alpha_{k+1} \langle \nabla f(y^k_{k+1}), x^\ast - x^k + \lambda^{-1}_{k+1} p^k \rangle \\\\ =&||p^k - x^k + x^\ast||^2 + \lambda_{k+1}^2 \alpha_{k+1}^2 ||\nabla f(y^{k+1})||^2 + 2(\lambda_{k+1} - 1)\alpha_{k+1} \langle \nabla f(y^{k+1}), p^k \rangle + 2 \lambda_{k+1} \alpha_{k+1} \langle \nabla f(y^{k+1}), x^\ast - y^k \rangle \end{aligned}\]where the last equality follows from the identity
\[\begin{aligned} y^{k+1} &= x^k + \frac{\lambda_k - 1}{\lambda_{k+1} } (x^k - x^{k-1} ) \\\\ &= x^k - \frac{1}{\lambda_{k+1} } p^k \end{aligned}\]Now we wish to derive bounds on the last two terms on the right hand side of the big equation.
To bound $2\lambda _ {k+1} \alpha _ {k+1} \langle \nabla f(y^{k+1}), x^\ast - y^k \rangle$ (the second of the terms), note that by the convexity of $f$, we have
\[f(y^{k+1}) - f^\ast \le \langle \nabla f(y^{k+1}), x^\ast - y^{k+1}\rangle\]The sufficient decrease condition yields that
\[f(y^{k+1}) - f(x^{k+1}) \ge \frac{\alpha_{k+1} }{2} ||\nabla f(y^{k+1})||^2\]and so substituting this into the convexity equation, we have that
\[\langle \nabla f(y^{k+1}), x^\ast - y^{k+1}\rangle \ge f(x^{k+1}) - f^\ast + \frac{\alpha_{k+1} }{2} ||\nabla f(y^{k+1})||^2\]To bound $2(\lambda _ {k+1} - 1) \alpha _ {k+1} \langle \nabla f(y^{k+1}), p^k\rangle$ (the first of the terms), note that by the definition of $y^{k+1}$, we have that
\[\begin{aligned} y^{k+1} &= x^k + \frac{\lambda_k - 1}{\alpha_{k+1}} (x^k - x^{k+1}) \\\\ &= x^k - \alpha^{-1}_{k+1} p^k \end{aligned}\]And then convexity implies that
\[f(y^{k+1}) + \alpha^{-1} \alpha_{k+1} \langle \nabla f(y^{k+1}), p^k\rangle \le f(x^k)\]so by $f(y^{k+1}) - f(x^{k+1}) \ge \frac{\alpha _ {k+1} }{2} \vert \vert \nabla f(y^{k+1}) \vert \vert ^2$ above, we see that
\[\frac{\alpha_{k+1} }{2} ||\nabla f(y^{k+1})||^2 \le f(x^k) - f(x^{k+1}) - \alpha^{-1}_{k+1} \langle \nabla f(y^{k+1}), p^k\rangle\]Combining everything so far, we have that
\[\begin{aligned} &||p^{k+1} - x^{k+1} + x^\ast||^2 - ||p^k - x^k + x^\ast||^2 \\\\ \le&2(\lambda_{k+1} - 1) \alpha_{k+1} \langle \nabla f(y^{k+1}, p^k) \rangle \\\\ & - 2\lambda_{k+1} \alpha_{k+1} (f(x^{k+1}) - f^\ast) \\\\ & + (\lambda_{k+1}^2 - \lambda_{k+1}) \alpha_{k+1}^2 ||\nabla f(y^{k+1})||^2 \\\\ \le& {-2}\lambda_{k+1} \alpha_{k+1} (f(x^{k+1}) - f^\ast) + 2(\lambda_{k+1}^2 - \lambda_{k+1}) \alpha_{k+1} (f(x^k) - f(x^{k+1})) \\\\ =&2\alpha_{k+1} \lambda^2_k (f(x^{k+1}) - f^\ast) - 2\alpha_{k+1} \lambda^2_{k+1} (f(x^{k+1}) - f^\ast)) \\\\ \le&2\alpha_k \lambda_k^2 (f(x^k) - f^\ast) - 2\alpha_{k+1} \lambda_{k+1}^2 (f(x^{k+1}) - f^\ast) \end{aligned}\](@todo, explain this monster).
Applying this inequality iteratively, we find
\[\begin{aligned} 2 \alpha_{k+1} \lambda^2_{k+1} (f(x^{k+1}) - f^\ast) &\le 2\alpha_{k+1} \lambda_{k+1}^2 (f(x^{k+1}) - f^\ast) + ||p^{k+1} - x^{k+1} + x^\ast||^2 \\\\ &\le 2\alpha_k \lambda_k^2 (f(x^k) - f^\ast) + ||p_k - x^k + x^\ast||^2 \\\\ &\le \cdots \\\\ &\le 2\alpha_0 \lambda_0^2 (f(x^0) - f^\ast) + ||p_0 - x^0 + x^\ast||^2 \\\\ &\le ||y^0 - x^\ast||^2 \end{aligned}\]where the last inequality follows from $\lambda _ 0 = 1$, $p^0 = (\lambda _ 0 - 1)(x^{-1} - x^0) = 0$, and
\[\begin{aligned} ||p^0 - x^0 + x^\ast||^2 &= ||x^0 - x^\ast||^2 \\\\ &= ||y^0 - \alpha_0 \nabla f(y^0) - x^\ast|| \\\\ &= ||y^0 - x^\ast||^2 + \alpha_0^2 ||\nabla f(y^0)||^2 - 2 \alpha_0 \langle \nabla f(y^0), y^0 - x^\ast \alpha \rangle \\\\ &\le ||y^0 - x^\ast||^2 + \alpha_0^2 ||\nabla f(y^0)||^2 - 2\alpha_0(f(x^0) - f^\ast + \frac{\alpha_0} 2 ||\nabla f(y^0)||^2) \\\\ &= ||y^0 - x^\ast||^2 - 2\alpha_0 \lambda_0^2(f(x^0) - f^\ast) \end{aligned}\]Hence this implies
\[2 \alpha_{k+1} (1 + \frac{k+1}{2})^2 (f(x^{k+1}) - f^\ast) \le ||y^0 - x^\ast||^2\]and so
\[f(x^{k+1}) - f^\ast \le \frac{4L||y^0 - x^\ast||^2}{(k+3)^2}\]as required.