Optimisation for Data Science HT25, Proximal method
Why the proximal method?
One way to motivate the proximal method is to see it as a generalisation of projected gradient descent. When using projected gradient descent, we wish to solve the optimisation problem
\[\min _ {x \in \Omega} f(x)\]where $\Omega$ is some convex set. To do this, we use gradient descent but project every iterate onto the feasible region:
\[x^{k+1} = \text{proj} _ \Omega(x^k - \alpha _ k \nabla f(x^k))\]We may rewrite this as
\[x^{k+1} = \text{argmin} _ {z \in \Omega} ||z - (x^k - \alpha _ k \nabla f(x^k))||^2\]and then convert this into an unconstrained optimisation problem
\[x^{k+1} = \text{argmin} _ z \Big[ || z - (x^k - \alpha _ k \nabla f(x^k)) ||^2 + I _ \Omega(z) \Big]\]You can show that this method has reasonable convergence properties ($O(\sqrt k)$ in the non-convex objective case). The proximal method is what you get when you generalise to other non-smooth but still convex penalty terms, other than $I _ \Omega$:
\[\begin{aligned} x^{k+1} &= \text{argmin} _ z \Big[ \frac{1}{2} ||z - (x^k - \alpha _ k \nabla f(x^k))||^2 + \lambda \alpha _ k \psi(z) \Big] \\\\ &= \text{prox} _ {\alpha _ k \lambda \psi}(x^k - \alpha _ k \nabla f(x^k)) \end{aligned}\]So $\text{prox}$ can be seen as a generalised projection operation.
Flashcards
Basic definitions
Suppose:
- $h : \mathbb R^n \to \mathbb R$ is proper convex
- $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
- $\lambda > 0$
@Define the Moreau envelope of $h$.
Suppose:
- $h : \mathbb R^n \to \mathbb R$ is proper convex
- $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
- $\lambda > 0$
In this context, define the proximal operator $\text{prox} _ {\lambda h}$.
defined by
\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ {u} \left(\lambda h(u) + \frac{1}{2}||u-x||^2\right)\]Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
@Define the gradient map $G _ \alpha$ in this context.
where
\[G _ \alpha(x) = \frac{1}{\alpha}(x - \text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]Examples of the proximal operator
Suppose $h(x) := 0$, i.e. it is identically zero. What is $\text{prox} _ {\lambda h}$?
@example~
Suppose $h(x) := I _ \Omega (x)$ where $\Omega$ is closed and convex. What is $\text{prox} _ {\lambda h}$?
which is the projection of $x$ onto $\Omega$.
@example~
Suppose $h(x) := \vert \vert x \vert \vert _ 1$. What is $\text{prox} _ {\lambda h}$?
Defined component-wise, it is:
\[(\text{prox} _ {\lambda ||\cdot|| _ 1}(x)) _ i = \begin{cases} x _ i - \lambda &\text{if } x _ i > \lambda \\\\ 0 &\text{if } x _ i \in [-\lambda, \lambda] \\\\ x _ i + \lambda &\text{if } x _ i < -\lambda \end{cases}\]Suppose $h(x) := \vert \vert x \vert \vert _ 0$ where $ \vert \vert x \vert \vert _ 0$ counts the non-zero components of $x$. What is $\text{prox} _ {\lambda h}$?
Defined component-wise, it is:
\[(\text{prox} _ {\lambda ||\cdot|| _ 0}(x)) _ i = \begin{cases} x _ i &\text{if } |x _ i| > \sqrt{2\lambda} \\\\ 0 &\text{if } |x _ i| \le \sqrt{2\lambda} \end{cases}\]Properties
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
- $\phi(u) := \lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2$
@State a property of $\text{prox} _ {\lambda h}$ that relates $\text{prox} _ {\lambda h}$ to the stationary points of $\phi$.
and
\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]i.e. $\text{prox} _ {\lambda h}(x)$ is a minimiser of $\phi$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
- $\phi(u) := \lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2$
@Prove that
\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox}
_ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]
and
\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]
Note
\[\partial \phi(u) = \lambda \partial h(u) + u - x\]By letting $u = \text{prox} _ {\lambda h}(x)$, we have
\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]as required.
Since
\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ u \left[\lambda h(u) + \frac 1 2 ||u - x||^2\right] \\\\ &=\text{argmin} _ u [\varphi(u)] \end{aligned}\]and $\varphi$ is convex (it is the sum of two convex functions), $0 \in \partial(\varphi(u))$ iff $u$ minimises $\varphi$. Since $\text{prox} _ {\lambda h}(x)$ is precisely the minimiser,
\[0 \in \partial \varphi(\text{prox} _ {\lambda h}(x))\]Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates to its finiteness.
for all $x$ even if $h(x) = \infty$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates to its convexity.
is convex in $x$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates to its differentiability and give its derivative.
$M _ {\lambda, h}(x)$ is differentiable in $x$ with gradient
\[\nabla M _ {\lambda, h}(x) = \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x))\]@Prove that if
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
then $M _ {\lambda, h}(x)$ is differentiable in $x$ with gradient
\[\nabla M _ {\lambda, h}(x) = \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x))\]
(you may assume $h$ is differentiable at $u = \text{prox} _ {\lambda h}(x)$ and that $\text{prox} _ {\lambda h}$ is differentiable at $x$).
where:
- $(1\star)$: Follows from implicit differentiation.
- $(2\star)$: Follows from the fact that $u$ is a minimiser of $h(u) - \frac{1}{2\lambda} \vert \vert u-x \vert \vert ^2$ and so that term is zero.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates its minima to the minima of $h$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
@State a property of $\text{prox} _ {\lambda h}$ that relates to its smoothness.
for all $x, y \in \mathbb R^n$ (i.e. it is $1$-Lipschitz).
@Prove that if
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
then
\[|| \text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y) || \le ||x - y||\]
for all $x, y \in \mathbb R^n$ (i.e. it is $1$-Lipschitz).
Note that $x - \text{prox} _ {\lambda h}(x) \in \lambda \partial h(\text{prox} _ {\lambda h}(x))$. Then
\[\lambda h(\text{prox} _ {\lambda h}(y)) \ge \lambda h(\text{prox} _ {\lambda h}(x)) + (x - \text{prox} _ {\lambda h}(x))^\top (\text{prox} _ {\lambda h}(y) - \text{prox} _ {\lambda h}(x))\]for all $x$ and $y$, by properties of the subdifferential. Likewise,
\[\lambda h(\text{prox} _ {\lambda h}(x)) \ge \lambda h(\text{prox} _ {\lambda h}(y)) + (y - \text{prox} _ {\lambda h}(y))^\top (\text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y))\]Summing these inequalities and simplifying, we have
\[0 \ge (x - \text{prox} _ {\lambda h}(x) - y + \text{prox} _ {\lambda h}(y))^\top (\text{prox} _ {\lambda h}(y) - \text{prox} _ {\lambda h}(x))\]Therefore
\[||\text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y)||^2 \le (x - y)^\top (\text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y))\]and hence by Cauchy-Schwarz,
\[||\text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y)|| \le ||x - y||\]as required.
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
@State a property of $G _ \alpha$ that relates it to the derivative of $f$.
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
@Prove that then:
\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]
Note that
\[\begin{aligned} x - \alpha G _ \alpha(x) &= x - \alpha \cdot \frac 1 \alpha (x - \text{prox} _ {a\lambda \psi}(x - \alpha \nabla f(x))) \\\\ &= \text{prox} _ {\alpha \lambda \psi} (x - \alpha \nabla f(x)) \\\\ &= \text{argmin} _ u \text{ } E(u) \end{aligned}\]where
\[E(u) := \alpha \lambda \psi(u) + \frac 1 2 ||u - (x - \alpha \nabla f(x))||^2\]is a convex function.
Then
\[0 \in \partial E(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]but since
\[\begin{aligned} &\partial E(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x))) \\\\ =& \partial E(x - \alpha G _ \alpha(x)) \\\\ =& \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + (x - \alpha G _ \alpha(x)) - (x - \alpha \nabla f(x)) \end{aligned}\]it follows that
\[\alpha G _ \alpha(x) \in \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + \alpha \nabla f(x)\]Regularised optimisation
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
What is the standard template used for iterative algorithms in this context? Give this in three ways:
- First in terms of two separate steps,
- then in one step using the proximal operator,
- and finally using the gradient map $G _ \alpha$.
As a two-step iterative process:
\[\begin{aligned} y^{k+1} &= x^k - \alpha _ k \nabla f(x^k) \\\\ x^{k+1} &= \text{argmin} _ {z} \text{ } ||z - y^{k+1}||^2 + \alpha _ k \lambda \psi(z) \end{aligned}\]In one step using $\text{prox}$:
\[x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))\]Using the gradient map $G _ \alpha$:
\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k}(x^k)\]Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
The proximal method solves this problem by using the iterative step
\[x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))\]
where
\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ {u} \left(\lambda h(u) + \frac{1}{2}||u-x||^2\right)\]
Motivate this update in two different ways:
- First as a generalisation of the projected gradient descent, then
- As solving a sequence of simpler optimisation problems.
Generalisation of projected gradient descent:
In the projected gradient descent method, we want to solve the optimisation problem
\[\min_{x \in \Omega} f(x)\]where $\Omega$ is a convex set. We do this by performing gradient descent steps as normal and then projecting onto $\Omega$:
\[\begin{aligned} x^{k+1} &= y^k - \alpha_k \nabla f(y^k) \\\\ y^{k+1} &= \text{argmin}_{z \in \mathbb \Omega} \frac 1 2||z - x^{k+1}||^2 \end{aligned}\]which we could rewrite as.
\[x^{k+1} = \text{prox}_{\alpha_kI_\Omega}(x^k - \alpha_k \nabla f(x^k))\]The proximal method is just what you get when you consider more general “projection operators” $\text{prox} _ {\alpha _ k \lambda \psi}$ where $\psi$ is a convex function.
As solving a sequence of simpler optimisation problems:
We may rewrite the update as so:
\[\begin{aligned} x^{k+1} &= \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k)) \\\\ &= \text{argmin}_{z} \left[\frac{1}{2} ||z - (x^k - \alpha_k \nabla f(x^k)||^2 + \lambda \alpha_k \psi(z)\right] \\\\ &= \text{argmin}_z \left[ f(x^k) + \nabla f(x^k)^\top(z - x^k) + \frac{1}{2\alpha_k}||z - x^k||^2 + \lambda \psi(x) \right] \end{aligned}\]Note that
\[f(x^k) + \nabla f(x^k)^\top (z - x^k) + \frac{1}{2\alpha_k} ||z - x^k||^2\]is a quadratic upper bound of $f$. So the proximal method can be interpreted as solving a sequence of simpler optimisation problems for which the minimum is an upper bound on the minimum of $f$.
Convergence theory
Foundational inequality / Overestimation property
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
@State the foundational inequality of the prox-method in this context, first in full and then @prove how it reduces to generalise the foundational inequality of steepest descent.
In general: For all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,
\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]Generalises the foundational inequality: Note that one step in the prox method is
\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k} (x^k)\]Substituting this into the above equation, it reduces to
\[\phi(x^{k+1}) \le \phi(z) + G _ {\alpha _ k}(x^k)^\top (x^k - z) - \frac{\alpha _ k}{2} ||G _ {\alpha _ k}(x^k)||^2\]Letting $z = x^k$, this becomes:
\[\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k} 2 ||G _ {\alpha _ k}(x^k)||^2\]Compare this to the foundational inequality of gradient descent:
\[f(x^{k+1}) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]@important~ (overestimation property)
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
@Prove the foundational inequality of the prox-method in this context, i.e. that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,
\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]
We require 3 sub-results.
$L$-smoothness upper bound (A): As $f$ is $L$-smooth, we have
\[f(y) \le f(x) + \nabla f(x)^\top (y - x) + \frac L 2 ||y - x||^2\]for all $y \in \mathbb R^n$. Letting $y = x - \alpha G _ \alpha(x)$, we obtain
\[\begin{aligned} f(x - \alpha G _ \alpha(x)) &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac{L\alpha^2}{2} ||G _ \alpha(x)||^2 \\\\ &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac \alpha 2 ||G _ \alpha(x)||^2 \end{aligned}\]where the last inequality follows from the assumption that $\alpha \in (0, \frac 1 L]$.
Convexity of $f$ upper bound (B): The convexity of $f$ implies that
\[f(z) \ge f(x) + \nabla f(x)^\top (z - x)\]and hence
\[f(x) \le f(z) - \nabla f(x)^\top (z - x)\]Convexity of $\psi$ upper bound (C): By a previous result, we have that
\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]or equivalently,
\[G_\alpha(x) - \nabla f(x) \in \lambda \partial \psi(x - \alpha G_\alpha(x))\]Thus by the definition of a $G _ \alpha(x) - \nabla f(x)$ being a subgradient of $\lambda \psi$ at $x - \alpha G _ \alpha(x)$:
\[\lambda \psi(z) \ge \lambda \psi(x - \alpha G_\alpha (x)) + (G_\alpha(x) - \nabla f(x))^\top (z-(x - \alpha G_\alpha(x))) \\\\\]and hence
\[\lambda \psi(x - \alpha G_\alpha(x)) \le \lambda \psi(z) - (G_\alpha(x) - \nabla f(x))^\top (z - (x - \alpha G_\alpha (x)))\]Combining these results:
\[\begin{aligned} &\phi(x - \alpha G _ \alpha(x))\\\\ &= f(x - \alpha G _ \alpha(x)) + \lambda \psi(x - \alpha G _ \alpha(x)) && (1\star) \\\\ &\le f(x) - \alpha G _ \alpha (x)^\top \nabla f(x) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (2\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x) - \alpha G_\alpha(x)^\top \nabla f(x) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (3\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x + \alpha G _ \alpha(x)) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(z) - (G _ \alpha(x) - \nabla f(x))^\top(z - (x - \alpha G _ \alpha(x))) &&(4\star)\\\\ &= f(z) + \lambda \psi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2 &&(5\star)\\\\ &= \phi(z) + G _ \alpha(x)^\top (x - z) - \frac{\alpha}{2}||G _ \alpha(x)||^2 && (6\star) \end{aligned}\]where:
- $(1\star)$: Follows from the definition of $\phi$.
- $(2\star)$: Follows from applying the $L$-smoothness upper bound to expand $f(x - \alpha G _ \alpha(x))$ (A).
- $(3\star)$: Follows from expanding $f(x)$ using the convexity upper bound (B).
- $(4\star)$: Follows from expanding $\lambda \psi(x - \alpha G _ \alpha(x))$ using the convexity upper bound (C).
- $(5\star)$: This is just algebra, noticing that the $\nabla f(x)^\top (z - x + \alpha G _ \alpha(x))$ terms cancel.
- $(6\star)$: Follows from the definition of $\phi$.
@important~ (overestimation property)
Global convergence
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
- We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
- The model has minimiser $x^\ast$
@State a result about the convergence in this case.
for all $k \ge 1$.
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
- We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
- The model has minimiser $x^\ast$
@Prove that then
\[\phi(x^k) - \phi(x^\ast) \le \frac L {2k} ||x^0 - x^\ast||^2\]
for all $k \ge 1$
We have the overestimation result that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$
\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]
With $x = x^k$, $z = x^\ast$ and $\alpha = 1/L$, this gives
\[\begin{aligned}
\phi(x^{k+1})
&\le \phi(x^\ast) + G _ {\alpha _ k}(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} ||G _ {\alpha _ k}(x^k)||^2 \\\\
&= \phi(x^\ast) + \frac{L}{2} \left( ||x^k - x^\ast||^2 - ||x^k - x^\ast - \alpha _ k G _ {\alpha _ k}(x^k)||^2 \right) \\\\
&= \phi(x^\ast) + \frac{L}{2} \left( ||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2 \right)
\end{aligned}\]
Therefore
\[\begin{aligned}
\sum^{T-1} _ {k = 0} \left( \phi(x^{k+1}) - \phi(x^\ast) \right) &\le \frac L {2} \sum^{T-1} _ {k = 0} \left( ||x^k - x^\ast||^2 - || x^{k+1} - x^\ast ||^2 \right) \\\\
&= \frac L {2} (||x^0 - x^\ast||^2 - ||x^T - x^\ast||^2) \\\\
&\le \frac L {2} ||x^0 - x^\ast||^2
\end{aligned}\]
Note also that the sequence of iterates is decreasing, since if $x$ and $z$ are both $x^{k}$, we have $\phi(x^{k+1}) \le \phi(x^k) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2$. Therefore
\[\begin{aligned}
\phi(x^T) - \phi(x^\ast) &= \frac{1}{T} \sum^{T-1} _ {k = 0} \left( \phi(x^{T}) - \phi(x^\ast) \right) \\\\
&\le \frac{1}{T} \sum^{T-1} _ {k = 0} \left( \phi(x^{k+1}) - \phi(x^\ast) \right) \\\\
&\le \frac{L}{2T} ||x^0 - x^\ast||^2
\end{aligned}\]
Examples
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\psi : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$ using the iterative step $x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))$
Consider the case where
\[\phi(x) := f(x) + \lambda I _ \Omega(x)\]
and where $\Omega$ is a convex set. Derive the expression for $x^{k+1}$ and interpret the results.
which is the orthogonal projection of the steepest descent update onto the feasible set.
@example~