Optimisation for Data Science HT25, Proximal method


Why the proximal method?

One way to motivate the proximal method is to see it as a generalisation of projected gradient descent. When using projected gradient descent, we wish to solve the optimisation problem

\[\min _ {x \in \Omega} f(x)\]

where $\Omega$ is some convex set. To do this, we use gradient descent but project every iterate onto the feasible region:

\[x^{k+1} = \text{proj} _ \Omega(x^k - \alpha _ k \nabla f(x^k))\]

We may rewrite this as

\[x^{k+1} = \text{argmin} _ {z \in \Omega} ||z - (x^k - \alpha _ k \nabla f(x^k))||^2\]

and then convert this into an unconstrained optimisation problem

\[x^{k+1} = \text{argmin} _ z \Big[ || z - (x^k - \alpha _ k \nabla f(x^k)) ||^2 + I _ \Omega(z) \Big]\]

You can show that this method has reasonable convergence properties ($O(\sqrt k)$ in the non-convex objective case). The proximal method is what you get when you generalise to other non-smooth but still convex penalty terms, other than $I _ \Omega$:

\[\begin{aligned} x^{k+1} &= \text{argmin} _ z \Big[ \frac{1}{2} ||z - (x^k - \alpha _ k \nabla f(x^k))||^2 + \lambda \alpha _ k \psi(z) \Big] \\\\ &= \text{prox} _ {\alpha _ k \lambda \psi}(x^k - \alpha _ k \nabla f(x^k)) \end{aligned}\]

So $\text{prox}$ can be seen as a generalised projection operation.

Flashcards

Basic definitions

Suppose:

  • $h : \mathbb R^n \to \mathbb R$ is proper convex
  • $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
  • $\lambda > 0$

@Define the Moreau envelope of $h$.


\[M _ {\lambda, h} : \mathbb R^n \to \mathbb R\] \[M _ {\lambda, h}(x) := \frac{1}{\lambda} \inf _ u \left(\lambda h(u) + \frac 1 2||u-x||^2\right)\]

Suppose:

  • $h : \mathbb R^n \to \mathbb R$ is proper convex
  • $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
  • $\lambda > 0$

In this context, define the proximal operator $\text{prox} _ {\lambda h}$.


\[\text{prox} _ {\lambda h} : \mathbb R^n \to \mathbb R\]

defined by

\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ {u} \left(\lambda h(u) + \frac{1}{2}||u-x||^2\right)\]

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$

@Define the gradient map $G _ \alpha$ in this context.


\[G _ \alpha : \mathbb R^n \to \mathbb R^n\]

where

\[G _ \alpha(x) = \frac{1}{\alpha}(x - \text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]

Examples of the proximal operator

Suppose $h(x) := 0$, i.e. it is identically zero. What is $\text{prox} _ {\lambda h}$?


\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ {u} \frac{1}{2} ||u - x||^2 \\\\ &= x \end{aligned}\]

@example~

Suppose $h(x) := I _ \Omega (x)$ where $\Omega$ is closed and convex. What is $\text{prox} _ {\lambda h}$?


\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ {u} \text{ } \lambda I _ \Omega(u) + \frac{1}{2} ||u - x||^2 \\\\ &= \text{argmin} _ {u \in \Omega} ||u - x||^2 \end{aligned}\]

which is the projection of $x$ onto $\Omega$.

@example~

Suppose $h(x) := \vert \vert x \vert \vert _ 1$. What is $\text{prox} _ {\lambda h}$?


Defined component-wise, it is:

\[(\text{prox} _ {\lambda ||\cdot|| _ 1}(x)) _ i = \begin{cases} x _ i - \lambda &\text{if } x _ i > \lambda \\\\ 0 &\text{if } x _ i \in [-\lambda, \lambda] \\\\ x _ i + \lambda &\text{if } x _ i < -\lambda \end{cases}\]

Suppose $h(x) := \vert \vert x \vert \vert _ 0$ where $ \vert \vert x \vert \vert _ 0$ counts the non-zero components of $x$. What is $\text{prox} _ {\lambda h}$?


Defined component-wise, it is:

\[(\text{prox} _ {\lambda ||\cdot|| _ 0}(x)) _ i = \begin{cases} x _ i &\text{if } |x _ i| > \sqrt{2\lambda} \\\\ 0 &\text{if } |x _ i| \le \sqrt{2\lambda} \end{cases}\]

Properties

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed
  • $\phi(u) := \lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2$

@State a property of $\text{prox} _ {\lambda h}$ that relates $\text{prox} _ {\lambda h}$ to the stationary points of $\phi$.


\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]

and

\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]

i.e. $\text{prox} _ {\lambda h}(x)$ is a minimiser of $\phi$.

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed
  • $\phi(u) := \lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2$

@Prove that

\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]

and

\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]

Note

\[\partial \phi(u) = \lambda \partial h(u) + u - x\]

By letting $u = \text{prox} _ {\lambda h}(x)$, we have

\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]

as required.

Since

\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ u \left[\lambda h(u) + \frac 1 2 ||u - x||^2\right] \\\\ &=\text{argmin} _ u [\varphi(u)] \end{aligned}\]

and $\varphi$ is convex (it is the sum of two convex functions), $0 \in \partial(\varphi(u))$ iff $u$ minimises $\varphi$. Since $\text{prox} _ {\lambda h}(x)$ is precisely the minimiser,

\[0 \in \partial \varphi(\text{prox} _ {\lambda h}(x))\]

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates to its finiteness.


\[M _ {\lambda, h}(x) < \infty\]

for all $x$ even if $h(x) = \infty$.

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates to its convexity.


\[M _ {\lambda, h}(x)\]

is convex in $x$.

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates to its differentiability and give its derivative.


$M _ {\lambda, h}(x)$ is differentiable in $x$ with gradient

\[\nabla M _ {\lambda, h}(x) = \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x))\]

@Prove that if

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

then $M _ {\lambda, h}(x)$ is differentiable in $x$ with gradient

\[\nabla M _ {\lambda, h}(x) = \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x))\]

(you may assume $h$ is differentiable at $u = \text{prox} _ {\lambda h}(x)$ and that $\text{prox} _ {\lambda h}$ is differentiable at $x$).


\[\begin{aligned} \frac{\text d}{\text dx} \big( M _ {\lambda , h}(x) \big) &= \frac{\partial}{\partial u} \big [ M _ {\lambda, h}(x) \big] \frac{\text du}{\text dx} + \frac{\partial}{\partial x}\big[ M _ {\lambda, h}(x) \big] &(1\star) \\\\ &= \frac{\partial}{\partial u} \left[h(u) - \frac{1}{2\lambda}||u - x||^2\right] \frac{\text d}{\text dx} (\text{prox} _ {\lambda h}(x)) + \frac{\partial}{\partial x} \left[\frac 1 {2\lambda}||u - x||^2\right] \\\\ &= \frac{\partial}{\partial x} \left[\frac{1}{2\lambda} ||u - x||^2\right] &(2\star) \\\\ &= \frac 1 \lambda ( x - u ) \\\\ &= \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x)) \end{aligned}\]

where:

  • $(1\star)$: Follows from implicit differentiation.
  • $(2\star)$: Follows from the fact that $u$ is a minimiser of $h(u) - \frac{1}{2\lambda} \vert \vert u-x \vert \vert ^2$ and so that term is zero.

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates its minima to the minima of $h$.


\[x^\ast \in \text{argmin} _ x h(x) \iff x^\ast \in \text{argmin} _ x M _ {\lambda, h}(x)\]

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function

@State a property of $\text{prox} _ {\lambda h}$ that relates to its smoothness.


\[|| \text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y) || \le ||x - y||\]

for all $x, y \in \mathbb R^n$ (i.e. it is $1$-Lipschitz).

@Prove that if

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function

then

\[|| \text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y) || \le ||x - y||\]

for all $x, y \in \mathbb R^n$ (i.e. it is $1$-Lipschitz).


Note that $x - \text{prox} _ {\lambda h}(x) \in \lambda \partial h(\text{prox} _ {\lambda h}(x))$. Then

\[\lambda h(\text{prox} _ {\lambda h}(y)) \ge \lambda h(\text{prox} _ {\lambda h}(x)) + (x - \text{prox} _ {\lambda h}(x))^\top (\text{prox} _ {\lambda h}(y) - \text{prox} _ {\lambda h}(x))\]

for all $x$ and $y$, by properties of the subdifferential. Likewise,

\[\lambda h(\text{prox} _ {\lambda h}(x)) \ge \lambda h(\text{prox} _ {\lambda h}(y)) + (y - \text{prox} _ {\lambda h}(y))^\top (\text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y))\]

Summing these inequalities and simplifying, we have

\[0 \ge (x - \text{prox} _ {\lambda h}(x) - y + \text{prox} _ {\lambda h}(y))^\top (\text{prox} _ {\lambda h}(y) - \text{prox} _ {\lambda h}(x))\]

Therefore

\[||\text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y)||^2 \le (x - y)^\top (\text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y))\]

and hence by Cauchy-Schwarz,

\[||\text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y)|| \le ||x - y||\]

as required.

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map

@State a property of $G _ \alpha$ that relates it to the derivative of $f$.


\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map

@Prove that then:

\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]

Note that

\[\begin{aligned} x - \alpha G _ \alpha(x) &= x - \alpha \cdot \frac 1 \alpha (x - \text{prox} _ {a\lambda \psi}(x - \alpha \nabla f(x))) \\\\ &= \text{prox} _ {\alpha \lambda \psi} (x - \alpha \nabla f(x)) \\\\ &= \text{argmin} _ u \text{ } E(u) \end{aligned}\]

where

\[E(u) := \alpha \lambda \psi(u) + \frac 1 2 ||u - (x - \alpha \nabla f(x))||^2\]

is a convex function.

Then

\[0 \in \partial E(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]

but since

\[\begin{aligned} &\partial E(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x))) \\\\ =& \partial E(x - \alpha G _ \alpha(x)) \\\\ =& \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + (x - \alpha G _ \alpha(x)) - (x - \alpha \nabla f(x)) \end{aligned}\]

it follows that

\[\alpha G _ \alpha(x) \in \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + \alpha \nabla f(x)\]

Regularised optimisation

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$

What is the standard template used for iterative algorithms in this context? Give this in three ways:

  • First in terms of two separate steps,
  • then in one step using the proximal operator,
  • and finally using the gradient map $G _ \alpha$.

As a two-step iterative process:

\[\begin{aligned} y^{k+1} &= x^k - \alpha _ k \nabla f(x^k) \\\\ x^{k+1} &= \text{argmin} _ {z} \text{ } ||z - y^{k+1}||^2 + \alpha _ k \lambda \psi(z) \end{aligned}\]

In one step using $\text{prox}$:

\[x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))\]

Using the gradient map $G _ \alpha$:

\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k}(x^k)\]

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$

The proximal method solves this problem by using the iterative step

\[x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))\]

where

\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ {u} \left(\lambda h(u) + \frac{1}{2}||u-x||^2\right)\]

Motivate this update in two different ways:

  • First as a generalisation of the projected gradient descent, then
  • As solving a sequence of simpler optimisation problems.

Generalisation of projected gradient descent:

In the projected gradient descent method, we want to solve the optimisation problem

\[\min_{x \in \Omega} f(x)\]

where $\Omega$ is a convex set. We do this by performing gradient descent steps as normal and then projecting onto $\Omega$:

\[\begin{aligned} x^{k+1} &= y^k - \alpha_k \nabla f(y^k) \\\\ y^{k+1} &= \text{argmin}_{z \in \mathbb \Omega} \frac 1 2||z - x^{k+1}||^2 \end{aligned}\]

which we could rewrite as.

\[x^{k+1} = \text{prox}_{\alpha_kI_\Omega}(x^k - \alpha_k \nabla f(x^k))\]

The proximal method is just what you get when you consider more general “projection operators” $\text{prox} _ {\alpha _ k \lambda \psi}$ where $\psi$ is a convex function.


As solving a sequence of simpler optimisation problems:

We may rewrite the update as so:

\[\begin{aligned} x^{k+1} &= \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k)) \\\\ &= \text{argmin}_{z} \left[\frac{1}{2} ||z - (x^k - \alpha_k \nabla f(x^k)||^2 + \lambda \alpha_k \psi(z)\right] \\\\ &= \text{argmin}_z \left[ f(x^k) + \nabla f(x^k)^\top(z - x^k) + \frac{1}{2\alpha_k}||z - x^k||^2 + \lambda \psi(x) \right] \end{aligned}\]

Note that

\[f(x^k) + \nabla f(x^k)^\top (z - x^k) + \frac{1}{2\alpha_k} ||z - x^k||^2\]

is a quadratic upper bound of $f$. So the proximal method can be interpreted as solving a sequence of simpler optimisation problems for which the minimum is an upper bound on the minimum of $f$.

Convergence theory

Foundational inequality / Overestimation property

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed

@State the foundational inequality of the prox-method in this context, first in full and then @prove how it reduces to generalise the foundational inequality of steepest descent.


In general: For all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,

\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]

Generalises the foundational inequality: Note that one step in the prox method is

\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k} (x^k)\]

Substituting this into the above equation, it reduces to

\[\phi(x^{k+1}) \le \phi(z) + G _ {\alpha _ k}(x^k)^\top (x^k - z) - \frac{\alpha _ k}{2} ||G _ {\alpha _ k}(x^k)||^2\]

Letting $z = x^k$, this becomes:

\[\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k} 2 ||G _ {\alpha _ k}(x^k)||^2\]

Compare this to the foundational inequality of gradient descent:

\[f(x^{k+1}) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

@important~ (overestimation property)

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed

@Prove the foundational inequality of the prox-method in this context, i.e. that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,

\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]

We require 3 sub-results.

$L$-smoothness upper bound (A): As $f$ is $L$-smooth, we have

\[f(y) \le f(x) + \nabla f(x)^\top (y - x) + \frac L 2 ||y - x||^2\]

for all $y \in \mathbb R^n$. Letting $y = x - \alpha G _ \alpha(x)$, we obtain

\[\begin{aligned} f(x - \alpha G _ \alpha(x)) &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac{L\alpha^2}{2} ||G _ \alpha(x)||^2 \\\\ &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac \alpha 2 ||G _ \alpha(x)||^2 \end{aligned}\]

where the last inequality follows from the assumption that $\alpha \in (0, \frac 1 L]$.

Convexity of $f$ upper bound (B): The convexity of $f$ implies that

\[f(z) \ge f(x) + \nabla f(x)^\top (z - x)\]

and hence

\[f(x) \le f(z) - \nabla f(x)^\top (z - x)\]

Convexity of $\psi$ upper bound (C): By a previous result, we have that

\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]

or equivalently,

\[G_\alpha(x) - \nabla f(x) \in \lambda \partial \psi(x - \alpha G_\alpha(x))\]

Thus by the definition of a $G _ \alpha(x) - \nabla f(x)$ being a subgradient of $\lambda \psi$ at $x - \alpha G _ \alpha(x)$:

\[\lambda \psi(z) \ge \lambda \psi(x - \alpha G_\alpha (x)) + (G_\alpha(x) - \nabla f(x))^\top (z-(x - \alpha G_\alpha(x))) \\\\\]

and hence

\[\lambda \psi(x - \alpha G_\alpha(x)) \le \lambda \psi(z) - (G_\alpha(x) - \nabla f(x))^\top (z - (x - \alpha G_\alpha (x)))\]

Combining these results:

\[\begin{aligned} &\phi(x - \alpha G _ \alpha(x))\\\\ &= f(x - \alpha G _ \alpha(x)) + \lambda \psi(x - \alpha G _ \alpha(x)) && (1\star) \\\\ &\le f(x) - \alpha G _ \alpha (x)^\top \nabla f(x) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (2\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x) - \alpha G_\alpha(x)^\top \nabla f(x) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (3\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x + \alpha G _ \alpha(x)) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(z) - (G _ \alpha(x) - \nabla f(x))^\top(z - (x - \alpha G _ \alpha(x))) &&(4\star)\\\\ &= f(z) + \lambda \psi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2 &&(5\star)\\\\ &= \phi(z) + G _ \alpha(x)^\top (x - z) - \frac{\alpha}{2}||G _ \alpha(x)||^2 && (6\star) \end{aligned}\]

where:

  • $(1\star)$: Follows from the definition of $\phi$.
  • $(2\star)$: Follows from applying the $L$-smoothness upper bound to expand $f(x - \alpha G _ \alpha(x))$ (A).
  • $(3\star)$: Follows from expanding $f(x)$ using the convexity upper bound (B).
  • $(4\star)$: Follows from expanding $\lambda \psi(x - \alpha G _ \alpha(x))$ using the convexity upper bound (C).
  • $(5\star)$: This is just algebra, noticing that the $\nabla f(x)^\top (z - x + \alpha G _ \alpha(x))$ terms cancel.
  • $(6\star)$: Follows from the definition of $\phi$.

@important~ (overestimation property)

Global convergence

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed
  • We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
  • The model has minimiser $x^\ast$

@State a result about the convergence in this case.


\[\phi(x^k) - \phi(x^\ast) \le \frac L {2k} ||x^0 - x^\ast||^2\]

for all $k \ge 1$.

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed
  • We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
  • The model has minimiser $x^\ast$

@Prove that then

\[\phi(x^k) - \phi(x^\ast) \le \frac L {2k} ||x^0 - x^\ast||^2\]

for all $k \ge 1$

We have the overestimation result that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$

\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]

With $x = x^k$, $z = x^\ast$ and $\alpha = 1/L$, this gives

\[\begin{aligned} \phi(x^{k+1}) &\le \phi(x^\ast) + G _ {\alpha _ k}(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} ||G _ {\alpha _ k}(x^k)||^2 \\\\ &= \phi(x^\ast) + \frac{L}{2} \left( ||x^k - x^\ast||^2 - ||x^k - x^\ast - \alpha _ k G _ {\alpha _ k}(x^k)||^2 \right) \\\\ &= \phi(x^\ast) + \frac{L}{2} \left( ||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2 \right) \end{aligned}\]

Therefore

\[\begin{aligned} \sum^{T-1} _ {k = 0} \left( \phi(x^{k+1}) - \phi(x^\ast) \right) &\le \frac L {2} \sum^{T-1} _ {k = 0} \left( ||x^k - x^\ast||^2 - || x^{k+1} - x^\ast ||^2 \right) \\\\ &= \frac L {2} (||x^0 - x^\ast||^2 - ||x^T - x^\ast||^2) \\\\ &\le \frac L {2} ||x^0 - x^\ast||^2 \end{aligned}\]

Note also that the sequence of iterates is decreasing, since if $x$ and $z$ are both $x^{k}$, we have $\phi(x^{k+1}) \le \phi(x^k) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2$. Therefore

\[\begin{aligned} \phi(x^T) - \phi(x^\ast) &= \frac{1}{T} \sum^{T-1} _ {k = 0} \left( \phi(x^{T}) - \phi(x^\ast) \right) \\\\ &\le \frac{1}{T} \sum^{T-1} _ {k = 0} \left( \phi(x^{k+1}) - \phi(x^\ast) \right) \\\\ &\le \frac{L}{2T} ||x^0 - x^\ast||^2 \end{aligned}\]

Examples

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\psi : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$ using the iterative step $x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))$

Consider the case where

\[\phi(x) := f(x) + \lambda I _ \Omega(x)\]

and where $\Omega$ is a convex set. Derive the expression for $x^{k+1}$ and interpret the results.


\[\begin{aligned} x^{k+1} &= \text{prox} _ {\alpha _ k \lambda I _ \Omega}(x^k - \alpha _ k \nabla f(x^k)) \\\\ &= \text{argmin} _ {z \in \Omega} \text{ } \frac{1}{2} ||z - (x^k - \alpha _ k\nabla f(x^k))||^2 \end{aligned}\]

which is the orthogonal projection of the steepest descent update onto the feasible set.

@example~




Related posts