Optimisation for Data Science HT25, Proximal method


Why the proximal method?

One way to motivate the proximal method is to see it as a generalisation of projected gradient descent. When using projected gradient descent, we wish to solve the optimisation problem

\[\min _ {x \in \Omega} f(x)\]

where $\Omega$ is some convex set. To do this, we use gradient descent but project every iterate onto the feasible region:

\[x^{k+1} = \text{proj} _ \Omega(x^k - \alpha _ k \nabla f(x^k))\]

We may rewrite this as

\[x^{k+1} = \text{argmin} _ {z \in \Omega} ||z - (x^k - \alpha _ k \nabla f(x^k))||^2\]

and then convert this into an unconstrained optimisation problem

\[x^{k+1} = \text{argmin} _ z \Big[ || z - (x^k - \alpha _ k \nabla f(x^k)) ||^2 + I _ \Omega(z) \Big]\]

You can show that this method has reasonable convergence properties ($O(\sqrt k)$ in the non-convex objective case). The proximal method is what you get when you generalise to other non-smooth but still convex penalty terms, other than $I _ \Omega$:

\[\begin{aligned} x^{k+1} &= \text{argmin} _ z \Big[ \frac{1}{2} ||z - (x^k - \alpha _ k \nabla f(x^k))||^2 + \lambda \alpha _ k \psi(z) \Big] \\\\ &= \text{prox} _ {\alpha _ k \lambda \psi}(x^k - \alpha _ k \nabla f(x^k)) \end{aligned}\]

So $\text{prox}$ can be seen as a generalised projection operation.

Flashcards

Basic definitions

Suppose:

  • $h : \mathbb R^n \to \mathbb R$ is proper convex
  • $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
  • $\lambda > 0$

@Define the Moreau envelope of $h$.


\[M _ {\lambda, h} : \mathbb R^n \to \mathbb R\] \[M _ {\lambda, h}(x) := \frac{1}{\lambda} \inf _ u \left(\lambda h(u) + \frac 1 2 \vert \vert u-x \vert \vert ^2\right)\]

Suppose:

  • $h : \mathbb R^n \to \mathbb R$ is proper convex
  • $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
  • $\lambda > 0$

In this context, define the proximal operator $\text{prox} _ {\lambda h}$.


\[\text{prox} _ {\lambda h} : \mathbb R^n \to \mathbb R\]

defined by

\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ {u} \left(\lambda h(u) + \frac{1}{2} \vert \vert u-x \vert \vert ^2\right)\]

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$

@Define the gradient map $G _ \alpha$ in this context.


\[G _ \alpha : \mathbb R^n \to \mathbb R^n\]

where

\[G _ \alpha(x) = \frac{1}{\alpha}(x - \text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]

Properties

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed
  • $\phi(u) := \lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2$

@State a property of $\text{prox} _ {\lambda h}$ that relates $\text{prox} _ {\lambda h}$ to the stationary points of $\phi$.


\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]

and

\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]

i.e. $\text{prox} _ {\lambda h}(x)$ is a minimiser of $\phi$.

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed
  • $\phi(u) := \lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2$

@Prove that

\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]

and

\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]

Note

\[\partial \phi(u) = \lambda \partial h(u) + u - x\]

By letting $u = \text{prox} _ {\lambda h}(x)$, we have

\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]

as required.

Since

\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ u \left[\lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2\right] \\\\ &=\text{argmin} _ u [\varphi(u)] \end{aligned}\]

and $\varphi$ is convex (it is the sum of two convex functions), $0 \in \partial(\varphi(u))$ iff $u$ minimises $\varphi$. Since $\text{prox} _ {\lambda h}(x)$ is precisely the minimiser,

\[0 \in \partial \varphi(\text{prox} _ {\lambda h}(x))\]

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates to its finiteness.


\[M _ {\lambda, h}(x) < \infty\]

for all $x$ even if $h(x) = \infty$.

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates to its convexity.


\[M _ {\lambda, h}(x)\]

is convex in $x$.

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates to its differentiability and give its derivative.


$M _ {\lambda, h}(x)$ is differentiable in $x$ with gradient

\[\nabla M _ {\lambda, h}(x) = \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x))\]

@Prove that if

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

then $M _ {\lambda, h}(x)$ is differentiable in $x$ with gradient

\[\nabla M _ {\lambda, h}(x) = \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x))\]

(you may assume $h$ is differentiable at $u = \text{prox} _ {\lambda h}(x)$ and that $\text{prox} _ {\lambda h}$ is differentiable at $x$).


Recall

\[M _ {\lambda, h}(x) := \frac{1}{\lambda} \inf _ u \left(\lambda h(u) + \frac 1 2 \vert \vert u-x \vert \vert ^2\right)\]

We write

\[M _ {\lambda, h}(x) := h(\text{prox} _ {\lambda h}(x)) + \frac 1 {2\lambda} \vert \vert \text{prox} _ {\lambda h}(x) - x \vert \vert ^2\] \[\begin{aligned} \frac{\text d}{\text dx} \big( M _ {\lambda , h}(x) \big) &= \frac{\partial}{\partial u} \big [ M _ {\lambda, h}(x) \big] \frac{\text du}{\text dx} + \frac{\partial}{\partial x}\big[ M _ {\lambda, h}(x) \big] &(1\star) \\\\ &= \frac{\partial}{\partial u} \left[h(u) + \frac{1}{2\lambda} \vert \vert u - x \vert \vert ^2\right] \frac{\text d}{\text dx} (\text{prox} _ {\lambda h}(x)) + \frac{\partial}{\partial x} \left[\frac 1 {2\lambda} \vert \vert u - x \vert \vert ^2\right] \\\\ &= \frac{\partial}{\partial x} \left[\frac{1}{2\lambda} \vert \vert u - x \vert \vert ^2\right] &(2\star) \\\\ &= \frac 1 \lambda ( x - u ) \\\\ &= \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x)) \end{aligned}\]

where:

  • $(1\star)$: Follows from the chain rule applied to the function $g(x, u(x))$ where $g(x, u) = h(u) + \frac 1 {2\lambda} \vert \vert u - x \vert \vert ^2$.
  • $(2\star)$: Follows from the fact that $u$ is a minimiser of $h(u) - \frac{1}{2\lambda} \vert \vert u-x \vert \vert ^2$ and so that term is zero.

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates its minima to the minima of $h$.


\[x^\ast \in \text{argmin} _ x h(x) \iff x^\ast \in \text{argmin} _ x M _ {\lambda, h}(x)\]

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function

@State a property of $\text{prox} _ {\lambda h}$ that relates to its smoothness.


\[\vert \vert \text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y) \vert \vert \le \vert \vert x - y \vert \vert\]

for all $x, y \in \mathbb R^n$ (i.e. it is $1$-Lipschitz).

@Prove that if

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function

then

\[\vert \vert \text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y) \vert \vert \le \vert \vert x - y \vert \vert\]

for all $x, y \in \mathbb R^n$ (i.e. it is $1$-Lipschitz).


By the result that says:

\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]

and

\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]

In the following, write $p$ for $\text{prox} _ {\lambda h}$. Then

Note that $x - p(x) \in \lambda \partial h(p(x))$. Then

\[\lambda h(u) \ge \lambda h(p(x)) + (x - p(x))^\top (u - p(x))\]

for all $x$ and $u$, by properties of the subdifferential. In particular, by taking $u = p(y)$ we have

\[\lambda h(p(y)) \ge \lambda h(p(x)) + (x - p(x))^\top (p(y) - p(x))\]

for all $x$ and $y$. Likewise,

\[\lambda h(p(x)) \ge \lambda h(p(y)) + (y - p(y))^\top (p(x) - p(y))\]

Summing these inequalities and simplifying, we have

\[\begin{aligned} &0 \ge (x - p(x) - y + p(y))^\top (p(y) - p(x)) \\\\ \implies& 0 \ge ((x - y) - (p(x) - p(y)))^\top (p(y) - p(x)) \\\\ \implies& 0 \ge (x - y)^\top(p(y) - p(x)) - (p(x) - p(y))^\top (p(y) - p(x)) \\\\ \implies& 0 \ge -(x-y)^\top (p(x) - p(y)) +(p(x) - p(y))^\top(p(x) - p(y)) \\\\ \implies& (x - y)^\top (p(x) - p(y)) \ge \vert \vert p(x) - p(y) \vert \vert ^2 \\\\ \end{aligned}\]

Hence by Cauchy-Schwarz and dividing through,

\[\vert \vert p(x) - p(y) \vert \vert \le \vert \vert x - y \vert \vert\]

as required.

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map

@State a property of $G _ \alpha$ that relates it to the derivative of $f$.


\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map

@Prove that then:

\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]

Note that

\[\begin{aligned} x - \alpha G _ \alpha(x) &= x - \alpha \cdot \frac 1 \alpha (x - \text{prox} _ {a\lambda \psi}(x - \alpha \nabla f(x))) \\\\ &= \text{prox} _ {\alpha \lambda \psi} (x - \alpha \nabla f(x)) \\\\ &= \text{argmin} _ u \text{ } E(u) \end{aligned}\]

where

\[E(u) := \alpha \lambda \psi(u) + \frac 1 2 \vert \vert u - (x - \alpha \nabla f(x)) \vert \vert ^2\]

is a convex function.

Then

\[0 \in \partial E(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]

but since

\[\begin{aligned} &\partial E(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x))) \\\\ =& \partial E(x - \alpha G _ \alpha(x)) \\\\ =& \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + (x - \alpha G _ \alpha(x)) - (x - \alpha \nabla f(x)) \end{aligned}\]

it follows that

\[\alpha G _ \alpha(x) \in \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + \alpha \nabla f(x)\]

Regularised optimisation

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$

What is the standard template used for iterative algorithms in this context? Give this in three ways:

  • First in terms of two separate steps,
  • then in one step using the proximal operator,
  • and finally using the gradient map $G _ \alpha$.

As a two-step iterative process:

\[\begin{aligned} y^{k+1} &= x^k - \alpha _ k \nabla f(x^k) \\\\ x^{k+1} &= \text{argmin} _ {z} \text{ } \vert \vert z - y^{k+1} \vert \vert ^2 + \alpha _ k \lambda \psi(z) \end{aligned}\]

In one step using $\text{prox}$:

\[x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))\]

Using the gradient map $G _ \alpha$:

\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k}(x^k)\]

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$

The proximal method solves this problem by using the iterative step

\[x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))\]

where

\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ {u} \left(\lambda h(u) + \frac{1}{2} \vert \vert u-x \vert \vert ^2\right)\]

Motivate this update in two different ways:

  • First as a generalisation of the projected gradient descent, then
  • As solving a sequence of simpler optimisation problems.

Generalisation of projected gradient descent:

In the projected gradient descent method, we want to solve the optimisation problem

\[\min _ {x \in \Omega} f(x)\]

where $\Omega$ is a convex set. We do this by performing gradient descent steps as normal and then projecting onto $\Omega$:

\[\begin{aligned} x^{k+1} &= y^k - \alpha _ k \nabla f(y^k) \\\\ y^{k+1} &= \text{argmin} _ {z \in \mathbb \Omega} \frac 1 2 \vert \vert z - x^{k+1} \vert \vert ^2 \end{aligned}\]

which we could rewrite as.

\[x^{k+1} = \text{prox} _ {\alpha _ kI _ \Omega}(x^k - \alpha _ k \nabla f(x^k))\]

The proximal method is just what you get when you consider more general “projection operators” $\text{prox} _ {\alpha _ k \lambda \psi}$ where $\psi$ is a convex function.


As solving a sequence of simpler optimisation problems:

We may rewrite the update as so:

\[\begin{aligned} x^{k+1} &= \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k)) \\\\ &= \text{argmin} _ {z} \left[\frac{1}{2} \vert \vert z - (x^k - \alpha _ k \nabla f(x^k) \vert \vert ^2 + \lambda \alpha _ k \psi(z)\right] \\\\ &= \text{argmin} _ z \left[ f(x^k) + \nabla f(x^k)^\top(z - x^k) + \frac{1}{2\alpha _ k} \vert \vert z - x^k \vert \vert ^2 + \lambda \alpha _ k \psi(x) \right] \end{aligned}\]

Note that

\[f(x^k) + \nabla f(x^k)^\top (z - x^k) + \frac{1}{2\alpha _ k} \vert \vert z - x^k \vert \vert ^2\]

is a quadratic upper bound of $f$. So the proximal method can be interpreted as solving a sequence of simpler optimisation problems for which the minimum is an upper bound on the minimum of $f$.

Convergence theory

Foundational inequality / Overestimation property

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed

@State the foundational inequality of the prox-method in this context, first in full and then @prove how it reduces to generalise the foundational inequality of steepest descent.


In general: For all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,

\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2\]

Generalises the foundational inequality: Note that one step in the prox method is

\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k} (x^k)\]

Substituting this into the above equation, it reduces to

\[\phi(x^{k+1}) \le \phi(z) + G _ {\alpha _ k}(x^k)^\top (x^k - z) - \frac{\alpha _ k}{2} \vert \vert G _ {\alpha _ k}(x^k) \vert \vert ^2\]

Letting $z = x^k$, this becomes:

\[\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k} 2 \vert \vert G _ {\alpha _ k}(x^k) \vert \vert ^2\]

Compare this to the foundational inequality of gradient descent:

\[f(x^{k+1}) \le f(x^k) - \frac 1 {2L} \vert \vert \nabla f(x^k) \vert \vert ^2\]

@important~ (overestimation property)

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed

@Prove the foundational inequality of the prox-method in this context, i.e. that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,

\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2\]

We require 3 sub-results.

$L$-smoothness upper bound (A): As $f$ is $L$-smooth, we have

\[f(y) \le f(x) + \nabla f(x)^\top (y - x) + \frac L 2 \vert \vert y - x \vert \vert ^2\]

for all $y \in \mathbb R^n$. Letting $y = x - \alpha G _ \alpha(x)$, we obtain

\[\begin{aligned} f(x - \alpha G _ \alpha(x)) &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac{L\alpha^2}{2} \vert \vert G _ \alpha(x) \vert \vert ^2 \\\\ &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2 \end{aligned}\]

where the last inequality follows from the assumption that $\alpha \in (0, \frac 1 L]$.

Convexity of $f$ upper bound (B): The convexity of $f$ implies that

\[f(z) \ge f(x) + \nabla f(x)^\top (z - x)\]

and hence

\[f(x) \le f(z) - \nabla f(x)^\top (z - x)\]

Convexity of $\psi$ upper bound (C): By a previous result, we have that

\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]

or equivalently,

\[G _ \alpha(x) - \nabla f(x) \in \lambda \partial \psi(x - \alpha G _ \alpha(x))\]

Thus by the definition of a $G _ \alpha(x) - \nabla f(x)$ being a subgradient of $\lambda \psi$ at $x - \alpha G _ \alpha(x)$:

\[\lambda \psi(z) \ge \lambda \psi(x - \alpha G _ \alpha (x)) + (G _ \alpha(x) - \nabla f(x))^\top (z-(x - \alpha G _ \alpha(x))) \\\\\]

and hence

\[\lambda \psi(x - \alpha G _ \alpha(x)) \le \lambda \psi(z) - (G _ \alpha(x) - \nabla f(x))^\top (z - (x - \alpha G _ \alpha (x)))\]

Combining these results:

\[\begin{aligned} &\phi(x - \alpha G _ \alpha(x))\\\\ &= f(x - \alpha G _ \alpha(x)) + \lambda \psi(x - \alpha G _ \alpha(x)) && (1\star) \\\\ &\le f(x) - \alpha G _ \alpha (x)^\top \nabla f(x) + \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (2\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (3\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x + \alpha G _ \alpha(x)) + \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2 + \lambda \psi(z) - (G _ \alpha(x) - \nabla f(x))^\top(z - (x - \alpha G _ \alpha(x))) &&(4\star)\\\\ &= f(z) + \lambda \psi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2 &&(5\star)\\\\ &= \phi(z) + G _ \alpha(x)^\top (x - z) - \frac{\alpha}{2} \vert \vert G _ \alpha(x) \vert \vert ^2 && (6\star) \end{aligned}\]

where:

  • $(1\star)$: Follows from the definition of $\phi$.
  • $(2\star)$: Follows from applying the $L$-smoothness upper bound to expand $f(x - \alpha G _ \alpha(x))$ (A).
  • $(3\star)$: Follows from expanding $f(x)$ using the convexity upper bound (B).
  • $(4\star)$: Follows from expanding $\lambda \psi(x - \alpha G _ \alpha(x))$ using the convexity upper bound (C).
  • $(5\star)$: This is just algebra, noticing that the $\nabla f(x)^\top (z - x + \alpha G _ \alpha(x))$ terms cancel.
  • $(6\star)$: Follows from the definition of $\phi$.

@important~ (overestimation property)

Global convergence

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed
  • We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
  • The model has minimiser $x^\ast$

@State a result about the convergence in this case.


\[\phi(x^k) - \phi(x^\ast) \le \frac L {2k} \vert \vert x^0 - x^\ast \vert \vert ^2\]

for all $k \ge 1$.

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed
  • We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
  • The model has minimiser $x^\ast$

@Prove that then

\[\phi(x^k) - \phi(x^\ast) \le \frac L {2k} \vert \vert x^0 - x^\ast \vert \vert ^2\]

for all $k \ge 1$.


We have the overestimation result that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$

\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2\]

With $x = x^k$, $z = x^\ast$ and $\alpha = 1/L$, this gives

\[\begin{aligned} \phi(x^{k+1}) &\le \phi(x^\ast) + G _ {\alpha _ k}(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} \vert \vert G _ {\alpha _ k}(x^k) \vert \vert ^2 \\\\ &= \phi(x^\ast) + \frac{L}{2} \left( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^k - x^\ast - \alpha _ k G _ {\alpha _ k}(x^k) \vert \vert ^2 \right) \\\\ &= \phi(x^\ast) + \frac{L}{2} \left( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2 \right) \end{aligned}\]

Therefore

\[\begin{aligned} \sum^{T-1} _ {k = 0} \left( \phi(x^{k+1}) - \phi(x^\ast) \right) &\le \frac L {2} \sum^{T-1} _ {k = 0} \left( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2 \right) \\\\ &= \frac L {2} ( \vert \vert x^0 - x^\ast \vert \vert ^2 - \vert \vert x^T - x^\ast \vert \vert ^2) \\\\ &\le \frac L {2} \vert \vert x^0 - x^\ast \vert \vert ^2 \end{aligned}\]

Note also that the sequence of iterates is decreasing, since if $x$ and $z$ are both $x^{k}$, we have $\phi(x^{k+1}) \le \phi(x^k) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2$. Therefore

\[\begin{aligned} \phi(x^T) - \phi(x^\ast) &= \frac{1}{T} \sum^{T-1} _ {k = 0} \left( \phi(x^{T}) - \phi(x^\ast) \right) \\\\ &\le \frac{1}{T} \sum^{T-1} _ {k = 0} \left( \phi(x^{k+1}) - \phi(x^\ast) \right) \\\\ &\le \frac{L}{2T} \vert \vert x^0 - x^\ast \vert \vert ^2 \end{aligned}\]

Computations of Moreau envelopes, proximal operators and gradient maps

Suppose you’re asked to calculate $\text{prox} _ {\lambda h}(x)$. Then you have

\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ u \text{ } E(u)\]

where

\[E(u) = \frac{1}{2}||u-x||^2 _ 2 + \lambda h(u)\]

In general, it’s helpful to proceed like so.

  • See if $E$ is separable. If $h = \vert \vert \cdot \vert \vert _ 1$ then you can write $E(u) = \sum^n _ {i = 1} \frac 1 2 (u _ i - x _ i)^2 _ 2 + \lambda \vert u _ i \vert $ and hence it suffices to minimise each summand separately. This also applies to if $E$ splits into blocks of variables.
  • Calculate $\partial E(u)$. This will likely split into several different cases depending on the value of $u$.
  • Assume you are in each particular case, and solve for $u$.
  • Combine all these cases together.

For simple functions like $h = 0$ or $h = I _ \Omega$, this full derivation is not needed.

$\text{prox}$ for $h = 0$

Suppose $h(x) := 0$, i.e. it is identically zero. What is $\text{prox} _ {\lambda h}$?


\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ {u} \frac{1}{2} \vert \vert u - x \vert \vert ^2 \\\\ &= x \end{aligned}\]

@example~ @justify~

$\text{prox}$ for $h = I _ {\Omega}$, indicator function on closed and convex set

Suppose $h(x) := I _ \Omega (x)$ where $\Omega$ is closed and convex. What is $\text{prox} _ {\lambda h}$?


\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ {u} \text{ } \lambda I _ \Omega(u) + \frac{1}{2} \vert \vert u - x \vert \vert ^2 \\\\ &= \text{argmin} _ {u \in \Omega} \vert \vert u - x \vert \vert ^2 \end{aligned}\]

which is the projection of $x$ onto $\Omega$.

@example~ @justify~

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\psi : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$ using the iterative step $x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))$

Consider the case where

\[\phi(x) := f(x) + \lambda I _ \Omega(x)\]

and where $\Omega$ is a convex set. Derive the expression for $x^{k+1}$ and interpret the results.


\[\begin{aligned} x^{k+1} &= \text{prox} _ {\alpha _ k \lambda I _ \Omega}(x^k - \alpha _ k \nabla f(x^k)) \\\\ &= \text{argmin} _ {z \in \Omega} \text{ } \frac{1}{2} \vert \vert z - (x^k - \alpha _ k\nabla f(x^k)) \vert \vert ^2 \end{aligned}\]

which is the orthogonal projection of the steepest descent update onto the feasible set.

@example~

$\text{prox}$ for $h = \vert \vert \cdot \vert \vert _ 1$

Suppose $h(x) := \vert \vert x \vert \vert _ 1$. What is $\text{prox} _ {\lambda h}$?


Defined component-wise, it is:

\[(\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 1}(x)) _ i = \begin{cases} x _ i - \lambda &\text{if } x _ i > \lambda \\\\ 0 &\text{if } x _ i \in [-\lambda, \lambda] \\\\ x _ i + \lambda &\text{if } x _ i < -\lambda \end{cases}\]

@Prove that

\[(\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 1}(x)) _ i = \begin{cases} x _ i - \lambda &\text{if } x _ i > \lambda \\\\ 0 &\text{if } x _ i \in [-\lambda, \lambda] \\\\ x _ i + \lambda &\text{if } x _ i < -\lambda \end{cases}\]

Note that

\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ u \left\lbrace \frac 1 2 \vert \vert u - x \vert \vert ^2 _ 2 + \lambda \vert \vert u \vert \vert _ 1 \right\rbrace \\\\ &= \text{argmin} _ u \left\lbrace \sum^n _ {i = 1} \frac 1 2 (u _ i - x _ i)^2 + \lambda \vert u _ i \vert \right\rbrace \end{aligned}\]

and hence it suffices to minimise each component separately. Let

\[E(u) := \frac 1 2 \vert \vert u - x \vert \vert ^2 _ 2 + \lambda \vert \vert u \vert \vert _ 1\]

Hence

\[\partial E(u) _ i = \begin{cases} \lbrace u _ i - x _ i + \lambda \rbrace &\text{if } u _ i > 0 \\\\ \lbrace u _ i - x _ i + \ell \mid \ell \in [-\lambda, \lambda] \rbrace &\text{if } u _ i = 0 \\\\ \lbrace u _ i - x _ i - \lambda \rbrace &\text{if } u _ i < 0 \\\\ \end{cases}\]

and by optimality conditions, we need $0 \in \partial E(u)$, so have

\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 1} (x) = \begin{cases} x _ i - \lambda &\text{if } x _ i - \lambda > 0 \\\\ 0 &\text{if } 0 \in \lbrace -x _ i + \ell \mid \ell \in [-\lambda, \lambda]\rbrace \\\\ x _ i + \lambda &\text{if } x _ i + \lambda < 0 \end{cases}\]

i.e.

\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 1} (x) = \begin{cases} x _ i - \lambda &\text{if } x _ i - \lambda > 0 \\\\ 0 &\text{if } x _ i \in [-\lambda, \lambda] \\\\ x _ i + \lambda &\text{if } x _ i + \lambda < 0 \end{cases}\]
$\text{prox}$ for $h = \vert \vert \cdot \vert \vert _ 2$

Suppose $h(x) := \vert \vert x \vert \vert _ 2$. What is $\text{prox} _ {\lambda h}$?


\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 2}(x) = \max\left(0, \left( 1 - \frac{\lambda}{ \vert \vert x \vert \vert _ 2} \right)\right) x\]

Prove that

\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 2}(x) = \max\left(0, \left( 1 - \frac{\lambda}{ \vert \vert x \vert \vert _ 2} \right)\right) x\]

\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 2}(x) = \text{argmin} _ u \left \lbrace \frac 1 2 \vert \vert u - x \vert \vert ^2 _ 2 + \lambda \vert \vert u \vert \vert _ 1 \right \rbrace\]

Let

\[E(u) = \frac 1 2 \vert \vert u - x \vert \vert ^2 _ 2 + \lambda \vert \vert u \vert \vert _ 1\]

We require $0 \in \partial E(u)$. Then

\[\partial E(u) = u - x + \lambda \partial( \vert \vert \cdot \vert \vert _ 2)(u)\]

since

\[\partial( \vert \vert \cdot \vert \vert _ 2)(u) = \begin{cases} \lbrace u / \vert \vert u \vert \vert \rbrace &\text{if }u \ne 0 \\\\ \lbrace v \mid \vert \vert v \vert \vert \le 1 \rbrace &\text{if } u =0 \end{cases}\]

So

\[\partial E(u) = \begin{cases} \lbrace u - x + \lambda \frac{u}{ \vert \vert u \vert \vert } \rbrace &\text{if } u \ne 0 \\\\ \lbrace u - x + \lambda v \mid \vert \vert v \vert \vert \le 1 \rbrace &\text{if } u = 0 \end{cases}\]

Case $u \ne 0$: In this case

\[\partial E(u) = \lbrace u - x + \lambda \frac{u}{ \vert \vert u \vert \vert }\rbrace\]

Then for the subgradient to contain $0$, we require $u - x + \lambda \frac{u}{ \vert \vert u \vert \vert } = 0$. Then $x = u + \lambda \frac u { \vert \vert u \vert \vert }$ which implies $ \vert \vert x \vert \vert = \vert \vert u \vert \vert (1 + \frac{\lambda}{ \vert \vert u \vert \vert }) = \vert \vert u \vert \vert + \lambda$, and hence $ \vert \vert u \vert \vert = \vert \vert x \vert \vert - \lambda$. This tells us two things: This case is only possible when $ \vert \vert x \vert \vert \ge \lambda$ (otherwise the norm would be negative), and if this is true, then: $x = u(1 + \frac{\lambda}{ \vert \vert x \vert \vert - \lambda})$ which implies

\[\begin{aligned} u &= \frac{x}{1 + \frac{\lambda}{ \vert \vert x \vert \vert - \lambda} } \\\\ &= \frac{x( \vert \vert x \vert \vert - \lambda)}{ \vert \vert x \vert \vert - \lambda + \lambda} \\\\ &= \left(1 - \frac{\lambda}{ \vert \vert x \vert \vert }\right)x \end{aligned}\]

Case $u = 0$: Then we require

\[-x + \lambda v = 0\]

which is feasible when $ \vert \vert x \vert \vert \le \lambda$.

Overall: Thus

\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 2}(x) = \begin{cases} \left(1 - \frac{\lambda}{ \vert \vert x \vert \vert }\right)x &\text{if } \vert \vert x \vert \vert \ge \lambda \\\\ 0 &\text{otherwise, when } \vert \vert x \vert \vert \le \lambda \end{cases}\]
$\text{prox}$ for $h = \vert \vert \cdot \vert \vert _ 0$

Suppose $h(x) := \vert \vert x \vert \vert _ 0$ where $ \vert \vert x \vert \vert _ 0$ counts the non-zero components of $x$. What is $\text{prox} _ {\lambda h}$?


Defined component-wise, it is:

\[(\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 0}(x)) _ i = \begin{cases} x _ i &\text{if } \vert x _ i \vert > \sqrt{2\lambda} \\\\ 0 &\text{if } \vert x _ i \vert \le \sqrt{2\lambda} \end{cases}\]

Suppose $h(x) := \vert \vert x \vert \vert _ 0$ where $ \vert \vert x \vert \vert _ 0$ counts the non-zero components of $x$.

@Prove that

\[(\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 0}(x)) _ i = \begin{cases} x _ i &\text{if } \vert x _ i \vert > \sqrt{2\lambda} \\\\ 0 &\text{if } \vert x _ i \vert \le \sqrt{2\lambda} \end{cases}\]

We have

\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 0}(x) = \text{argmin} _ u \text{ } E(u)\]

where

\[E(u) = \sum^n _ {i = 1} \frac{1}{2}(u _ i - x _ i)^2 + \lambda [u _ i \ne 0]\]

Hence it suffices to minimise each summand separately. Then

\[E _ i(u) = \frac 1 2 (u _ i - x _ i)^2 + \lambda [u _ i \ne 0]\]

If $u _ i \ne 0$, the cost is $E _ i(u) = \frac 1 2(u _ i - x _ i)^2 + \lambda$. Hence it is worth setting $u _ i = x _ i$ iff $\frac 1 2 x _ i^2 > \lambda$, i.e. $ \vert x _ i \vert > 2\lambda$. Otherwise, set $u _ i = 0$. Then

\[(\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 0}(x)) _ i = \begin{cases} x _ i &\text{if } \vert x _ i \vert > \sqrt{2\lambda} \\\\ 0 &\text{if } \vert x _ i \vert \le \sqrt{2\lambda} \end{cases}\]
$\text{prox}$ for $h(x) = \sum \vert \vert y _ \ell \vert \vert _ 2$, sum of $2$-norms of blocks

Suppose

\[\begin{aligned} &h : \mathbb R^n \to \mathbb R \\\\ &h(x) = \sum^m _ {\ell = 1} \vert \vert y _ \ell \vert \vert \end{aligned}\]

where $n = km$ for some $k, m \in \mathbb N$ and the inputs $x$ have block structure $x = (y _ 1, \ldots, y _ m)$. What is $\text{prox} _ {\lambda h}$?


Let $\text{prox} _ {\lambda h}(x) _ {y _ \ell}$ denote the coordinates of $\text{prox} _ {\lambda h}(x)$ in the $\ell$th block. Then

\[\text{prox} _ {\lambda h}(x) _ {y _ {\ell} } = \begin{cases} (1 - \frac{\lambda}{ \vert \vert y _ \ell \vert \vert })y _ \ell &\text{if } \vert \vert y _ \ell \vert \vert \ge \lambda \\\\ 0 &\text{if } \vert \vert y _ \ell \vert \vert < \lambda \end{cases}\]

Suppose

\[\begin{aligned} &h : \mathbb R^n \to \mathbb R \\\\ &h(x) = \sum^m _ {\ell = 1} \vert \vert y _ \ell \vert \vert \end{aligned}\]

where $n = km$ for some $k, m \in \mathbb N$ and the inputs $x$ have block structure $x = (y _ 1, \ldots, y _ m)$.

@Prove that

\[\text{prox} _ {\lambda h}(x) _ {y _ {\ell} } = \begin{cases} (1 - \frac{\lambda}{ \vert \vert y _ \ell \vert \vert })y _ \ell &\text{if } \vert \vert y _ \ell \vert \vert \ge \lambda \\\\ 0 &\text{if } \vert \vert y _ \ell \vert \vert < \lambda \end{cases}\]

where $\text{prox} _ {\lambda h}(x) _ {y _ \ell}$ denote the coordinates of $\text{prox} _ {\lambda h}(x)$ in the $\ell$th block.

(you may assume results about the proximal maps of simpler regularisers)


We have

\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ u \text{ } E(u)\]

where

\[E(u) := \sum^m _ {\ell = 1} \frac{1}{2} \vert \vert u - y _ \ell \vert \vert ^2 + \vert \vert y _ \ell \vert \vert\]

Hence it suffices to minimise each block separately. But minimising each block corresponds to the proximal map applied to the $2$-norm. So by the result

If $h(x) := \vert \vert x \vert \vert _ 2$. Then

\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 2}(x) = \max\left(0, \left( 1 - \frac{\lambda}{ \vert \vert x \vert \vert _ 2} \right)\right) x\]

it follows

\[\text{prox} _ {\lambda h}(x) _ {y _ {\ell} } = \text{prox} _ {\lambda h _ {y _ \ell} }(y _ \ell) = \begin{cases} (1 - \frac{\lambda}{ \vert \vert y _ \ell \vert \vert })y _ \ell &\text{if } \vert \vert y _ \ell \vert \vert \ge \lambda \\\\ 0 &\text{if } \vert \vert y _ \ell \vert \vert < \lambda \end{cases}\]
$M _ {\lambda h}$ for $h$ piecewise linear

Suppose

\[\begin{aligned} &h : \mathbb R \to \mathbb R \\\\ &h(x) = \begin{cases} \frac 1 2 x &\text{if } x \le 0 \\\\ 2x &\text{if } x > 0 \end{cases} \end{aligned}\]

@Prove that then

\[M _ {\lambda, h}(x) = \begin{cases} \frac 1 2 x - \frac \lambda 8 &\text{if } x \le \frac \lambda 2 \\\\ \frac{x^2}{2\lambda} &\text{if } \frac{\lambda}{2} < x < 2\lambda \\\\ 2x - 2\lambda &\text{if }x \ge 2\lambda \end{cases}\]

We have

\[M _ {\lambda, h}(x) = \inf _ u E(u)\]

where

\[\begin{aligned} E(u) &= \frac 1 {2\lambda} (u - x)^2 + h(u) \\\\ &= \begin{cases} \frac{1}{2\lambda} (u - x)^2 + \frac 1 2 u &\text{if } x \le 0 \\\\ \frac{1}{2\lambda} (u - x)^2 + 2u &\text{if } x > 0 \end{cases} \end{aligned}\]

Case $u \le 0$: Then

\[\frac{\text d}{\text du} \left[ \frac{1}{2\lambda} (u - x)^2 + \frac 1 2 u \right] = \frac 1 \lambda (u - x) + \frac 1 2\]

which implies $u = x - \frac{\lambda}{2}$. Hence $u \le 0$ implies $x \le \frac \lambda 2$ and for this value $E(u) = \frac 1 2 x - \frac \lambda 8$ (plug it in).

Case $u > 0$: Then

\[\frac{\text d}{\text du} \left[ \frac{1}{2\lambda} (u - x)^2 + \frac 1 2 u \right] = \frac 1 \lambda (u - x) + 2\]

which implies $u = x - 2\lambda$. Hence $u > 0$ implies $x > 2 \lambda$ and for this value $E(u) = 2x - 2\lambda$.

Missing case: We still need to determine what happens when $\frac \lambda 2 < x < 2\lambda$. In this case, the value of the objective is given by the value at the boundary, i.e. when $u = 0$. Then $E(0) = \frac{x^2}{2\lambda}$.

Overall: Then

\[M _ {\lambda, h}(x) = \begin{cases} \frac 1 2 x - \frac \lambda 8 &\text{if } x \le \frac \lambda 2 \\\\ \frac{x^2}{2\lambda} &\text{if } \frac{\lambda}{2} < x < 2\lambda \\\\ 2x - 2\lambda &\text{if }x \ge 2\lambda \end{cases}\]
$\text{prox}$ and $G _ \alpha$ for $h$ weighted 1-norm plus indicator function




Related posts