Notes - Optimisation for Data Science HT25, Proximal methods


Flashcards

Basic definitions

Suppose:

  • $h : \mathbb R^n \to \mathbb R$ is proper convex
  • $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
  • $\lambda > 0$

@Define the Moreau envelope of $h$.


\[M _ {\lambda, h} : \mathbb R^n \to \mathbb R\]

defined by

\[M _ {\lambda, h)}(x) := \frac{1}{\lambda} \inf _ u \left(\lambda h(u) + \frac 1 2||u-x||^2\right)\]

Suppose:

  • $h : \mathbb R^n \to \mathbb R$ is proper convex
  • $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
  • $\lambda > 0$

In this context, define the proximal operator $\text{prox} _ {\lambda h}$.


\[\text{prox} _ {\lambda h} : \mathbb R^n \to \mathbb R\]

defined by

\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ {u} \left(\lambda h(u) + \frac{1}{2}||u-x||^2\right)\]

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$

@Define the gradient map $G _ \alpha$ in this context.


\[G _ \alpha : \mathbb R^n \to \mathbb R^n\]

where

\[G _ \alpha(x) = \frac{1}{\alpha}(x - \text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]

Examples of the proximal operator

Suppose $h(x) := 0$, i.e. it is identically zero. What is $\text{prox} _ {\lambda h}$?


\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ {u} \frac{1}{2} ||u - x||^2 \\\\ &= x \end{aligned}\]

@example~

Suppose $h(x) := I _ \Omega (x)$ where $\Omega$ is closed and convex. What is $\text{prox} _ {\lambda h}$?


\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ {u} \text{ } \lambda I _ \Omega(u) + \frac{1}{2} ||u - x||^2 \\\\ &= \text{argmin} _ {u \in \Omega} ||u - x||^2 \end{aligned}\]

which is the projection of $x$ onto $\Omega$.

@example~

Suppose $h(x) := \vert \vert x \vert \vert _ 1$. What is $\text{prox} _ {\lambda h}$?


Defined component-wise, it is:

\[(\text{prox} _ {\lambda ||\cdot|| _ 1}(x)) _ i = \begin{cases} x _ i - \lambda &\text{if } x _ i > \lambda \\\\ 0 &\text{if } x _ i \in [-\lambda, \lambda] \\\\ x _ i + \lambda &\text{if } x _ i < -\lambda \end{cases}\]

Suppose $h(x) := \vert \vert x \vert \vert _ 0$ where $ \vert \vert x \vert \vert _ 0$ counts the non-zero components of $x$. What is $\text{prox} _ {\lambda h}$?


Defined component-wise, it is:

\[(\text{prox} _ {\lambda ||\cdot|| _ 1}(x)) _ i = \begin{cases} x _ i &\text{if } |x _ i| > \sqrt{2\lambda} \\\\ 0 &\text{if } |x _ i| < \sqrt{2\lambda} \end{cases}\]

Properties

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed
  • $\phi(u) := \lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2$

@State a property of $\text{prox} _ {\lambda h}$ that relates $\text{prox} _ {\lambda h}$ to the stationary points of $\phi$.


\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]

and

\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates to its finiteness.


\[M _ {\lambda, h}(x) < \infty\]

for all $x$ even if $h(x) = \infty$.

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates to its convexity.


\[M _ {\lambda, h}(x)\]

is convex in $x$.

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates to its differentiability.


$M _ {\lambda, h}(x)$ is differentiable in $x$ with gradient

\[\nabla M _ {\lambda, h}(x) = \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x))\]

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
  • $x \in \mathbb R^n$ is fixed

@State a property of $M _ {\lambda, h}$ that relates its minima to the minima of $h$.


\[x^\ast \in \text{argmin} _ x h(x) \iff x^\ast \in \text{argmin} _ x M _ {\lambda, h}(x)\]

Suppose:

  • $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function

@State a property of $\text{prox} _ {\lambda h}$ that relates to its smoothness.


\[|| \text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y) || \le ||x - y||\]

for all $x, y \in \mathbb R^n$.

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map

@State a property of $G _ \alpha$ that relates it to the derivative of $f$.

\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map

@Prove that then:

\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]

Note that

\[\begin{aligned} x - \alpha G _ \alpha(x) &= \text{prox} _ {\alpha \lambda \psi} (x - \alpha \nabla f(x)) \\\\ &= \text{argmin} _ u \text{ } \Xi(u) \end{aligned}\]

where

\[\Xi(u) := \alpha \lambda \psi(u) + \frac 1 2 ||u - (x - \alpha \nabla f(x))||^2\]

is a convex function.

Then

\[0 \in \partial \Xi(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]

but since

\[\begin{aligned} \partial \Xi(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x))) &= \partial \Xi(x - \alpha G _ \alpha(x)) \\\\ &= \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + (x - \alpha G _ \alpha(x)) - (x - \alpha \nabla f(x)) \end{aligned}\]

it follows that

\[\alpha G _ \alpha(x) \in \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + \alpha \nabla f(x)\]

Regularised optimisation

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$

What is the standard template used for iterative algorithms in this context? Give this in three ways:

  • First in terms of two separate steps,
  • then in one step using the proximal operator,
  • and finally using the gradient map $G _ \alpha$.

As a two-step iterative process:

\[\begin{aligned} y^{k+1} &= x^k - \alpha _ k \nabla f(x^k) \\\\ x^{k+1} &= \text{argmin} _ {z} \text{ } ||z - y^{k+1}||^2 + \alpha _ k \lambda \psi(z) \end{aligned}\]

In one step using $\text{prox}$:

\[x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))\]

Using the gradient map $G _ \alpha$:

\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k}(x^k)\]

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
  • $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$ using the iterative step $x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))$

If $\psi(x) = f(x) + \lambda I _ \Omega(x)$ where $\Omega$ is a convex step, can you derive the corresponding update and interpret the results?


\[\begin{aligned} x^{k+1} &= \text{prox} _ {\alpha _ k \lambda I _ \Omega}(x^k - \alpha _ k \nabla f(x^k)) \\\\ &= \text{argmin} _ {z \in \Omega} \text{ } \frac{1}{2} ||z - (x^k - \alpha _ k\nabla f(x^k))||^2 \end{aligned}\]

which is the orthogonal projection of the steepest descent update onto the feasible set.

@example~

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed

@State the foundational inequality of the prox-method in this context, first in full and then @prove how it reduces to generalise the foundational inequality of steepest descent.

In general: For all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,

\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]

Generalises the foundational inequality: Note that one step in the prox method is

\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k} (x^k)\]

Substituting this into the above equation, it reduces to

\[\phi(x^{k+1}) \le \phi(z) + G _ {\alpha _ k}(x^k)^\top (x^k - z) - \frac{\alpha _ k}{2} ||G _ {\alpha _ k}(x^k)||^2\]

Letting $z = x^k$, this becomes:

\[\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k} 2 ||G _ {\alpha _ k}(x^k)||^2\]

Compare this to the foundational inequality of gradient descent:

\[f(x^{k+1}) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed

@Prove the foundational inequality of the prox-method in this context, i.e. that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,

\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]

As $f$ is $L$-smooth, we have

\[f(y) \le f(x) + \nabla f(x)^\top (y - x) + \frac L 2 ||y - x||^2\]

for all $y \in \mathbb R^n$. Letting $y = x - \alpha G _ \alpha(x)$, we obtain

\[\begin{aligned} f(x - \alpha G _ \alpha(x)) &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac{L\alpha^2}{2} ||G _ \alpha(x)||^2 \\\\ &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac \alpha 2 ||G _ \alpha(x)||^2 \end{aligned}\]

where the last inequality follows from the assumption that $\alpha \in (0, \frac 1 L]$.

The convexity of $f$ implies that

\[f(z) \ge f(x) + \nabla f(x)^\top (z - x)\]

By the result that

\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]

we have

\[\lambda \psi(z) \ge \lambda \psi(x - \alpha G _ \alpha(x)) + (G _ \alpha(x) - \nabla f(x))^\top(z - (x - \alpha G _ \alpha(x)))\]

Combining these results

\[\begin{aligned} \phi(x - \alpha G _ \alpha(x)) &= f(x - \alpha G _ \alpha(x)) + \lambda \psi(x - \alpha G _ \alpha(x)) && (1\star) \\\\ &\le f(x) - \alpha G _ \alpha (x)^\top \nabla f(x) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (2\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x + \alpha G _ \alpha(x)) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (3\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x + \alpha G _ \alpha(x)) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(z) - (G _ \alpha(x) - \nabla f(x))^\top(z - (x - \alpha G _ \alpha(x))) &&(4\star)\\\\ &= f(z) + \lambda \psi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2 &&(5\star)\\\\ &= \phi(z) + G _ \alpha(x)^\top (x - z) - \frac{\alpha}{2}||G _ \alpha(x)||^2 \end{aligned}\]

where:

  • $(1\star)$: Follows from the definition of $\phi$
  • $(2,3,4\star)$: Follow from the above results

Convergence theory

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed
  • We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
    • The model has minimiser $x^\ast$

@State a result about the convergence in this case.


\[\phi(x^k) - \phi(x^\ast) \le \frac L {2k} ||x^0 - x^\ast||^2\]

for all $k \ge 1$.

Suppose:

  • We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
  • $G _ \alpha$ is the gradient map
  • $f$ is $L$-smooth and convex
  • $\psi$ is convex and closed
  • We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
  • The model has minimiser $x^\ast$

@Prove that then

\[\phi(x^k) - \phi(x^\ast) \le \frac L {2k} ||x^0 - x^\ast||^2\]

for all $k \ge 1$

(you may use previous results).


By a previous result, we have that

\[\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k} 2 ||G _ {\alpha _ k}(x^k)||^2\]

and therefore $(\phi(x^k)) _ {k \in \mathbb N}$ is monotone decreasing.

Furthermore, using the result that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,

\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]

with $z = x^\ast$ gives

\[\begin{aligned} 0 &\le \phi(x^{k+1}) - \phi(x^\ast) \\\\ &\le G _ {\alpha _ k}(x^k)^\top (x^k - x^\ast) - \frac{\alpha _ k}{2} ||G _ {\alpha _ k}(x^k)||^2 \\\\ &= \frac{1}{2\alpha _ k} \left( ||x^k - x^\ast||^2 - ||x^k - x^\ast - \alpha _ k G _ {\alpha _ k}(x^k)||^2 \right) \\\\ &= \frac{1}{2\alpha _ k} \left( ||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2 \right) \end{aligned}\]

Therefore $( \vert \vert x^k - x^\ast \vert \vert ) _ {k \in \mathbb N}$ is also monotone decreasing.

Combining this together, we have

\[\begin{aligned} T \times (\phi(x^T) - \phi(x^\ast)) &\le \sum^{T-1} _ {k = 0} (\phi(x^{k+1}) - \phi(x^\ast)) \\\\ &\le \frac L 2 \sum^{T-1} _ {k = 0}(||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2) \\\\ &\le \frac L 2 ||x^0 - x^\ast||^2 \end{aligned}\]



Related posts