Notes - Optimisation for Data Science HT25, Proximal methods
Flashcards
Basic definitions
Suppose:
- $h : \mathbb R^n \to \mathbb R$ is proper convex
- $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
- $\lambda > 0$
@Define the Moreau envelope of $h$.
defined by
\[M _ {\lambda, h)}(x) := \frac{1}{\lambda} \inf _ u \left(\lambda h(u) + \frac 1 2||u-x||^2\right)\]Suppose:
- $h : \mathbb R^n \to \mathbb R$ is proper convex
- $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
- $\lambda > 0$
In this context, define the proximal operator $\text{prox} _ {\lambda h}$.
defined by
\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ {u} \left(\lambda h(u) + \frac{1}{2}||u-x||^2\right)\]Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
@Define the gradient map $G _ \alpha$ in this context.
where
\[G _ \alpha(x) = \frac{1}{\alpha}(x - \text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]Examples of the proximal operator
Suppose $h(x) := 0$, i.e. it is identically zero. What is $\text{prox} _ {\lambda h}$?
@example~
Suppose $h(x) := I _ \Omega (x)$ where $\Omega$ is closed and convex. What is $\text{prox} _ {\lambda h}$?
which is the projection of $x$ onto $\Omega$.
@example~
Suppose $h(x) := \vert \vert x \vert \vert _ 1$. What is $\text{prox} _ {\lambda h}$?
Defined component-wise, it is:
\[(\text{prox} _ {\lambda ||\cdot|| _ 1}(x)) _ i = \begin{cases} x _ i - \lambda &\text{if } x _ i > \lambda \\\\ 0 &\text{if } x _ i \in [-\lambda, \lambda] \\\\ x _ i + \lambda &\text{if } x _ i < -\lambda \end{cases}\]Suppose $h(x) := \vert \vert x \vert \vert _ 0$ where $ \vert \vert x \vert \vert _ 0$ counts the non-zero components of $x$. What is $\text{prox} _ {\lambda h}$?
Defined component-wise, it is:
\[(\text{prox} _ {\lambda ||\cdot|| _ 1}(x)) _ i = \begin{cases} x _ i &\text{if } |x _ i| > \sqrt{2\lambda} \\\\ 0 &\text{if } |x _ i| < \sqrt{2\lambda} \end{cases}\]Properties
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
- $\phi(u) := \lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2$
@State a property of $\text{prox} _ {\lambda h}$ that relates $\text{prox} _ {\lambda h}$ to the stationary points of $\phi$.
and
\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates to its finiteness.
for all $x$ even if $h(x) = \infty$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates to its convexity.
is convex in $x$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates to its differentiability.
$M _ {\lambda, h}(x)$ is differentiable in $x$ with gradient
\[\nabla M _ {\lambda, h}(x) = \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x))\]Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates its minima to the minima of $h$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
@State a property of $\text{prox} _ {\lambda h}$ that relates to its smoothness.
for all $x, y \in \mathbb R^n$.
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
@State a property of $G _ \alpha$ that relates it to the derivative of $f$.
\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
@Prove that then:
\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]
Note that
\[\begin{aligned} x - \alpha G _ \alpha(x) &= \text{prox} _ {\alpha \lambda \psi} (x - \alpha \nabla f(x)) \\\\ &= \text{argmin} _ u \text{ } \Xi(u) \end{aligned}\]where
\[\Xi(u) := \alpha \lambda \psi(u) + \frac 1 2 ||u - (x - \alpha \nabla f(x))||^2\]is a convex function.
Then
\[0 \in \partial \Xi(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]but since
\[\begin{aligned} \partial \Xi(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x))) &= \partial \Xi(x - \alpha G _ \alpha(x)) \\\\ &= \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + (x - \alpha G _ \alpha(x)) - (x - \alpha \nabla f(x)) \end{aligned}\]it follows that
\[\alpha G _ \alpha(x) \in \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + \alpha \nabla f(x)\]Regularised optimisation
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
What is the standard template used for iterative algorithms in this context? Give this in three ways:
- First in terms of two separate steps,
- then in one step using the proximal operator,
- and finally using the gradient map $G _ \alpha$.
As a two-step iterative process:
\[\begin{aligned} y^{k+1} &= x^k - \alpha _ k \nabla f(x^k) \\\\ x^{k+1} &= \text{argmin} _ {z} \text{ } ||z - y^{k+1}||^2 + \alpha _ k \lambda \psi(z) \end{aligned}\]In one step using $\text{prox}$:
\[x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))\]Using the gradient map $G _ \alpha$:
\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k}(x^k)\]Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$ using the iterative step $x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))$
If $\psi(x) = f(x) + \lambda I _ \Omega(x)$ where $\Omega$ is a convex step, can you derive the corresponding update and interpret the results?
which is the orthogonal projection of the steepest descent update onto the feasible set.
@example~
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
@State the foundational inequality of the prox-method in this context, first in full and then @prove how it reduces to generalise the foundational inequality of steepest descent.
In general: For all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,
\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]
Generalises the foundational inequality: Note that one step in the prox method is
\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k} (x^k)\]
Substituting this into the above equation, it reduces to
\[\phi(x^{k+1}) \le \phi(z) + G _ {\alpha _ k}(x^k)^\top (x^k - z) - \frac{\alpha _ k}{2} ||G _ {\alpha _ k}(x^k)||^2\]
Letting $z = x^k$, this becomes:
\[\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k} 2 ||G _ {\alpha _ k}(x^k)||^2\]
Compare this to the foundational inequality of gradient descent:
\[f(x^{k+1}) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
@Prove the foundational inequality of the prox-method in this context, i.e. that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,
\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]
As $f$ is $L$-smooth, we have
\[f(y) \le f(x) + \nabla f(x)^\top (y - x) + \frac L 2 ||y - x||^2\]for all $y \in \mathbb R^n$. Letting $y = x - \alpha G _ \alpha(x)$, we obtain
\[\begin{aligned} f(x - \alpha G _ \alpha(x)) &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac{L\alpha^2}{2} ||G _ \alpha(x)||^2 \\\\ &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac \alpha 2 ||G _ \alpha(x)||^2 \end{aligned}\]where the last inequality follows from the assumption that $\alpha \in (0, \frac 1 L]$.
The convexity of $f$ implies that
\[f(z) \ge f(x) + \nabla f(x)^\top (z - x)\]By the result that
\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]we have
\[\lambda \psi(z) \ge \lambda \psi(x - \alpha G _ \alpha(x)) + (G _ \alpha(x) - \nabla f(x))^\top(z - (x - \alpha G _ \alpha(x)))\]Combining these results
\[\begin{aligned} \phi(x - \alpha G _ \alpha(x)) &= f(x - \alpha G _ \alpha(x)) + \lambda \psi(x - \alpha G _ \alpha(x)) && (1\star) \\\\ &\le f(x) - \alpha G _ \alpha (x)^\top \nabla f(x) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (2\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x + \alpha G _ \alpha(x)) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (3\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x + \alpha G _ \alpha(x)) + \frac \alpha 2 ||G _ \alpha(x)||^2 + \lambda \psi(z) - (G _ \alpha(x) - \nabla f(x))^\top(z - (x - \alpha G _ \alpha(x))) &&(4\star)\\\\ &= f(z) + \lambda \psi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2 &&(5\star)\\\\ &= \phi(z) + G _ \alpha(x)^\top (x - z) - \frac{\alpha}{2}||G _ \alpha(x)||^2 \end{aligned}\]where:
- $(1\star)$: Follows from the definition of $\phi$
- $(2,3,4\star)$: Follow from the above results
Convergence theory
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
- We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
-
- The model has minimiser $x^\ast$
@State a result about the convergence in this case.
- The model has minimiser $x^\ast$
for all $k \ge 1$.
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
- We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
- The model has minimiser $x^\ast$
@Prove that then
\[\phi(x^k) - \phi(x^\ast) \le \frac L {2k} ||x^0 - x^\ast||^2\]
for all $k \ge 1$
(you may use previous results).
By a previous result, we have that
\[\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k} 2 ||G _ {\alpha _ k}(x^k)||^2\]and therefore $(\phi(x^k)) _ {k \in \mathbb N}$ is monotone decreasing.
Furthermore, using the result that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,
\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 ||G _ \alpha(x)||^2\]with $z = x^\ast$ gives
\[\begin{aligned} 0 &\le \phi(x^{k+1}) - \phi(x^\ast) \\\\ &\le G _ {\alpha _ k}(x^k)^\top (x^k - x^\ast) - \frac{\alpha _ k}{2} ||G _ {\alpha _ k}(x^k)||^2 \\\\ &= \frac{1}{2\alpha _ k} \left( ||x^k - x^\ast||^2 - ||x^k - x^\ast - \alpha _ k G _ {\alpha _ k}(x^k)||^2 \right) \\\\ &= \frac{1}{2\alpha _ k} \left( ||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2 \right) \end{aligned}\]Therefore $( \vert \vert x^k - x^\ast \vert \vert ) _ {k \in \mathbb N}$ is also monotone decreasing.
Combining this together, we have
\[\begin{aligned} T \times (\phi(x^T) - \phi(x^\ast)) &\le \sum^{T-1} _ {k = 0} (\phi(x^{k+1}) - \phi(x^\ast)) \\\\ &\le \frac L 2 \sum^{T-1} _ {k = 0}(||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2) \\\\ &\le \frac L 2 ||x^0 - x^\ast||^2 \end{aligned}\]