Optimisation for Data Science HT25, Proximal method
Why the proximal method?
One way to motivate the proximal method is to see it as a generalisation of projected gradient descent. When using projected gradient descent, we wish to solve the optimisation problem
\[\min _ {x \in \Omega} f(x)\]where $\Omega$ is some convex set. To do this, we use gradient descent but project every iterate onto the feasible region:
\[x^{k+1} = \text{proj} _ \Omega(x^k - \alpha _ k \nabla f(x^k))\]We may rewrite this as
\[x^{k+1} = \text{argmin} _ {z \in \Omega} ||z - (x^k - \alpha _ k \nabla f(x^k))||^2\]and then convert this into an unconstrained optimisation problem
\[x^{k+1} = \text{argmin} _ z \Big[ || z - (x^k - \alpha _ k \nabla f(x^k)) ||^2 + I _ \Omega(z) \Big]\]You can show that this method has reasonable convergence properties ($O(\sqrt k)$ in the non-convex objective case). The proximal method is what you get when you generalise to other non-smooth but still convex penalty terms, other than $I _ \Omega$:
\[\begin{aligned} x^{k+1} &= \text{argmin} _ z \Big[ \frac{1}{2} ||z - (x^k - \alpha _ k \nabla f(x^k))||^2 + \lambda \alpha _ k \psi(z) \Big] \\\\ &= \text{prox} _ {\alpha _ k \lambda \psi}(x^k - \alpha _ k \nabla f(x^k)) \end{aligned}\]So $\text{prox}$ can be seen as a generalised projection operation.
Flashcards
Basic definitions
Suppose:
- $h : \mathbb R^n \to \mathbb R$ is proper convex
- $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
- $\lambda > 0$
@Define the Moreau envelope of $h$.
Suppose:
- $h : \mathbb R^n \to \mathbb R$ is proper convex
- $h$ is closed (all sublevel sets $\{x : f(x) \le \alpha\}$ are closed)
- $\lambda > 0$
In this context, define the proximal operator $\text{prox} _ {\lambda h}$.
defined by
\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ {u} \left(\lambda h(u) + \frac{1}{2} \vert \vert u-x \vert \vert ^2\right)\]Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
@Define the gradient map $G _ \alpha$ in this context.
where
\[G _ \alpha(x) = \frac{1}{\alpha}(x - \text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]Properties
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
- $\phi(u) := \lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2$
@State a property of $\text{prox} _ {\lambda h}$ that relates $\text{prox} _ {\lambda h}$ to the stationary points of $\phi$.
and
\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]i.e. $\text{prox} _ {\lambda h}(x)$ is a minimiser of $\phi$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
- $\phi(u) := \lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2$
@Prove that
\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox}
_ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]
and
\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]
Note
\[\partial \phi(u) = \lambda \partial h(u) + u - x\]By letting $u = \text{prox} _ {\lambda h}(x)$, we have
\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]as required.
Since
\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ u \left[\lambda h(u) + \frac 1 2 \vert \vert u - x \vert \vert ^2\right] \\\\ &=\text{argmin} _ u [\varphi(u)] \end{aligned}\]and $\varphi$ is convex (it is the sum of two convex functions), $0 \in \partial(\varphi(u))$ iff $u$ minimises $\varphi$. Since $\text{prox} _ {\lambda h}(x)$ is precisely the minimiser,
\[0 \in \partial \varphi(\text{prox} _ {\lambda h}(x))\]Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates to its finiteness.
for all $x$ even if $h(x) = \infty$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates to its convexity.
is convex in $x$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates to its differentiability and give its derivative.
$M _ {\lambda, h}(x)$ is differentiable in $x$ with gradient
\[\nabla M _ {\lambda, h}(x) = \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x))\]@Prove that if
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
then $M _ {\lambda, h}(x)$ is differentiable in $x$ with gradient
\[\nabla M _ {\lambda, h}(x) = \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x))\]
(you may assume $h$ is differentiable at $u = \text{prox} _ {\lambda h}(x)$ and that $\text{prox} _ {\lambda h}$ is differentiable at $x$).
Recall
\[M _ {\lambda, h}(x) := \frac{1}{\lambda} \inf _ u \left(\lambda h(u) + \frac 1 2 \vert \vert u-x \vert \vert ^2\right)\]We write
\[M _ {\lambda, h}(x) := h(\text{prox} _ {\lambda h}(x)) + \frac 1 {2\lambda} \vert \vert \text{prox} _ {\lambda h}(x) - x \vert \vert ^2\] \[\begin{aligned} \frac{\text d}{\text dx} \big( M _ {\lambda , h}(x) \big) &= \frac{\partial}{\partial u} \big [ M _ {\lambda, h}(x) \big] \frac{\text du}{\text dx} + \frac{\partial}{\partial x}\big[ M _ {\lambda, h}(x) \big] &(1\star) \\\\ &= \frac{\partial}{\partial u} \left[h(u) + \frac{1}{2\lambda} \vert \vert u - x \vert \vert ^2\right] \frac{\text d}{\text dx} (\text{prox} _ {\lambda h}(x)) + \frac{\partial}{\partial x} \left[\frac 1 {2\lambda} \vert \vert u - x \vert \vert ^2\right] \\\\ &= \frac{\partial}{\partial x} \left[\frac{1}{2\lambda} \vert \vert u - x \vert \vert ^2\right] &(2\star) \\\\ &= \frac 1 \lambda ( x - u ) \\\\ &= \frac 1 \lambda (x - \text{prox} _ {\lambda h}(x)) \end{aligned}\]where:
- $(1\star)$: Follows from the chain rule applied to the function $g(x, u(x))$ where $g(x, u) = h(u) + \frac 1 {2\lambda} \vert \vert u - x \vert \vert ^2$.
- $(2\star)$: Follows from the fact that $u$ is a minimiser of $h(u) - \frac{1}{2\lambda} \vert \vert u-x \vert \vert ^2$ and so that term is zero.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
- $x \in \mathbb R^n$ is fixed
@State a property of $M _ {\lambda, h}$ that relates its minima to the minima of $h$.
Suppose:
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
@State a property of $\text{prox} _ {\lambda h}$ that relates to its smoothness.
for all $x, y \in \mathbb R^n$ (i.e. it is $1$-Lipschitz).
@Prove that if
- $h : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex and closed function
then
\[\vert \vert \text{prox} _ {\lambda h}(x) - \text{prox} _ {\lambda h}(y) \vert \vert \le \vert \vert x - y \vert \vert\]
for all $x, y \in \mathbb R^n$ (i.e. it is $1$-Lipschitz).
By the result that says:
\[\partial \phi(\text{prox} _ {\lambda h}(x)) = \lambda \partial h(\text{prox} _ {\lambda h}(x)) + \text{prox} _ {\lambda h}(x) - x\]
\[0 \in \partial \phi(\text{prox} _ {\lambda h}(x))\]and
In the following, write $p$ for $\text{prox} _ {\lambda h}$. Then
Note that $x - p(x) \in \lambda \partial h(p(x))$. Then
\[\lambda h(u) \ge \lambda h(p(x)) + (x - p(x))^\top (u - p(x))\]for all $x$ and $u$, by properties of the subdifferential. In particular, by taking $u = p(y)$ we have
\[\lambda h(p(y)) \ge \lambda h(p(x)) + (x - p(x))^\top (p(y) - p(x))\]for all $x$ and $y$. Likewise,
\[\lambda h(p(x)) \ge \lambda h(p(y)) + (y - p(y))^\top (p(x) - p(y))\]Summing these inequalities and simplifying, we have
\[\begin{aligned} &0 \ge (x - p(x) - y + p(y))^\top (p(y) - p(x)) \\\\ \implies& 0 \ge ((x - y) - (p(x) - p(y)))^\top (p(y) - p(x)) \\\\ \implies& 0 \ge (x - y)^\top(p(y) - p(x)) - (p(x) - p(y))^\top (p(y) - p(x)) \\\\ \implies& 0 \ge -(x-y)^\top (p(x) - p(y)) +(p(x) - p(y))^\top(p(x) - p(y)) \\\\ \implies& (x - y)^\top (p(x) - p(y)) \ge \vert \vert p(x) - p(y) \vert \vert ^2 \\\\ \end{aligned}\]Hence by Cauchy-Schwarz and dividing through,
\[\vert \vert p(x) - p(y) \vert \vert \le \vert \vert x - y \vert \vert\]as required.
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
@State a property of $G _ \alpha$ that relates it to the derivative of $f$.
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
@Prove that then:
\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]
Note that
\[\begin{aligned} x - \alpha G _ \alpha(x) &= x - \alpha \cdot \frac 1 \alpha (x - \text{prox} _ {a\lambda \psi}(x - \alpha \nabla f(x))) \\\\ &= \text{prox} _ {\alpha \lambda \psi} (x - \alpha \nabla f(x)) \\\\ &= \text{argmin} _ u \text{ } E(u) \end{aligned}\]where
\[E(u) := \alpha \lambda \psi(u) + \frac 1 2 \vert \vert u - (x - \alpha \nabla f(x)) \vert \vert ^2\]is a convex function.
Then
\[0 \in \partial E(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x)))\]but since
\[\begin{aligned} &\partial E(\text{prox} _ {\alpha \lambda \psi}(x - \alpha \nabla f(x))) \\\\ =& \partial E(x - \alpha G _ \alpha(x)) \\\\ =& \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + (x - \alpha G _ \alpha(x)) - (x - \alpha \nabla f(x)) \end{aligned}\]it follows that
\[\alpha G _ \alpha(x) \in \alpha \lambda \partial \psi(x - \alpha G _ \alpha(x)) + \alpha \nabla f(x)\]Regularised optimisation
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
What is the standard template used for iterative algorithms in this context? Give this in three ways:
- First in terms of two separate steps,
- then in one step using the proximal operator,
- and finally using the gradient map $G _ \alpha$.
As a two-step iterative process:
\[\begin{aligned} y^{k+1} &= x^k - \alpha _ k \nabla f(x^k) \\\\ x^{k+1} &= \text{argmin} _ {z} \text{ } \vert \vert z - y^{k+1} \vert \vert ^2 + \alpha _ k \lambda \psi(z) \end{aligned}\]In one step using $\text{prox}$:
\[x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))\]Using the gradient map $G _ \alpha$:
\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k}(x^k)\]Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\sigma : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
The proximal method solves this problem by using the iterative step
\[x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))\]
where
\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ {u} \left(\lambda h(u) + \frac{1}{2} \vert \vert u-x \vert \vert ^2\right)\]
Motivate this update in two different ways:
- First as a generalisation of the projected gradient descent, then
- As solving a sequence of simpler optimisation problems.
Generalisation of projected gradient descent:
In the projected gradient descent method, we want to solve the optimisation problem
\[\min _ {x \in \Omega} f(x)\]where $\Omega$ is a convex set. We do this by performing gradient descent steps as normal and then projecting onto $\Omega$:
\[\begin{aligned} x^{k+1} &= y^k - \alpha _ k \nabla f(y^k) \\\\ y^{k+1} &= \text{argmin} _ {z \in \mathbb \Omega} \frac 1 2 \vert \vert z - x^{k+1} \vert \vert ^2 \end{aligned}\]which we could rewrite as.
\[x^{k+1} = \text{prox} _ {\alpha _ kI _ \Omega}(x^k - \alpha _ k \nabla f(x^k))\]The proximal method is just what you get when you consider more general “projection operators” $\text{prox} _ {\alpha _ k \lambda \psi}$ where $\psi$ is a convex function.
As solving a sequence of simpler optimisation problems:
We may rewrite the update as so:
\[\begin{aligned} x^{k+1} &= \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k)) \\\\ &= \text{argmin} _ {z} \left[\frac{1}{2} \vert \vert z - (x^k - \alpha _ k \nabla f(x^k) \vert \vert ^2 + \lambda \alpha _ k \psi(z)\right] \\\\ &= \text{argmin} _ z \left[ f(x^k) + \nabla f(x^k)^\top(z - x^k) + \frac{1}{2\alpha _ k} \vert \vert z - x^k \vert \vert ^2 + \lambda \alpha _ k \psi(x) \right] \end{aligned}\]Note that
\[f(x^k) + \nabla f(x^k)^\top (z - x^k) + \frac{1}{2\alpha _ k} \vert \vert z - x^k \vert \vert ^2\]is a quadratic upper bound of $f$. So the proximal method can be interpreted as solving a sequence of simpler optimisation problems for which the minimum is an upper bound on the minimum of $f$.
Convergence theory
Foundational inequality / Overestimation property
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
@State the foundational inequality of the prox-method in this context, first in full and then @prove how it reduces to generalise the foundational inequality of steepest descent.
In general: For all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,
\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2\]Generalises the foundational inequality: Note that one step in the prox method is
\[x^{k+1} = x^k - \alpha _ k G _ {\alpha _ k} (x^k)\]Substituting this into the above equation, it reduces to
\[\phi(x^{k+1}) \le \phi(z) + G _ {\alpha _ k}(x^k)^\top (x^k - z) - \frac{\alpha _ k}{2} \vert \vert G _ {\alpha _ k}(x^k) \vert \vert ^2\]Letting $z = x^k$, this becomes:
\[\phi(x^{k+1}) \le \phi(x^k) - \frac{\alpha _ k} 2 \vert \vert G _ {\alpha _ k}(x^k) \vert \vert ^2\]Compare this to the foundational inequality of gradient descent:
\[f(x^{k+1}) \le f(x^k) - \frac 1 {2L} \vert \vert \nabla f(x^k) \vert \vert ^2\]@important~ (overestimation property)
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
@Prove the foundational inequality of the prox-method in this context, i.e. that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$,
\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2\]
We require 3 sub-results.
$L$-smoothness upper bound (A): As $f$ is $L$-smooth, we have
\[f(y) \le f(x) + \nabla f(x)^\top (y - x) + \frac L 2 \vert \vert y - x \vert \vert ^2\]for all $y \in \mathbb R^n$. Letting $y = x - \alpha G _ \alpha(x)$, we obtain
\[\begin{aligned} f(x - \alpha G _ \alpha(x)) &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac{L\alpha^2}{2} \vert \vert G _ \alpha(x) \vert \vert ^2 \\\\ &\le f(x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2 \end{aligned}\]where the last inequality follows from the assumption that $\alpha \in (0, \frac 1 L]$.
Convexity of $f$ upper bound (B): The convexity of $f$ implies that
\[f(z) \ge f(x) + \nabla f(x)^\top (z - x)\]and hence
\[f(x) \le f(z) - \nabla f(x)^\top (z - x)\]Convexity of $\psi$ upper bound (C): By a previous result, we have that
\[G _ \alpha(x) \in \nabla f(x) + \lambda \partial \psi(x - \alpha G _ \alpha(x))\]or equivalently,
\[G _ \alpha(x) - \nabla f(x) \in \lambda \partial \psi(x - \alpha G _ \alpha(x))\]Thus by the definition of a $G _ \alpha(x) - \nabla f(x)$ being a subgradient of $\lambda \psi$ at $x - \alpha G _ \alpha(x)$:
\[\lambda \psi(z) \ge \lambda \psi(x - \alpha G _ \alpha (x)) + (G _ \alpha(x) - \nabla f(x))^\top (z-(x - \alpha G _ \alpha(x))) \\\\\]and hence
\[\lambda \psi(x - \alpha G _ \alpha(x)) \le \lambda \psi(z) - (G _ \alpha(x) - \nabla f(x))^\top (z - (x - \alpha G _ \alpha (x)))\]Combining these results:
\[\begin{aligned} &\phi(x - \alpha G _ \alpha(x))\\\\ &= f(x - \alpha G _ \alpha(x)) + \lambda \psi(x - \alpha G _ \alpha(x)) && (1\star) \\\\ &\le f(x) - \alpha G _ \alpha (x)^\top \nabla f(x) + \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (2\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x) - \alpha G _ \alpha(x)^\top \nabla f(x) + \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2 + \lambda \psi(x - \alpha G _ \alpha(x)) && (3\star) \\\\ &\le f(z) - \nabla f(x)^\top (z - x + \alpha G _ \alpha(x)) + \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2 + \lambda \psi(z) - (G _ \alpha(x) - \nabla f(x))^\top(z - (x - \alpha G _ \alpha(x))) &&(4\star)\\\\ &= f(z) + \lambda \psi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2 &&(5\star)\\\\ &= \phi(z) + G _ \alpha(x)^\top (x - z) - \frac{\alpha}{2} \vert \vert G _ \alpha(x) \vert \vert ^2 && (6\star) \end{aligned}\]where:
- $(1\star)$: Follows from the definition of $\phi$.
- $(2\star)$: Follows from applying the $L$-smoothness upper bound to expand $f(x - \alpha G _ \alpha(x))$ (A).
- $(3\star)$: Follows from expanding $f(x)$ using the convexity upper bound (B).
- $(4\star)$: Follows from expanding $\lambda \psi(x - \alpha G _ \alpha(x))$ using the convexity upper bound (C).
- $(5\star)$: This is just algebra, noticing that the $\nabla f(x)^\top (z - x + \alpha G _ \alpha(x))$ terms cancel.
- $(6\star)$: Follows from the definition of $\phi$.
@important~ (overestimation property)
Global convergence
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
- We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
- The model has minimiser $x^\ast$
@State a result about the convergence in this case.
for all $k \ge 1$.
Suppose:
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$
- $G _ \alpha$ is the gradient map
- $f$ is $L$-smooth and convex
- $\psi$ is convex and closed
- We use the prox-gradient algorithm with constant step length $\alpha _ k = 1/L$ and initial point $x^0$
- The model has minimiser $x^\ast$
@Prove that then
\[\phi(x^k) - \phi(x^\ast) \le \frac L {2k} \vert \vert x^0 - x^\ast \vert \vert ^2\]
for all $k \ge 1$.
We have the overestimation result that for all $\alpha \in (0, \frac 1 L]$ and $z \in \mathbb R^n$
\[\phi(x - \alpha G _ \alpha(x)) \le \phi(z) + G _ \alpha(x)^\top (x - z) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2\]With $x = x^k$, $z = x^\ast$ and $\alpha = 1/L$, this gives
\[\begin{aligned} \phi(x^{k+1}) &\le \phi(x^\ast) + G _ {\alpha _ k}(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} \vert \vert G _ {\alpha _ k}(x^k) \vert \vert ^2 \\\\ &= \phi(x^\ast) + \frac{L}{2} \left( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^k - x^\ast - \alpha _ k G _ {\alpha _ k}(x^k) \vert \vert ^2 \right) \\\\ &= \phi(x^\ast) + \frac{L}{2} \left( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2 \right) \end{aligned}\]Therefore
\[\begin{aligned} \sum^{T-1} _ {k = 0} \left( \phi(x^{k+1}) - \phi(x^\ast) \right) &\le \frac L {2} \sum^{T-1} _ {k = 0} \left( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2 \right) \\\\ &= \frac L {2} ( \vert \vert x^0 - x^\ast \vert \vert ^2 - \vert \vert x^T - x^\ast \vert \vert ^2) \\\\ &\le \frac L {2} \vert \vert x^0 - x^\ast \vert \vert ^2 \end{aligned}\]Note also that the sequence of iterates is decreasing, since if $x$ and $z$ are both $x^{k}$, we have $\phi(x^{k+1}) \le \phi(x^k) - \frac \alpha 2 \vert \vert G _ \alpha(x) \vert \vert ^2$. Therefore
\[\begin{aligned} \phi(x^T) - \phi(x^\ast) &= \frac{1}{T} \sum^{T-1} _ {k = 0} \left( \phi(x^{T}) - \phi(x^\ast) \right) \\\\ &\le \frac{1}{T} \sum^{T-1} _ {k = 0} \left( \phi(x^{k+1}) - \phi(x^\ast) \right) \\\\ &\le \frac{L}{2T} \vert \vert x^0 - x^\ast \vert \vert ^2 \end{aligned}\]Computations of Moreau envelopes, proximal operators and gradient maps
Suppose you’re asked to calculate $\text{prox} _ {\lambda h}(x)$. Then you have
\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ u \text{ } E(u)\]where
\[E(u) = \frac{1}{2}||u-x||^2 _ 2 + \lambda h(u)\]In general, it’s helpful to proceed like so.
- See if $E$ is separable. If $h = \vert \vert \cdot \vert \vert _ 1$ then you can write $E(u) = \sum^n _ {i = 1} \frac 1 2 (u _ i - x _ i)^2 _ 2 + \lambda \vert u _ i \vert $ and hence it suffices to minimise each summand separately. This also applies to if $E$ splits into blocks of variables.
- Calculate $\partial E(u)$. This will likely split into several different cases depending on the value of $u$.
- Assume you are in each particular case, and solve for $u$.
- Combine all these cases together.
For simple functions like $h = 0$ or $h = I _ \Omega$, this full derivation is not needed.
$\text{prox}$ for $h = 0$
Suppose $h(x) := 0$, i.e. it is identically zero. What is $\text{prox} _ {\lambda h}$?
@example~ @justify~
$\text{prox}$ for $h = I _ {\Omega}$, indicator function on closed and convex set
Suppose $h(x) := I _ \Omega (x)$ where $\Omega$ is closed and convex. What is $\text{prox} _ {\lambda h}$?
which is the projection of $x$ onto $\Omega$.
@example~ @justify~
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is $L$-smooth and convex
- $\psi : \mathbb R^n \to \mathbb R$ is convex and closed
- We want to solve the regularised optimisation problem $\min _ {x \in \mathbb R^n} \phi(x) = f(x) + \lambda \psi(x)$ using the iterative step $x^{k+1} = \text{prox} _ {\alpha _ k \lambda \psi} (x^k - \alpha _ k \nabla f(x^k))$
Consider the case where
\[\phi(x) := f(x) + \lambda I _ \Omega(x)\]
and where $\Omega$ is a convex set. Derive the expression for $x^{k+1}$ and interpret the results.
which is the orthogonal projection of the steepest descent update onto the feasible set.
@example~
$\text{prox}$ for $h = \vert \vert \cdot \vert \vert _ 1$
Suppose $h(x) := \vert \vert x \vert \vert _ 1$. What is $\text{prox} _ {\lambda h}$?
Defined component-wise, it is:
\[(\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 1}(x)) _ i = \begin{cases} x _ i - \lambda &\text{if } x _ i > \lambda \\\\ 0 &\text{if } x _ i \in [-\lambda, \lambda] \\\\ x _ i + \lambda &\text{if } x _ i < -\lambda \end{cases}\]@Prove that
\[(\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 1}(x)) _ i = \begin{cases}
x _ i - \lambda &\text{if } x _ i > \lambda \\\\
0 &\text{if } x _ i \in [-\lambda, \lambda] \\\\
x _ i + \lambda &\text{if } x _ i < -\lambda
\end{cases}\]
Note that
\[\begin{aligned} \text{prox} _ {\lambda h}(x) &= \text{argmin} _ u \left\lbrace \frac 1 2 \vert \vert u - x \vert \vert ^2 _ 2 + \lambda \vert \vert u \vert \vert _ 1 \right\rbrace \\\\ &= \text{argmin} _ u \left\lbrace \sum^n _ {i = 1} \frac 1 2 (u _ i - x _ i)^2 + \lambda \vert u _ i \vert \right\rbrace \end{aligned}\]and hence it suffices to minimise each component separately. Let
\[E(u) := \frac 1 2 \vert \vert u - x \vert \vert ^2 _ 2 + \lambda \vert \vert u \vert \vert _ 1\]Hence
\[\partial E(u) _ i = \begin{cases} \lbrace u _ i - x _ i + \lambda \rbrace &\text{if } u _ i > 0 \\\\ \lbrace u _ i - x _ i + \ell \mid \ell \in [-\lambda, \lambda] \rbrace &\text{if } u _ i = 0 \\\\ \lbrace u _ i - x _ i - \lambda \rbrace &\text{if } u _ i < 0 \\\\ \end{cases}\]and by optimality conditions, we need $0 \in \partial E(u)$, so have
\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 1} (x) = \begin{cases} x _ i - \lambda &\text{if } x _ i - \lambda > 0 \\\\ 0 &\text{if } 0 \in \lbrace -x _ i + \ell \mid \ell \in [-\lambda, \lambda]\rbrace \\\\ x _ i + \lambda &\text{if } x _ i + \lambda < 0 \end{cases}\]i.e.
\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 1} (x) = \begin{cases} x _ i - \lambda &\text{if } x _ i - \lambda > 0 \\\\ 0 &\text{if } x _ i \in [-\lambda, \lambda] \\\\ x _ i + \lambda &\text{if } x _ i + \lambda < 0 \end{cases}\]$\text{prox}$ for $h = \vert \vert \cdot \vert \vert _ 2$
Suppose $h(x) := \vert \vert x \vert \vert _ 2$. What is $\text{prox} _ {\lambda h}$?
Prove that
\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 2}(x) = \max\left(0, \left( 1 - \frac{\lambda}{ \vert \vert x \vert \vert _ 2} \right)\right) x\]
Let
\[E(u) = \frac 1 2 \vert \vert u - x \vert \vert ^2 _ 2 + \lambda \vert \vert u \vert \vert _ 1\]We require $0 \in \partial E(u)$. Then
\[\partial E(u) = u - x + \lambda \partial( \vert \vert \cdot \vert \vert _ 2)(u)\]since
\[\partial( \vert \vert \cdot \vert \vert _ 2)(u) = \begin{cases} \lbrace u / \vert \vert u \vert \vert \rbrace &\text{if }u \ne 0 \\\\ \lbrace v \mid \vert \vert v \vert \vert \le 1 \rbrace &\text{if } u =0 \end{cases}\]So
\[\partial E(u) = \begin{cases} \lbrace u - x + \lambda \frac{u}{ \vert \vert u \vert \vert } \rbrace &\text{if } u \ne 0 \\\\ \lbrace u - x + \lambda v \mid \vert \vert v \vert \vert \le 1 \rbrace &\text{if } u = 0 \end{cases}\]Case $u \ne 0$: In this case
\[\partial E(u) = \lbrace u - x + \lambda \frac{u}{ \vert \vert u \vert \vert }\rbrace\]Then for the subgradient to contain $0$, we require $u - x + \lambda \frac{u}{ \vert \vert u \vert \vert } = 0$. Then $x = u + \lambda \frac u { \vert \vert u \vert \vert }$ which implies $ \vert \vert x \vert \vert = \vert \vert u \vert \vert (1 + \frac{\lambda}{ \vert \vert u \vert \vert }) = \vert \vert u \vert \vert + \lambda$, and hence $ \vert \vert u \vert \vert = \vert \vert x \vert \vert - \lambda$. This tells us two things: This case is only possible when $ \vert \vert x \vert \vert \ge \lambda$ (otherwise the norm would be negative), and if this is true, then: $x = u(1 + \frac{\lambda}{ \vert \vert x \vert \vert - \lambda})$ which implies
\[\begin{aligned} u &= \frac{x}{1 + \frac{\lambda}{ \vert \vert x \vert \vert - \lambda} } \\\\ &= \frac{x( \vert \vert x \vert \vert - \lambda)}{ \vert \vert x \vert \vert - \lambda + \lambda} \\\\ &= \left(1 - \frac{\lambda}{ \vert \vert x \vert \vert }\right)x \end{aligned}\]Case $u = 0$: Then we require
\[-x + \lambda v = 0\]which is feasible when $ \vert \vert x \vert \vert \le \lambda$.
Overall: Thus
\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 2}(x) = \begin{cases} \left(1 - \frac{\lambda}{ \vert \vert x \vert \vert }\right)x &\text{if } \vert \vert x \vert \vert \ge \lambda \\\\ 0 &\text{otherwise, when } \vert \vert x \vert \vert \le \lambda \end{cases}\]$\text{prox}$ for $h = \vert \vert \cdot \vert \vert _ 0$
Suppose $h(x) := \vert \vert x \vert \vert _ 0$ where $ \vert \vert x \vert \vert _ 0$ counts the non-zero components of $x$. What is $\text{prox} _ {\lambda h}$?
Defined component-wise, it is:
\[(\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 0}(x)) _ i = \begin{cases} x _ i &\text{if } \vert x _ i \vert > \sqrt{2\lambda} \\\\ 0 &\text{if } \vert x _ i \vert \le \sqrt{2\lambda} \end{cases}\]Suppose $h(x) := \vert \vert x \vert \vert _ 0$ where $ \vert \vert x \vert \vert _ 0$ counts the non-zero components of $x$.
@Prove that
\[(\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 0}(x)) _ i = \begin{cases}
x _ i &\text{if } \vert x _ i \vert > \sqrt{2\lambda} \\\\
0 &\text{if } \vert x _ i \vert \le \sqrt{2\lambda}
\end{cases}\]
We have
\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 0}(x) = \text{argmin} _ u \text{ } E(u)\]where
\[E(u) = \sum^n _ {i = 1} \frac{1}{2}(u _ i - x _ i)^2 + \lambda [u _ i \ne 0]\]Hence it suffices to minimise each summand separately. Then
\[E _ i(u) = \frac 1 2 (u _ i - x _ i)^2 + \lambda [u _ i \ne 0]\]If $u _ i \ne 0$, the cost is $E _ i(u) = \frac 1 2(u _ i - x _ i)^2 + \lambda$. Hence it is worth setting $u _ i = x _ i$ iff $\frac 1 2 x _ i^2 > \lambda$, i.e. $ \vert x _ i \vert > 2\lambda$. Otherwise, set $u _ i = 0$. Then
\[(\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 0}(x)) _ i = \begin{cases} x _ i &\text{if } \vert x _ i \vert > \sqrt{2\lambda} \\\\ 0 &\text{if } \vert x _ i \vert \le \sqrt{2\lambda} \end{cases}\]$\text{prox}$ for $h(x) = \sum \vert \vert y _ \ell \vert \vert _ 2$, sum of $2$-norms of blocks
Suppose
\[\begin{aligned}
&h : \mathbb R^n \to \mathbb R \\\\
&h(x) = \sum^m _ {\ell = 1} \vert \vert y _ \ell \vert \vert
\end{aligned}\]
where $n = km$ for some $k, m \in \mathbb N$ and the inputs $x$ have block structure $x = (y _ 1, \ldots, y _ m)$. What is $\text{prox} _ {\lambda h}$?
Let $\text{prox} _ {\lambda h}(x) _ {y _ \ell}$ denote the coordinates of $\text{prox} _ {\lambda h}(x)$ in the $\ell$th block. Then
\[\text{prox} _ {\lambda h}(x) _ {y _ {\ell} } = \begin{cases} (1 - \frac{\lambda}{ \vert \vert y _ \ell \vert \vert })y _ \ell &\text{if } \vert \vert y _ \ell \vert \vert \ge \lambda \\\\ 0 &\text{if } \vert \vert y _ \ell \vert \vert < \lambda \end{cases}\]Suppose
\[\begin{aligned}
&h : \mathbb R^n \to \mathbb R \\\\
&h(x) = \sum^m _ {\ell = 1} \vert \vert y _ \ell \vert \vert
\end{aligned}\]
where $n = km$ for some $k, m \in \mathbb N$ and the inputs $x$ have block structure $x = (y _ 1, \ldots, y _ m)$.
@Prove that
\[\text{prox} _ {\lambda h}(x) _ {y _ {\ell} } = \begin{cases}
(1 - \frac{\lambda}{ \vert \vert y _ \ell \vert \vert })y _ \ell &\text{if } \vert \vert y _ \ell \vert \vert \ge \lambda \\\\
0 &\text{if } \vert \vert y _ \ell \vert \vert < \lambda
\end{cases}\]
where $\text{prox} _ {\lambda h}(x) _ {y _ \ell}$ denote the coordinates of $\text{prox} _ {\lambda h}(x)$ in the $\ell$th block.
(you may assume results about the proximal maps of simpler regularisers)
We have
\[\text{prox} _ {\lambda h}(x) = \text{argmin} _ u \text{ } E(u)\]where
\[E(u) := \sum^m _ {\ell = 1} \frac{1}{2} \vert \vert u - y _ \ell \vert \vert ^2 + \vert \vert y _ \ell \vert \vert\]Hence it suffices to minimise each block separately. But minimising each block corresponds to the proximal map applied to the $2$-norm. So by the result
\[\text{prox} _ {\lambda \vert \vert \cdot \vert \vert _ 2}(x) = \max\left(0, \left( 1 - \frac{\lambda}{ \vert \vert x \vert \vert _ 2} \right)\right) x\]If $h(x) := \vert \vert x \vert \vert _ 2$. Then
it follows
\[\text{prox} _ {\lambda h}(x) _ {y _ {\ell} } = \text{prox} _ {\lambda h _ {y _ \ell} }(y _ \ell) = \begin{cases} (1 - \frac{\lambda}{ \vert \vert y _ \ell \vert \vert })y _ \ell &\text{if } \vert \vert y _ \ell \vert \vert \ge \lambda \\\\ 0 &\text{if } \vert \vert y _ \ell \vert \vert < \lambda \end{cases}\]$M _ {\lambda h}$ for $h$ piecewise linear
Suppose
\[\begin{aligned}
&h : \mathbb R \to \mathbb R \\\\
&h(x) = \begin{cases}
\frac 1 2 x &\text{if } x \le 0 \\\\
2x &\text{if } x > 0
\end{cases}
\end{aligned}\]
@Prove that then
\[M _ {\lambda, h}(x) = \begin{cases}
\frac 1 2 x - \frac \lambda 8 &\text{if } x \le \frac \lambda 2 \\\\
\frac{x^2}{2\lambda} &\text{if } \frac{\lambda}{2} < x < 2\lambda \\\\
2x - 2\lambda &\text{if }x \ge 2\lambda
\end{cases}\]
We have
\[M _ {\lambda, h}(x) = \inf _ u E(u)\]where
\[\begin{aligned} E(u) &= \frac 1 {2\lambda} (u - x)^2 + h(u) \\\\ &= \begin{cases} \frac{1}{2\lambda} (u - x)^2 + \frac 1 2 u &\text{if } x \le 0 \\\\ \frac{1}{2\lambda} (u - x)^2 + 2u &\text{if } x > 0 \end{cases} \end{aligned}\]Case $u \le 0$: Then
\[\frac{\text d}{\text du} \left[ \frac{1}{2\lambda} (u - x)^2 + \frac 1 2 u \right] = \frac 1 \lambda (u - x) + \frac 1 2\]which implies $u = x - \frac{\lambda}{2}$. Hence $u \le 0$ implies $x \le \frac \lambda 2$ and for this value $E(u) = \frac 1 2 x - \frac \lambda 8$ (plug it in).
Case $u > 0$: Then
\[\frac{\text d}{\text du} \left[ \frac{1}{2\lambda} (u - x)^2 + \frac 1 2 u \right] = \frac 1 \lambda (u - x) + 2\]which implies $u = x - 2\lambda$. Hence $u > 0$ implies $x > 2 \lambda$ and for this value $E(u) = 2x - 2\lambda$.
Missing case: We still need to determine what happens when $\frac \lambda 2 < x < 2\lambda$. In this case, the value of the objective is given by the value at the boundary, i.e. when $u = 0$. Then $E(0) = \frac{x^2}{2\lambda}$.
Overall: Then
\[M _ {\lambda, h}(x) = \begin{cases} \frac 1 2 x - \frac \lambda 8 &\text{if } x \le \frac \lambda 2 \\\\ \frac{x^2}{2\lambda} &\text{if } \frac{\lambda}{2} < x < 2\lambda \\\\ 2x - 2\lambda &\text{if }x \ge 2\lambda \end{cases}\]$\text{prox}$ and $G _ \alpha$ for $h$ weighted 1-norm plus indicator function
…