Notes - Optimisation for Data Science HT25, Basic definitions

[[Course - Optimisation for Data Science HT25]]^U

Flashcards

Minimisers

Consider the optimisation model

\[\min _ {x \in \mathbb R^n} f(x) \quad\text{ subject to }x \in \mathcal F\]

@Define what is meant by:

A local minimiser
A strict local minimiser
A global minimiser

$x^\ast \in \mathcal F$ is:

A local minimiser if $f(x^\ast) \le f(x)$ for all $x \in \mathcal F \cap B _ \varepsilon (x^\ast)$ for some $\varepsilon > 0$.
A strict local minimiser if $f(x^\ast) < f(x)$ for all $x \in \mathcal F \cap B _ \varepsilon(x^\ast)$
A global minimiser if $f(x^\ast) \le f(x)$ for all $x \in \mathcal F$.

Convexity

@Define what it means for a function $f : \mathbb R^n \to \mathbb R$ to be convex.

For all $\lambda \in [0, 1]$ and $x, y \in \mathbb R^n$,

\[f((1-\lambda)x + \lambda y) \le (1 - \lambda)f(x) + \lambda f(y)\]

Suppose:

We have a set of functions $\{f _ i(x) : \mathcal D \to \mathbb R \mid i \in \mathcal N\}$ with $\mathcal N$ a finite index set
Each $f _ i$ is a convex function defined on a common convex domain $\mathcal D \subseteq \mathbb R^n$.

@Justify that

\[x \mapsto \max _ {i \in \mathcal N} f _ i(x)\]

is a convex function.

For all $x _ 1, x _ 2 \in \mathcal D$ and $\lambda \in [0, 1]$,

\[\begin{aligned} \max _ i f _ i(\lambda x _ 1 + (1-\lambda)x _ 2) &\le \max _ i (\lambda f _ i(x _ 1) + (1-\lambda) f _ i(x _ 2)) \\\\ &\le \lambda \max _ i f _ i(x _ 1 +) + (1 - \lambda)\max _ i f _ i(x _ 2) \end{aligned}\]

Suppose:

$\mathcal D \subseteq R^n$ is a convex domain
$f : \mathcal D \to \mathbb R$
$f$ has gradient $\nabla f(x)$ at $x \in \mathcal D$

@Justify that then the first-order Taylor approximation is a lower bounding function, i.e. for all $y \in \mathcal D$,

\[f(x) + \nabla f(x)^\top (y - x) \le f(y)\]

By definition, $f$ is convex if for all $\lambda \in [0, 1]$ and $x, y \in \mathbb R^n$,

\[f((1-\lambda)x + \lambda y) \le (1 - \lambda)f(x) + \lambda f(y)\]

Therefore

\[f(x) + \frac{f(x + \lambda(y - x)) - f(x)}{\lambda} \le f(y)\]

and taking the limit $\lambda \to 0$ yields the result.

@Define what it means for a function $f : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ to be proper convex.

$f$ is convex and $f(x) < +\infty$ for at least one point $x$.

Suppose:

$\mathcal D \subseteq \mathbb R^n$ is a convex domain
$f : \mathcal D \to \mathbb R$ is a convex function

How is it possible to extend $f$ to a proper convex function on all of $\mathbb R^n$?

Set $f(x) := +\infty$ for $x \notin \mathcal D$.

Suppose $\Omega \subset \mathbb R^n$. @Define the indicator function $I _ \Omega$, and state a result about its convexity.

\[I _ \Omega = \begin{cases} 0 &\text{if }x \in \Omega \\\\ +\infty &\text{otherwise} \end{cases}\]

$\Omega \ne \emptyset$ is a convex set iff $I _ \Omega$ is a proper convex function.

Suppose we have the convex optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{subject to }x \in \mathcal F\]

where $\mathcal F$ is a convex set. How can you convert it into an equivalent unconstrained convex optimisation problem?

Consider

\[\min _ {x \in \mathbb R^n} f(x) + I _ \mathcal F(x)\]

Smoothness and strong convexity

@Define what it means for a function $f : \mathbb R^n \to \mathbb R$ to be strongly convex with modulus of convexity $\gamma > 0$.

For all $\lambda \in [0, 1]$ and $x, y \in \mathbb R^n$:

\[f((1 - \lambda)x + \lambda y) \le (1 - \lambda) f(x) + \lambda f(y) - \frac \gamma 2 \lambda (1 - \lambda)||x - y||^2\]

@Define what it means for a function $f : \mathbb R^n \to \mathbb R$ to be $L$-smooth.

It is differentiable everywhere with $L$-Lipschitz continuous gradient, i.e. for all $x, y \in \mathbb R^n$

\[||\nabla f(x) - \nabla f(y) || \le L||x - y||\]

Subgradients

Suppose $f : \mathbb R^n \to \mathbb R$ is a proper convex function. Define a subgradient $v$ of $f$ at $x$ and the subdifferential $\partial f(x)$.

$v \in \mathbb R^n$ is a subgradient of $f$ at $x$ if for all $y \in \mathbb R^n$, $f(y) \ge f(x) + v^\top (y - x)$.
$\partial f(x)$ is the set of all subgradients of $f$ at $x$.

Suppose:

$f : \mathbb R^n \to \mathbb R$ is a proper convex function
$a \in \partial f(x)$
$b \in \partial f(y)$

@Prove that then:

$(a - b)^\top (x - y) \ge 0$

@todo, after problem sheet one.

Suppose $f : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex function. Consider the optimisation model

\[\min _ {x \in \mathbb R^n} f(x) \quad\text{ subject to }x \in \mathcal F\]

Recall that $x^\ast \in \mathcal F$ is:

A local minimiser if $f(x^\ast) \le f(x)$ for all $x \in \mathcal F \cap B _ \varepsilon (x^\ast)$ for some $\varepsilon > 0$.
A strict local minimiser if $f(x^\ast) < f(x)$ for all $x \in \mathcal F \cap B _ \varepsilon(x^\ast)$
A global minimiser if $f(x^\ast) \le f(x)$ for all $x \in \mathcal F$.

@State three results about the minimisation of proper convex functions.

(1): $x^\ast$ is a local minimiser iff it is a global minimiser.

(2): The set of minimisers

\[\text{argmin} _ {x \in \mathbb R^n} f(x) := \\{x^\ast \mid f(x^\ast) = \min _ {x \in \mathbb R^n} f(x)\\}\]

is a convex set.

(3): $x^\ast \in \text{argmin} _ {x \in \mathbb R^n} f(x)$ if and only if $\pmb 0 \in \partial f(x^\ast)$.

Suppose $f : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is a proper convex function. Consider the optimisation model

\[\min _ {x \in \mathbb R^n} f(x) \quad\text{ subject to }x \in \mathcal F\]

Recall that $x^\ast \in \mathcal F$ is:

A local minimiser if $f(x^\ast) \le f(x)$ for all $x \in \mathcal F \cap B _ \varepsilon (x^\ast)$ for some $\varepsilon > 0$.
A strict local minimiser if $f(x^\ast) < f(x)$ for all $x \in \mathcal F \cap B _ \varepsilon(x^\ast)$
A global minimiser if $f(x^\ast) \le f(x)$ for all $x \in \mathcal F$.

@Prove the following three results about the minimisation of proper convex functions:

$x^\ast$ is a local minimiser iff it is a global minimiser.
The set of minimisers $\text{argmin} _ {x \in \mathbb R^n} f(x) := \{x^\ast \mid f(x^\ast) = \min _ {x \in \mathbb R^n} f(x)\}$ is a convex set.
$x^\ast \in \text{argmin} _ {x \in \mathbb R^n} f(x)$ if and only if $\pmb 0 \in \partial f(x^\ast)$.

@todo, after sheet one.

Suppose $f : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is proper convex. @State a result that effectively reduces the subgradient to the normal gradient when $f$ is differentiable, and vice versa.

If $f$ is differentiable at $x$, then $\partial f(x) = \{\nabla f(x)\}$. Likewise, if $\partial f(x) = \{v\}$ for some $v$, then $f$ is differentiable at $x$.

Suppose $f : \mathbb R^n \to \mathbb R \cup \{+\infty\}$ is proper convex. @Prove that:

If $f$ is differentiable at $x$, then $\partial f(x) = \{\nabla f(x)\}$
If $\partial f(x) = \{v\}$ for some $v$, then $f$ is differentiable at $x$.

$f$ differentiable at $x$ implies $\partial f(x) = \{\nabla f(x)\}$:

By convexity, for all $x, y \in \mathbb R^n$ and $t \in [0, 1]$,

\[\begin{aligned} & f((1-t)x + ty) \le (1 - t)f(x) + t f(y) \\\\ \implies& \frac{f(x + t(y - x)) - f(x)}{t} \le f(y) - f(x) \\\\ \end{aligned}\]

Taking the limit at $t \to 0$, we see

\[\nabla f(x)^\top (y - x) \le f(y) - f(x)\]

and hence $\nabla f(x) \in \partial f(x)$.

Likewise, suppose that $\gamma \in \partial f(x)$. Then by the definition of subgradients,

\[\begin{aligned} \gamma^t ( -w) &\le \frac{f(x - tw) - f(x)}{t} \\ &\stackrel{t \to 0^+}\to \nabla f(x)^\top (-w) \end{aligned}\]

But also

\[\begin{aligned} \gamma^\top w &\le \frac{f(x + tw) - f(x)}{t} \\ &\stackrel{t \to 0^+}\to \nabla f(x)^\top w \end{aligned}\]

and hence

\[\nabla f(x)^\top w \le \gamma^\top w \le \nabla f(x)^\top w\]

Therefore

\[\gamma = \nabla f(x)\]

@todo

Connections between strong convexity and smoothness

Suppose:

$f : \mathbb R^n \to \mathbb R \cup \{+\infty\}$
$f$ is $\gamma$-strongly convex
$f$ is differentiable at $x$

@State a result which relates the Taylor approximation of $f$ to $\gamma$.

For all $y \in \mathbb R^n$,

\[\frac \gamma 2 ||y - x||^2 \le f(y) - f(x) - \nabla f(x)^\top ( y - x)\]

Suppose:

$f : \mathbb R^n \to \mathbb R \cup \{+\infty\}$
$f$ is $\gamma$-strongly convex
$f$ is differentiable at $x$

@Prove that then for all $y \in \mathbb R^n$,

\[\frac \gamma 2 ||y - x||^2 \le f(y) - f(x) - \nabla f(x)^\top ( y - x)\]

@todo, after sheet 1.

Suppose:

$f : \mathbb R^n \to \mathbb R \cup \{+\infty\}$
$f$ is $L$-smooth (and not necessarily convex)
$x \in \mathbb R^n$

@State a result which relates the Taylor approximation of $f$ to $L$.

For all $y \in \mathbb R^n$,

\[f(y) - f(x) - \nabla f(x)^\top ( y - x) \le \frac L 2 ||y - x||^2\]

Suppose:

$f : \mathbb R^n \to \mathbb R \cup \{+\infty\}$
$f$ is $L$-smooth (and not necessarily convex)
$x \in \mathbb R^n$

@Prove that then for all $y \in \mathbb R^n$,

\[f(y) - f(x) - \nabla f(x)^\top ( y - x) \le \frac L 2 ||y - x||^2\]

@todo, after sheet one.

Suppose:

$f : \mathbb R^n \to \mathbb R$
$f$ is $\gamma$-strongly convex
$f$ is $L$-smooth

@State a result which relates $\gamma$ and $L$.

\[\gamma \le L\]

Suppose:

$f : \mathbb R^n \to \mathbb R$
$f$ is $\gamma$-strongly convex
$f$ is $L$-smooth

@Justify (appealing to other results) that then $\gamma \le L$.

By other results,

\[\frac \gamma 2 ||y - x||^2 \le f(y) - f(x) - \nabla f(x)^\top ( y - x) \le \frac L 2 ||y - x||^2\]

for all $y \in \mathbb R^n$.

Convergence rates

Suppose:

$(\phi _ k) _ {k \in \mathbb N} \in \mathbb R _ +^\mathbb N$
$\phi _ k \to 0$ as $k \to \infty$.

@Define what it means for $(\phi _ k) _ {k \in \mathbb N}$ to converge:

Sublinearly
$R$-linearly
$Q$-linearly
$R$-superlinearly
$Q$-superlinearly
$Q$-quadratically

and state the implications between these convergence rates.

Sublinearly: Slower than $R$-linearly
$R$-linearly: $\exists \sigma \in (0, 1)$ and $C > 0$ such that $\forall k \in \mathbb N$, $\phi _ k \le C(1 - \sigma)^k$.
$Q$-linearly: $\exists \sigma \in (0, 1)$ such that $\forall k \in \mathbb N$, $\frac{\phi _ {k+1} }{\phi _ k} \le 1 - \sigma$.
$Q$-superlinearly: $\lim _ {k \to \infty} \frac{\phi _ {k+1} }{\phi _ k} = 0$
$R$-superlinearly: $\exists (\nu _ k) _ {k \in \mathbb N} \in \mathbb R _ +^{\mathbb N}$ such that $\nu _ k \to 0$ $Q$-superlinearly and $\forall k \in \mathbb N$, $\phi _ k \le \nu _ k$
$Q$-quadratically: $\exists C > 0$ such that $\forall k \in \mathbb N$, $\frac{\phi _ {k+1} }{\phi _ k^2} \le C$.

All the following implications are strictly one directional:

$R$-linearity is implied by
$Q$-linearity is implied by
$R$-superlinearity is implied by
$Q$-superlinearity is implied by
$Q$-quadraticity

Flashcards

Minimisers

Convexity

Smoothness and strong convexity

Subgradients

Connections between strong convexity and smoothness

Convergence rates

Related posts