Optimisation for Data Science HT25, Useful miscellany


Flashcards

Useful identity in convex convergence proofs

@State a useful identity between two vectors $g, v \in \mathbb R^n$ which is useful when proving convergence with the assumption of convexity.


\[g^\top v - \frac{1}{2L} \vert \vert g \vert \vert ^2 = \frac{L}{2}( \vert \vert v \vert \vert ^2 - \vert \vert v - \frac{1}{L} g \vert \vert ^2)\]

E.g. when $g = \nabla f(x^k)$ and $v = x^k - x^\ast$, this becomes

\[\begin{aligned} \nabla f(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} \vert \vert x^k - x^\ast \vert \vert ^2 &= \frac{L}{2}( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^k - x^\ast - \frac{1}{L}\nabla f(x^k) \vert \vert ^2) \\\\ &= \frac L 2( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2) \end{aligned}\]

which can then be used in a telescoping sum.

@Prove that for any two vectors $g, v \in \mathbb R^n$,

\[g^\top v - \frac{1}{2L} \vert \vert g \vert \vert ^2 = \frac{L}{2}( \vert \vert v \vert \vert ^2 - \vert \vert v - \frac{1}{L} g \vert \vert ^2)\]

\[\begin{aligned} g^\top v - \frac{1}{2L} \vert \vert g \vert \vert ^2 &= \frac{L}{2} \left( \frac{2}{L} \langle g, v \rangle - \frac{1}{L^2} \langle g, g \rangle \right) \\\\ &= \frac{L}{2}\left( \langle v, v\rangle - \langle v, v \rangle + \frac{2}{L} \langle g, v \rangle - \frac{1}{L^2} \langle g, g\rangle \right) \\\\ &= \frac{L}{2}\left( \langle v, v\rangle - \langle v, v \rangle + \frac{1}{L} \langle g, v \rangle + \frac{1}{L} \langle g, v \rangle - \frac{1}{L^2} \langle g, g\rangle \right) \\\\ &= \frac{L}{2}\left( \langle v, v\rangle - \langle v, v \rangle + \left\langle \frac{1}{L} g, v \right\rangle + \left\langle v, \frac{1}{L} g \right\rangle - \left\langle \frac{1}{L} g, \frac{1}{L} g\right\rangle \right) \\\\ &= \frac{L}{2} \left(\langle v, v\rangle - \left\langle v - \frac 1 L g, v - \frac 1 L g \right\rangle\right) \\\\ &=\frac{L}{2}( \vert \vert v \vert \vert ^2 - \vert \vert v - \frac{1}{L} g \vert \vert ^2) \end{aligned}\]

Useful consequence of Taylor’s theorem

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is sufficiently smooth
  • $d \in \mathbb R^n$

@State a useful result about how you can rewrite $f(x + d)$.


\[f(x + d) = f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text dt\]

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is sufficiently smooth
  • $d \in \mathbb R^n$

@Prove that then

\[f(x + d) = f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text dt\]

Consider the function $g : [0, 1] \to \mathbb R$ defined by $g(t) = f(x + td)$. Then by the fundamental theorem of calculus,

\[g(1) - g(0) = \int^1 _ 0 g'(t) \text dt\]

Hence

\[f(x + d) = f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text dt\]

Common inequalities

@State a bound involving $ \vert \vert a + b \vert \vert ^2$.


\[\vert \vert a + b \vert \vert ^2 \le 2( \vert \vert a \vert \vert ^2 + \vert \vert b \vert \vert ^2)\]

@State a bound involving $\sqrt{a + b}$ where $a, b > 0$.


\[\sqrt{a + b} \le \sqrt a + \sqrt b\]

Condition number of a matrix

What is the condition number of a matrix $\kappa(A)$?


\[\kappa(A) =\frac{\lambda _ \text{max} }{\lambda _ \text{min} }\]

Multivariate chain rules

Suppose:

  • $f : \mathbb R^d \to \mathbb R^m$
  • $g : \mathbb R^m \to \mathbb R$

What is

\[\nabla (g \circ f)\]

?


\[\mathbf J(f)^\top (\nabla g \circ f)\]

e.g. $\nabla ( \vert \vert u - 2x \vert \vert ^2) = 4(u - 2x)$.

Suppose:

  • $f : \mathbb R^d \to \mathbb R$
  • $g : \mathbb R \to \mathbb R$

What is

\[\nabla(g \circ f)\]

?


\[\nabla f(x) \cdot g'(f(x))\]

Compositions of convex / concave functions

Suppose $f$ and $g$ are functions. What rules let you remember if $f \circ g$ is convex/concave depending on the convexity of $f$ and $g$?


$f$ convex? $f$ increasing? $g$ convex? Then $f \circ g$ convex?
0 0 0 Nothing in general
0 0 1 0
0 1 0 0
0 1 1 Nothing in general
1 0 0 1
1 0 1 Nothing in general
1 1 0 Nothing in general
1 1 1 1

Jensen’s inequality

@State Jensen’s inequality, first in the general finite form and then in terms of expectation.


Suppose:

  • $f$ is a real convex function
  • $x _ 1, \ldots, x _ n$ are points in the domain
  • $\alpha _ 1, \ldots, \alpha _ n$ are positive weights

Then:

  • $f\left(\frac{\sum \alpha _ i x _ i}{\sum \alpha _ i}\right) \le \frac{\sum \alpha _ i f(x _ i)}{\sum \alpha _ i}$, or
  • $f(\mathbb E[x]) \le \mathbb E[f(x)]$

Holder’s inequality

@State Holder’s inequality (in the finite sum form), and describe how the Cauchy-Schwarz inequality is a special case.


Suppose:

  • $p$ and $q$ are such that $\frac 1 p + \frac 1 q = 1$
  • $x _ 1, \ldots, x _ n$ and $y _ 1, \ldots, y _ n$ are arbitrary real or complex numbers

Then:

\[\sum^n _ {i = 1} \vert x _ i y _ i \vert \le \left(\sum^n _ {i = 1} \vert x _ i \vert ^p\right)^{1/p} \left(\sum^n _ {i = 1} \vert y _ i \vert ^q\right)^{1/q}\]

The Cauchy-Schwarz inequality corresponds to the case where $p = 1/2, q = 1/2$.

Log-sum-exp is convex

@Justify that

\[f(x) = \log\left( \sum^n _ {i=1} e^{x _ i} \right)\]

is a convex function.


Let $u _ i = e^{x _ i}$ and $v _ i = e^{y _ i}$. Therefore

\[\begin{aligned} f((1-\lambda)x + \lambda y) &= \log\left( \sum^n _ {i=1} e^{(1 - \lambda)x _ i + \lambda y _ i} \right) \\\\ &= \log\left( \sum^n _ {i=1} u _ i^{(1-\lambda)} v _ i^{\lambda} \right) \\\\ &\le \log\left(\left(\sum^n _ {i=1} u _ i^{(1 - \lambda) \cdot \frac{1}{1-\lambda} }\right)^{1 - \lambda}\cdot \left(\sum^n _ {i=1} v _ i^{\lambda \cdot \frac{1}{\lambda} }\right)^\lambda\right) \\\\ &\le (1 - \lambda)\log\left(\sum^n _ {i = 1} u _ i^{}\right) + \lambda \log\left(\sum^n _ {i=1} v _ i\right) \end{aligned}\]

as required.

Other useful characterisations of $L$-smoothness, convexity, and $\gamma$-strong convexity

@State a useful result which helps you check whether a function $f : \mathbb R^d \to \mathbb R$ (with sufficiently many derivatives) is $\gamma$-strongly convex or $L$-smooth.


  • $f : \mathbb R^d \to \mathbb R$ is $L$-smooth iff for all $x$, $\lambda _ \text{max}(\nabla^2 f(x)) \le L$.
  • This is equivalent to the Hessian being uniformly bounded above by $L$, i.e. for all $x$, $ \vert \vert \nabla^2 f(x) \vert \vert \le L$.

  • $f : \mathbb R^d \to \mathbb R$ is $\gamma$-strongly convex iff for all $x$, $\lambda _ \text{min}(\nabla^2 f(x)) \ge \gamma$.
  • This is equivalent to the inverse Hessian being uniformly bounded above by $1/\gamma$, i.e. for all $x$, $ \vert \vert \nabla^2 f(x))^{-1} \vert \vert \ge \gamma$.

@State a result about the smoothness of $\sum \lambda _ i f _ i$ where each $f _ i$ is $L _ i$-smooth.


The sum is $\sum \lambda _ i L _ i$-smooth.

Suppose $f$ is $L$-smooth. What can you say about the smoothness of $f(Ax + b)$?


It is $L \vert \vert A \vert \vert ^2$-smooth (i.e. $L \sigma _ \max^2$ where $\sigma _ \max$ is the largest singular value).




Related posts