Optimisation for Data Science HT25, Useful miscellany
-
[[Course - Optimisation for Data Science HT25]]U
- [[Notes - Optimisation for Data Science HT25, Overview of convergence results]]U
- [[Notes - Optimisation for Data Science HT25, Optimisation terminology]]U
- [[Notes - Optimisation for Data Science HT25, Smoothness and convexity]]U
- See also:
- [[Notes - Continuous Mathematics HT23, Taylor’s theorem]]U
- [[Notes - Continuous Mathematics HT23, Derivatives]]U
Flashcards
Useful identity in convex convergence proofs
@State a useful identity between two vectors $g, v \in \mathbb R^n$ which is useful when proving convergence with the assumption of convexity.
E.g. when $g = \nabla f(x^k)$ and $v = x^k - x^\ast$, this becomes
\[\begin{aligned} \nabla f(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} \vert \vert x^k - x^\ast \vert \vert ^2 &= \frac{L}{2}( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^k - x^\ast - \frac{1}{L}\nabla f(x^k) \vert \vert ^2) \\\\ &= \frac L 2( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2) \end{aligned}\]which can then be used in a telescoping sum.
@Prove that for any two vectors $g, v \in \mathbb R^n$,
\[g^\top v - \frac{1}{2L} \vert \vert g \vert \vert ^2 = \frac{L}{2}( \vert \vert v \vert \vert ^2 - \vert \vert v - \frac{1}{L} g \vert \vert ^2)\]
Useful consequence of Taylor’s theorem
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is sufficiently smooth
- $d \in \mathbb R^n$
@State a useful result about how you can rewrite $f(x + d)$.
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is sufficiently smooth
- $d \in \mathbb R^n$
@Prove that then
\[f(x + d) = f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text dt\]
Consider the function $g : [0, 1] \to \mathbb R$ defined by $g(t) = f(x + td)$. Then by the fundamental theorem of calculus,
\[g(1) - g(0) = \int^1 _ 0 g'(t) \text dt\]Hence
\[f(x + d) = f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text dt\]Common inequalities
@State a bound involving $ \vert \vert a + b \vert \vert ^2$.
@State a bound involving $\sqrt{a + b}$ where $a, b > 0$.
Condition number of a matrix
What is the condition number of a matrix $\kappa(A)$?
Multivariate chain rules
Suppose:
- $f : \mathbb R^d \to \mathbb R^m$
- $g : \mathbb R^m \to \mathbb R$
What is
\[\nabla (g \circ f)\]
?
e.g. $\nabla ( \vert \vert u - 2x \vert \vert ^2) = 4(u - 2x)$.
Suppose:
- $f : \mathbb R^d \to \mathbb R$
- $g : \mathbb R \to \mathbb R$
What is
\[\nabla(g \circ f)\]
?
Compositions of convex / concave functions
Suppose $f$ and $g$ are functions. What rules let you remember if $f \circ g$ is convex/concave depending on the convexity of $f$ and $g$?
| $f$ convex? | $f$ increasing? | $g$ convex? | Then $f \circ g$ convex? |
|---|---|---|---|
| 0 | 0 | 0 | Nothing in general |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | Nothing in general |
| 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | Nothing in general |
| 1 | 1 | 0 | Nothing in general |
| 1 | 1 | 1 | 1 |
Jensen’s inequality
@State Jensen’s inequality, first in the general finite form and then in terms of expectation.
Suppose:
- $f$ is a real convex function
- $x _ 1, \ldots, x _ n$ are points in the domain
- $\alpha _ 1, \ldots, \alpha _ n$ are positive weights
Then:
- $f\left(\frac{\sum \alpha _ i x _ i}{\sum \alpha _ i}\right) \le \frac{\sum \alpha _ i f(x _ i)}{\sum \alpha _ i}$, or
- $f(\mathbb E[x]) \le \mathbb E[f(x)]$
Holder’s inequality
@State Holder’s inequality (in the finite sum form), and describe how the Cauchy-Schwarz inequality is a special case.
Suppose:
- $p$ and $q$ are such that $\frac 1 p + \frac 1 q = 1$
- $x _ 1, \ldots, x _ n$ and $y _ 1, \ldots, y _ n$ are arbitrary real or complex numbers
Then:
\[\sum^n _ {i = 1} \vert x _ i y _ i \vert \le \left(\sum^n _ {i = 1} \vert x _ i \vert ^p\right)^{1/p} \left(\sum^n _ {i = 1} \vert y _ i \vert ^q\right)^{1/q}\]The Cauchy-Schwarz inequality corresponds to the case where $p = 1/2, q = 1/2$.
Log-sum-exp is convex
@Justify that
\[f(x) = \log\left( \sum^n _ {i=1} e^{x _ i} \right)\]
is a convex function.
Let $u _ i = e^{x _ i}$ and $v _ i = e^{y _ i}$. Therefore
\[\begin{aligned} f((1-\lambda)x + \lambda y) &= \log\left( \sum^n _ {i=1} e^{(1 - \lambda)x _ i + \lambda y _ i} \right) \\\\ &= \log\left( \sum^n _ {i=1} u _ i^{(1-\lambda)} v _ i^{\lambda} \right) \\\\ &\le \log\left(\left(\sum^n _ {i=1} u _ i^{(1 - \lambda) \cdot \frac{1}{1-\lambda} }\right)^{1 - \lambda}\cdot \left(\sum^n _ {i=1} v _ i^{\lambda \cdot \frac{1}{\lambda} }\right)^\lambda\right) \\\\ &\le (1 - \lambda)\log\left(\sum^n _ {i = 1} u _ i^{}\right) + \lambda \log\left(\sum^n _ {i=1} v _ i\right) \end{aligned}\]as required.
Other useful characterisations of $L$-smoothness, convexity, and $\gamma$-strong convexity
@State a useful result which helps you check whether a function $f : \mathbb R^d \to \mathbb R$ (with sufficiently many derivatives) is $\gamma$-strongly convex or $L$-smooth.
- $f : \mathbb R^d \to \mathbb R$ is $L$-smooth iff for all $x$, $\lambda _ \text{max}(\nabla^2 f(x)) \le L$.
-
This is equivalent to the Hessian being uniformly bounded above by $L$, i.e. for all $x$, $ \vert \vert \nabla^2 f(x) \vert \vert \le L$.
- $f : \mathbb R^d \to \mathbb R$ is $\gamma$-strongly convex iff for all $x$, $\lambda _ \text{min}(\nabla^2 f(x)) \ge \gamma$.
- This is equivalent to the inverse Hessian being uniformly bounded above by $1/\gamma$, i.e. for all $x$, $ \vert \vert \nabla^2 f(x))^{-1} \vert \vert \ge \gamma$.
@State a result about the smoothness of $\sum \lambda _ i f _ i$ where each $f _ i$ is $L _ i$-smooth.
The sum is $\sum \lambda _ i L _ i$-smooth.
Suppose $f$ is $L$-smooth. What can you say about the smoothness of $f(Ax + b)$?
It is $L \vert \vert A \vert \vert ^2$-smooth (i.e. $L \sigma _ \max^2$ where $\sigma _ \max$ is the largest singular value).