Optimisation for Data Science HT25, Useful miscellany
- [[Course - Optimisation for Data Science HT25]]U
- Cauchy-Schwarz inequality
- Holder’s inequality
- $ \vert \vert a + b \vert \vert ^2 \le 2( \vert \vert a \vert \vert ^2 + \vert \vert b \vert \vert ^2)$
- $\sqrt{a + b} \le \sqrt a + \sqrt b$ for all $a, b > 0$
- Spectral norm
- Differentiability of functions $f : \mathbb R^d \to \mathbb R^m$
- Chain rule for multivariate functions
- Multivariate Taylor’s theorem
- Useful version of Taylor’s theorem with integral error term used in proof that $L$-smooth implies a quadratic upper bound
- $B$-Lipschitz implies that the derivatives are bounded by $B$ in spectral norm
- Jensen’s inequality
- First-order characterisation(s) of convexity
- Second-order characterisations of convexity
- Convex increasing of convex is convex, etc.
- Log-sum-exp is convex
Flashcards
Useful identity in convex convergence proofs
@State a useful identity between two vectors $g, v \in \mathbb R^n$ which is useful when proving convergence with the assumption of convexity.
\[g^\top v - \frac{1}{2L} ||g||^2 = \frac{L}{2}(||v||^2 - ||v - \frac{1}{L} g||^2)\]
E.g. when $g = \nabla f(x^k)$ and $v = x^k - x^\ast$, this becomes
\[\begin{aligned} \nabla f(x^k)^\top (x^k - x^\ast) - \frac{1}{2L}||x^k - x^\ast||^2 &= \frac{L}{2}(||x^k - x^\ast||^2 - ||x^k - x^\ast - \frac{1}{L}\nabla f(x^k)||^2) \\\\ &= \frac L 2(||x^k - x^\ast||^2 - ||x^{k+1} - x^\ast||^2) \end{aligned}\]which can then be used in a telescoping sum.
@Prove that for any two vectors $g, v \in \mathbb R^n$,
\[g^\top v - \frac{1}{2L} ||g||^2 = \frac{L}{2}(||v||^2 - ||v - \frac{1}{L} g||^2)\]
\[\begin{aligned}
g^\top v - \frac{1}{2L} ||g||^2 &= \frac{L}{2} \left( \frac{2}{L} \langle g, v \rangle - \frac{1}{L^2} \langle g, g \rangle \right) \\\\
&= \frac{L}{2}\left( \langle v, v\rangle - \langle v, v \rangle + \frac{2}{L} \langle g, v \rangle - \frac{1}{L^2} \langle g, g\rangle \right) \\\\
&= \frac{L}{2}\left( \langle v, v\rangle - \langle v, v \rangle + \frac{1}{L} \langle g, v \rangle + \frac{1}{L} \langle g, v \rangle - \frac{1}{L^2} \langle g, g\rangle \right) \\\\
&= \frac{L}{2}\left( \langle v, v\rangle - \langle v, v \rangle + \left\langle \frac{1}{L} g, v \right\rangle + \left\langle v, \frac{1}{L} g \right\rangle - \left\langle \frac{1}{L} g, \frac{1}{L} g\right\rangle \right) \\\\
&= \frac{L}{2} \left(\langle v, v\rangle - \left\langle v - \frac 1 L g, v - \frac 1 L g \right\rangle\right) \\\\
&=\frac{L}{2}(||v||^2 - ||v - \frac{1}{L} g||^2)
\end{aligned}\]
Useful consequence of Taylor’s theorem
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is sufficiently smooth
- $d \in \mathbb R^n$
@State a useful result about how you can rewrite $f(x + d)$.
\[f(x + d) = f(x) + \int^1_0 \nabla f(x + td)^\top d \text dt\]
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is sufficiently smooth
- $d \in \mathbb R^n$
@Prove that then
\[f(x + d) = f(x) + \int^1_0 \nabla f(x + td)^\top d \text dt\]
Consider the function $g : [0, 1] \to \mathbb R$ defined by $g(t) = f(x + td)$. Then by the fundamental theorem of calculus,
\[g(1) - g(0) = \int^1_0 g'(t) \text dt\]Hence
\[f(x + d) = f(x) + \int^1_0 \nabla f(x + td)^\top d \text dt\]Condition number of a matrix
What is the condition number of a matrix $\kappa(A)$?
\[\kappa(A) =\frac{\lambda_\text{max} }{\lambda_\text{min} }\]