Optimisation for Data Science HT25, Misc
Flashcards
Useful identity in convex convergence proofs
@State a useful identity between two vectors $g, v \in \mathbb R^n$ which is useful when proving convergence with the assumption of convexity.
E.g. when $g = \nabla f(x^k)$ and $v = x^k - x^\ast$, this becomes
\[\begin{aligned} \nabla f(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} \vert \vert x^k - x^\ast \vert \vert ^2 &= \frac{L}{2}( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^k - x^\ast - \frac{1}{L}\nabla f(x^k) \vert \vert ^2) \\\\ &= \frac L 2( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2) \end{aligned}\]which can then be used in a telescoping sum.
@Prove that for any two vectors $g, v \in \mathbb R^n$,
\[g^\top v - \frac{1}{2L} \vert \vert g \vert \vert ^2 = \frac{L}{2}( \vert \vert v \vert \vert ^2 - \vert \vert v - \frac{1}{L} g \vert \vert ^2)\]
Useful consequence of Taylor’s theorem
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is sufficiently smooth
- $d \in \mathbb R^n$
@State a useful result about how you can rewrite $f(x + d)$.
Suppose:
- $f : \mathbb R^n \to \mathbb R$ is sufficiently smooth
- $d \in \mathbb R^n$
@Prove that then
\[f(x + d) = f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text dt\]
Consider the function $g : [0, 1] \to \mathbb R$ defined by $g(t) = f(x + td)$. Then by the fundamental theorem of calculus,
\[g(1) - g(0) = \int^1 _ 0 g'(t) \text dt\]Hence
\[f(x + d) = f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text dt\]Common inequalities
@State a bound involving $ \vert \vert a + b \vert \vert ^2$.
@State a bound involving $\sqrt{a + b}$ where $a, b > 0$.
Condition number of a matrix
What is the condition number of a matrix $\kappa(A)$?
Multivariate chain rules
Suppose:
- $f : \mathbb R^d \to \mathbb R^m$
- $g : \mathbb R^m \to \mathbb R$
What is
\[\nabla (g \circ f)\]
?
e.g. $\nabla ( \vert \vert u - 2x \vert \vert ^2) = 4(u - 2x)$.
^multivariate-chain-rule
Compute $\nabla f(x)$, where
\[f(x) = \|Ax - b\|^2, \quad A \in \mathbb R^{m \times n},\ b \in \mathbb R^m.\]
$\nabla f(x) = 2 A^\top (Ax - b)$.
Write $f = g \circ h$ with $h(x) = Ax - b : \mathbb R^n \to \mathbb R^m$ (so $\mathbf J(h)(x) = A$) and $g(u) = \|u\|^2$ (so $\nabla g(u) = 2u$). The multivariate chain rule (∆multivariate-chain-rule) gives
\[\nabla f(x) = \mathbf J(h)(x)^\top (\nabla g)(h(x)) = A^\top \cdot 2(Ax - b) = 2 A^\top (Ax - b).\]Compute $\nabla f(x)$, where
\[f(x) = \log\!\left(\sum _ {i=1}^m e^{a _ i^\top x}\right), \quad a _ 1, \ldots, a _ m \in \mathbb R^n.\]
Let $p _ i(x) := e^{a _ i^\top x}\!\big/\!\sum _ j e^{a _ j^\top x}$ (the softmax weight). Then
\[\nabla f(x) = \sum _ {i=1}^m p _ i(x)\, a _ i.\]Write $f = g \circ h$ with $h(x) = (a _ 1^\top x, \ldots, a _ m^\top x)^\top \in \mathbb R^m$ (so $\mathbf J(h)(x)$ has $i$th row $a _ i^\top$, and $\mathbf J(h)(x)^\top$ has $i$th column $a _ i$) and $g(u) = \log\sum _ i e^{u _ i}$ (so $\nabla g(u) = \mathrm{softmax}(u)$, with $i$th entry $e^{u _ i}/\sum _ j e^{u _ j}$). The multivariate chain rule (∆multivariate-chain-rule) gives
\[\nabla f(x) = \mathbf J(h)(x)^\top\, \mathrm{softmax}(h(x)) = \sum _ i p _ i(x)\, a _ i.\]Suppose:
- $f : \mathbb R^d \to \mathbb R$
- $g : \mathbb R \to \mathbb R$
What is
\[\nabla(g \circ f)\]
?
^scalar-chain-rule
Compute $\nabla f(x)$, where
\[f(x) = e^{a^\top x}, \quad a \in \mathbb R^n.\]
$\nabla f(x) = e^{a^\top x}\, a$.
Write $f = g \circ h$ with $h(x) = a^\top x$ (so $\nabla h(x) = a$) and $g(t) = e^t$ (so $g'(t) = e^t$). The scalar chain rule (∆scalar-chain-rule) gives
\[\nabla f(x) = g'(h(x))\, \nabla h(x) = e^{a^\top x}\, a.\]Compute $\nabla f(x)$, where
\[f(x) = \sin(\|x\|^2).\]
$\nabla f(x) = 2x\, \cos(\|x\|^2)$.
Write $f = g \circ h$ with $h(x) = \|x\|^2 = x^\top x$ (so $\nabla h(x) = 2x$) and $g(t) = \sin t$ (so $g'(t) = \cos t$). The scalar chain rule (∆scalar-chain-rule) gives
\[\nabla f(x) = g'(h(x))\, \nabla h(x) = 2x\, \cos(\|x\|^2).\]Product rule
Suppose:
- $g, h : \mathbb R^n \to \mathbb R$ are differentiable
- $x \in \mathbb R^n$
@State the product rule for $\nabla(g \cdot h)(x)$.
Compute $\nabla f(x)$, where
\[f(x) = (a^\top x)(b^\top x), \quad a, b \in \mathbb R^n.\]
$\nabla f(x) = (b^\top x)\, a + (a^\top x)\, b$.
Apply the product rule (∆product-rule) with $g(x) = a^\top x$ (so $\nabla g(x) = a$) and $h(x) = b^\top x$ (so $\nabla h(x) = b$):
\[\nabla f(x) = h(x)\, \nabla g(x) + g(x)\, \nabla h(x) = (b^\top x)\, a + (a^\top x)\, b.\]Compute $\nabla f(x)$, where
\[f(x) = \|x\|^2\, e^{c^\top x}, \quad c \in \mathbb R^n.\]
$\nabla f(x) = e^{c^\top x}\bigl(2x + \|x\|^2\, c\bigr)$.
Apply the product rule (∆product-rule) with $g(x) = \|x\|^2$ (so $\nabla g(x) = 2x$) and $h(x) = e^{c^\top x}$ (so $\nabla h(x) = e^{c^\top x}\, c$ by ∆scalar-chain-rule):
\[\nabla f(x) = h(x)\, \nabla g(x) + g(x)\, \nabla h(x) = 2x\, e^{c^\top x} + \|x\|^2\, e^{c^\top x}\, c = e^{c^\top x}\bigl(2x + \|x\|^2\, c\bigr).\]Compositions of convex / concave functions
Suppose $f : \mathbb R \to \mathbb R$ and $g : \mathbb R^n \to \mathbb R$. What rules let you remember whether the composition $f \circ g$ is convex or concave, from the curvature of $f$ and $g$?
How to read the table: each $0/1$ is a curvature toggle, not a yes/no. In the “$f$ convex?”, “$g$ convex?” and “$f \circ g$ convex?” columns, $1 =$ convex and $0 =$ concave — i.e. the opposite curvature, not merely “non-convex” (so $g(x) = x^3$, which is neither, is not covered by a $0$). In the “$f$ increasing?” column, $1 =$ nondecreasing and $0 =$ nonincreasing. “Nothing in general” means no conclusion.
| $f$ convex? | $f$ increasing? | $g$ convex? | Then $f \circ g$ convex? |
|---|---|---|---|
| 0 | 0 | 0 | Nothing in general |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | Nothing in general |
| 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | Nothing in general |
| 1 | 1 | 0 | Nothing in general |
| 1 | 1 | 1 | 1 |
The four determined rows are the standard scalar composition rules:
- $f$ convex & nondecreasing, $g$ convex $\implies f \circ g$ convex.
- $f$ convex & nonincreasing, $g$ concave $\implies f \circ g$ convex.
- $f$ concave & nondecreasing, $g$ concave $\implies f \circ g$ concave.
- $f$ concave & nonincreasing, $g$ convex $\implies f \circ g$ concave.
Caveat: the monotonicity of $f$ must hold over the range $g$ actually takes (formally, on $f$’s extended-value extension).
Jensen’s inequality
@State Jensen’s inequality, first in the general finite form and then in terms of expectation.
Suppose:
- $f$ is a real convex function
- $x _ 1, \ldots, x _ n$ are points in the domain
- $\alpha _ 1, \ldots, \alpha _ n$ are positive weights
Then:
- $f\left(\frac{\sum \alpha _ i x _ i}{\sum \alpha _ i}\right) \le \frac{\sum \alpha _ i f(x _ i)}{\sum \alpha _ i}$, or
- $f(\mathbb E[x]) \le \mathbb E[f(x)]$
Holder’s inequality
@State Holder’s inequality (in the finite sum form), and describe how the Cauchy-Schwarz inequality is a special case.
Suppose:
- $p$ and $q$ are such that $\frac 1 p + \frac 1 q = 1$
- $x _ 1, \ldots, x _ n$ and $y _ 1, \ldots, y _ n$ are arbitrary real or complex numbers
Then:
\[\sum^n _ {i = 1} \vert x _ i y _ i \vert \le \left(\sum^n _ {i = 1} \vert x _ i \vert ^p\right)^{1/p} \left(\sum^n _ {i = 1} \vert y _ i \vert ^q\right)^{1/q}\]The Cauchy-Schwarz inequality corresponds to the case where $p = 1/2, q = 1/2$.
Log-sum-exp is convex
@Justify that
\[f(x) = \log\left( \sum^n _ {i=1} e^{x _ i} \right)\]
is a convex function.
Let $u _ i = e^{x _ i}$ and $v _ i = e^{y _ i}$. Therefore
\[\begin{aligned} f((1-\lambda)x + \lambda y) &= \log\left( \sum^n _ {i=1} e^{(1 - \lambda)x _ i + \lambda y _ i} \right) \\\\ &= \log\left( \sum^n _ {i=1} u _ i^{(1-\lambda)} v _ i^{\lambda} \right) \\\\ &\le \log\left(\left(\sum^n _ {i=1} u _ i^{(1 - \lambda) \cdot \frac{1}{1-\lambda} }\right)^{1 - \lambda}\cdot \left(\sum^n _ {i=1} v _ i^{\lambda \cdot \frac{1}{\lambda} }\right)^\lambda\right) \\\\ &\le (1 - \lambda)\log\left(\sum^n _ {i = 1} u _ i^{}\right) + \lambda \log\left(\sum^n _ {i=1} v _ i\right) \end{aligned}\]as required.
Other useful characterisations of $L$-smoothness, convexity, and $\gamma$-strong convexity
@State a useful result which helps you check whether a function $f : \mathbb R^d \to \mathbb R$ (with sufficiently many derivatives) is $\gamma$-strongly convex or $L$-smooth.
$f : \mathbb R^d \to \mathbb R$ is $L$-smooth iff for all $x$, $\lambda _ \text{max}(\nabla^2 f(x)) \le L$.
This is equivalent to the Hessian being uniformly bounded above by $L$, i.e. for all $x$, $ \vert \vert \nabla^2 f(x) \vert \vert \le L$.
$f : \mathbb R^d \to \mathbb R$ is $\gamma$-strongly convex iff for all $x$, $\lambda _ \text{min}(\nabla^2 f(x)) \ge \gamma$.
This is equivalent to the inverse Hessian being uniformly bounded above by $1/\gamma$, i.e. for all $x$, $ \vert \vert \nabla^2 f(x))^{-1} \vert \vert \ge \gamma$.
@State a result about the smoothness of $\sum \lambda _ i f _ i$ where each $f _ i$ is $L _ i$-smooth.
The sum is $\sum \lambda _ i L _ i$-smooth.
Suppose $f$ is $L$-smooth. What can you say about the smoothness of $f(Ax + b)$?
It is $L \vert \vert A \vert \vert ^2$-smooth (i.e. $L \sigma _ \max^2$ where $\sigma _ \max$ is the largest singular value).