Optimisation for Data Science HT25, Misc


Flashcards

Useful identity in convex convergence proofs

@State a useful identity between two vectors $g, v \in \mathbb R^n$ which is useful when proving convergence with the assumption of convexity.

\[g^\top v - \frac{1}{2L} \vert \vert g \vert \vert ^2 = \frac{L}{2}( \vert \vert v \vert \vert ^2 - \vert \vert v - \frac{1}{L} g \vert \vert ^2)\]

E.g. when $g = \nabla f(x^k)$ and $v = x^k - x^\ast$, this becomes

\[\begin{aligned} \nabla f(x^k)^\top (x^k - x^\ast) - \frac{1}{2L} \vert \vert x^k - x^\ast \vert \vert ^2 &= \frac{L}{2}( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^k - x^\ast - \frac{1}{L}\nabla f(x^k) \vert \vert ^2) \\\\ &= \frac L 2( \vert \vert x^k - x^\ast \vert \vert ^2 - \vert \vert x^{k+1} - x^\ast \vert \vert ^2) \end{aligned}\]

which can then be used in a telescoping sum.

@Prove that for any two vectors $g, v \in \mathbb R^n$,

\[g^\top v - \frac{1}{2L} \vert \vert g \vert \vert ^2 = \frac{L}{2}( \vert \vert v \vert \vert ^2 - \vert \vert v - \frac{1}{L} g \vert \vert ^2)\]
\[\begin{aligned} g^\top v - \frac{1}{2L} \vert \vert g \vert \vert ^2 &= \frac{L}{2} \left( \frac{2}{L} \langle g, v \rangle - \frac{1}{L^2} \langle g, g \rangle \right) \\\\ &= \frac{L}{2}\left( \langle v, v\rangle - \langle v, v \rangle + \frac{2}{L} \langle g, v \rangle - \frac{1}{L^2} \langle g, g\rangle \right) \\\\ &= \frac{L}{2}\left( \langle v, v\rangle - \langle v, v \rangle + \frac{1}{L} \langle g, v \rangle + \frac{1}{L} \langle g, v \rangle - \frac{1}{L^2} \langle g, g\rangle \right) \\\\ &= \frac{L}{2}\left( \langle v, v\rangle - \langle v, v \rangle + \left\langle \frac{1}{L} g, v \right\rangle + \left\langle v, \frac{1}{L} g \right\rangle - \left\langle \frac{1}{L} g, \frac{1}{L} g\right\rangle \right) \\\\ &= \frac{L}{2} \left(\langle v, v\rangle - \left\langle v - \frac 1 L g, v - \frac 1 L g \right\rangle\right) \\\\ &=\frac{L}{2}( \vert \vert v \vert \vert ^2 - \vert \vert v - \frac{1}{L} g \vert \vert ^2) \end{aligned}\]

Useful consequence of Taylor’s theorem

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is sufficiently smooth
  • $d \in \mathbb R^n$

@State a useful result about how you can rewrite $f(x + d)$.

\[f(x + d) = f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text dt\]

Suppose:

  • $f : \mathbb R^n \to \mathbb R$ is sufficiently smooth
  • $d \in \mathbb R^n$

@Prove that then

\[f(x + d) = f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text dt\]

Consider the function $g : [0, 1] \to \mathbb R$ defined by $g(t) = f(x + td)$. Then by the fundamental theorem of calculus,

\[g(1) - g(0) = \int^1 _ 0 g'(t) \text dt\]

Hence

\[f(x + d) = f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text dt\]

Common inequalities

@State a bound involving $ \vert \vert a + b \vert \vert ^2$.

\[ \vert \vert a + b \vert \vert ^2 \le 2( \vert \vert a \vert \vert ^2 + \vert \vert b \vert \vert ^2)\]

@State a bound involving $\sqrt{a + b}$ where $a, b > 0$.

\[\sqrt{a + b} \le \sqrt a + \sqrt b\]

Condition number of a matrix

What is the condition number of a matrix $\kappa(A)$?

\[\kappa(A) =\frac{\lambda _ \text{max} }{\lambda _ \text{min} }\]

Multivariate chain rules

Suppose:

  • $f : \mathbb R^d \to \mathbb R^m$
  • $g : \mathbb R^m \to \mathbb R$

What is

\[\nabla (g \circ f)\]

?

\[\mathbf J(f)^\top (\nabla g \circ f)\]

e.g. $\nabla ( \vert \vert u - 2x \vert \vert ^2) = 4(u - 2x)$.

^multivariate-chain-rule

Compute $\nabla f(x)$, where

\[f(x) = \|Ax - b\|^2, \quad A \in \mathbb R^{m \times n},\ b \in \mathbb R^m.\]

$\nabla f(x) = 2 A^\top (Ax - b)$.

Write $f = g \circ h$ with $h(x) = Ax - b : \mathbb R^n \to \mathbb R^m$ (so $\mathbf J(h)(x) = A$) and $g(u) = \|u\|^2$ (so $\nabla g(u) = 2u$). The multivariate chain rule (∆multivariate-chain-rule) gives

\[\nabla f(x) = \mathbf J(h)(x)^\top (\nabla g)(h(x)) = A^\top \cdot 2(Ax - b) = 2 A^\top (Ax - b).\]

Source: ∆multivariate-chain-rule.

@example~

Compute $\nabla f(x)$, where

\[f(x) = \log\!\left(\sum _ {i=1}^m e^{a _ i^\top x}\right), \quad a _ 1, \ldots, a _ m \in \mathbb R^n.\]

Let $p _ i(x) := e^{a _ i^\top x}\!\big/\!\sum _ j e^{a _ j^\top x}$ (the softmax weight). Then

\[\nabla f(x) = \sum _ {i=1}^m p _ i(x)\, a _ i.\]

Write $f = g \circ h$ with $h(x) = (a _ 1^\top x, \ldots, a _ m^\top x)^\top \in \mathbb R^m$ (so $\mathbf J(h)(x)$ has $i$th row $a _ i^\top$, and $\mathbf J(h)(x)^\top$ has $i$th column $a _ i$) and $g(u) = \log\sum _ i e^{u _ i}$ (so $\nabla g(u) = \mathrm{softmax}(u)$, with $i$th entry $e^{u _ i}/\sum _ j e^{u _ j}$). The multivariate chain rule (∆multivariate-chain-rule) gives

\[\nabla f(x) = \mathbf J(h)(x)^\top\, \mathrm{softmax}(h(x)) = \sum _ i p _ i(x)\, a _ i.\]

Source: ∆multivariate-chain-rule; cf. ∆ObslqGUqfkEnCPNk2bWcj7Dg (log-sum-exp convexity) for the inner function.

@example~

Suppose:

  • $f : \mathbb R^d \to \mathbb R$
  • $g : \mathbb R \to \mathbb R$

What is

\[\nabla(g \circ f)\]

?

\[\nabla f(x) \cdot g'(f(x))\]

^scalar-chain-rule

Compute $\nabla f(x)$, where

\[f(x) = e^{a^\top x}, \quad a \in \mathbb R^n.\]

$\nabla f(x) = e^{a^\top x}\, a$.

Write $f = g \circ h$ with $h(x) = a^\top x$ (so $\nabla h(x) = a$) and $g(t) = e^t$ (so $g'(t) = e^t$). The scalar chain rule (∆scalar-chain-rule) gives

\[\nabla f(x) = g'(h(x))\, \nabla h(x) = e^{a^\top x}\, a.\]

Source: ∆scalar-chain-rule.

@example~

Compute $\nabla f(x)$, where

\[f(x) = \sin(\|x\|^2).\]

$\nabla f(x) = 2x\, \cos(\|x\|^2)$.

Write $f = g \circ h$ with $h(x) = \|x\|^2 = x^\top x$ (so $\nabla h(x) = 2x$) and $g(t) = \sin t$ (so $g'(t) = \cos t$). The scalar chain rule (∆scalar-chain-rule) gives

\[\nabla f(x) = g'(h(x))\, \nabla h(x) = 2x\, \cos(\|x\|^2).\]

Source: ∆scalar-chain-rule.

@example~

Product rule

Suppose:

  • $g, h : \mathbb R^n \to \mathbb R$ are differentiable
  • $x \in \mathbb R^n$

@State the product rule for $\nabla(g \cdot h)(x)$.

\[\nabla(g \cdot h)(x) = h(x)\, \nabla g(x) + g(x)\, \nabla h(x).\]

Compute $\nabla f(x)$, where

\[f(x) = (a^\top x)(b^\top x), \quad a, b \in \mathbb R^n.\]

$\nabla f(x) = (b^\top x)\, a + (a^\top x)\, b$.

Apply the product rule (∆product-rule) with $g(x) = a^\top x$ (so $\nabla g(x) = a$) and $h(x) = b^\top x$ (so $\nabla h(x) = b$):

\[\nabla f(x) = h(x)\, \nabla g(x) + g(x)\, \nabla h(x) = (b^\top x)\, a + (a^\top x)\, b.\]

Source: ∆product-rule.

@example~

Compute $\nabla f(x)$, where

\[f(x) = \|x\|^2\, e^{c^\top x}, \quad c \in \mathbb R^n.\]

$\nabla f(x) = e^{c^\top x}\bigl(2x + \|x\|^2\, c\bigr)$.

Apply the product rule (∆product-rule) with $g(x) = \|x\|^2$ (so $\nabla g(x) = 2x$) and $h(x) = e^{c^\top x}$ (so $\nabla h(x) = e^{c^\top x}\, c$ by ∆scalar-chain-rule):

\[\nabla f(x) = h(x)\, \nabla g(x) + g(x)\, \nabla h(x) = 2x\, e^{c^\top x} + \|x\|^2\, e^{c^\top x}\, c = e^{c^\top x}\bigl(2x + \|x\|^2\, c\bigr).\]

Compositions of convex / concave functions

Suppose $f : \mathbb R \to \mathbb R$ and $g : \mathbb R^n \to \mathbb R$. What rules let you remember whether the composition $f \circ g$ is convex or concave, from the curvature of $f$ and $g$?

How to read the table: each $0/1$ is a curvature toggle, not a yes/no. In the “$f$ convex?”, “$g$ convex?” and “$f \circ g$ convex?” columns, $1 =$ convex and $0 =$ concave — i.e. the opposite curvature, not merely “non-convex” (so $g(x) = x^3$, which is neither, is not covered by a $0$). In the “$f$ increasing?” column, $1 =$ nondecreasing and $0 =$ nonincreasing. “Nothing in general” means no conclusion.

$f$ convex?$f$ increasing?$g$ convex?Then $f \circ g$ convex?
000Nothing in general
0010
0100
011Nothing in general
1001
101Nothing in general
110Nothing in general
1111

The four determined rows are the standard scalar composition rules:

  • $f$ convex & nondecreasing, $g$ convex $\implies f \circ g$ convex.
  • $f$ convex & nonincreasing, $g$ concave $\implies f \circ g$ convex.
  • $f$ concave & nondecreasing, $g$ concave $\implies f \circ g$ concave.
  • $f$ concave & nonincreasing, $g$ convex $\implies f \circ g$ concave.

Caveat: the monotonicity of $f$ must hold over the range $g$ actually takes (formally, on $f$’s extended-value extension).

Source: Boyd & Vandenberghe, Convex Optimization, §3.2.4 (scalar composition); supporting fact for the convexity material in Course - Optimisation for Data Science HT25U / Course - Continuous Optimisation HT26U.

Jensen’s inequality

@State Jensen’s inequality, first in the general finite form and then in terms of expectation.

Suppose:

  • $f$ is a real convex function
  • $x _ 1, \ldots, x _ n$ are points in the domain
  • $\alpha _ 1, \ldots, \alpha _ n$ are positive weights

Then:

  • $f\left(\frac{\sum \alpha _ i x _ i}{\sum \alpha _ i}\right) \le \frac{\sum \alpha _ i f(x _ i)}{\sum \alpha _ i}$, or
  • $f(\mathbb E[x]) \le \mathbb E[f(x)]$

Holder’s inequality

@State Holder’s inequality (in the finite sum form), and describe how the Cauchy-Schwarz inequality is a special case.

Suppose:

  • $p$ and $q$ are such that $\frac 1 p + \frac 1 q = 1$
  • $x _ 1, \ldots, x _ n$ and $y _ 1, \ldots, y _ n$ are arbitrary real or complex numbers

Then:

\[\sum^n _ {i = 1} \vert x _ i y _ i \vert \le \left(\sum^n _ {i = 1} \vert x _ i \vert ^p\right)^{1/p} \left(\sum^n _ {i = 1} \vert y _ i \vert ^q\right)^{1/q}\]

The Cauchy-Schwarz inequality corresponds to the case where $p = 1/2, q = 1/2$.

Log-sum-exp is convex

@Justify that

\[f(x) = \log\left( \sum^n _ {i=1} e^{x _ i} \right)\]

is a convex function.

Let $u _ i = e^{x _ i}$ and $v _ i = e^{y _ i}$. Therefore

\[\begin{aligned} f((1-\lambda)x + \lambda y) &= \log\left( \sum^n _ {i=1} e^{(1 - \lambda)x _ i + \lambda y _ i} \right) \\\\ &= \log\left( \sum^n _ {i=1} u _ i^{(1-\lambda)} v _ i^{\lambda} \right) \\\\ &\le \log\left(\left(\sum^n _ {i=1} u _ i^{(1 - \lambda) \cdot \frac{1}{1-\lambda} }\right)^{1 - \lambda}\cdot \left(\sum^n _ {i=1} v _ i^{\lambda \cdot \frac{1}{\lambda} }\right)^\lambda\right) \\\\ &\le (1 - \lambda)\log\left(\sum^n _ {i = 1} u _ i^{}\right) + \lambda \log\left(\sum^n _ {i=1} v _ i\right) \end{aligned}\]

as required.

Other useful characterisations of $L$-smoothness, convexity, and $\gamma$-strong convexity

@State a useful result which helps you check whether a function $f : \mathbb R^d \to \mathbb R$ (with sufficiently many derivatives) is $\gamma$-strongly convex or $L$-smooth.

  • $f : \mathbb R^d \to \mathbb R$ is $L$-smooth iff for all $x$, $\lambda _ \text{max}(\nabla^2 f(x)) \le L$.

  • This is equivalent to the Hessian being uniformly bounded above by $L$, i.e. for all $x$, $ \vert \vert \nabla^2 f(x) \vert \vert \le L$.

  • $f : \mathbb R^d \to \mathbb R$ is $\gamma$-strongly convex iff for all $x$, $\lambda _ \text{min}(\nabla^2 f(x)) \ge \gamma$.

  • This is equivalent to the inverse Hessian being uniformly bounded above by $1/\gamma$, i.e. for all $x$, $ \vert \vert \nabla^2 f(x))^{-1} \vert \vert \ge \gamma$.

@State a result about the smoothness of $\sum \lambda _ i f _ i$ where each $f _ i$ is $L _ i$-smooth.

The sum is $\sum \lambda _ i L _ i$-smooth.

Suppose $f$ is $L$-smooth. What can you say about the smoothness of $f(Ax + b)$?

It is $L \vert \vert A \vert \vert ^2$-smooth (i.e. $L \sigma _ \max^2$ where $\sigma _ \max$ is the largest singular value).