Notes - Continuous Mathematics HT23, Optimisation

What is the name of $f : F \to \mathbb{R}$, where $F \subseteq \mathbb{R}^n$ and you’re trying to solve

\[\text{argmin}_{x \in F} f(x)\]

?

The objective function.

What’s it called when your objective function is $f : F \to \mathbb R$ and $F = \mathbb{R}$?

Unconstrained optimisation.

What’s it called when your objective function is $f : F \to \mathbb R$ and $F \subset \mathbb R$?

Constrained optimisation.

When you have an objective function $f : F \to \mathbb R$ and $F \subset \mathbb R$, what’s the standard form for $F$?

\[F = \\{ \pmb x \in \mathbb{R}^n \space | \space g_1(\pmb x) = 0, \ldots, g_l(\pmb x) = 0, h_1(\pmb x) \ge 0,\ldots,h_m(\pmb x) \ge 0 \\}\]

where $g _ i$ are equality constraints and $h _ i$ are inequality constraints.

What’s it called in optimisation when the constraints actually force $F$, the domain of the objective function, to be $\emptyset$?

Inconsistent.

What’s it called in optimisation where either the objective function $f$ can take arbitrarily small values or $\forall x \in F \text{ } \exists y \in F \text{ s.t. } f(y) < f(x)$?

Unbounded.

If $f : [a,b] \to \mathbb{R}$ and $x \ne a, b$, then what is the “first-order condition”, i.e. what does any solution to $\text{argmin} _ {x\in F}$ satisfy?

\[\frac{\text df}{\text dx} = 0\]

The first non-zero derivative $d$ of a function is at the $k$-th derivative. Can you classify it into a stationary point of inflection, local maximum or local minimum?

If $k$ is odd, then it is a stationary point of inflection.
If $k$ is even, then:
- It is a minimum if $d > 0$, and
- It is a maximum if $d < 0$.

What’s the intuition behind a saddle point of $f$ being fundamentally different from a local minimum or maximum?

Moving in some directions increases $f$, some directions decrease $f$.

Let $f : \mathbb{R}^n \to \mathbb{R}$ be a function and the Hessian at some point $x _ 0$ be $\text H = \mathbf{H}(f)(x _ 0)$. When is $x _ 0$ a local minimum, a local maximum or a saddle point?

Local minimum: $\text H$ is positive definite.
Local maximum: $\text H$ is negative definite.
Saddle point: $\text H$ is indefinite.

When minimising $f(\pmb x)$ subject to a single equality constraint $g(\pmb x) = 0$, you find the stationary points of the Lagrangian

\[\Lambda(\lambda, \pmb x) = f(\pmb x) - \lambda g(\pmb x)\]

because you’re actually solving

\[\frac{\text d f}{\text d \pmb x} = \lambda \frac{\text d g}{\text d \pmb x}\]

Why (draw a picture!)?

Because if you consider the contours of $f$, then they must be parallel to $g$ at the minimum and maximum because otherwise you could move around.

What’s the technique for minimising $f(\pmb x)$ subject to a single equality constraint $g(\pmb x) = 0$?

Find the stationary points of the Lagrangian

\[\Lambda(\lambda, \pmb x) = f(\pmb x) - \lambda g(\pmb x)\]

When minimising $f(\pmb x)$ subject to a single equality constraint $g(\pmb x) = 0$, what big system of equations do you need to solve?

\[\frac{\text d \Lambda}{\text d(\lambda, \pmb x)} = \pmb{0},\]

i.e.

\[\begin{aligned} \frac{\partial f}{\partial x_1} &= \lambda\frac{\partial g}{\partial x_1} \\\\ &\ldots\\\\ \frac{\partial f}{\partial x_n} &= \lambda\frac{\partial g}{\partial x_n} \end{aligned}\]

What’s the technique for minimising $f(\pmb x)$ subject to a several equality constraints $g _ 1(\pmb x) = 0, g _ 2(\pmb x) = 0, \ldots, g _ l(\pmb x) = 0$?

Find the stationary points of the Lagrangian

\[\Lambda(\lambda_1, \ldots, \lambda_l, \pmb x) = f(\pmb x) - \lambda_1 g_1(\pmb x) - \ldots - \lambda_l g_l(\pmb x)\]

Say you’re minimising $f(\pmb x)$ subject to a several equality constraints $g _ 1(\pmb x) = 0, g _ 2(\pmb x) = 0, \ldots, g _ l(\pmb x) = 0$ and you’ve formed the Lagrangian

\[\Lambda(\lambda_1, \ldots, \lambda_l, \pmb x) = f(\pmb x) - \lambda_1 g_1(\pmb x) - \ldots - \lambda_l g_l(\pmb x)\]

How can you tell if a stationary point of the Lagrangian is a minimum, maximum, or saddle point?

It’s really hard, you just typically evaluate all candidates and see which ones are biggest.

What’s the technique for minimising $f(\pmb x)$ subject to an inequality constraint $g _ 1(\pmb x) \ge 0$?

Consider whether the constraint is tight (i.e. $g _ 1(\pmb x) = 0$) or slack (i.e. $g _ 1(x) > 0$) and proceed accordingly.

What does it mean for an inequality constraint $g _ 1(\pmb x)$ to be slack?

\[g_1(\pmb x) > 0\]

How does knowing an inequality constraint $g _ 1(\pmb x) \ge 0$ is slack, i.e. $g _ 1(x) > 0$ help you find a solution?

You can ignore it and then just see if a potential solution satisfies it at the end.

What does it mean for an inequality constraint $g _ 1(\pmb x)$ to be tight?

\[g_1(\pmb x) = 0\]

How does knowing an inequality constraint $g _ 1(\pmb x) \ge 0$ is tight, i.e. $g _ 1(x) = 0$ help you find a solution?

You can treat it like an equality constraint.

What’s the technique for minimising $f(\pmb x)$ subject to several inequality constraints $g _ 1(\pmb x) \ge 0, \ldots, g _ l(\pmb x) \ge 0$?

Consider whether each is slack or tight and proceed accordingly.

When we use root finding methods like Newton’s method, we can hope for a relative error for on the order of $O(\epsilon)$, where $\epsilon$ is machine epsilon. What’s the best we can hope for when solving optimisation (i.e. minimisation) problems?

\[O(\sqrt{\epsilon})\]

What is a bracket when bracketing a minimum?

A triple $(a, b, c)$ with $f(a), f(c) > f(b)$.

Why when bracketing a minimum is a triple $(a, b, c)$ with $f(a), f(c) > f(b)$ a sensible idea?

Because the function has to “turn around” somewhere between $a$ and $c$, so there must be at least one local minimum.

Say you’ve bracketed a minimum in $(a, b, c)$. How can you refine the bracket given $z \in (a, b)$?

If $f(z) < f(b)$, take the new bracket as $(a, z, b)$.
If $f(z) > f(b)$, take the new bracket as $(z, b, c)$.

Say you’ve bracketed a minimum in $(a, b, c)$. How should we pick $z\in (a,b)$ to narrow down the minimum (there are two cases)?

If $b - a > c - b$, take $z = a + \phi (b - a)$.
If $b - a < c - b$, take $z = c - \phi(c - b)$.

What type of convergence does golden section search have?

Linear.

Why does golden section search not have quite as fast convergence as interval bisection?

Because the bracket shrinks by roughly a factor of $0.618$ each time, compared to a factor of $0.5$.

Golden section search requires a starting bracket $(a, b, c)$. What’s a good way to do an exponentially increasing search for a starting bracket?

\[(0, (1 + \phi)^k, (1+\phi)^{k+1})\]

Prove that the optimal ratio to subdivide a bracket $(a _ n, b _ n, c _ n)$ for a minimum during golden section search is $z = a _ n + \phi(b _ n - a _ n)$ or $z = c _ n - \phi(c _ n - b _ n)$ where $\phi$ is the golden ratio.

Todo.

During golden section search on a bracket $(a, b, c)$, there’s a choice to either pick $z \in (a, b)$ or $z \in (b, c)$. How do you decide which one to choose?

Subdivide the largest interval.

Assuming that $(b, c)$ is the largest interval in the bracket $(a, b, c)$ during golden section search, could you write an equation for picking $z \in (b, c)$ such that the interval is subdivided in the ratio $1 - \phi : \phi$?

\[\begin{aligned} &\phantom\implies \frac{c - z}{c - b} = \frac \phi 1 \\\\ &\implies c - z = \phi(c- b) \end{aligned}\]

Assuming that $(b, c)$ is the largest interval in the bracket $(a, b, c)$ during golden section search, what might the length of the new interval be, depending on an evaluation of $f(z)$, and what should you assume so bad luck doesn’t penalise you?

$c - b$ or $z - a$, should assume they’re equal

\[c - b = z-a\]

If, during golden section search, the bracket $(a, b, c)$ was split into the ratio $1 - \phi : \phi$, then can you write an equation linking $a, b, c$ together?

\[\begin{aligned} &\phantom \implies \frac{c-b}{c - a} = \frac \phi 1 \\\\ &\implies c- b = \phi (c -a) \end{aligned}\]

Muller’s method is a root finding algorithm that uses parabolic interpolation. What’s the name of the optimisation algorithm (i.e. finding a minimum) that uses a similar idea?

Successive parabolic interpolation.

Halley’s method uses the second-degree Taylor polynomial to find the roots of a given function. What happens if you try and apply this technique to create an optimisation algorithm for finding the minimum of a function?

It is equivalent to Newton’s method on $f’$.

What’s the best algorithm to use for a numerical one-dimensional optimisation problem?

Brent’s method.

What’s the general pattern of an iteration for line-search optimisation algorithms, given a current $\pmb x _ n$?

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

There’s two desirable properties for what we want to be true about the direction $d _ n$. What are they?

$\pmb d _ n$ is a descent direction, i.e. it moves down the function
$\pmb d _ n, \pmb d _ {n-1}, \cdots$ should not veer widly in different directions.

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

One desirable property of $\pmb d _ n$ is that it is a descent direction, i.e. it moves “down” the function towards the minimum. How could we write this mathematically in terms of $f, \pmb d _ n$ and $\alpha _ n$?

\[\frac{\text d f(\pmb x_n + \alpha_n \pmb d_n)}{\text d \alpha} < 0\]

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

One desirable property of $\pmb d _ n$ is that it is a descent direction, i.e. it moves “down” the function towards the minimum. This can be written mathematically as

\[\frac{\text d f(\pmb x_n + \alpha_n \pmb d_n)}{\text d \alpha} < 0\]

What is this equivalent to, in terms of a dot product?

\[\pmb d_n^\intercal \frac{\text d f}{\text d \pmb x}(\pmb x_n) < 0\]

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

There’s two desirable properties for what we want to be true about the step length $\alpha _ n$. What are they?

$\alpha _ n$ should loosely minimise $f(\pmb x _ n + \alpha _ n \pmb d _ n)$.
We don’t run into Zeno’s paradox: $\sum^\infty _ {k=1} \alpha _ n = \infty$ .

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

How does co-ordinate gradient descent choose $\pmb d _ n$?

It cycles through unit vectors $\pmb e _ i$.

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

How does steepest gradient descent choose $\pmb d _ n$ (in terms of $f$)?

\[\pmb d_n = -\frac{\text d f}{\text d\pmb x}(\pmb x_n)\]

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

Steepest gradient descent choosese $\pmb d _ n$ as

\[\pmb d_n = -\frac{\text d f}{\text d\pmb x}(\pmb x_n)\]

What’s a high-level explanation of how conjugate gradient descent improves this?

It picks directions that wont “undo” our previous minimisation efforts along previous directions.

What does it mean for a set of directions $\pmb d _ 1, \cdots, \pmb d _ n$ to be conjugate?

\[\pmb d_n^\intercal \mathbf H( f)(\pmb x_n) \pmb d_m = 0\]

for all $m < n$.

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

Assume that we’ve determined a direction $\pmb d _ n$. Now all that’s left is determine a step length $\alpha _ n$. What would we ideally want to minimise here?

\[f(\pmb x_n + \alpha_n \pmb d_n) \approx 0\]

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

Assume that we’ve determined a direction $\pmb d _ n$. Now all that’s left is determine a step length $\alpha _ n$. Here we’re minismising $f(\pmb x _ n + \alpha _ n \pmb d _ n)$ with respect to $\alpha _ n$, so it could make sense to use one the 1D optimisation algorithms like golden section search. Why is this a bad idea in practice?

It’s often faster to only approximately minimise $f$ in order to make each individual iteration faster.

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

Assume that we’ve determined a direction $\pmb d _ n$. Now all that’s left is determine a step length $\alpha _ n$. One approach is to use backtracking; initialising $\alpha _ n$ to $\alpha^\text{init}$ and then successively halving $\alpha _ n$ until we improve $f(\pmb x _ n + \alpha _ n \pmb d _ n)$. Why is the condition that

\[f(\pmb x_n + \alpha_n \pmb d_n) < f(\pmb x_{n-1})\]

not a good idea?

It means that we might make infinitesimal progress.

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

Assume that we’ve determined a direction $\pmb d _ n$. Now all that’s left is determine a step length $\alpha _ n$. One approach is to use backtracking; initialising $\alpha _ n$ to $\alpha^\text{init}$ and then successively halving $\alpha _ n$ until we improve $f(\pmb x _ n + \alpha _ n \pmb d _ n)$. A naïve condition for improvement would be

\[f(\pmb x_n + \alpha_n \pmb d_n) < f(\pmb x_{n-1})\]

but this means that we might make infinitesimal progress. What’s the name and formula for the rule that is a better condition than this?

The Armijo rule,

\[f(\pmb x_n + \alpha_n \pmb d_n) < f(\pmb x_{n-1}) + \sigma\alpha_{n} \pmb d_n^\intercal \frac{\text d f}{\text d \pmb x}(\pmb x_n)\]

where $\sigma$ is a constant roughly somewhere from $10^{-1}$ to $10^{-4}$.

What is the order of convergence for gradient descent or conjugate gradient descent?

Linear.

The general pattern of an iteration for $\pmb x _ n$ for a line-search optimisation algorithm looks like:

Choose a direction $\pmb d _ n$
Choose a step length $\alpha _ n$
Set $\pmb x _ {n+1} = \pmb x _ n + \alpha _ n \pmb d _ n$

What does this look like for Newton’s method when applied to optimisation?

Choose $\pmb d _ n = -\mathbf H^{-1}(f)(x _ 0) \frac{\text d f}{\text d \pmb x}(\pmb x _ n)$.
Set $\alpha _ n = 1$.

Where does Newton’s method for optimising $f$ (as a line search method) come from?

Approximating $f$ as

\[\hat f(\pmb x) = f(\pmb x_0) + (\pmb x - \pmb x_0)^\intercal \frac{\text d f}{\text d \pmb x}(\pmb x_0) + \frac{1}{2} (\pmb x - \pmb x_0)^\intercal \mathbf H(f)(\pmb x_0) (\pmb x - \pmb x_0)\]

and then solving the first-order condition.

When does Newton’s method for multidimension optimisation, i.e.

\[\pmb x_{n+1} = \pmb x_n -\mathbf H^{-1}(f)(x_0) \frac{\text d f}{\text d \pmb x}(\pmb x_n)\]

guarantee (quadratic) convergence?

When the Hessian is positive definite.

What is the name of the analogue of the secant method when applied to multidimensional optimisation?

BFGS.

Broyden’s method for multidimensional root finding keeps track of an approximate Jacobian in order to perform iterations that look like

\[\pmb x_n = \pmb x_{n-1} - \hat{\mathbf J}^{-1}\_n f(\pmb x_{n-1})\]

BFGS is the analogue for Newton’s method being used for multidimensional optimisation instead. Rather than tracking an approximate Jacobian $\hat{\mathbf J}$, we instead keep track of an approximate Hessian $\hat{\mathbf H}$. What direction $\pmb d _ n$ and step length $\alpha _ n$ do we use?

$\pmb d _ n = -\hat{\mathbf H} _ n^{-1} \frac{\text d f}{\text d \pmb x}(\pmb x _ n)$
$\alpha _ n = 1$

What order of convergence do we get for multidimensional optimisation with BFGS?

Superlinear.

In terms of the dimension $d$ of the problem, how many derivatives are needed at each step of (conjugate) gradient descent?

\[O(d)\]

In terms of the dimension $d$ of the problem, how much computation is needed at each step of (conjugate) gradient descent?

\[O(d)\]

In terms of the dimension $d$ of the problem, how much memory is needed at each step of (conjugate) gradient descent?

\[O(d)\]

In terms of the dimension $d$ of the problem, how many derivatives are needed at each step of Newton’s method for multidimensional optimisation?

\[O(d^2)\]

In terms of the dimension $d$ of the problem, how much computation is needed at each step of Newton’s method for multidimensional optimisation?

\[O(d^3)\]

In terms of the dimension $d$ of the problem, how much memory is needed at each step of Newton’s method for multidimensional optimisation?

\[O(d^2)\]

In terms of the dimension $d$ of the problem, how many derivatives are needed at each step of BFGS?

\[O(d)\]

In terms of the dimension $d$ of the problem, how much computation is needed at each step of BFGS?

\[O(d^2)\]

In terms of the dimension $d$ of the problem, how much memory is needed at each step of BFGS?

\[O(d^2)\]

How can you reframe any root finding problem $\pmb f(\pmb x^\star) = \pmb 0$ as a minimisation problem?

\[\pmb x^\star = \text{argmin } ||\pmb f(\pmb x)||^2\]

How can you reframe solving the matrix equation

\[\mathbf A \pmb x^\star = b\]

as a minimisation problem, which might be more feasible to solve for very large matrices?

\[\pmb x^\star = \text{argmin}_{\pmb x^\star} \text{ } (\pmb A\pmb x - \pmb b)^\intercal(\pmb A\pmb x - \pmb b)\]

Notes - Continuous Mathematics HT23, Optimisation

Flashcards

Exact methods/theory

Approximate methods

Golden section search

Other 1D algorithms

Gradient descent

Misusing tools

Related posts