Continuous Optimisation HT26, Trust-region methods


Flashcards

Basic definitions

Suppose we have the unconstrained optimisation problem

\[\min f(x) \text{ s.t. } x \in \mathbb R^n\]

Compare linesearch methods to trust region methods.

  • Linesearch methods
    • Choose descent direction $s^k$
    • Compute stepsize $\alpha^k$ to reduce $f(x^k + \alpha s^k)$
    • Update $x^{k+1} = x^k + \alpha^k s^k$
  • Trust region methods:
    • Pick direction $s^k$ to reduce local model of $f(x^k + s^k)$
    • Accept $x^{k+1} = x^k + s^k$ if decrease in model is also achieved by $f(x^k + s^k)$
    • Otherwise, set $x^{k+1} = x^k$ and refine the model

In trust region methods, we

  • Pick direction $s^k$ to reduce local model of $f(x^k + s^k)$
  • Accept $x^{k+1} = x^k + s^k$ if decrease in model is also achieved by $f(x^k + s^k)$
  • Otherwise, set $x^{k+1} = x^k$ and refine the model

@State how we might approximate $f$ by a linear model and a quadratic model at step $k$, and the problems with both approaches.

Linear model:

\[l _ k(s) := f(x^k) + s^\top \nabla f(x^k)\]

Quadratic model:

\[q _ k(s) := f(x^k) + s^\top \nabla f(x^k) + \frac 1 2 s^\top \nabla^2 f(x^k) s\]

Problems:

  • Models may not resemble $f(x^k + s)$ when $s$ is large
  • Models may be unbounded from below
    • $l _ k(s)$ is always unbounded below, unless $\nabla f(x^k) = 0$
    • $q _ k(s)$ is always unbounded below if $\nabla^2 f(x^k)$ is negative definite or indefinite, and sometimes if $\nabla^2 f(x^k)$ is positive semidefinite

In trust region methods, we

  • Pick direction $s^k$ to reduce local model of $f(x^k + s^k)$
  • Accept $x^{k+1} = x^k + s^k$ if decrease in model is also achieved by $f(x^k + s^k)$
  • Otherwise, set $x^{k+1} = x^k$ and refine the model

For example, we might use a local linear or quadratic model $m _ k$. However, one problem is that these models may not resemble $f(x^k + s)$ when $s$ is large, or these models might not be bounded below.

What is the solution to this, and how does this lead to the trust region subproblem?

We trust the models only in a trust region, defined by a trust region constraint

\[ \vert \vert s \vert \vert \le \Delta _ k\]

for some radius $\Delta _ k > 0$. Thus we have the trust region subproblem

\[\min _ {s \in \mathbb R^n} m _ k(s) \quad\text{s.t.}\quad \vert \vert s \vert \vert \le \Delta _ k\]

where $m _ k$ is our local model.

In trust region methods, we

  • Pick direction $s^k$ to reduce local model of $f(x^k + s^k)$
  • Accept $x^{k+1} = x^k + s^k$ if decrease in model is also achieved by $f(x^k + s^k)$
  • Otherwise, set $x^{k+1} = x^k$ and refine the model

where we trust the models of $f$ in a trust region defined by $ \vert \vert s \vert \vert \le \Delta _ k$ for some $\Delta _ k$. How is $\Delta _ k$ chosen?

It is chosen based on the value of

\[\rho _ k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m _ k(s^k)}\]

where $s^k$ is a (approximate) minimiser of $s _ k$ (i.e. the ratio of the true function decrease to the predicted model decrease). Then if:

  • $\rho _ k$ is not too much smaller than $1$: $x^{k+1} := x^k + s^k$, $\Delta _ {k+1} \ge \Delta _ k$
  • $\rho _ k$ close to or $\ge 1$: $\Delta _ k$ is increased
  • $\rho _ k$ is much smaller than $1$: $x^{k+1} = x^k$, and $\Delta _ k$ is reduced

@State the steps in a generic trust region @algorithm.

Suppose we have:

  • $\Delta _ 0 > 0$
  • $x^0 \in \mathbb R^n$, $\epsilon > 0$

Then while $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$, do the following:

  • Form the local quadratic model $m _ k(s)$ of $f(x^k + s)$
  • Solve (approximately) the trust region subproblem for $s^k$ with $m _ k(s^k) < f(x^k)$. Compute $\rho _ k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m _ k(s^k)}$.
  • Then:
    • (Very successful) If $\rho _ k \ge 0.9$ then: set $x^{k+1} := x^k + s^k$, and $\Delta _ {k+1} = 2\Delta _ k$
    • (Successful) Else if $\rho _ k \ge 0.1$, then: set $x^{k+1} := x^k + s^k$, and $\Delta _ {k+1} := \Delta _ k$
    • (Unsuccessful) Else set $x^{k+1} = x^k$ and $\Delta _ {k+1} := \frac 1 2 \Delta _ k$
  • Let $k := k+1$.

@exam~

Convergence analysis

Suppose we have the unconstrained optimisation problem

\[\min f(x) \text{ s.t. } x \in \mathbb R^n\]

Recall that the trust-region method works as follows. Suppose further we have

  • $\Delta _ 0 > 0$
  • $x^0 \in \mathbb R^n$, $\epsilon > 0$

Then while $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$, do the following:

  • Form the local quadratic model $m _ k(s)$ of $f(x^k + s)$
  • Solve (approximately) the trust region subproblem for $s^k$ with $m _ k(s^k) < f(x^k)$. Compute $\rho _ k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m _ k(s^k)}$.
  • Then:
    • (Very successful) If $\rho _ k \ge 0.9$ then: set $x^{k+1} := x^k + s^k$, and $\Delta _ {k+1} = 2\Delta _ k$
    • (Successful) Else if $\rho _ k \ge 0.1$, then: set $x^{k+1} := x^k + s^k$, and $\Delta _ {k+1} := \Delta _ k$
    • (Unsuccessful) Else set $x^{k+1} = x^k$ and $\Delta _ {k+1} := \frac 1 2 \Delta _ k$
  • Let $k := k+1$.

@Define what is meant by the Cauchy point at step $k$, and why this is useful.

Assume $\nabla f(x^k) \ne 0$ (guaranteed inside the GTR loop by the termination test $\|\nabla f(x^k)\| \ge \epsilon$; without this the argmin below is degenerate, see remark). The Cauchy step is

\[s^k _ c := -\alpha _ c^k \nabla f(x^k), \quad \text{where} \quad \alpha _ c^k := \arg\min _ {\alpha} m _ k(-\alpha \nabla f(x^k)) \quad\text{s.t.}\quad 0 < \alpha \le \frac{\Delta_k}{\|\nabla f(x^k)\|}.\]

This is a linesearch along the steepest-descent direction of $m_k$ at $x^k$, restricted to the trust region (the upper bound on $\alpha$ comes from $\|\alpha \nabla f(x^k)\| = \alpha \|\nabla f(x^k)\| \le \Delta_k$, using $\alpha > 0$). The Cauchy point itself is

\[y^k _ c := x^k + s^k _ c.\]

Why useful: it gives the “minimal” condition for sufficient decrease in the model — the Cauchy decrease condition

\[m _ k(s^k) \le m _ k(s^k _ c), \quad \text{and} \quad \|s^k\| \le \Delta _ k.\]

Any reasonable subproblem solver (CG, dogleg, etc.) does at least as well as the Cauchy point per iteration, so this condition is easy to enforce while pinning down the global-convergence guarantee (∆gtr-global-convergence-theorem).

Remark (why $\nabla f(x^k) \ne 0$ matters): if $\nabla f(x^k) = 0$, then $-\alpha \nabla f(x^k) = 0$ for every $\alpha$, so $m _ k(0) = f(x^k)$ is constant in $\alpha$ and the constraint $\|\alpha \cdot 0\| \le \Delta _ k$ holds for every $\alpha > 0$; the argmin is then the whole set $(0, \infty)$, not a single value. The GTR termination test rules this out before the Cauchy point is ever computed.

@exam~

Suppose we have the unconstrained optimisation problem

\[\min f(x) \text{ s.t. } x \in \mathbb R^n\]

Recall that the trust-region method works as follows. Suppose further we have

  • $\Delta _ 0 > 0$
  • $x^0 \in \mathbb R^n$, $\epsilon > 0$

Then while $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$, do the following:

  • Form the local quadratic model $m _ k(s)$ of $f(x^k + s)$
  • Solve (approximately) the trust region subproblem for $s^k$ with $m _ k(s^k) < f(x^k)$. Compute $\rho_k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m_k(s^k)}$.
  • Then:
    • (Very successful) If $\rho_k \ge 0.9$ then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} = 2\Delta_k$
    • (Successful) Else if $\rho_k \ge 0.1$, then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} := \Delta_k$
    • (Unsuccessful) Else set $x^{k+1} = x^k$ and $\Delta_{k+1} := \frac 1 2 \Delta_k$
  • Let $k := k+1$.

Recall furthermore the Cauchy point $y^k_c := x^k + s^k_c$ given by

\[s^k_c := -\alpha_c^k \nabla f(x^k)\]

where

\[\alpha_c^k := \arg\min_{\alpha > 0} m _ k(-\alpha \nabla f(x^k)) \quad\text{s.t.}\quad \vert \vert \alpha \nabla f(x^k) \vert \vert \le \Delta _ k\]

@State a minimal condition of sufficient decrease in this model.

\[m _ k(s^k) \le m _ k(s^k _ c), \text{ and } \vert \vert s^k \vert \vert \le \Delta _ k \]

@exam~

Suppose we have the unconstrained optimisation problem

\[\min f(x) \text{ s.t. } x \in \mathbb R^n\]

Recall that the trust-region method works as follows. Suppose further we have

  • $\Delta _ 0 > 0$
  • $x^0 \in \mathbb R^n$, $\epsilon > 0$

Then while $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$, do the following:

  • Form the local quadratic model $m _ k(s)$ of $f(x^k + s)$
  • Solve (approximately) the trust region subproblem for $s^k$ with $m _ k(s^k) < f(x^k)$. Compute $\rho_k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m_k(s^k)}$.
  • Then:
    • (Very successful) If $\rho_k \ge 0.9$ then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} = 2\Delta_k$
    • (Successful) Else if $\rho_k \ge 0.1$, then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} := \Delta_k$
    • (Unsuccessful) Else set $x^{k+1} = x^k$ and $\Delta_{k+1} := \frac 1 2 \Delta_k$
  • Let $k := k+1$.

Recall further the Cauchy point $y^k_c := x^k + s^k_c$ given by

\[s^k_c := -\alpha_c^k \nabla f(x^k)\]

where

\[\alpha_c^k := \arg\min_{\alpha > 0} m _ k(-\alpha \nabla f(x^k)) \quad\text{s.t.}\quad \vert \vert \alpha \nabla f(x^k) \vert \vert \le \Delta _ k\]

In this context, @state a theorem about the global convergence of the generic trust region method.

Suppose:

  • $f \in \mathcal C^2(\mathbb R^n)$
  • $f$ is bounded below on $\mathbb R^n$
  • $\nabla f$ is Lipschitz continuous on $\mathbb R^n$
  • $\{x^k\}$ is generated by the generic trust region method
  • The computation of $s^k$ is such that $m _ k(s^k) \le m _ k(s^k _ c)$ for all $k$

Then either:

  • There exists $k \ge 0$ such that $\nabla f(x^k) = 0$, or
  • $\liminf _ {k \to \infty} \vert \vert \nabla f(x^k) \vert \vert = 0$

@exam~

Suppose we have the unconstrained optimisation problem

\[\min f(x) \text{ s.t. } x \in \mathbb R^n\]

Recall that the trust-region method works as follows. Suppose further we have

  • $\Delta _ 0 > 0$
  • $x^0 \in \mathbb R^n$, $\epsilon > 0$

Then while $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$, do the following:

  • Form the local quadratic model $m _ k(s)$ of $f(x^k + s)$
  • Solve (approximately) the trust region subproblem for $s^k$ with $m _ k(s^k) < f(x^k)$. Compute $\rho_k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m_k(s^k)}$.
  • Then:
    • (Very successful) If $\rho_k \ge 0.9$ then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} = 2\Delta_k$
    • (Successful) Else if $\rho_k \ge 0.1$, then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} := \Delta_k$
    • (Unsuccessful) Else set $x^{k+1} = x^k$ and $\Delta_{k+1} := \frac 1 2 \Delta_k$
  • Let $k := k+1$.

Recall further the Cauchy point $y^k_c := x^k + s^k_c$ given by

\[s^k_c := -\alpha_c^k \nabla f(x^k)\]

where

\[\alpha_c^k := \arg\min_{\alpha > 0} m _ k(-\alpha \nabla f(x^k)) \quad\text{s.t.}\quad \vert \vert \alpha \nabla f(x^k) \vert \vert \le \Delta _ k\]

@State and @prove how to compute $\alpha _ c^k$.

Let $\overline \alpha := \frac{\Delta _ k}{ \vert \vert \nabla f(x^k) \vert \vert }$ and $h^k := \nabla f(x^k)^\top \nabla^2 f(x^k) \nabla f(x^k)$. Then:

  • If $h^k > 0$: $\alpha^k _ c = \min(\frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{h^k}, \overline \alpha)$
  • If $h^k \le 0$: $\alpha^k _ c = \overline \alpha$

Expanding our local quadratic model $m _ k$, we have

\[m _ k(s) = f(x^k) + s^\top \nabla f(x^k) + \frac 1 2 s^\top \nabla^2 f(x^k) s\]

and we aim to find $\alpha _ c^k$ as the global solution of

\[\min _ {\alpha > 0} m _ k(-\alpha \nabla f(x^k)) \text{ s.t. } \vert \vert \alpha \nabla f(x^k) \vert \vert \le \Delta _ k\]

given that $\nabla f(x^k) \ne 0$.

The conditions that $\alpha > 0$ and $ \vert \vert \alpha \nabla f(x^k) \vert \vert \le \Delta _ k$ imply $0 < \alpha \le \frac{\Delta _ k}{ \vert \vert \nabla f(x^k) \vert \vert } := \overline \alpha$. Define

\[\psi(\alpha) := m _ k(-\alpha \nabla f(x^k)) = f(x^k) - \alpha \vert \vert \nabla f(x^k) \vert \vert ^2 + \frac{\alpha^2}{2} h^k\]

where $h^k := \nabla f(x^k)^\top \nabla^2 f(x^k) \nabla f(x^k)$. Then

\[\psi'(0) = - \vert \vert \nabla f(x^k) \vert \vert ^2 < 0\]

so decreasing from $\alpha = 0$ for sufficiently small $\alpha$, thus $\alpha^k_c > 0$ (i.e. the minimum exists, since this rules out the case that any step $\alpha$ would strictly increase the objective).

Therefore if $h^k > 0$, it follows we may take $\alpha _ \min = \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{h^k} = \arg\min _ {\alpha > 0} \psi(\alpha)$ and so $\alpha^k _ c = \min(\alpha _ \min, \overline \alpha)$ (i.e. we are basically switching on whether $\alpha _ \min < \overline \alpha$. If this is the case, then the unconstrained minimiser is within the trust region and we should move directly to the unconstrained minimiser. If not, then the best we can do is move right up to the trust region boundary).

If $h^k \le 0$, $\psi(\alpha)$ is unbounded below on $\mathbb R$ and so $\alpha _ c^k = \overline \alpha$. (One way to think about this case is that $\alpha _ \min$ is basically $\infty$, so the $\min(\alpha _ \min, \overline \alpha)$) formula still holds.)

@exam~

Suppose we have the unconstrained optimisation problem

\[\min f(x) \text{ s.t. } x \in \mathbb R^n\]

Recall that the trust-region method works as follows. Suppose further we have

  • $\Delta _ 0 > 0$
  • $x^0 \in \mathbb R^n$, $\epsilon > 0$

Then while $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$, do the following:

  • Form the local quadratic model $m _ k(s)$ of $f(x^k + s)$
  • Solve (approximately) the trust region subproblem for $s^k$ with $m _ k(s^k) < f(x^k)$. Compute $\rho_k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m_k(s^k)}$.
  • Then:
    • (Very successful) If $\rho_k \ge 0.9$ then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} = 2\Delta_k$
    • (Successful) Else if $\rho_k \ge 0.1$, then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} := \Delta_k$
    • (Unsuccessful) Else set $x^{k+1} = x^k$ and $\Delta_{k+1} := \frac 1 2 \Delta_k$
  • Let $k := k+1$.

Recall further the Cauchy point $y^k_c := x^k + s^k_c$ given by

\[s^k_c := -\alpha_c^k \nabla f(x^k)\]

where

\[\alpha_c^k := \arg\min_{\alpha > 0} m _ k(-\alpha \nabla f(x^k)) \quad\text{s.t.}\quad \vert \vert \alpha \nabla f(x^k) \vert \vert \le \Delta _ k\]

@State a lemma which connects the Cauchy decrease condition $m _ k(s^k) \le m _ k(s^k _ c)$ with the decrease in $f$.

Suppose:

  • $m _ k(s^k) \le m _ k(s^k _ c)$ for all $k$

Then for every $k$:

\[\begin{aligned} f(x^k) - m _ k(s^k) &\ge f(x^k) - m _ k(s^k _ c) \\ &\ge \frac 1 2 \vert \vert \nabla f(x^k) \vert \vert \min \left\{\Delta _ k, \frac{ \vert \vert \nabla f(x^k) \vert \vert }{ \vert \vert \nabla^2 f(x^k) \vert \vert }\right\} \end{aligned}\]

@exam~

Suppose we have the unconstrained optimisation problem $\min f(x)$ subject to $x \in \mathbb R^n$ and are using the trust-region method, where we have

  • $\Delta _ 0 > 0$
  • $x^0 \in \mathbb R^n$, $\epsilon > 0$

Then while $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$, we do the following:

  • Form the local quadratic model $m _ k(s)$ of $f(x^k + s)$
  • Solve (approximately) the trust region subproblem for $s^k$ with $m _ k(s^k) < f(x^k)$. Compute $\rho_k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m_k(s^k)}$.
  • Then:
    • (Very successful) If $\rho_k \ge 0.9$ then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} = 2\Delta_k$
    • (Successful) Else if $\rho_k \ge 0.1$, then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} := \Delta_k$
    • (Unsuccessful) Else set $x^{k+1} = x^k$ and $\Delta_{k+1} := \frac 1 2 \Delta_k$
  • Let $k := k+1$.

Recall further the Cauchy point $y^k_c := x^k + s^k_c$ given by $s^k_c := -\alpha_c^k \nabla f(x^k)$ where

\[\alpha_c^k := \arg\min_{\alpha > 0} m _ k(-\alpha \nabla f(x^k)) \quad\text{s.t.}\quad \vert \vert \alpha \nabla f(x^k) \vert \vert \le \Delta _ k\]

@Prove that if we have $m _ k(s^k) \le m _ k(s^k _ c)$ for all $k$, then

\[\begin{aligned} f(x^k) - m _ k(s^k) &\ge f(x^k) - m _ k(s^k _ c) \\ &\ge \frac 1 2 \vert \vert \nabla f(x^k) \vert \vert \min \left\{\Delta _ k, \frac{ \vert \vert \nabla f(x^k) \vert \vert }{ \vert \vert \nabla^2 f(x^k) \vert \vert }\right\} \end{aligned}\]

for all $k$.

Define

\[\psi(\alpha) := m _ k(-\alpha \nabla f(x^k)) = f(x^k) - \alpha \vert \vert \nabla f(x^k) \vert \vert ^2 + \frac{\alpha^2}{2} h^k\]

where $h^k := \nabla f(x^k)^\top \nabla^2 f(x^k) \nabla f(x^k)$.

If $h^k \le 0$, then the definition of $\psi$ implies

\[\begin{aligned} m _ k(s^k _ c) &= m _ k(-\alpha _ c^k \nabla f(x^k)) \\ &= \psi(\alpha^k _ c) \\ &\le f(x^k) - \alpha^k _ c \vert \vert \nabla f(x^k) \vert \vert ^2 \end{aligned}\]

and in this case, since the function is unbounded below (by increasing $\alpha^k _ c$), the constraint must be tight and so we also have

\[\begin{aligned} \alpha^k _ c &= \overline \alpha \\ &= \frac{\Delta _ k}{ \vert \vert \nabla f(x^k) \vert \vert } \end{aligned}\]

and hence

\[\begin{aligned} f(x^k) - m _ k(s^k _ c) &\ge \alpha _ c^k \vert \vert \nabla f(x^k) \vert \vert ^2 \\ &= \frac{\Delta _ k}{ \vert \vert \nabla f(x^k) \vert \vert } \vert \vert \nabla f(x^k) \vert \vert ^2 \\ &= \Delta _ k \vert \vert \nabla f(x^k) \vert \vert \\ &\ge \frac 1 2 \vert \vert \nabla f(x^k) \vert \vert \min \left\{\Delta _ k, \frac{ \vert \vert \nabla f(x^k) \vert \vert }{ \vert \vert \nabla^2 f(x^k) \vert \vert }\right\} \end{aligned}\]

Otherwise, $h^k > 0$. Then $\alpha^k _ c = \min\{\alpha _ \min, \overline \alpha\}$ where $\alpha _ \min = \vert \vert \nabla f(x^k) \vert \vert ^2 / h^k$. Assume first that $\alpha^k _ c = \overline \alpha$. Then $\alpha^k _ c \le \alpha _ \min$, which from the expression of $\alpha _ \min$, implies $\alpha _ c^k h^k \le \vert \vert \nabla f(x^k) \vert \vert ^2$. Using this bound in the expression of $\psi(\alpha)$, we obtain

\[\begin{aligned} f(x^k) - m _ k(s^k _ c) &= f(x^k) - \psi(\alpha^k _ c) \\ &= \alpha^k _ c \vert \vert \nabla f(x^k) \vert \vert ^2 - \frac{(\alpha^k _ c)^2}{2} h^k \\ &= \alpha^k _ c \vert \vert \nabla f(x^k) \vert \vert ^2 - \frac{\alpha^k _ c}{2}(\alpha^k _ c h^k) \\ &\ge \left(\alpha^k _ c - \frac{\alpha^k _ c}{2}\right) \vert \vert \nabla f(x^k) \vert \vert ^2 \\ &= \frac{\alpha^k _ c}{2} \vert \vert \nabla f(x^k) \vert \vert ^2 \end{aligned}\]

and from the expression of $\overline \alpha$, we find

\[\begin{aligned} f(x^k) - m _ k(s^k _ c) &\ge \frac{\Delta _ k}{2 \vert \vert \nabla f(x^k) \vert \vert } \vert \vert \nabla f(x^k) \vert \vert ^2 \\ &= \frac 1 2 \Delta _ k \vert \vert \nabla f(x^k) \vert \vert \\ &\ge \frac 1 2 \vert \vert \nabla f(x^k) \vert \vert \min \left\{\Delta _ k, \frac{ \vert \vert \nabla f(x^k) \vert \vert }{ \vert \vert \nabla^2 f(x^k) \vert \vert }\right\} \end{aligned}\]

Assume now that $h^k > 0$ and $\alpha^k _ c = \alpha _ \min = \vert \vert \nabla f(x^k) \vert \vert ^2 / h^k$ (i.e. the constraint is not active, so we may obtain this minimum by differentiating and setting to zero, see ∆exact-linesearch-quadratic-optimal-stepsize-proof). Replacing this value in the model decrease, we obtain

\[\begin{aligned} f(x^k) - m _ k(s^k _ c) &= \alpha^k _ c \vert \vert \nabla f(x^k) \vert \vert ^2 - \frac{(\alpha^k _ c)^2}{2} h^k \\ &= \frac{ \vert \vert \nabla f(x^k) \vert \vert ^4}{2h^k} \end{aligned}\]

and hence by Cauchy-Schwarz and the Rayleigh quotient inequalities, we have

\[\begin{aligned} \frac{ \vert \vert \nabla f(x^k) \vert \vert ^4}{2h^k} &= \frac{ \vert \vert \nabla f(x^k) \vert \vert ^4}{2(\nabla f(x^k))^\top \nabla^2 f(x^k) \nabla f(x^k)} \\ &\ge \frac{ \vert \vert \nabla f(x^k) \vert \vert ^4}{2 \vert \vert \nabla^2 f(x^k) \vert \vert \cdot \vert \vert \nabla f(x^k) \vert \vert ^2} \\ &\ge \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{2 \vert \vert \nabla^2 f(x^k) \vert \vert } \end{aligned}\]

Hence

\[\begin{aligned} f(x^k) - m _ k(s^k _ c) &\ge \frac{ \vert \vert \nabla f(x^k) \vert \vert ^2}{2 \vert \vert \nabla^2 f(x^k) \vert \vert } \\ &\ge \frac 1 2 \vert \vert \nabla f(x^k) \vert \vert \min \left\{\Delta _ k, \frac{ \vert \vert \nabla f(x^k) \vert \vert }{ \vert \vert \nabla^2 f(x^k) \vert \vert }\right\} \end{aligned}\]

Suppose we have the unconstrained optimisation problem $\min f(x)$ subject to $x \in \mathbb R^n$ and are using the trust-region method, where we have

  • $\Delta _ 0 > 0$
  • $x^0 \in \mathbb R^n$, $\epsilon > 0$

Then while $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$, we do the following:

  • Form the local quadratic model $m _ k(s)$ of $f(x^k + s)$
  • Solve (approximately) the trust region subproblem for $s^k$ with $m _ k(s^k) < f(x^k)$. Compute $\rho_k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m_k(s^k)}$.
  • Then:
    • (Very successful) If $\rho_k \ge 0.9$ then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} = 2\Delta_k$
    • (Successful) Else if $\rho_k \ge 0.1$, then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} := \Delta_k$
    • (Unsuccessful) Else set $x^{k+1} = x^k$ and $\Delta_{k+1} := \frac 1 2 \Delta_k$
  • Let $k := k+1$.

Recall further the Cauchy point $y^k_c := x^k + s^k_c$ given by $s^k_c := -\alpha_c^k \nabla f(x^k)$ where

\[\alpha_c^k := \arg\min_{\alpha > 0} m _ k(-\alpha \nabla f(x^k)) \quad\text{s.t.}\quad \vert \vert \alpha \nabla f(x^k) \vert \vert \le \Delta _ k\]

@State a lemma which gives a lower bound on the trust region radius $\Delta _ k$.

Suppose:

  • $f \in \mathcal C^2(\mathbb R^n)$
  • $\nabla f$ is Lipschitz continuous on $\mathbb R^n$ with Lipschitz constant $L$
  • $m _ k(s^k) \le m _ k(s^k _ c)$ for all $k$
  • For all $k$, there exists $\epsilon > 0$ such that $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$

Then there exists a constant $c \in (0, 1)$ independent of $k$ such that

\[\Delta _ k \ge \frac{c}{L}\epsilon\]

for all $k \ge 0$.

Suppose we have the unconstrained optimisation problem $\min f(x)$ subject to $x \in \mathbb R^n$ and are using the trust-region method, where we have

  • $\Delta _ 0 > 0$
  • $x^0 \in \mathbb R^n$, $\epsilon > 0$

Then while $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$, we do the following:

  • Form the local quadratic model $m _ k(s)$ of $f(x^k + s)$
  • Solve (approximately) the trust region subproblem for $s^k$ with $m _ k(s^k) < f(x^k)$. Compute $\rho_k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m_k(s^k)}$.
  • Then:
    • (Very successful) If $\rho_k \ge 0.9$ then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} = 2\Delta_k$
    • (Successful) Else if $\rho_k \ge 0.1$, then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} := \Delta_k$
    • (Unsuccessful) Else set $x^{k+1} = x^k$ and $\Delta_{k+1} := \frac 1 2 \Delta_k$
  • Let $k := k+1$.

Recall further the Cauchy point $y^k_c := x^k + s^k_c$ given by $s^k_c := -\alpha_c^k \nabla f(x^k)$ where

\[\alpha_c^k := \arg\min_{\alpha > 0} m _ k(-\alpha \nabla f(x^k)) \quad\text{s.t.}\quad \vert \vert \alpha \nabla f(x^k) \vert \vert \le \Delta _ k\]

@State a lemma which says that at least one limit point is stationary.

Suppose:

  • $f \in \mathcal C^2 (\mathbb R^n)$
  • $f$ is bounded below on $\mathbb R^n$
  • $\nabla f$ is Lipschitz continuous on $\mathbb R^n$ with Lipschitz constant $L$
  • $\{x^k\}$ is generated by the generic trust region method
  • $m _ k(s^k) \le m _ k(s^k _ c)$ for all $k$

Then either:

  • there exists $k \ge 0$ such that $\nabla f(x^k) = 0$, or
  • $\lim \inf _ {k \to \infty} \vert \vert \nabla f(x^k) \vert \vert = 0$

Suppose we have the unconstrained optimisation problem $\min f(x)$ subject to $x \in \mathbb R^n$ and are using the trust-region method, where we have

  • $\Delta _ 0 > 0$
  • $x^0 \in \mathbb R^n$, $\epsilon > 0$

Then while $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$, we do the following:

  • Form the local quadratic model $m _ k(s)$ of $f(x^k + s)$
  • Solve (approximately) the trust region subproblem for $s^k$ with $m _ k(s^k) < f(x^k)$. Compute $\rho_k := \frac{f(x^k) - f(x^k + s^k)}{f(x^k) - m_k(s^k)}$.
  • Then:
    • (Very successful) If $\rho_k \ge 0.9$ then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} = 2\Delta_k$
    • (Successful) Else if $\rho_k \ge 0.1$, then: set $x^{k+1} := x^k + s^k$, and $\Delta_{k+1} := \Delta_k$
    • (Unsuccessful) Else set $x^{k+1} = x^k$ and $\Delta_{k+1} := \frac 1 2 \Delta_k$
  • Let $k := k+1$.

Recall further the Cauchy point $y^k_c := x^k + s^k_c$ given by $s^k_c := -\alpha_c^k \nabla f(x^k)$ where

\[\alpha_c^k := \arg\min_{\alpha > 0} m _ k(-\alpha \nabla f(x^k)) \quad\text{s.t.}\quad \vert \vert \alpha \nabla f(x^k) \vert \vert \le \Delta _ k\]

@Prove that if

  • $f \in \mathcal C^2 (\mathbb R^n)$
  • $f$ is bounded below on $\mathbb R^n$
  • $\nabla f$ is Lipschitz continuous on $\mathbb R^n$ with Lipschitz constant $L$
  • $\{x^k\}$ is generated by the generic trust region method
  • $m _ k(s^k) \le m _ k(s^k _ c)$ for all $k$

then either:

  • there exists $k \ge 0$ such that $\nabla f(x^k) = 0$, or
  • $\lim \inf _ {k \to \infty} \vert \vert \nabla f(x^k) \vert \vert = 0$

If there exists $k$ such that $\nabla f(x^k) = 0$, then the method terminates.

Assume instead that there exists $\epsilon > 0$ such that $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$ for all $k$. Hence there are infinite many successful iterations $k$, which we say belong to a set $\mathcal S$. Hence by definition of $\rho _ k$, we obtain

\[\begin{aligned} f(x^k) - f(x^{k+1}) &\ge 0.1(f(x^k) - m _ k(s^k)) \\ &\ge \frac{0.1}{2} \vert \vert \nabla f(x^k) \vert \vert \min \left\{ \Delta _ k, \frac{ \vert \vert \nabla f(x^k) \vert \vert }{ \vert \vert \nabla^2 f(x^k) \vert \vert } \right\} \end{aligned}\]

where the second inequality is justified by ∆sufficient-model-decrease-given-cauchy-condition. This holds for all $k \in \mathcal S$.

As $\nabla f$ is Lipschitz continuous with Lipschitz constant $L$, it follows that $ \vert \vert \nabla^2 f(x) \vert \vert \le L$ for all $x$ (by general results about Lipschitz functions). Hence since $ \vert \vert \nabla f(x^k) \vert \vert \ge \epsilon$ for all $K$, it follows for all $k \in \mathcal S$ that

\[\begin{aligned} f(x^k) - f(x^{k+1}) &\ge 0.05\epsilon \min \left\{ \frac{\epsilon}{L}, \Delta _ k \right\} \\ &\ge 0.05\epsilon \min \left\{ \frac{\epsilon}{L}, \frac{c}{L}\epsilon \right\} \end{aligned}\]

where we have used ∆lower-bound-on-trust-region-radius. Hence by the fact $c \in (0, 1)$, it follows that for all $k \in \mathcal S$,

\[f(x^k) - f(x^{k+1}) \ge \frac{0.05c}{L}\epsilon^2 \quad (\star)\]

Since $f(x^k) \ge f _ \text{low}$ for all $k$, it follows that for all $k \ge 0$,

\[\begin{aligned} f(x^0) - f _ \text{low} &\ge f(x^0) - f(x^k) \\ &= \sum^{k-1} _ {i=0} (f(x^i) - f(x^{i+1})) \\ &= \sum^{k-1} _ {i \in \mathcal S} (f(x^i) - f(x^{i+1})) \end{aligned}\]

where we used $f(x^k) = f(x^{k+1})$ on unsuccessful $k$. Let $k \to \infty$. Then

\[\begin{aligned} f(x^0) - f _ \text{low} &\ge \sum^\infty _ {i=0} (f(x^i) - f(x^{i+1})) \\ &= \sum _ {i \in \mathcal S} (f(x^i) - f(x^{i+1})) \\ &\ge \vert S \vert \frac{0.05c}{L}\epsilon^2 \quad (\star\star) \end{aligned}\]

where we used $(\star)$ and that $ \vert \mathcal S \vert $ is the number of successful iterations. Since the left hand side of $(\star\star)$ is finite whereas the right hand side is infinite (since $ \vert \mathcal S \vert = \infty$). Hence there must exist some $k$ such that $ \vert \vert \nabla f(x^k) \vert \vert < \epsilon$, and so it follows that $\lim\inf _ {k \to \infty} \vert \vert \nabla f(x^k) \vert \vert = 0$.

@exam~

Solving the trust region subproblem

Suppose we have:

  • $h \in \mathbb R$
  • $\Delta > 0$
  • $g \in \mathbb R^{n}$
  • $H \in \mathbb R^{n\times n}$, symmetric

and consider the trust region subproblem

\[\min _ {s\in \mathbb R^n} m(s) = h + s^\top g + \frac 1 2 s^\top H s\quad \text{s.t. } \vert \vert s \vert \vert \le \Delta\]

@State a theorem giving necessary and sufficient conditions for $s^\ast$ be a global minimiser of this subproblem.

$s^\ast$ is a global minimiser iff there exists $\lambda^\ast \ge 0$ such that

  1. $(H + \lambda^\ast I) s^\ast = -g$
  2. $H + \lambda^\ast I \succeq 0$
  3. $\lambda^\ast ( \vert \vert s^\ast \vert \vert - \Delta) = 0$
  4. $ \vert \vert s^\ast \vert \vert \le \Delta$

Moreover, if $H + \lambda^\ast I$ is positive definite, then $s^\ast$ is the unique global minimiser.

@exam~

Suppose we have:

  • $h \in \mathbb R$
  • $\Delta > 0$
  • $g \in \mathbb R^{n}$
  • $H \in \mathbb R^{n\times n}$, symmetric

and consider the trust region subproblem

\[\min _ {s\in \mathbb R^n} m(s) = h + s^\top g + \frac 1 2 s^\top H s\quad \text{s.t. } \vert \vert s \vert \vert \le \Delta\]

Recall also the ∆exact-conditions-for-solving-trust-region-subproblem, i.e. that $s^\ast$ is a global minimiser iff there exists $\lambda^\ast \ge 0$ such that

  1. $(H + \lambda^\ast I) s^\ast = -g$
  2. $H + \lambda^\ast I \succeq 0$
  3. $\lambda^\ast ( \vert \vert s^\ast \vert \vert - \Delta) = 0$
  4. $ \vert \vert s^\ast \vert \vert \le \Delta$

and moreover, if $H + \lambda^\ast I$ is positive definite, then $s^\ast$ is the unique global minimiser. @State and @justify how to solve the trust region subproblem exact as a result.

Step 1: First check if $H$ is positive semidefinite, i.e. $\lambda _ \min(H) \ge 0$. If this is the case:

  • Compute the minimum norm solution $s$ of $Hs = -g$. This gives $-H^{-1} g$ if $H \succ 0$, or $-H^+ g$ if $H$ is singular and $g \in \text{range}(H)$.
  • If such an $s$ exists and $\|s\| \le \Delta$, we are done, then $s^\ast = s$ and $\lambda^\ast = 0$ (i.e. we are on the interior of the trust region).

(Why? If $H$ is positive semidefinite, then $\lambda^\ast$ might equal $0$. In this case we can try solving directly for $s^\ast$ and see if all the conditions are satisfies).

Step 2: If $H$ is not positive semidefinite, then it has a strictly negative eigenvalue and so we know that for $H + \lambda^\ast I \ge 0$ to hold we require $\lambda^\ast > 0$. If $H$ was positive semidefinite but such an $s$ didn’t exist, then we know we have to pick $\lambda^\ast > 0$ to make it work. So the constraint is active in either case. This gives us that

\[(H + \lambda I)s = -g, \quad \vert \vert s \vert \vert = \Delta\]

We know by the requirement that $H + \lambda^\ast I \succeq 0$, we must have $\lambda \ge -\lambda _ \min(H) =: \overline\lambda$. Hence we can set $s(\lambda) = -(H + \lambda I)^{-1} g$ and try and solve for

\[\|s(\lambda)\| = \lambda \quad \text{for } \lambda \ge \overline\lambda.\]

If this equation has a solution $\lambda^\ast$, then we are done, since we may calculate $s^\ast = -(H + \lambda I)^{-1}g$.

There is still the possibility that this equation has no root.

Suppose $H$ is positive definite and if we compute $s = -H^{-1}g$, we find that $ \vert \vert s \vert \vert \le \Delta$. Then $s^\ast = s$ uniquely and $\lambda^\ast = 0$.

Case 2: If $H$ is not positive definite, then we know that $\lambda^\ast \ge 0$

If $H$ is not positive definite, or $H$ is positive definite but $s = -H^{-1}g$ gives $ \vert \vert s \vert \vert \ge \Delta$. Then by the above ∆exact-conditions-for-solving-trust-region-subproblem we have that

\[(H + \lambda I)s = -g, \quad \vert \vert s \vert \vert = \Delta\]

and we know by positive semidefiniteness that $\lambda \ge \max\{0, -\lambda _ \min(H)\} := \underline \lambda$. Define

\[s(\lambda) = -(H + \lambda I)^{-1} g\]

for any $\lambda > \underline \lambda$. Then $s^\ast = s(\lambda^\ast)$ where $\lambda^\ast \ge \underline \lambda$ is a solution of

\[ \vert \vert s(\lambda) \vert \vert = \Delta, \quad \lambda \ge \underline \lambda\]

which is a nonlinear equation in one variable $\lambda$ for which we can use Newton’s method to solve.

@exam~

Suppose we have:

  • $h \in \mathbb R$
  • $\Delta > 0$
  • $g \in \mathbb R^{n}$
  • $H \in \mathbb R^{n\times n}$, symmetric

and consider the trust region subproblem

\[\min _ {s\in \mathbb R^n} m(s) = h + s^\top g + \frac 1 2 s^\top H s\quad \text{s.t. } \vert \vert s \vert \vert \le \Delta\]

Recall also the ∆exact-conditions-for-solving-trust-region-subproblem, i.e. that $s^\ast$ is a global minimiser iff there exists $\lambda^\ast \ge 0$ such that

  1. $(H + \lambda^\ast I) s^\ast = -g$
  2. $H + \lambda^\ast I \succeq 0$
  3. $\lambda^\ast ( \vert \vert s^\ast \vert \vert - \Delta) = 0$
  4. $ \vert \vert s^\ast \vert \vert \le \Delta$

and moreover, if $H + \lambda^\ast I$ is positive definite, then $s^\ast$ is the unique global minimiser. In the case where $H$ is not positive definite, in order to calculate $s^\ast$ via these conditions we end up solving the equation

\[ \vert \vert s(\lambda) \vert \vert = \Delta, \quad \lambda \ge \underline \lambda\]

where

\[s(\lambda) = -(H + \lambda I)^{-1} g\]

@State the “secular equation” in this case and explain why this is useful.

Setup: let $H = U^\top \Lambda U$ be the spectral decomposition of $H$, with $U$ orthogonal, $\Lambda = \mathrm{diag}(\lambda _ 1, \ldots, \lambda _ n)$ in increasing order, and define $\gamma := Ug \in \mathbb R^n$ so that $\gamma _ i = v _ i^\top g$ is the projection of $g$ onto the $i$-th eigenvector $v _ i$ of $H$. Then

\[\|s(\lambda)\|^2 = \gamma^\top (\Lambda + \lambda I)^{-2} \gamma = \sum _ {i=1}^n \frac{\gamma _ i^2}{(\lambda + \lambda _ i)^2}\]

(see ∆trust-region-secular-equation-proof).

Secular equation: rather than solving $\|s(\lambda)\| = \Delta$ directly, define

\[\phi(\lambda) := \frac{1}{\|s(\lambda)\|} - \frac{1}{\Delta} = \left(\sum _ {i=1}^n \frac{\gamma _ i^2}{(\lambda + \lambda _ i)^2}\right)^{-1/2} - \frac{1}{\Delta}\]

and solve $\phi(\lambda) = 0$ for $\lambda \in (\max\{0, -\lambda _ 1\}, \infty)$.

Why useful: $\phi$ has no poles on $(-\lambda _ 1, \infty)$ and is analytic there, unlike $\|s(\lambda)\|$ which blows up at $\lambda = -\lambda _ 1$. Newton’s method on $\phi$ is therefore globally well-behaved and locally quadratic — except in the ∆trust-region-hard-case, where $\gamma _ 1 = 0$, the $i = 1$ term in the sum drops out, and the secular equation has no solution on $\lambda > -\lambda _ 1$.

Suppose we have:

  • $h \in \mathbb R$
  • $\Delta > 0$
  • $g \in \mathbb R^{n}$
  • $H \in \mathbb R^{n\times n}$, symmetric

and consider the trust region subproblem

\[\min _ {s\in \mathbb R^n} m(s) = h + s^\top g + \frac 1 2 s^\top H s\quad \text{s.t. } \vert \vert s \vert \vert \le \Delta\]

Recall also the ∆exact-conditions-for-solving-trust-region-subproblem, i.e. that $s^\ast$ is a global minimiser iff there exists $\lambda^\ast \ge 0$ such that

  1. $(H + \lambda^\ast I) s^\ast = -g$
  2. $H + \lambda^\ast I \succeq 0$
  3. $\lambda^\ast ( \vert \vert s^\ast \vert \vert - \Delta) = 0$
  4. $ \vert \vert s^\ast \vert \vert \le \Delta$

and moreover, if $H + \lambda^\ast I$ is positive definite, then $s^\ast$ is the unique global minimiser. In the case where $H$ is not positive definite, in order to calculate $s^\ast$ via these conditions we end up solving the equation

\[ \vert \vert s(\lambda) \vert \vert = \Delta, \quad \lambda \ge \underline \lambda\]

where

\[s(\lambda) = -(H + \lambda I)^{-1} g\]

@Prove that rather than solving this equation, we may instead define

\[\begin{aligned} \phi(\lambda) &= \frac{1}{ \vert \vert s(\lambda) \vert \vert }-\frac 1 \Delta \\ &= \frac{1}{\sum^n _ {i=1} \frac{\gamma _ i^2}{(\lambda + \lambda _ i)^2}} - \frac{1}{\Delta} \end{aligned}\]

and solve the secular equation

\[\phi(\lambda) = 0 \quad \text{for } \lambda \in(\max\{0, -\lambda _ 1\}, \infty)\]

Since $H$ is symmetric, we know it has spectral decomposition $H = U^\top \Lambda U$ where $U$ is an orthonormal matrix of eigenvectors of $H$ and $\Lambda$ is a diagonal matrix of eigenvalues of $H$ given by $\lambda _ 1 \le \lambda _ 2 \le \cdots \le \lambda _ n$, where $\lambda _ 1 = \lambda _ \min(H)$.

By ∆exact-conditions-for-solving-trust-region-subproblem, we know that

\[H + \lambda I = U^\top (\Lambda + \lambda I) U\]

is positive semidefinite. Hence $\lambda _ 1 + \lambda \ge 0$, and so $\lambda \ge -\lambda _ 1$ and thus $\lambda \ge \max\{0, -\lambda _ 1\}$. Thus

\[\begin{aligned} \psi(\lambda) &= \vert \vert s(\lambda) \vert \vert ^2 \\ &= \vert \vert U^\top (\Lambda + \lambda I)^{-1}Ug \vert \vert ^2 \\ &= g^\top U^\top (\Lambda + \lambda I)^{-2} Ug \end{aligned}\]

Let $g = U^\top \gamma$ for some $\gamma = (\gamma _ 1, \ldots, \gamma _ n) \in \mathbb R^n$. As $U U^\top = U^\top U = I$, we have

\[\begin{aligned} \psi(\lambda) &= \gamma^\top (\Lambda + \lambda I)^{-2} \gamma \\ &= \sum^n _ {i=1} \frac{\gamma _ i^2}{(\lambda + \lambda _ i)^2} \\ &= \Delta^2 \end{aligned}\]

Hence the minimising the secular equation is equivalent to solving for $\lambda$. $\phi$ is a convenient form as it has no poles and is analytic on $(-\lambda, \infty)$.

Consider the trust-region subproblem

\[\min _ {s \in \mathbb R^n} m(s) = h + s^\top g + \frac 1 2 s^\top H s \quad\text{ s.t. }\quad \|s\| \le \Delta.\]

@Define the “hard case” of this subproblem, and @describe how the solution is constructed.

Hard case: occurs when both

  • $H$ has $\lambda _ \min (H) < 0$, and
  • $g$ is orthogonal to the eigenspace of $\lambda_\min(H)$.

In the ordinary nonconvex case ($g$ has a component along the negative-eigenvalue eigenspace), we find $\lambda^\ast > \underline \lambda := \max \{0, -\lambda _ \min(H)\}$ such that $\|s(\lambda^\ast)\| = \Delta$, where $s(\lambda) = -(H + \lambda I)^{-1} g$ (see ∆trust-region-subproblem-solving-recipe). In the hard case this fails: $\lambda \to \underline \lambda^+$, $s(\lambda)$ stays bounded (the singularity in $(H + \lambda I)^{-1}$ along the bad eigenspace doesn’t blow up because $g$ has zero component there), so the secular equation has no solution for $\lambda > \underline \lambda$.

Recipe: set $\lambda^\ast := -\lambda _ \min(H) = \underline \lambda$, so $H + \lambda^\ast I \succ 0$ with non-trivial null space (the eigenspace of $\lambda _ \min(H)$). Then $(H + \lambda^\ast I)s = -g$ is solvable (since $g$ lies in the range of $H + \lambda^\ast I$, which is the orthogonal complement of the null space). Solve for the components of $s$ orthogonal to the null-space; the null-space component is free to choose. Fix its magnitude using $\|s\| = \Delta$.

Uniqueness fails: since $H + \lambda^\ast I$ is only PSD (not PD), ∆exact-conditions-for-solving-trust-region-subproblem only guarantees existence and the global minimiser is not unique. See ∆trust-region-hard-case exampe.

Work through ∆trust-region-hard-case on

\[H = \text{diag}(-2, -1, -1), \quad g = (0, 0, 1)^\top, \quad \Delta = \sqrt 2.\]

Check the hard-case conditions:

  • $\lambda _ \min(H) = -2 < 0$
  • Eigenspace of $\lambda _ \min(H)$ is $\text{span}\{e _ 1\}$, so $g = (0, 0, 1)^\top$ is orthogonal to $e _ 1$

So set $\lambda^\ast = 2$, giving $H + 2I = \text{diag}(0, 1, 1)$. Solve $(H + 2I)s = -g$:

\[\begin{pmatrix} 0 & & \\ & 1 & \\ & & 1 \end{pmatrix} \begin{pmatrix} s _ 1 \\ s _ 2 \\ s _ 3 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \\ -1 \end{pmatrix}\]

Hence $s _ 2 = 0$, $s _ 3 = -1$, and $s _ 1$ is free (it lies in the null space of $H + \lambda^\ast I)$.

Imposing $\|s\| = \sqrt 2$ gives $s _ 1^2 + 0 + 1 = 2$, so $s _ 1 = \pm 1$. Hence there are two global minimisers:

\[s^\ast = (1, 0, -1)^\top, \quad \text{or} \quad s^\ast = (-1, 0, -1)^\top.\]

@example~

Solve the trust-region subproblem

\[\min _ {s \in \mathbb R^3} s^\top g + \frac 1 2 s^\top H s \quad \text{s.t.} \quad \|s\| \le \Delta\]

with $H = \text{diag}(1, 2, 2)$, $g = (1, 0, 1)^\top$, $\Delta = 2$.

Recall also the ∆exact-conditions-for-solving-trust-region-subproblem, i.e. that $s^\ast$ is a global minimiser iff there exists $\lambda^\ast \ge 0$ such that

  1. $(H + \lambda^\ast I) s^\ast = -g$
  2. $H + \lambda^\ast I \succeq 0$
  3. $\lambda^\ast ( \vert \vert s^\ast \vert \vert - \Delta) = 0$
  4. $ \vert \vert s^\ast \vert \vert \le \Delta$

and moreover, if $H + \lambda^\ast I$ is positive definite, then $s^\ast$ is the unique global minimiser.

Since $H \succ 0$, we are in case 1 of ∆trust-region-subproblem-solving-recipe. Try the unconstrained minimiser:

\[s _ N = -H^{-1} g = (-1, 0, -1/2)^\top, \quad \|s _ N\| = \sqrt{1 + 1/4} = \frac{\sqrt 5}{2}\]

Since $\|s _ N\| < 2 = \Delta$, the trust-region constraint is inactive. By ∆exact-conditions-for-solving-trust-region-subproblem, $\lambda^\ast = 0$ and the unique global minimiser is

\[s^\ast = s _ N = (-1, 0, -1/2)^\top.\]

@example~

Solve the trust-region subproblem

\[\min _ {s \in \mathbb R^3} s^\top g + \frac 1 2 s^\top H s \quad \text{s.t.} \quad \|s\| \le \Delta\]

with $H = \text{diag}(1, 2, 2)$, $g = (1, 0, 1)^\top$, $\Delta = 5/12$.

Recall also the ∆exact-conditions-for-solving-trust-region-subproblem, i.e. that $s^\ast$ is a global minimiser iff there exists $\lambda^\ast \ge 0$ such that

  1. $(H + \lambda^\ast I) s^\ast = -g$
  2. $H + \lambda^\ast I \succeq 0$
  3. $\lambda^\ast ( \vert \vert s^\ast \vert \vert - \Delta) = 0$
  4. $ \vert \vert s^\ast \vert \vert \le \Delta$

and moreover, if $H + \lambda^\ast I$ is positive definite, then $s^\ast$ is the unique global minimiser.

Since $H \succ 0$, we are in case 1 of ∆trust-region-subproblem-solving-recipe. Try the unconstrained minimiser:

\[s _ N = -H^{-1} g = (-1, 0, -1/2)^\top, \quad \|s _ N\| = \sqrt{1 + 1/4} = \frac{\sqrt 5}{2}\]

This fails since $\sqrt{5}/2 > 5/12 = \Delta$ (see ∆trust-region-case-1-example), so the constraint is active. By ∆exact-conditions-for-solving-trust-region-subproblem, we look for $\lambda^\ast > 0$ with

\[s(\lambda) = -(H + \lambda I)^{-1} g = \left( -\frac{1}{1+\lambda}, 0, -\frac{1}{2 + \lambda} \right)^\top\]

and $\|s(\lambda)\| = \Delta$, i.e.

\[\frac{1}{(1 + \lambda)^2} + \frac{1}{(2 + \lambda)^2} = \frac{25}{144}.\]

We note that $\lambda = 2$ is a root. Since $\lambda^\ast = 2 > 0$ and $H + 2I \succ 0$, this is the unique global minimiser:

\[s^\ast = (-1/3, 0, -1/4)^\top.\]

@example~

Solve the trust-region subproblem

\[\min _ {s \in \mathbb R^3} s^\top g + \frac 1 2 s^\top H s \quad \text{s.t.} \quad \|s\| \le \Delta\]

with $H = \text{diag}(-2, -1, -1)$, $g = (1, 0, 1)^\top$, $\Delta = 5/12$.

Recall also the ∆exact-conditions-for-solving-trust-region-subproblem, i.e. that $s^\ast$ is a global minimiser iff there exists $\lambda^\ast \ge 0$ such that

  1. $(H + \lambda^\ast I) s^\ast = -g$
  2. $H + \lambda^\ast I \succeq 0$
  3. $\lambda^\ast ( \vert \vert s^\ast \vert \vert - \Delta) = 0$
  4. $ \vert \vert s^\ast \vert \vert \le \Delta$

and moreover, if $H + \lambda^\ast I$ is positive definite, then $s^\ast$ is the unique global minimiser.

$H$ has $\lambda _ \min(H) = -2 < 0$, and $g$ has nonzero component along $e _ 1$ (the eigenspace of $\lambda _ \min$), so we are in the ordinary nonconvex case, not the hard case (∆trust-region-hard-case). By ∆exact-conditions-for-solving-trust-region-subproblem, $\lambda^\ast = -\lambda _ \min(H) = 2$ and $\|s^\ast\| = \Delta$. Write

\[s(\lambda) = -(H + \lambda I)^{-1} g = \left( -\frac{1}{\lambda - 2}, 0, -\frac{1}{\lambda - 1} \right)^\top\]

and we require $\|s(\lambda)\|^2 = 25/144$:

\[\frac{1}{(\lambda - 2)^2} + \frac{1}{(\lambda - 1)^2} = \frac{25}{144}.\]

Substitute $u := \lambda - 3$, so $\lambda - 2 = u + 1$ and $\lambda - 1 = u + 2$. Then the equation becomes

\[\frac{1}{(u+1)^2} + \frac{1}{(u + 2)^2} = \frac{25}{144},\]

the same equation as ∆trust-region-case-2-pd-example, with root $u = 2$. Hence $\lambda^\ast = 5$, and

\[s^\ast = (-1/3, 0, -1/4)^\top.\]

@example~

Global convergence theorem

Consider the modified trust-region method where the local model uses an arbitrary symmetric matrix $B^k$ in place of $\nabla^2 f(x^k)$:

\[m _ k(s) := f(x^k) + s^\top \nabla f(x^k) + \frac 1 2 s^\top B^k s, \quad \|s\| \le \Delta _ k.\]

The Cauchy point and the GTR scheme are defined as in ∆generic-trust-region-algorithm. @State the global convergence theorem for this generalisation and identify the extra hypotheses needed beyond the standard GTR setup.

Suppose:

  • $f \in \mathcal C^2(\mathbb R^n)$, bounded below on $\mathbb R^n$
  • $\nabla f$ Lipschitz continuous with constant $L$
  • $\{x^k\}$ generated by GTR with model $m _ k(s)$ above
  • $m _ k(s^k) \le m _ k(s^k _ c)$ for all $k$ (Cauchy decreases)
  • New hypothesis: $\|B^k\| \le \kappa _ B$ uniformly in $k$

Then either:

  • there exists $k \ge 0$ with $\nabla f(x^k) = 0$, or
  • $\lim \inf _ {k \to \infty} \|\nabla f(x^k)\| = 0$.

Where the hypothesis enters: the proof of ∆one-limit-point-of-gtr-method-stationary generalises with $\nabla^2 f(x^k)$ replaced by $B^k$ throughout, with two changes:

\[f(x^k) - m _ k(s) \ge \frac 1 2 \|\nabla f(x^k)\| \min \left\{ \Delta _ k, \frac{\|\nabla f(x^k)\|}{\|B^k\|} \right\}.\]
  • The bound $ \vert f(x^k + s^k) - m _ k(s^k) \vert \le \tfrac 1 2 [L + \kappa _ B]\Delta^2 _ k$ replaces the $L$-only bound in the original proof of ∆lower-bound-on-trust-region-radius, and the lower bound on $\Delta _ k$ weakens to $\Delta _ k \ge c/(L + \kappa _ B) \cdot \epsilon$ for some $c \in (0, 1)$.

The hypothesis $\|B^k\| \le \kappa _ B$ is what makes both these bounds finite. Without it, $B^k$ could be unboundedly indefinite and the Cauchy decrease becomes useless.

Why useful: this covers quasi-Newton trust-region methods (where $B^k$ is a Hessian approximation, not necessarily PD) and methods using approximate derivatives, without requiring PD-ness of $B^k$ (compare with ∆general-second-order-glm-global-convergence for the linesearch case, which did need PD).

Bite-sized

In the generic trust-region scheme, the success of an iteration is judged by the ratio $\rho _ k := (f(x^k) - f(x^k + s^k)) / (f(x^k) - m _ k(s^k))$ with thresholds: $\rho _ k \ge 0.9$ very successful (accept, double $\Delta$); $\rho _ k \ge 0.1$ successful (accept, keep $\Delta$); else unsuccessful (reject, halve $\Delta$).

Source: Lecture 8, A Generic Trust Region (GTR) method slide.

@bite~

Why is the Cauchy decrease condition $m _ k(s^k) \le m _ k(s^k _ c)$ imposed on the (approximate) trust-region subproblem solution, and why is it usually easy to satisfy?

Purpose: it provides the minimal model decrease needed for the global convergence proof (∆gtr-global-convergence-theorem). Without it, an approximate solver could pick $s^k$ that decreases $m _ k$ trivially and the proof’s “sufficient decrease per iteration” step would fail.

Easy in practice: $s^k _ c$ is just the Cauchy point — a steepest-descent step inside the trust region, computable in closed form (∆cauchy-point-computation) from one gradient and one Hessian-vector product. Any reasonable iterative solver (conjugate gradient, dogleg, two-dimensional subspace minimisation) does better than $s^k _ c$ on every iteration. So requiring $m _ k(s^k) \le m _ k(s^k _ c)$ doesn’t constrain the algorithm — it just nails down the worst-case guarantee.

Source: Lecture 8, The Cauchy point of the (TR) subproblem discussion.

@bite~

The “secular equation” $\phi(\lambda) := 1/\ \vert s(\lambda)\ \vert - 1/\Delta = 0$ is preferred over the direct equation $\ \vert s(\lambda)\ \vert = \Delta$ when solving the trust-region subproblem (case 2) because $\phi$ has no poles and is analytic on $(-\lambda _ 1, \infty)$, making Newton’s method globally convergent and locally quadratically convergent. The direct $\ \vert s(\lambda)\ \vert $ has a pole at $\lambda = -\lambda _ 1$ which makes Newton iterations unstable nearby.

Source: Lecture 9, The secular equation slide.

@bite~

In one line each, what’s the philosophical difference between linesearch and trust-region methods?

  • Linesearch: choose a (descent) direction first, then choose the stepsize along it to control how much we trust the local model. Liberal in direction, conservative in stepsize.
  • Trust-region: choose a region around $x^k$ in which we trust the local model, then minimise the model over that region. Conservative in step size (bounded by $\Delta _ k$), liberal in direction (anywhere in the trust region, including non-descent directions for non-PD Hessians).

Both reduce a global problem to a tractable local subproblem; they differ only in which axis (direction or stepsize) is committed to first.

Source: Lecture 8, Linesearch versus trust-region methods slide.

@bite~

The ratio $\rho _ k = (f(x^k) - f(x^k + s^k)) / (f(x^k) - m _ k(s^k))$ compares actual decrease in $f$ to predicted decrease in the model $m _ k$. $\rho _ k \approx 1$ means the model is accurate; $\rho _ k \ll 1$ means the model is overconfident and we should shrink the trust region; $\rho _ k > 1$ means the model is overconservative and we can expand.

Source: Lecture 8, Generic trust-region method slide.

@bite~

When computing the Cauchy point, the case $h^k := \nabla f(x^k)^\top \nabla^2 f(x^k) \nabla f(x^k) \le 0$ corresponds to negative or zero curvature of $f$ along $-\nabla f(x^k)$. In that case $\psi(\alpha) := m _ k(-\alpha \nabla f)$ is unbounded below on $\mathbb R$, so the constrained minimum is attained at $\alpha^k _ c = \overline\alpha = \Delta _ k / \ \vert \nabla f(x^k)\ \vert $ (the trust-region boundary).

Source: Lecture 9, Computation of the Cauchy point slide.

@bite~

For large-scale problems, solving the trust-region subproblem exactly (via Newton’s method on the secular equation, requiring Cholesky factorisations of $H + \lambda I$ for various $\lambda$) is too expensive. The practical alternative is to use conjugate-gradient or Lanczos methods on the subproblem — these only need matrix-vector products $Hv$, and their first step is a steepest-descent step so the Cauchy decrease is automatic.

Source: Lecture 9, Solving the (TR) subproblem for large-scale problems slide.

@bite~

Give the proof strategy for ∆cauchy-point-computation (the closed-form computation of $\alpha^k _ c$).

  • Reduce to a 1D problem in $\alpha$: define $\psi(\alpha) := m _ k(-\alpha \nabla f(x^k)) = f(x^k) - \alpha \|\nabla f(x^k)\|^2 + \tfrac 1 2 \alpha^2 h^k$, where $h^k := \nabla f(x^k)^\top \nabla^2 f(x^k) \nabla f(x^k)$.
  • Note $\psi'(0) = -\|\nabla f(x^k)\|^2 < 0$ so $\alpha^k_c > 0$.
  • Feasibility window: $\alpha \in (0, \overline\alpha]$ with $\overline\alpha := \Delta _ k / \|\nabla f(x^k)\|$.
  • Case-split on sign of $h^k$:
    • $h^k > 0$: $\psi$ is a convex parabola; unconstrained minimiser at $\alpha _ \min = \|\nabla f\|^2 / h^k$. Hence $\alpha^k _ c = \min(\alpha _ \min, \overline\alpha)$.
    • $h^k \le 0$: $\psi$ is concave (or linear); unbounded below on $\mathbb R$, so constrained min lies on the boundary, $\alpha^k _ c = \overline\alpha$.

Source: Lecture 9, Computation of the Cauchy point slide.

@bite~ @proofsupport~

Give the proof strategy for ∆sufficient-model-decrease-given-cauchy-condition (Lemma 12: Cauchy model decrease ≥ $\tfrac{1}{2} \|\nabla f\| \min\{\Delta _ k, \|\nabla f\|/\|\nabla^2 f\|\}$).

  • The Cauchy-decrease hypothesis $m _ k(s^k) \le m _ k(s^k _ c)$ reduces the task to bounding $f(x^k) - m _ k(s^k _ c)$.
  • Split into three cases based on which regime of ∆cauchy-point-computation produced $\alpha^k _ c$:
    • Case A ($h^k \le 0$): $\alpha^k _ c = \overline\alpha = \Delta _ k / \|\nabla f(x^k)\|$. Since $\psi(\alpha^k _ c) \le f(x^k) - \alpha^k _ c \|\nabla f\|^2$ (drop the $h^k$ term), get $f(x^k) - m _ k(s^k _ c) \ge \Delta _ k \|\nabla f(x^k)\|$.
    • Case B ($h^k > 0$, $\alpha^k _ c = \overline\alpha$): $\alpha^k _ c \le \alpha _ \min$, so $\alpha^k _ c h^k \le \|\nabla f\|^2$. Using this in $\psi$, $f(x^k) - m _ k(s^k _ c) \ge \tfrac{1}{2} \alpha^k _ c \|\nabla f\|^2 = \tfrac{1}{2} \Delta _ k \|\nabla f(x^k)\|$.
    • Case C ($h^k > 0$, $\alpha^k _ c = \alpha _ \min = \|\nabla f\|^2/h^k$): plug in to get $f(x^k) - m _ k(s^k _ c) = \|\nabla f\|^4 / (2 h^k)$, then bound $h^k \le \|\nabla^2 f\| \|\nabla f\|^2$ by Rayleigh–Ritz to give $\|\nabla f\|^2 / (2 \|\nabla^2 f\|)$.
  • All three lower bounds dominate $\tfrac{1}{2} \|\nabla f\| \min\{\Delta _ k, \|\nabla f\| / \|\nabla^2 f\|\}$.

Source: Lecture 9, Lemma 12 proof.

@bite~ @proofsupport~

Case A ($h^k \le 0$) of the Cauchy-decrease lemma (∆cauchy-decrease-lemma-proof). When $\alpha^k _ c = \overline\alpha = \Delta _ k/\ \vert \nabla f(x^k)\ \vert $, dropping the non-positive $h^k$ term in $\psi$ gives

\[f(x^k) - m _ k(s^k _ c) \ge \alpha^k _ c \ \vert \nabla f(x^k)\ \vert ^2 = <span class="cloze" tabindex="0">\Delta _ k \ \vert \nabla f(x^k)\ \vert</span>.\]

This is the easiest of the three cases — the negative-curvature step rides the boundary all the way.

Source: Lecture 9, Lemma 12 proof, $h^k \le 0$ case.

@bite~ @proofsupport~

Case C ($h^k > 0$, unconstrained interior minimum) of the Cauchy-decrease lemma (∆cauchy-decrease-lemma-proof). When $\alpha^k _ c = \ \vert \nabla f(x^k)\ \vert ^2 / h^k$, substituting into $\psi$ collapses the quadratic to

\[f(x^k) - m _ k(s^k _ c) = <span class="cloze" tabindex="0">\frac{\ \vert \nabla f(x^k)\ \vert ^4}{2 h^k}</span>\]

and Rayleigh–Ritz $h^k \le \ \vert \nabla^2 f(x^k)\ \vert \ \vert \nabla f(x^k)\ \vert ^2$ then yields the $\ \vert \nabla f\ \vert ^2 / (2 \ \vert \nabla^2 f\ \vert )$ branch of the min.

Source: Lecture 9, Lemma 12 proof, $h^k > 0$, $\alpha^k _ c = \alpha _ \min$ case.

@bite~ @proofsupport~

Give the proof strategy for ∆one-limit-point-of-gtr-method-stationary (Theorem 14: GTR has $\liminf \|\nabla f(x^k)\| = 0$).

  • Argue by contradiction: assume $\|\nabla f(x^k)\| \ge \epsilon > 0$ for all $k$ (else done).
  • On every successful step $k \in \mathcal S$, the radius-update rule gives $\rho _ k \ge 0.1$, so $f(x^k) - f(x^{k+1}) \ge 0.1 (f(x^k) - m _ k(s^k))$.
  • Apply ∆sufficient-model-decrease-given-cauchy-condition (Lemma 12) to bound the model decrease; combine with the $\|\nabla^2 f\| \le L$ bound (from Lipschitz $\nabla f$) and ∆lower-bound-on-trust-region-radius (Lemma 13: $\Delta _ k \ge c\epsilon/L$) to get a uniform per-successful-step decrease $\ge 0.05 c \epsilon^2 / L$.
  • Telescope over successful iterations: bounded-below $f$ gives finite LHS but $ \vert \mathcal S \vert = \infty$ gives infinite RHS — contradiction.
  • Hence the standing assumption fails, so $\liminf \|\nabla f(x^k)\| = 0$.

Source: Lecture 9, Theorem 14 proof.

@bite~ @proofsupport~

The uniform per-successful-step decrease used in the contradiction argument of ∆one-limit-point-of-gtr-method-stationary-proof (Theorem 14). Combining $\rho _ k \ge 0.1$ with ∆sufficient-model-decrease-given-cauchy-condition (Lemma 12), $\ \vert \nabla^2 f(x^k)\ \vert \le L$, and ∆lower-bound-on-trust-region-radius (Lemma 13’s $\Delta _ k \ge c\epsilon/L$):

\[\forall k \in \mathcal S : f(x^k) - f(x^{k+1}) \ge <span class="cloze" tabindex="0">\frac{0.05 c}{L} \epsilon^2</span>.\]

This independent-of-$k$ lower bound clashes with the bounded-below telescoping LHS, giving the contradiction.

Source: Lecture 9, Theorem 14 proof, inequality $(\star)$.

@bite~ @proofsupport~

In ∆one-limit-point-of-gtr-method-stationary (Theorem 14), where does "$ \vert \mathcal S \vert = \infty$" (infinitely many successful iterations) come in, and what happens if instead only finitely many successful steps occur?

  • The contradiction in the proof uses $ \vert \mathcal S \vert = \infty$ to make the RHS of the telescoped sum infinite while the LHS $\le f(x^0) - f _ {\text{low}}$ stays finite. Without infinite $\mathcal S$, this contradiction fails.
  • However, the case of finitely many successful iterations is handled separately (∆lower-bound-on-trust-region-radius, remark 2): after the last successful step, all subsequent steps are unsuccessful, so $\Delta _ k \to 0$. Combined with Lemma 13 (which says $\Delta _ k \ge c\epsilon/L$ under $\|\nabla f(x^k)\| \ge \epsilon$), this forces $\|\nabla f(x^k)\| < \epsilon$ for some $k$, so in fact the last successful iterate has zero gradient.
  • Either way, the conclusion holds: either GTR terminates with $\nabla f(x^k) = 0$, or $\liminf \|\nabla f(x^k)\| = 0$.

Source: Lecture 9, Theorem 14 proof + remark following ∆lower-bound-on-trust-region-radius.

@bite~ @proofsupport~

Give the proof strategy for ∆trust-region-secular-equation-proof (deriving the secular equation $\phi(\lambda) = 1/\|s(\lambda)\| - 1/\Delta = 0$ via spectral decomposition).

  • Spectral-decompose $H = U^\top \Lambda U$ with $U$ orthonormal and $\Lambda$ diagonal of eigenvalues $\lambda _ 1 \le \cdots \le \lambda _ n$.
  • Note ∆exact-conditions-for-solving-trust-region-subproblem requires $H + \lambda I \succeq 0$, which translates to $\lambda \ge -\lambda _ 1 = -\lambda _ \min(H)$, hence $\lambda \ge \max\{0, -\lambda _ 1\}$.
  • Compute $\|s(\lambda)\|^2 = g^\top (H + \lambda I)^{-2} g$, change variables $g = U^\top \gamma$ and use $UU^\top = I$ to get $\|s(\lambda)\|^2 = \sum _ i \gamma _ i^2 / (\lambda + \lambda _ i)^2$.
  • Set $\|s(\lambda)\|^2 = \Delta^2$ as the equation to solve.
  • Switch to $\phi(\lambda) := 1/\|s(\lambda)\| - 1/\Delta$ for numerical stability — it has no poles on $(-\lambda _ 1, \infty)$ and is analytic there, unlike $\|s(\lambda)\|$ which blows up at $\lambda = -\lambda _ 1$.

Source: Lecture 9, Spectral decomposition approach and The secular equation slides.

@bite~ @proofsupport~

Visualising the subproblem

The solution recipe ∆trust-region-subproblem-solving-recipe and the ∆trust-region-secular-equation are easier to picture in two dimensions. The left panel shows the model contours, the trust region $\ \vert s\ \vert \le \Delta$ and the solution $s^\ast$ as $\Delta$ varies. The right panel shows the secular picture, with $\phi(\lambda)$ crossing zero at $\lambda^\ast$. Drag $g$, push the eigenvalues of $H$ into the indefinite range, and try the hard case.

Solving the trust-region subproblem   \(\min_s\)   \(m(s) = g^\top s + \tfrac12 s^\top H s\)   subject to   \(\|s\| \le \Delta\). Drag inside the left panel to move the gradient g; the sliders set the eigenvalues \(\lambda_1 \le \lambda_2\) of H, its tilt, and \(\Delta\). On the boundary the solution solves \((H+\lambda^{*}I)s^{*} = -g\) with \(\|s^{*}\| = \Delta\) via the secular equation.

model contours · trust region \(\|s\|\le\Delta\) · solution path \(s^{*}(\Delta)\)

secular picture: \(\|s(\lambda)\|\) and \(\phi(\lambda)=1/\|s(\lambda)\|-1/\Delta\)