Continuous Optimisation HT26, Quasi-Newton methods


Flashcards

@State an expression for approximating the Hessian from $n +1$ gradient evaluations.

You can determine the $i$th column via the approximation

\[[\nabla^2 f(x)]e^i \approx \frac 1 h \left[ \nabla f(x + he^i) - \nabla f(x) \right]\]

One way to approximate the Hessian via $n+1$ gradient evaluations is to do the following for each column:

\[[\nabla^2 f(x)]e^i \approx \frac 1 h \left[ \nabla f(x + he^i) - \nabla f(x) \right]\]

What is the problem with making $h$ too large or too small?

  • If $h$ is too large, you have inaccurate approximations
  • If $h$ is too small, you have numerical cancellation errors

In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. What is the idea behind secant approximations for computing $B^k \approx \nabla^2 f(x^k)$, and what are some things we should require about this matrix?

At the start of line search, choose some $B^0$ (e.g. $I$). After computing $s^k = -(B^k)^{-1} \nabla f(x^k)$ and $x^{k+1} = x^k + \alpha^k s^k$, compute an update $B^{k+1}$ of $B^k$ in terms of already computed quantities $\nabla f(x^{k+1})$, $\nabla f(x^k)$, $\nabla f(x^0)$, etc.

We would like:

  • $B^{k+1}$ should be symmetric, nonsingular and positive definite
  • $B^{k+1}$ should be close to $B^k$
  • $B^{k+1}$ should be cheap to compute
  • $B^k \to \nabla^2 f(x^k)$

In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context, @state the secant equation.

We choose $B^{k+1}$ such that

\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]

@exam~

In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that

\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]

@Justify that this equation is satisfied by the Hessian when $f$ is a quadratic function.

Let $f(x) = g^\top x + \frac 1 2 x^\top H x$. Then $\nabla f(x) = Hx + g$ and $\nabla^2 f = H$. Thus $\nabla f(x^{k+1}) - \nabla f(x^k) = H(x^{k+1} - x^k)$ and hence the secant equation holds with $B^{k+1} := \nabla^2 f$.

In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that

\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]

Can you give an intuitive interpretation of the secant equation?

We implicitly have the quadratic model

\[m(x^k + s) = f(x^k) + \nabla f(x^k)^\top s + \frac 1 2 s^\top B^k s\]

We want the quadratic model in the next iteration

\[m _ + (x^k + s) := f(x^k) + \nabla f(x^k)^\top s + \frac 1 2 s^\top B^{k+1} s\]

to correctly predict the change in gradient between iterations given by $\gamma^k := \nabla f(x^{k+1}) - \nabla f(x^k)$:

\[\begin{aligned} \gamma^k &= \nabla f(x^{k+1}) - \nabla f(x^k) \\ &= \nabla m _ + (x^{k+1}) - \nabla m _ +(x^k) \\ &= B^{k+1} (x^{k+1} - x^k) + \nabla f(x^k) - \nabla f(x^k) \\ &= B^{k+1} (x^{k+1} - x^k) \end{aligned}\]

In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that

\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]

@State how this could be satisfied with a symmetric rank 1 update.

Choose $B^{k+1} := B^k + u^k (u^k)^\top$, where

  • $u^k = \frac{\gamma^k - B^k \delta^k}{\rho^k}$
  • $\delta^k := x^{k+1} - x^k = \alpha^k s^k$
  • $(\rho^k)^2 := (\gamma^k - B^k \delta^k)^\top \delta^k > 0$

In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that

\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]

A symmetric rank 1 update satisfies this by choosing $B^{k+1} := B^k + u^k (u^k)^\top$, where

  • $u^k = (\gamma^k - B^k \delta^k) / \rho^k$
  • $\delta^k := x^{k+1} - x^k = \alpha^k s^k$
  • $(\rho^k)^2 := (\gamma^k - B^k \delta^k)^\top \delta^k > 0$

What are the issues with this?

  • $B^k$ may not be positive definite
  • $s^k$ may not be a descent direction
  • $\rho^k$ may be close to zero

In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that

\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]

A symmetric rank 1 update satisfies this by choosing $B^{k+1} := B^k + u^k (u^k)^\top$, where

  • $u^k = (\gamma^k - B^k \delta^k) / \rho^k$
  • $\delta^k := x^{k+1} - x^k = \alpha^k s^k$
  • $(\rho^k)^2 := (\gamma^k - B^k \delta^k)^\top \delta^k > 0$

What is the time complexity of this?

\[O(n^2)\]

via Sherman-Morrison-Woodbury formula.

In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that

\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]

@Define the BFGS (Broyden-Fletcher-Goldfarb-Shanno) update.

Set

\[**@Define** theB^{k+1} = B^k - \frac{B^k \delta^k (\delta^k)^\top B^k}{(\delta^k)^\top B^k \delta^k} + \frac{\gamma^k (\gamma^k)^\top}{(\gamma^k)^\top \delta^k}\]

where $\delta^k := x^{k+1} - x^k$ and $\gamma^k := \nabla f(x^{k+1}) - \nabla f(x^k)$. This is a rank-2 update of $B^k$.

@exam~

In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that

\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]

The BFGS (Broyden-Fletcher-Goldfarb-Shanno) update computes

\[B^{k+1} = B^k - \frac{B^k \delta^k (\delta^k)^\top B^k}{(\delta^k)^\top B^k \delta^k} + \frac{\gamma^k (\gamma^k)^\top}{(\gamma^k)^\top \delta^k}\]

where $\delta^k := x^{k+1} - x^k$ and $\gamma^k := \nabla f(x^{k+1}) - \nabla f(x^k)$. This is a rank-2 update of $B^k$. What is the time complexity of this?

\[O(n^2)\]

via the Sherman-Morrison-Woodbury formula.

In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that

\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]

The BFGS (Broyden-Fletcher-Goldfarb-Shanno) update computes

\[B^{k+1} = B^k - \frac{B^k \delta^k (\delta^k)^\top B^k}{(\delta^k)^\top B^k \delta^k} + \frac{\gamma^k (\gamma^k)^\top}{(\gamma^k)^\top \delta^k}\]

where $\delta^k := x^{k+1} - x^k$ and $\gamma^k := \nabla f(x^{k+1}) - \nabla f(x^k)$. This is a rank-2 update of $B^k$. @State a result about the symmetry and definiteness of this update.

If $B^k$ is symmetric positive definite and the curvature condition $(\delta^k)^\top \gamma^k > 0$ holds, then $B^{k+1}$ is symmetric positive definite.

The curvature condition is enforced by Wolfe linesearch but not by backtracking Armijo, which is why BFGS in practice is paired with Wolfe.

@exam~

What is meant by the BFGS method?

A generic linesearch method with

\[s^k := -(B^k)^{-1}\nabla f(x^k)\]

where $B^k$ is updated by the BFGS formula on each iteration.

The BFGS method is the generic linesearch method with

\[s^k := -(B^k)^{-1}\nabla f(x^k)\]

where $B^k$ is updated by the BFGS formula on each iteration. What must be true about how the step-size is calculated in order to ensure global convergence, and why?

We require Wolfe linesearch (∆wolfe-conditions) rather than backtracking Armijo.

Why: global convergence of BFGS in the GLM framework needs $B^k$ to stay positive definite — so that $s^k = -(B^k)^{-1}\nabla f(x^k)$ stays a descent direction and the umbrella ∆general-second-order-glm-global-convergence theorem applies (its eigenvalue hypothesis requires PD $B^k$). BFGS preserves PD iff the curvature condition $(\delta^k)^\top \gamma^k > 0$ holds (see ∆bfgs-preserves-positive-definiteness, ∆bite-bfgs-curvature-condition).

Wolfe enforces this. The curvature half (W2) says $\nabla f(x^{k+1})^\top s^k \ge c _ 2 \nabla f(x^k)^\top s^k$. Subtracting $\nabla f(x^k)^\top s^k$ from both sides:

\[(\gamma^k)^\top s^k \ge (c _ 2 - 1)\, \nabla f(x^k)^\top s^k > 0,\]

where the second inequality uses $c _ 2 - 1 < 0$ and $\nabla f(x^k)^\top s^k < 0$ (descent direction). Multiplying by $\alpha^k > 0$ gives $(\delta^k)^\top \gamma^k = \alpha^k (s^k)^\top \gamma^k > 0$ — the curvature condition holds at every iteration.

Armijo doesn’t. Backtracking Armijo only enforces (W1) — sufficient decrease on $f(x^{k+1})$ — and says nothing about $\nabla f(x^{k+1})^\top s^k$. So $(\delta^k)^\top \gamma^k > 0$ can fail, $B^{k+1}$ can become indefinite, the next $s^{k+1}$ may not be descent, and the umbrella convergence theorem’s hypothesis breaks down.

The BFGS method is the generic linesearch method with

\[s^k := -(B^k)^{-1}\nabla f(x^k)\]

where $B^k$ is updated by the BFGS formula on each iteration. @State a high-level result about its local convergence.

The BFGS method has local Q-superlinear convergence.

The BFGS method is the generic linesearch method with

\[s^k := -(B^k)^{-1}\nabla f(x^k)\]

where $B^k$ is updated by the BFGS formula on each iteration. @State a result about the convergence of $B^k$ to the Hessian.

Suppose:

  • $f$ is a strictly convex quadratic, $f(x) = c + g^\top x + \tfrac{1}{2} x^\top H x$ with $H \in \mathbb R^{n \times n}$ symmetric positive definite — so $\nabla^2 f(x) \equiv H$ is constant in $x$
  • the BFGS method uses exact linesearch

Then after at most $n$ iterations:

  • $B^n = H$ (the BFGS approximation equals the true, constant Hessian of $f$); and
  • $x^n = x^\ast = -H^{-1} g$ (the iterates terminate at the unique global minimiser).

@Describe how BFGS is implemented in practice using Cholesky factors, and why this is preferred over directly maintaining $B^k$.

In practice, BFGS maintains a Cholesky factor $L^k$ of $B^k$:

\[B^k = L^k(L^k)^\top, \qquad L^k \text{ lower triangular.}\]

The rank-2 BFGS update for $B^k$ corresponds to a rank-2 update of $L^k$, computable in $O(n^2)$ flops.

Benefits over storing $B^k$ explicitly:

  • Solving the Newton system $B^k s^k = -\nabla f(x^k)$ becomes two triangular solves: $L^k v = -\nabla f(x^k)$ (forward substitution), then $(L^k)^\top s^k = v$ (back substitution). Cost $O(n^2)$ instead of $O(n^3)$ for a generic dense solve.
  • Positive-definiteness preservation is automatic: as long as $L^k$ stays nonsingular (no zero diagonal entry), $B^k = L^k (L^k)^\top$ is positive definite by construction. No need to verify PD-ness separately after each each update.
  • Numerical stability: avoids forming $B^k$ or its inverse explicitly, which would amplify round-off from the secant update.

Cost summary: BFGS via Cholesky factors achieves $O(n^2)$ per iteration, matching the SMW (Sherman-Morrison-Woodbury) inverse-update approach but with better numerical stability.

Bite-sized

@State the four wishes for a quasi-Newton update $B^{k+1}$ of $B^k$.

  1. Computable from already-known quantities ($\nabla f(x^{k+1}), \nabla f(x^k), \ldots, B^k, s^k$) — no second derivatives.
  2. Symmetric and positive definite (so the resulting search direction $s^k = -(B^k)^{-1} \nabla f(x^k)$ is well-defined and descent).
  3. Cheap update of $B^k$: $O(n^2)$ per iteration, not $O(n^3)$ as for forming/inverting from scratch.
  4. Asymptotic convergence: $B^k \to \nabla^2 f(x^\ast)$ along the iterates, so the local rate approaches Newton’s.

These rule out “approximate the Hessian with finite differences each iteration” ($O(n^2)$ extra gradient evaluations) in favour of recurrence-based updates like ∆sr1-update and ∆bfgs-update-formula.

Source: Lecture 7, Wish List for $B^{k+1}$ slide.

@bite~

For BFGS to preserve positive definiteness of $B^k$, the curvature condition $(\delta^k)^\top \gamma^k > 0$ (where $\delta^k = x^{k+1} - x^k$, $\gamma^k = \nabla f(x^{k+1}) - \nabla f(x^k)$) must hold. This is enforced automatically by the Wolfe linesearch but not by backtracking Armijo alone — hence BFGS in practice is paired with Wolfe.

Source: Lecture 7, BFGS Updates — Key properties slide.

@bite~

The lecturer motivates BFGS as the closest factored update meeting a linear secant-style constraint. Write $B^k = J^k (J^k)^\top$ with $J^k$ arbitrary nonsingular, choose $J^{k+1}$ to solve $\min _ J \ \vert J - J^k\ \vert _ F$ subject to $J \delta^k = \gamma^k$, then set $B^{k+1} := J^{k+1}(J^{k+1})^\top$. The Frobenius objective keeps the new factor close to $J^k$ — realising the wish-list requirement that $B^{k+1}$ stay close to $B^k$ (∆bite-quasi-newton-wish-list) — and the closed-form $J^{k+1}$ is a rank-1 update of $J^k$, so $B^{k+1}$ is a rank-2 update of $B^k$, which the lecturer identifies with the BFGS formula ∆bfgs-update-formula. In practice take $J^k := L^k$, the lower-triangular Cholesky factor of $B^k$: the update is then done directly on $L^k$ in $O(n^2)$, and $B^{k+1} = L^{k+1}(L^{k+1})^\top$ stays positive definite automatically whenever the curvature condition $(\delta^k)^\top \gamma^k > 0$ holds (∆bite-bfgs-curvature-condition).

Source: Lecture 7, BFGS Updates — Derivation slide. ⚠ The identification with the BFGS rank-2 formula is the lecturer’s heuristic claim — the constraint $J\delta = \gamma$ doesn’t pin down $(J^{k+1})^\top \delta$, so $B^{k+1}\delta = J^{k+1}(J^{k+1})^\top \delta$ need not equal $\gamma$ (the secant fails on $B$ in general; a 2×2 counter-example with $B^k = \mathrm{diag}(1, 4)$, $\delta = (1, 1)$, $\gamma = (2, 5)$ confirms this). The standard closest-update derivation of BFGS (Dennis–Schnabel; Nocedal–Wright Ch. 6) instead minimises a weighted Frobenius norm on $B$ itself subject to symmetry and the actual secant $B\delta = \gamma$ — bare Frobenius on $B$ with symmetry + secant gives PSB (Powell-Symmetric-Broyden), not BFGS. Treat the factor-level $\min _ J$ as motivation for the form of a rank-2 factored update, not a derivation of BFGS specifically.

@bite~

Compare SR1 and BFGS quasi-Newton updates on three axes: rank, PD preservation, and reliability.

  • Rank: SR1 is rank-1 ($B^{k+1} = B^k + uu^\top$, one outer product); BFGS is rank-2 (one term subtracted, one added).
  • PD preservation: BFGS preserves PD when the curvature condition $\delta^\top \gamma > 0$ holds (enforced by Wolfe linesearch). SR1 does not preserve PD in general — $B^k$ may become indefinite, making $s^k$ ascent rather than descent.
  • Reliability: SR1 has a denominator $\rho^k = ((\gamma - B\delta)^\top \delta)^{1/2}$ that can be near zero, causing the update to blow up; BFGS has more stable denominators. BFGS is preferred in practice; SR1 is used when an indefinite $B^k$ is acceptable (e.g. trust-region methods).

Cost: both $O(n^2)$ per iteration via Sherman-Morrison-Woodbury (∆sherman-morrison-woodbury).

Source: Lecture 7, Symmetric Rank-1 (SR1) Updates and BFGS Updates slides.

@bite~

Why is the per-iteration cost of quasi-Newton methods $O(n^2)$ rather than $O(n^3)$ as for Newton? Because the Sherman-Morrison-Woodbury formula lets us update $(B^k)^{-1}$ (or its Cholesky factor) by a low-rank perturbation in $O(n^2)$, rather than refactorising from scratch.

Source: Lecture 7, cost discussion for SR1 and BFGS updates; cf. ∆sherman-morrison-woodbury.

@bite~