Continuous Optimisation HT26, Quasi-Newton methods
Flashcards
@State an expression for approximating the Hessian from $n +1$ gradient evaluations.
You can determine the $i$th column via the approximation
\[[\nabla^2 f(x)]e^i \approx \frac 1 h \left[ \nabla f(x + he^i) - \nabla f(x) \right]\]One way to approximate the Hessian via $n+1$ gradient evaluations is to do the following for each column:
\[[\nabla^2 f(x)]e^i \approx \frac 1 h \left[ \nabla f(x + he^i) - \nabla f(x) \right]\]
What is the problem with making $h$ too large or too small?
- If $h$ is too large, you have inaccurate approximations
- If $h$ is too small, you have numerical cancellation errors
In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. What is the idea behind secant approximations for computing $B^k \approx \nabla^2 f(x^k)$, and what are some things we should require about this matrix?
At the start of line search, choose some $B^0$ (e.g. $I$). After computing $s^k = -(B^k)^{-1} \nabla f(x^k)$ and $x^{k+1} = x^k + \alpha^k s^k$, compute an update $B^{k+1}$ of $B^k$ in terms of already computed quantities $\nabla f(x^{k+1})$, $\nabla f(x^k)$, $\nabla f(x^0)$, etc.
We would like:
- $B^{k+1}$ should be symmetric, nonsingular and positive definite
- $B^{k+1}$ should be close to $B^k$
- $B^{k+1}$ should be cheap to compute
- $B^k \to \nabla^2 f(x^k)$
In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context, @state the secant equation.
We choose $B^{k+1}$ such that
\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that
\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]
@Justify that this equation is satisfied by the Hessian when $f$ is a quadratic function.
Let $f(x) = g^\top x + \frac 1 2 x^\top H x$. Then $\nabla f(x) = Hx + g$ and $\nabla^2 f = H$. Thus $\nabla f(x^{k+1}) - \nabla f(x^k) = H(x^{k+1} - x^k)$ and hence the secant equation holds with $B^{k+1} := \nabla^2 f$.
In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that
\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]
Can you give an intuitive interpretation of the secant equation?
We implicitly have the quadratic model
\[m(x^k + s) = f(x^k) + \nabla f(x^k)^\top s + \frac 1 2 s^\top B^k s\]We want the quadratic model in the next iteration
\[m _ + (x^k + s) := f(x^k) + \nabla f(x^k)^\top s + \frac 1 2 s^\top B^{k+1} s\]to correctly predict the change in gradient between iterations given by $\gamma^k := \nabla f(x^{k+1}) - \nabla f(x^k)$:
\[\begin{aligned} \gamma^k &= \nabla f(x^{k+1}) - \nabla f(x^k) \\ &= \nabla m _ + (x^{k+1}) - \nabla m _ +(x^k) \\ &= B^{k+1} (x^{k+1} - x^k) + \nabla f(x^k) - \nabla f(x^k) \\ &= B^{k+1} (x^{k+1} - x^k) \end{aligned}\]In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that
\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]
@State how this could be satisfied with a symmetric rank 1 update.
Choose $B^{k+1} := B^k + u^k (u^k)^\top$, where
- $u^k = \frac{\gamma^k - B^k \delta^k}{\rho^k}$
- $\delta^k := x^{k+1} - x^k = \alpha^k s^k$
- $(\rho^k)^2 := (\gamma^k - B^k \delta^k)^\top \delta^k > 0$
In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that
\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]
A symmetric rank 1 update satisfies this by choosing $B^{k+1} := B^k + u^k (u^k)^\top$, where
- $u^k = (\gamma^k - B^k \delta^k) / \rho^k$
- $\delta^k := x^{k+1} - x^k = \alpha^k s^k$
- $(\rho^k)^2 := (\gamma^k - B^k \delta^k)^\top \delta^k > 0$
What are the issues with this?
- $B^k$ may not be positive definite
- $s^k$ may not be a descent direction
- $\rho^k$ may be close to zero
In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that
\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]
A symmetric rank 1 update satisfies this by choosing $B^{k+1} := B^k + u^k (u^k)^\top$, where
- $u^k = (\gamma^k - B^k \delta^k) / \rho^k$
- $\delta^k := x^{k+1} - x^k = \alpha^k s^k$
- $(\rho^k)^2 := (\gamma^k - B^k \delta^k)^\top \delta^k > 0$
What is the time complexity of this?
via Sherman-Morrison-Woodbury formula.
In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that
\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]
@Define the BFGS (Broyden-Fletcher-Goldfarb-Shanno) update.
Set
\[**@Define** theB^{k+1} = B^k - \frac{B^k \delta^k (\delta^k)^\top B^k}{(\delta^k)^\top B^k \delta^k} + \frac{\gamma^k (\gamma^k)^\top}{(\gamma^k)^\top \delta^k}\]where $\delta^k := x^{k+1} - x^k$ and $\gamma^k := \nabla f(x^{k+1}) - \nabla f(x^k)$. This is a rank-2 update of $B^k$.
In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that
\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]
The BFGS (Broyden-Fletcher-Goldfarb-Shanno) update computes
\[B^{k+1} = B^k - \frac{B^k \delta^k (\delta^k)^\top B^k}{(\delta^k)^\top B^k \delta^k} + \frac{\gamma^k (\gamma^k)^\top}{(\gamma^k)^\top \delta^k}\]
where $\delta^k := x^{k+1} - x^k$ and $\gamma^k := \nabla f(x^{k+1}) - \nabla f(x^k)$. This is a rank-2 update of $B^k$. What is the time complexity of this?
via the Sherman-Morrison-Woodbury formula.
In Newton’s method for optimisation, we require the Hessian $\nabla^2 f$. In quasi-Newton methods, we approximate $\nabla^2 f$ by some other matrix $B^k$. In this context the secant equation states we should choose $B^{k+1}$ such that
\[\gamma _ k := \nabla f(x^{k+1}) - \nabla f(x^k) = B^{k+1} (x^{k+1} - x^k) = B^{k+1} \alpha^k s^k\]
The BFGS (Broyden-Fletcher-Goldfarb-Shanno) update computes
\[B^{k+1} = B^k - \frac{B^k \delta^k (\delta^k)^\top B^k}{(\delta^k)^\top B^k \delta^k} + \frac{\gamma^k (\gamma^k)^\top}{(\gamma^k)^\top \delta^k}\]
where $\delta^k := x^{k+1} - x^k$ and $\gamma^k := \nabla f(x^{k+1}) - \nabla f(x^k)$. This is a rank-2 update of $B^k$. @State a result about the symmetry and definiteness of this update.
If $B^k$ is symmetric positive definite and the curvature condition $(\delta^k)^\top \gamma^k > 0$ holds, then $B^{k+1}$ is symmetric positive definite.
The curvature condition is enforced by Wolfe linesearch but not by backtracking Armijo, which is why BFGS in practice is paired with Wolfe.
What is meant by the BFGS method?
A generic linesearch method with
\[s^k := -(B^k)^{-1}\nabla f(x^k)\]where $B^k$ is updated by the BFGS formula on each iteration.
The BFGS method is the generic linesearch method with
\[s^k := -(B^k)^{-1}\nabla f(x^k)\]
where $B^k$ is updated by the BFGS formula on each iteration. What must be true about how the step-size is calculated in order to ensure global convergence, and why?
We require Wolfe linesearch (∆wolfe-conditions) rather than backtracking Armijo.
Why: global convergence of BFGS in the GLM framework needs $B^k$ to stay positive definite — so that $s^k = -(B^k)^{-1}\nabla f(x^k)$ stays a descent direction and the umbrella ∆general-second-order-glm-global-convergence theorem applies (its eigenvalue hypothesis requires PD $B^k$). BFGS preserves PD iff the curvature condition $(\delta^k)^\top \gamma^k > 0$ holds (see ∆bfgs-preserves-positive-definiteness, ∆bite-bfgs-curvature-condition).
Wolfe enforces this. The curvature half (W2) says $\nabla f(x^{k+1})^\top s^k \ge c _ 2 \nabla f(x^k)^\top s^k$. Subtracting $\nabla f(x^k)^\top s^k$ from both sides:
\[(\gamma^k)^\top s^k \ge (c _ 2 - 1)\, \nabla f(x^k)^\top s^k > 0,\]where the second inequality uses $c _ 2 - 1 < 0$ and $\nabla f(x^k)^\top s^k < 0$ (descent direction). Multiplying by $\alpha^k > 0$ gives $(\delta^k)^\top \gamma^k = \alpha^k (s^k)^\top \gamma^k > 0$ — the curvature condition holds at every iteration.
Armijo doesn’t. Backtracking Armijo only enforces (W1) — sufficient decrease on $f(x^{k+1})$ — and says nothing about $\nabla f(x^{k+1})^\top s^k$. So $(\delta^k)^\top \gamma^k > 0$ can fail, $B^{k+1}$ can become indefinite, the next $s^{k+1}$ may not be descent, and the umbrella convergence theorem’s hypothesis breaks down.
The BFGS method is the generic linesearch method with
\[s^k := -(B^k)^{-1}\nabla f(x^k)\]
where $B^k$ is updated by the BFGS formula on each iteration. @State a high-level result about its local convergence.
The BFGS method has local Q-superlinear convergence.
The BFGS method is the generic linesearch method with
\[s^k := -(B^k)^{-1}\nabla f(x^k)\]
where $B^k$ is updated by the BFGS formula on each iteration. @State a result about the convergence of $B^k$ to the Hessian.
Suppose:
- $f$ is a strictly convex quadratic, $f(x) = c + g^\top x + \tfrac{1}{2} x^\top H x$ with $H \in \mathbb R^{n \times n}$ symmetric positive definite — so $\nabla^2 f(x) \equiv H$ is constant in $x$
- the BFGS method uses exact linesearch
Then after at most $n$ iterations:
- $B^n = H$ (the BFGS approximation equals the true, constant Hessian of $f$); and
- $x^n = x^\ast = -H^{-1} g$ (the iterates terminate at the unique global minimiser).
@Describe how BFGS is implemented in practice using Cholesky factors, and why this is preferred over directly maintaining $B^k$.
In practice, BFGS maintains a Cholesky factor $L^k$ of $B^k$:
\[B^k = L^k(L^k)^\top, \qquad L^k \text{ lower triangular.}\]The rank-2 BFGS update for $B^k$ corresponds to a rank-2 update of $L^k$, computable in $O(n^2)$ flops.
Benefits over storing $B^k$ explicitly:
- Solving the Newton system $B^k s^k = -\nabla f(x^k)$ becomes two triangular solves: $L^k v = -\nabla f(x^k)$ (forward substitution), then $(L^k)^\top s^k = v$ (back substitution). Cost $O(n^2)$ instead of $O(n^3)$ for a generic dense solve.
- Positive-definiteness preservation is automatic: as long as $L^k$ stays nonsingular (no zero diagonal entry), $B^k = L^k (L^k)^\top$ is positive definite by construction. No need to verify PD-ness separately after each each update.
- Numerical stability: avoids forming $B^k$ or its inverse explicitly, which would amplify round-off from the secant update.
Cost summary: BFGS via Cholesky factors achieves $O(n^2)$ per iteration, matching the SMW (Sherman-Morrison-Woodbury) inverse-update approach but with better numerical stability.
Bite-sized
@State the four wishes for a quasi-Newton update $B^{k+1}$ of $B^k$.
- Computable from already-known quantities ($\nabla f(x^{k+1}), \nabla f(x^k), \ldots, B^k, s^k$) — no second derivatives.
- Symmetric and positive definite (so the resulting search direction $s^k = -(B^k)^{-1} \nabla f(x^k)$ is well-defined and descent).
- Cheap update of $B^k$: $O(n^2)$ per iteration, not $O(n^3)$ as for forming/inverting from scratch.
- Asymptotic convergence: $B^k \to \nabla^2 f(x^\ast)$ along the iterates, so the local rate approaches Newton’s.
These rule out “approximate the Hessian with finite differences each iteration” ($O(n^2)$ extra gradient evaluations) in favour of recurrence-based updates like ∆sr1-update and ∆bfgs-update-formula.
For BFGS to preserve positive definiteness of $B^k$, the curvature condition $(\delta^k)^\top \gamma^k > 0$ (where $\delta^k = x^{k+1} - x^k$, $\gamma^k = \nabla f(x^{k+1}) - \nabla f(x^k)$) must hold. This is enforced automatically by the Wolfe linesearch but not by backtracking Armijo alone — hence BFGS in practice is paired with Wolfe.
The lecturer motivates BFGS as the closest factored update meeting a linear secant-style constraint. Write $B^k = J^k (J^k)^\top$ with $J^k$ arbitrary nonsingular, choose $J^{k+1}$ to solve $\min _ J \ \vert J - J^k\ \vert _ F$ subject to $J \delta^k = \gamma^k$, then set $B^{k+1} := J^{k+1}(J^{k+1})^\top$. The Frobenius objective keeps the new factor close to $J^k$ — realising the wish-list requirement that $B^{k+1}$ stay close to $B^k$ (∆bite-quasi-newton-wish-list) — and the closed-form $J^{k+1}$ is a rank-1 update of $J^k$, so $B^{k+1}$ is a rank-2 update of $B^k$, which the lecturer identifies with the BFGS formula ∆bfgs-update-formula. In practice take $J^k := L^k$, the lower-triangular Cholesky factor of $B^k$: the update is then done directly on $L^k$ in $O(n^2)$, and $B^{k+1} = L^{k+1}(L^{k+1})^\top$ stays positive definite automatically whenever the curvature condition $(\delta^k)^\top \gamma^k > 0$ holds (∆bite-bfgs-curvature-condition).
Compare SR1 and BFGS quasi-Newton updates on three axes: rank, PD preservation, and reliability.
- Rank: SR1 is rank-1 ($B^{k+1} = B^k + uu^\top$, one outer product); BFGS is rank-2 (one term subtracted, one added).
- PD preservation: BFGS preserves PD when the curvature condition $\delta^\top \gamma > 0$ holds (enforced by Wolfe linesearch). SR1 does not preserve PD in general — $B^k$ may become indefinite, making $s^k$ ascent rather than descent.
- Reliability: SR1 has a denominator $\rho^k = ((\gamma - B\delta)^\top \delta)^{1/2}$ that can be near zero, causing the update to blow up; BFGS has more stable denominators. BFGS is preferred in practice; SR1 is used when an indefinite $B^k$ is acceptable (e.g. trust-region methods).
Cost: both $O(n^2)$ per iteration via Sherman-Morrison-Woodbury (∆sherman-morrison-woodbury).
Why is the per-iteration cost of quasi-Newton methods $O(n^2)$ rather than $O(n^3)$ as for Newton? Because the Sherman-Morrison-Woodbury formula lets us update $(B^k)^{-1}$ (or its Cholesky factor) by a low-rank perturbation in $O(n^2)$, rather than refactorising from scratch.