Continuous Optimisation HT26, Overview of results and methods


This is a map of every main result and method in the course.

Optimality-condition theorems

Theorem 1: First-order necessary condition (unconstrained)

For the unconstrained problem $\min _ {x \in \mathbb R^n} f(x)$ with $f \in \mathcal C^1$, every local minimiser is stationary:

\[x^\ast \text{ a local minimiser} \implies \nabla f(x^\ast) = 0.\]

The converse is false — stationary points also include maximisers and saddles — so this is only a first filter, sharpened by the second-order conditions below (and made sufficient in the convex case).

Source: Lecture 1b, Theorem 1 (First-order necessary conditions).

Lemma 1: Descent-direction lemma

If $\nabla f(x)^\top s < 0$ then $s$ is a genuine descent direction: $f(x + \alpha s) < f(x)$ for all sufficiently small $\alpha > 0$. This is the workhorse behind every linesearch method and the contradiction step of the optimality proofs.

Source: Lecture 1b, Lemma 1 (Descent directions).

Theorem 2: Second-order necessary condition (unconstrained)

If $x^\ast$ is a local minimiser of $f \in \mathcal C^2$ then, in addition to stationarity, the Hessian is positive semidefinite:

\[x^\ast \text{ a local minimiser} \implies \nabla^2 f(x^\ast) \succeq 0.\]

PSD (not PD) is the most we can conclude — e.g. $f(x)=x^3$ at $0$ satisfies it vacuously yet has no minimiser.

Source: Lecture 2, Theorem 2 (Second-order necessary conditions).

Theorem 3: Second-order sufficient condition (unconstrained)

If $\nabla f(x^\ast) = 0$ and $\nabla^2 f(x^\ast) \succ 0$ then $x^\ast$ is a strict local minimiser. The PD requirement is sufficient but not necessary ($f(x)=x^4$ at $0$ is a strict minimiser with $f''(0)=0$).

Source: Lecture 2, Theorem 3 (Second-order sufficient conditions).

Stationary points of quadratics

For $q(x) = g^\top x + \tfrac12 x^\top H x$ with $H$ symmetric, $\nabla q = Hx + g$ and $\nabla^2 q = H$, so stationary points solve the linear system $Hx + g = 0$. When $H$ is nonsingular the unique stationary point is $x^\ast = -H^{-1}g$, classified by the definiteness of $H$ (PD $\to$ min, ND $\to$ max, indefinite $\to$ saddle). This is the prototype for local behaviour near any critical point, since $f$ is locally quadratic there.

Source: Lecture 2 (Stationary points of quadratic functions).

Convexity: definitions and characterisations

A function $f$ on a convex domain is convex if the chord lies on or above the graph; for $f \in \mathcal C^1$ this is equivalent to the tangent lying on or below the graph, and for $f \in \mathcal C^2$ to $\nabla^2 f(x) \succeq 0$ everywhere:

\[f(y) \ge f(x) + \nabla f(x)^\top (y-x) \quad\Longleftrightarrow\quad \nabla^2 f(x) \succeq 0.\]

Convexity collapses the optimality hierarchy: every stationary point and every local minimiser is automatically a global minimiser.

Source: Lecture 1b (Optimality conditions for convex problems); Problem Sheet 1 Q C.1.

Theorem 16: Necessity of KKT under a constraint qualification

At a constrained minimiser of $\min f(x)$ s.t. $c _ E(x)=0,\ c _ I(x)\ge 0$, the KKT system (stationarity of the Lagrangian, primal/dual feasibility, complementary slackness) holds provided a constraint qualification makes the linearised feasible directions $\mathcal F(x^\ast)$ coincide with the tangent cone $T _ \Omega(x^\ast)$. The course proves only the equality-only case (the general inequality case via Farkas is non-carded and a pending to-do).

Source: Lectures 10-11, Theorem 16 (Necessity of first-order conditions).

Theorem 17: KKT necessary for linearly constrained problems

If every constraint is affine, $c(x) = Ax - b$, then a local minimiser is a KKT point with no constraint qualification required — affineness makes the first-order linearisation of the constraints exact, so the CQ is automatic.

Source: Lectures 10-11, Theorem 17 (Linearly constrained problems).

Theorem 18: KKT sufficient for convex programs

For a convex program ($f$ convex, each $c _ i$ ($i\in I$) concave, $c _ E$ affine) the feasible set $\Omega$ is convex and any KKT point is a global minimiser — for convex problems the first-order KKT conditions are both necessary and sufficient.

Source: Lectures 10-11, Theorem 18 (Sufficiency for convex problems).

Theorem 19: Second-order necessary conditions (constrained)

At a constrained local minimiser that is a KKT point with multipliers $(y^\ast,\lambda^\ast)$, the Lagrangian Hessian is PSD on the critical cone:

\[s^\top \nabla^2 _ {xx}\mathcal L(x^\ast,y^\ast,\lambda^\ast)\, s \ge 0 \qquad \text{for all } s \in F(\lambda^\ast).\]

It is the Lagrangian Hessian, not $\nabla^2 f$, because curvature of the constraints matters (the two-circles example: same linear $f$, min vs max distinguished only by $\nabla^2 _ {xx}\mathcal L$). The full proof is non-examinable (Lecture 11 handout).

Source: Lectures 10-11, Theorem 19 (Second-order necessary conditions; proof non-examinable).

Theorem 20: Second-order sufficient conditions (constrained)

If $(x^\ast,y^\ast,\lambda^\ast)$ satisfies KKT and $s^\top \nabla^2 _ {xx}\mathcal L\, s > 0$ for every nonzero $s$ in the critical cone, then $x^\ast$ is a constrained local minimiser. This is the constrained analogue of the unconstrained PD-Hessian sufficient condition.

  • Statement (sufficient)∆kkt-conditions-sufficient-second-order — KKT + PD Lagrangian Hessian on the critical cone $\implies$ constrained local min.
  • Proofno flashcard (non-examinable companion to Theorem 19; Lecture 11 handout).

Source: Lectures 10-11, Theorem 20 (Second-order sufficient conditions).

Quadratic programming and duality

A QP is $\min c^\top x + \tfrac12 x^\top H x$ s.t. $Ax = b,\ \tilde A x \ge \tilde b$; it is convex iff $H \succeq 0$, and its KKT system is linear-plus-complementarity. For a strictly convex QP with only inequality constraints the primal and dual optimal values coincide (strong duality).

Source: Lectures 10-11 (Quadratic programming; duality theory).

Convergence theorems

Lemma 2: Armijo holds under a Lipschitz gradient

If $\nabla f$ is $L$-Lipschitz and $s^k$ is a descent direction, the Armijo condition holds for every step in an explicit interval:

\[f(x^k+\alpha s^k) \le f(x^k) + \beta\alpha\nabla f(x^k)^\top s^k \quad \text{for all } \alpha \in [0,\alpha^k _ {\max}], \qquad \alpha^k _ {\max} = \frac{(\beta-1)\nabla f(x^k)^\top s^k}{L\ \vert s^k\ \vert ^2}.\]

This explicit interval is what makes the backtracking stepsize bound (Lemma 3) and hence global convergence quantitative.

Source: Lecture 3, Lemma 2.

Lemma 3: Backtracking-Armijo stepsize lower bound

The bArmijo linesearch terminates in finitely many steps and the accepted stepsize is bounded below:

\[\alpha^k \ge \min\{\alpha _ {(0)},\ \tau\,\alpha^k _ {\max}\}.\]

A short or absent lower bound here would break the telescoping global-convergence argument.

Source: Lecture 3, Lemma 3.

Theorem 4: Global convergence of the generic linesearch method (bArmijo)

With $f \in \mathcal C^1$ bounded below, $\nabla f$ Lipschitz, and descent directions $s^k$, the bArmijo GLM satisfies: either $\nabla f(x^l)=0$ for some $l$, or

\[\lim _ {k\to\infty} \min\!\left\{\frac{ \vert \nabla f(x^k)^\top s^k \vert }{\ \vert s^k\ \vert },\ \vert \nabla f(x^k)^\top s^k \vert \right\} = 0.\]

The $\min$ form is exactly why $s^k$ must stay bounded away from orthogonality to $-\nabla f$ (the $\cos\theta _ k \ge \delta$ requirement) for genuine $\ \vert \nabla f\ \vert \to 0$.

Source: Lecture 4, Theorem 4.

Theorem 5: Global convergence of steepest descent

Steepest descent ($s^k=-\nabla f(x^k)$) with (b)Armijo and the Theorem 4 hypotheses gives $\nabla f(x^l)=0$ for some $l$ or $\ \vert \nabla f(x^k)\ \vert \to 0$. It is a one-line corollary of Theorem 4 with $\cos\theta _ k\equiv 1$.

Source: Lecture 5, Theorem 5.

Theorem 6: Local linear rate of steepest descent

Near a PD-Hessian minimiser, SD converges only linearly, with rate governed by the Hessian conditioning:

\[\rho \le \rho _ {\mathrm{SD}} = \frac{\kappa(x^\ast)-1}{\kappa(x^\ast)+1}, \qquad \kappa(x^\ast)=\frac{\lambda^\ast _ {\max}}{\lambda^\ast _ {\min}}.\]

Ill-conditioning ($\kappa \gg 1$) pushes $\rho _ {\mathrm{SD}} \to 1$ (Rosenbrock: $\kappa\approx 258$, $\rho _ {\mathrm{SD}}\approx 0.992$), the core weakness of SD.

Source: Lecture 5, Theorem 6.

Theorem 7: Local convergence of pure Newton

With $\nabla^2 f(x^\ast)$ nonsingular, $\nabla^2 f$ locally Lipschitz, and $x^{k _ 0}$ close enough to $x^\ast$, pure Newton is well-defined and converges quadratically:

\[\ \vert x^{k+1}-x^\ast\ \vert = O(\ \vert x^k-x^\ast\ \vert ^2).\]

The “close enough” neighbourhood depends on unknown problem constants, which is why pure Newton is wrapped in a globalisation in practice.

Source: Lecture 6, Theorem 7.

Theorem 8: Newton + bArmijo accepts the unit step

With $\beta < \tfrac12$ and $\alpha _ {(0)}=1$, backtracking-Armijo eventually accepts $\alpha^k=1$, so globalised Newton recovers the pure-Newton quadratic rate near $x^\ast$. The mechanism is the “half-decrease” identity $f(x^k+s^k)=f(x^k)+\tfrac12\nabla f(x^k)^\top s^k+o(\ \vert s^k\ \vert ^2)$.

Source: Lecture 6, Theorem 8; Problem Sheet 2 Q C.3.

Theorem 9: Global convergence of Newton + bArmijo

If the Hessian eigenvalues at the iterates are positive and uniformly bounded below away from zero, then bArmijo-Newton gives $\nabla f(x^l)=0$ for some $l$ or $\ \vert \nabla f(x^k)\ \vert \to 0$. (Strong convexity is a clean sufficient set of hypotheses.) Positivity makes the Newton step descent; the uniform lower bound controls $\ \vert s^k\ \vert $.

Source: Lecture 6, Theorem 9.

Theorem 10: Global convergence of a general second-order method

The Theorem 9 conclusion holds for any $s^k$ solving $B^k s^k = -\nabla f(x^k)$ when the eigenvalues of $B^k$ are uniformly bounded above and below away from zero (so the proof is structurally identical with $B^k$ in place of $\nabla^2 f$). This is the umbrella result covering quasi-Newton and modified-Newton globalisation.

Source: Lecture 6, Theorem 10.

Newton's method is scale invariant

Newton’s iterates commute with any nonsingular linear change of variables $y=Ax$: $s _ y = A s _ x$, hence $y+\alpha s _ y = A(x+\alpha s _ x)$. This is the structural reason Newton is immune to the conditioning problems that cripple steepest descent.

Source: Lecture 6 (Scale invariance of Newton’s method).

Modified / damped Newton

When $\nabla^2 f(x^k)$ is not PD the Newton direction need not be descent; solve $(\nabla^2 f(x^k)+M^k)s^k=-\nabla f(x^k)$ with $M^k$ chosen to make the matrix sufficiently PD (zero when already PD). Three standard recipes: spectral, smallest-eigenvalue shift, modified Cholesky.

Source: Lecture 6 (Modified Newton methods).

Quasi-Newton (SR1, BFGS) convergence properties

BFGS preserves positive-definiteness under the curvature condition $(\delta^k)^\top\gamma^k>0$ (enforced by Wolfe, not by Armijo alone), converges Q-superlinearly locally, and terminates in $n$ steps on a strictly convex quadratic with exact linesearch. SR1 is rank-1 and does not preserve PD.

Source: Lecture 7 (Quasi-Newton methods); Problem Sheet 3 Q B.1.

Gauss-Newton convergence (nonlinear least-squares)

For $\min \tfrac12\ \vert r(x)\ \vert ^2$, Gauss-Newton uses $\widetilde{\nabla^2}f=J^\top J$. It converges globally under a uniformly full-rank Jacobian and Lipschitz $\nabla f$ (to a stationary point of $f$, not necessarily $r=0$); locally it is quadratic in the zero-residual case (reduces to Theorem 7) and only linear in the nonzero-residual case (rate $ \vert \lambda \vert $, $\lambda$ the top eigenvalue of $(J^\top J)^{-1}\sum _ j r _ j\nabla^2 r _ j$).

Source: Lecture 7 (Gauss-Newton method); Problem Sheet 3 Q B.2.

Theorem 14: Global convergence of the trust-region method

With $f \in \mathcal C^2$ bounded below, $\nabla f$ Lipschitz, and the Cauchy-decrease condition, the generic trust-region method gives $\nabla f(x^k)=0$ for some $k$ or $\liminf _ {k\to\infty}\ \vert \nabla f(x^k)\ \vert =0$ (only $\liminf$, not $\lim$). It rests on the Cauchy model-decrease lemma (Lemma 12) and the trust-region radius lower bound (Lemma 13).

Source: Lectures 8-9, Theorem 14 (with Lemmas 12-13); complete proof non-examinable.

Trust-region method with arbitrary $B^k$

The Theorem 14 conclusion still holds with an arbitrary symmetric model Hessian $B^k$ provided $\ \vert B^k\ \vert \le \kappa _ B$; the convergence constants merely weaken to $c/(L+\kappa _ B)$ (Sheet 3 Q B.4).

  • Statement∆gk-with-arbitrary-bk — convergence under $\ \vert B^k\ \vert \le\kappa _ B$, weakened constants.
  • Proofno flashcard (Theorem 14 proof with $\nabla^2 f\to B^k$; Sheet 3 Q B.4).

Source: Lectures 8-9; Problem Sheet 3 Q B.4.

Trust-region subproblem: global solution

$s^\ast$ solves $\min _ {\ \vert s\ \vert \le\Delta} g^\top s+\tfrac12 s^\top H s$ iff there is $\lambda^\ast \ge 0$ with

\[(H+\lambda^\ast I)s^\ast = -g, \qquad H+\lambda^\ast I \succeq 0, \qquad \lambda^\ast(\ \vert s^\ast\ \vert -\Delta)=0,\]

and the solution is unique when $H+\lambda^\ast I \succ 0$. The secular equation $1/\ \vert s(\lambda)\ \vert - 1/\Delta = 0$ turns this into a 1-D rootfind; the hard case is when $g \perp$ the $\lambda _ {\min}$-eigenspace.

Source: Lecture 9 (Solving the trust-region subproblem); Problem Sheet 3 Q B.3.

Theorem 21: Convergence of the quadratic penalty method

For $\min f(x)$ s.t. $c(x)=0$ solved via $\Phi _ \sigma=f+\tfrac{1}{2\sigma}\ \vert c\ \vert ^2$ with $\sigma^k\to 0$ and approximate stationarity, if $x^k\to x^\ast$ with $\{\nabla c _ i(x^\ast)\}$ linearly independent (LICQ) then $x^\ast$ is a KKT point and the estimates $y^k=-c(x^k)/\sigma^k \to y^\ast$. LICQ is what makes the pseudo-inverse $J^+=(JJ^\top)^{-1}J$ well-defined; without $\sigma^k\to 0$ the limit need not be feasible.

Source: Lecture 12, Theorem 21.

Theorem 22: Convergence of the augmented Lagrangian method

For $\Phi=f-u^\top c+\tfrac{1}{2\sigma}\ \vert c\ \vert ^2$ with multiplier estimate $y^k=u^k-c(x^k)/\sigma^k$, under LICQ the multipliers converge and Lagrangian stationarity holds; the limit is a KKT point if either $\sigma^k\to 0$ (bounded $u^k$) or $u^k\to y^\ast$ (bounded $\sigma^k$). The second case is the distinctive advantage: KKT without driving $\sigma\to 0$, so the Hessian stays well-conditioned.

Source: Lecture 13, Theorem 22.

Theorem 27: Existence of the central path

For $\min f(x)$ s.t. $c(x)\ge 0$, under strict complementarity, LICQ, and a second-order sufficient condition at $x^\ast$, there is a unique $\mathcal C^1$ central path $\mu\mapsto x(\mu)$ of barrier minimisers for small $\mu>0$, with $x(\mu)\to x^\ast$ as $\mu\to 0$. (Nonconvex problems can have one central path per local minimiser.)

Source: Lectures 14-15, Theorem 27.

Theorem 28: Global convergence of the barrier method

Under LICQ, with $\mu^k\to 0$ and approximate stationarity of $f _ {\mu^k}$, the barrier iterates converge to a KKT point of (iCP) and the estimates $\lambda^k=\mu^k/c(x^k)\to\lambda^\ast$. The proof splits active/inactive indices: inactive multipliers vanish, active ones converge via the active-Jacobian pseudo-inverse.

Source: Lectures 14-15, Theorem 28.

Supporting results (Useful miscellany)

Auxiliary theorems repeatedly used inside the convergence proofs above.

Source: Lecture 5 / Problem Sheets 2-3 (Rayleigh-Ritz, descent lemma, SMW).

Methods

Generic linesearch method (GLM)

Iterate $x^{k+1}=x^k+\alpha^k s^k$ with a descent direction $s^k$ and a stepsize rule for $\alpha^k$. The rule must keep $\alpha^k$ neither too long (Armijo) nor too short (Wolfe / Goldstein curvature side); convergence is Theorem 4, resting on Lemmas 2-3.

Source: Lectures 3-4 (Generic linesearch methods).

Steepest descent

GLM with $s^k=-\nabla f(x^k)$ — the direction that most decreases the linear model on a fixed-norm sphere. Globally convergent (Theorem 5) but only linearly local (Theorem 6), and scale-dependent.

Source: Lecture 5 (Steepest descent).

Newton's method

GLM with $s^k$ solving $\nabla^2 f(x^k)s^k=-\nabla f(x^k)$ — minimiser of the second-order model. Quadratic local rate (Theorem 7), unit step under bArmijo (Theorem 8), global under bounded PD Hessian (Theorem 9), scale invariant; globalised by linesearch or trust region.

Source: Lecture 6 (Newton’s method for unconstrained optimisation).

Quasi-Newton (SR1, BFGS)

GLM with $s^k=-(B^k)^{-1}\nabla f(x^k)$ where $B^k$ approximates the Hessian via the secant equation $\gamma^k=B^{k+1}\delta^k$, updated cheaply ($O(n^2)$ via Sherman-Morrison-Woodbury). BFGS is the rank-2 PD-preserving update; SR1 is rank-1 and may be indefinite.

Source: Lecture 7 (Quasi-Newton methods).

Gauss-Newton (nonlinear least-squares)

For $\min\tfrac12\ \vert r(x)\ \vert ^2$, approximate $\nabla^2 f\approx J^\top J$ and solve a linear least-squares subproblem $\min _ s\tfrac12\ \vert J(x^k)s+r(x^k)\ \vert ^2$ each step.

Source: Lecture 7 (Nonlinear least-squares and Gauss-Newton).

Trust-region method

Minimise a local quadratic model inside $\ \vert s\ \vert \le\Delta _ k$; accept/reject by the actual-to-predicted ratio $\rho _ k$, adjusting $\Delta _ k$. Globally convergent (Theorem 14) via the Cauchy point; subproblem solved by the secular equation.

Source: Lectures 8-9 (Trust-region methods).

Quadratic penalty method (equality constrained)

Minimise $\Phi _ \sigma=f+\tfrac{1}{2\sigma}\ \vert c\ \vert ^2$ for a decreasing $\sigma^k\to 0$. Converges to a KKT point under LICQ (Theorem 21) but the Hessian is $O(1/\sigma)$ ill-conditioned; an auxiliary-variable Newton system computes the step cleanly.

Source: Lecture 12 (Penalty methods).

Augmented Lagrangian method (equality constrained)

Minimise $\Phi=f-u^\top c+\tfrac{1}{2\sigma}\ \vert c\ \vert ^2$, updating the multiplier estimate $u$ and penalty $\sigma$. Converges to a KKT point (Theorem 22) and — unlike the pure penalty — can do so with $\sigma$ bounded, limiting ill-conditioning.

Source: Lecture 13 (Augmented Lagrangian methods).

Barrier / interior-point methods (inequality constrained)

Replace $c(x)\ge 0$ by the log barrier $f _ \mu=f-\mu\sum\log c _ i$ and follow the central path as $\mu\to 0$ (Theorems 27-28). The basic primal method is ill-conditioned and has poor warm-starts; primal-dual methods fix this by making the multipliers explicit Newton variables (perturbed KKT $c _ i\lambda _ i=\mu$). For LP, IPMs are polynomial-time vs simplex’s exponential worst case.

Source: Lectures 14-15 (Interior point methods).