This is a map of every main result and method in the course.
Optimality-condition theorems
Theorem 1: First-order necessary condition (unconstrained)
For the unconstrained problem $\min _ {x \in \mathbb R^n} f(x)$ with $f \in \mathcal C^1$, every local minimiser is stationary:
\[x^\ast \text{ a local minimiser} \implies \nabla f(x^\ast) = 0.\]The converse is false — stationary points also include maximisers and saddles — so this is only a first filter, sharpened by the second-order conditions below (and made sufficient in the convex case).
- Statement∆first-order-necessary-condition-statement — local minimiser $\implies \nabla f(x^\ast)=0$.
- Characterisation∆stationary-point-iff-gradient-zero — $x$ stationary $\iff \nabla f(x)=0$ (the defining equivalence).
- Why not sufficient∆bite-first-order-conditions-do-not-classify — also holds at maximisers/saddles; needs Hessian info, except in the convex case.
- Proof— no flashcard (3-line contradiction from Lecture 1a using ∆negative-inner-product-implies-descent-direction; pending course to-do item).
Source: Lecture 1b, Theorem 1 (First-order necessary conditions).
Lemma 1: Descent-direction lemma
If $\nabla f(x)^\top s < 0$ then $s$ is a genuine descent direction: $f(x + \alpha s) < f(x)$ for all sufficiently small $\alpha > 0$. This is the workhorse behind every linesearch method and the contradiction step of the optimality proofs.
- Statement∆negative-inner-product-implies-descent-direction — $\nabla f(x)^\top s<0 \implies f$ strictly decreases along $s$ for small $\alpha$.
- Definition∆descent-direction — $s$ is a descent direction iff $\nabla f(x)^\top s < 0$ (cloze form: ∆bite-descent-direction-definition).
- Negative gradient is descent∆negative-gradient-descent-direction-when-nonzero — $-\nabla f(x)$ is descent whenever $\nabla f(x) \ne 0$.
- Proof∆descent-direction-lemma-proof — continuity of $\nabla f$ gives $\bar\alpha$ with $\nabla f(x+\alpha s)^\top s<0$ on $[0,\bar\alpha]$; first-order Taylor closes it.
- Strategy∆bite-descent-direction-lemma-strategy — continuity-extension + mean-value Taylor + combine.
Source: Lecture 1b, Lemma 1 (Descent directions).
Theorem 2: Second-order necessary condition (unconstrained)
If $x^\ast$ is a local minimiser of $f \in \mathcal C^2$ then, in addition to stationarity, the Hessian is positive semidefinite:
\[x^\ast \text{ a local minimiser} \implies \nabla^2 f(x^\ast) \succeq 0.\]PSD (not PD) is the most we can conclude — e.g. $f(x)=x^3$ at $0$ satisfies it vacuously yet has no minimiser.
- Statement∆second-order-necessary-condition-statement — local minimiser $\implies \nabla^2 f(x^\ast)\succeq 0$.
- Counterexample (PSD not sufficient)∆x-cubed-second-order-necessary-counterexample — $f(x)=x^3$ at $0$: $f'(0)=f''(0)=0$, not a minimiser.
- PSD vs PD gap∆bite-second-order-psd-vs-pd-asymmetry — the $x^3$/$x^4$ pair illustrating necessary-vs-sufficient.
- Proof∆second-order-necessary-condition-proof — contradiction: assume $s^\top\nabla^2 f s<0$, extend by continuity, reduced 2nd-order Taylor (linear term killed by $\nabla f(x^\ast)=0$).
- Strategy∆bite-second-order-necessary-strategy — contradiction + continuity-extension + Taylor.
- Key identity∆bite-second-order-necessary-key-inequality-reduced-taylor — $f(x^\ast+\alpha s)=f(x^\ast)+\tfrac{\alpha^2}{2}s^\top\nabla^2 f(\cdot)s$.
- Hypothesis role∆bite-second-order-necessary-hypothesis-first-order-zero — where $\nabla f(x^\ast)=0$ enters.
Source: Lecture 2, Theorem 2 (Second-order necessary conditions).
Theorem 3: Second-order sufficient condition (unconstrained)
If $\nabla f(x^\ast) = 0$ and $\nabla^2 f(x^\ast) \succ 0$ then $x^\ast$ is a strict local minimiser. The PD requirement is sufficient but not necessary ($f(x)=x^4$ at $0$ is a strict minimiser with $f''(0)=0$).
- Statement∆second-order-sufficient-condition-statement — $\nabla f(x^\ast)=0,\ \nabla^2 f(x^\ast)\succ 0 \implies$ strict local min.
- Counterexample (PD not necessary)∆x-fourth-second-order-sufficient-counterexample — $f(x)=x^4$ at $0$.
- Strict vs non-strict∆bite-second-order-sufficient-gives-strict — sufficient gives strict; necessary only PSD.
- Proof∆second-order-sufficient-condition-proof — PD + continuity gives a neighbourhood with $\nabla^2 f\succ 0$; 2nd-order Taylor at $\alpha=1$.
- Strategy∆bite-second-order-sufficient-strategy — neighbourhood-of-PD + Taylor + interior intermediate point.
Source: Lecture 2, Theorem 3 (Second-order sufficient conditions).
Stationary points of quadratics
For $q(x) = g^\top x + \tfrac12 x^\top H x$ with $H$ symmetric, $\nabla q = Hx + g$ and $\nabla^2 q = H$, so stationary points solve the linear system $Hx + g = 0$. When $H$ is nonsingular the unique stationary point is $x^\ast = -H^{-1}g$, classified by the definiteness of $H$ (PD $\to$ min, ND $\to$ max, indefinite $\to$ saddle). This is the prototype for local behaviour near any critical point, since $f$ is locally quadratic there.
- Nonsingular case∆quadratic-stationary-point-nonsingular — $x^\ast=-H^{-1}g$, classified by $H$’s definiteness.
- Singular case∆quadratic-stationary-point-singular — solvable iff $Hx+g=0$ is consistent.
- Gradient & Hessian∆bite-quadratic-gradient-and-hessian — $\nabla q=Hx+g$, $\nabla^2 q=H$.
- As a linear system∆bite-quadratic-stationary-linear-system — stationary points $\iff Hx+g=0$.
- Local quadratic model∆bite-local-quadratic-near-stationary-point — near $x^\ast$, $f\approx f(x^\ast)+\tfrac12(x-x^\ast)^\top\nabla^2 f(x^\ast)(x-x^\ast)$.
Source: Lecture 2 (Stationary points of quadratic functions).
Convexity: definitions and characterisations
A function $f$ on a convex domain is convex if the chord lies on or above the graph; for $f \in \mathcal C^1$ this is equivalent to the tangent lying on or below the graph, and for $f \in \mathcal C^2$ to $\nabla^2 f(x) \succeq 0$ everywhere:
\[f(y) \ge f(x) + \nabla f(x)^\top (y-x) \quad\Longleftrightarrow\quad \nabla^2 f(x) \succeq 0.\]Convexity collapses the optimality hierarchy: every stationary point and every local minimiser is automatically a global minimiser.
- Zeroth-order definition∆convex-function-definition — chord-above-graph inequality + geometric reading.
- First-order characterisation (+ proof)∆convexity-first-order — tangent-below-graph; iff convex.
- Second-order characterisation (+ proof)∆convexity-second-order — $\nabla^2 f\succeq 0$ everywhere iff convex.
- Stationary $\implies$ global∆bite-convex-stationary-implies-global — first-order conditions are already sufficient when convex.
- Local $\implies$ global∆bite-convex-local-implies-global — every local min is global.
- Convex quadratic∆bite-convex-quadratic-psd-hessian — $H\succeq 0 \implies q$ convex; stationary $\implies$ global.
- Strategies∆bite-convexity-first-order-strategy, ∆bite-convexity-second-order-strategy — difference-quotient limit; Taylor-minus-first-order.
Source: Lecture 1b (Optimality conditions for convex problems); Problem Sheet 1 Q C.1.
Theorem 16: Necessity of KKT under a constraint qualification
At a constrained minimiser of $\min f(x)$ s.t. $c _ E(x)=0,\ c _ I(x)\ge 0$, the KKT system (stationarity of the Lagrangian, primal/dual feasibility, complementary slackness) holds provided a constraint qualification makes the linearised feasible directions $\mathcal F(x^\ast)$ coincide with the tangent cone $T _ \Omega(x^\ast)$. The course proves only the equality-only case (the general inequality case via Farkas is non-carded and a pending to-do).
- KKT definition∆kkt-conditions — feasibility, stationarity, dual feasibility, complementary slackness.
- Lagrangian∆lagrangian-of-constrained-optimisation-problem — $\mathcal L=f-y^\top c _ E-\lambda^\top c _ I$; KKT $\implies \nabla _ x\mathcal L=0$.
- Necessity statement∆kkt-necessity-theorem-statement — under a CQ, local min $\implies$ KKT point.
- Why a CQ∆constraint-qualifications — sufficient condition for $T _ \Omega(x)=\mathcal F(x)$.
- SCQ / LICQ∆scq, ∆licq — Slater (affine $c _ E$, strict interior) and linear-independence CQs.
- Tangent cone / linearised directions∆tangent-cone, ∆set-of-linearised-feasible-directions — geometric vs algebraic feasible sets.
- CQ-failure example∆constraint-qualifications-failure-example — circle $\cap$ half-plane cusp where SCQ & LICQ both fail.
- Proof (equality-only)∆kkt-necessity-equality-only-proof — feasible $\mathcal C^1$ path, linearise $c _ E$ ($s\in\ker J _ E$), expand $f$, rank-nullity ($\nabla f\in\operatorname{Range}J _ E^\top$).
- Strategy∆bite-kkt-necessity-equality-strategy — path setup + constraint linearisation + $f$-expansion + linear algebra.
- Proof (general inequality / Farkas)— no flashcard (Sheet 4 Q C.1; pending course to-do item).
Source: Lectures 10-11, Theorem 16 (Necessity of first-order conditions).
Theorem 17: KKT necessary for linearly constrained problems
If every constraint is affine, $c(x) = Ax - b$, then a local minimiser is a KKT point with no constraint qualification required — affineness makes the first-order linearisation of the constraints exact, so the CQ is automatic.
- Statement∆kkt-necessary-for-linearly-constrained-problem — affine constraints $\implies$ local min is KKT, no CQ.
- KKT form (linear constraints)∆kkt-conditions-for-linearly-constrained-problem — $\nabla f=A _ E^\top y+A _ I^\top\lambda$, feasibility, $\lambda\ge 0$, complementarity.
- Proof— no flashcard (the Theorem 16 Taylor argument with exact linearisation; pending course to-do item).
Source: Lectures 10-11, Theorem 17 (Linearly constrained problems).
Theorem 18: KKT sufficient for convex programs
For a convex program ($f$ convex, each $c _ i$ ($i\in I$) concave, $c _ E$ affine) the feasible set $\Omega$ is convex and any KKT point is a global minimiser — for convex problems the first-order KKT conditions are both necessary and sufficient.
- Convex program definition∆convex-programming-problem — $f$ convex, $c _ I$ concave, $c _ E$ affine (+ generality caveat).
- Feasible set is convex∆convex-program-implies-feasible-set-convex — $\Omega$ convex.
- Sufficiency statement∆kkt-condition-sufficient-when-convex — convex program: KKT $\implies$ global minimiser.
- Convex sufficiency (bite)∆bite-convex-kkt-sufficient — KKT not just necessary but sufficient when convex.
- Proof (Ω convex)∆convex-program-feasible-set-convex-proof — affine $c _ E$ + concave $c _ I$ + convex combination.
- Proof (KKT sufficient)∆kkt-condition-sufficient-when-convex-proof — first-order convexity + KKT stationarity + concavity + complementarity.
- Strategies∆bite-convex-program-feasible-set-strategy, ∆bite-convex-kkt-sufficient-strategy — convex-combination; convexity-inequality + KKT substitution.
Source: Lectures 10-11, Theorem 18 (Sufficiency for convex problems).
Theorem 19: Second-order necessary conditions (constrained)
At a constrained local minimiser that is a KKT point with multipliers $(y^\ast,\lambda^\ast)$, the Lagrangian Hessian is PSD on the critical cone:
\[s^\top \nabla^2 _ {xx}\mathcal L(x^\ast,y^\ast,\lambda^\ast)\, s \ge 0 \qquad \text{for all } s \in F(\lambda^\ast).\]It is the Lagrangian Hessian, not $\nabla^2 f$, because curvature of the constraints matters (the two-circles example: same linear $f$, min vs max distinguished only by $\nabla^2 _ {xx}\mathcal L$). The full proof is non-examinable (Lecture 11 handout).
- Why second-order needed∆why-second-order-conditions-needed-constrained — along feasible $s$, first-order gives $s^\top\nabla f\ge 0$; ties need curvature.
- Statement (necessary)∆second-order-necessary-conditions-constrained — $s^\top\nabla^2 _ {xx}\mathcal L\,s\ge 0$ on the critical cone.
- Critical cone∆critical-cone — $F(\lambda^\ast)$: feasible directions with $s^\top\nabla c _ i=0$ for active $i$ with $\lambda _ i^\ast>0$.
- Why $\nabla^2 _ {xx}\mathcal L$ not $\nabla^2 f$∆why-lagrangian-hessian-example — two-circles example, min vs max via multiplier sign.
- Critical-cone purpose∆bite-critical-cone-purpose — curvature only matters where the first-order term is tight.
- Proof— no flashcard (non-examinable; Lecture 11 Theorem 19 handout).
Source: Lectures 10-11, Theorem 19 (Second-order necessary conditions; proof non-examinable).
Theorem 20: Second-order sufficient conditions (constrained)
If $(x^\ast,y^\ast,\lambda^\ast)$ satisfies KKT and $s^\top \nabla^2 _ {xx}\mathcal L\, s > 0$ for every nonzero $s$ in the critical cone, then $x^\ast$ is a constrained local minimiser. This is the constrained analogue of the unconstrained PD-Hessian sufficient condition.
- Statement (sufficient)∆kkt-conditions-sufficient-second-order — KKT + PD Lagrangian Hessian on the critical cone $\implies$ constrained local min.
- Proof— no flashcard (non-examinable companion to Theorem 19; Lecture 11 handout).
Source: Lectures 10-11, Theorem 20 (Second-order sufficient conditions).
Quadratic programming and duality
A QP is $\min c^\top x + \tfrac12 x^\top H x$ s.t. $Ax = b,\ \tilde A x \ge \tilde b$; it is convex iff $H \succeq 0$, and its KKT system is linear-plus-complementarity. For a strictly convex QP with only inequality constraints the primal and dual optimal values coincide (strong duality).
- QP definition∆quadratic-programming-problem-definition — $\min c^\top x+\tfrac12 x^\top Hx$ s.t. $Ax=b,\tilde Ax\ge\tilde b$.
- QP KKT conditions∆qp-kkt-conditions — $H\hat x+c=A^\top\hat y+\tilde A^\top\hat\lambda$, feasibility, $\hat\lambda\ge 0$, complementarity.
- QP convex when∆bite-qp-convex-when-h-psd — convex $\iff H\succeq 0$.
- QP duality∆qp-duality-statement — strictly convex QP: primal and dual optima coincide.
- Proof of strong duality— no flashcard (Lectures 10-11; statement only is carded).
Source: Lectures 10-11 (Quadratic programming; duality theory).
Convergence theorems
Lemma 2: Armijo holds under a Lipschitz gradient
If $\nabla f$ is $L$-Lipschitz and $s^k$ is a descent direction, the Armijo condition holds for every step in an explicit interval:
\[f(x^k+\alpha s^k) \le f(x^k) + \beta\alpha\nabla f(x^k)^\top s^k \quad \text{for all } \alpha \in [0,\alpha^k _ {\max}], \qquad \alpha^k _ {\max} = \frac{(\beta-1)\nabla f(x^k)^\top s^k}{L\ \vert s^k\ \vert ^2}.\]This explicit interval is what makes the backtracking stepsize bound (Lemma 3) and hence global convergence quantitative.
- Statement∆armijo-condition-satisifed-when-lipschitz — Armijo on $[0,\alpha^k _ {\max}]$ with the explicit $\alpha^k _ {\max}$.
- Proof∆armijo-condition-satisfied-when-lipschitz-proof — Taylor + Cauchy-Schwarz + Lipschitz $\to$ $\alpha^2 L\ \vert s\ \vert ^2$ bound, then solve.
- Strategy∆bite-lipschitz-armijo-strategy — mean-value Taylor + ± gradient + CS + Lipschitz + solve.
- Key inequality∆bite-lipschitz-armijo-key-inequality-quadratic-bound — the $f(x^k+\alpha s)\le f(x^k)+\alpha\nabla f^\top s+\alpha^2 L\ \vert s\ \vert ^2$ bound.
- Hypothesis roles∆bite-lipschitz-armijo-hypothesis-lipschitz-role, ∆bite-lipschitz-armijo-hypothesis-beta-range — where Lipschitz / $\beta\in(0,1)$ enter.
Source: Lecture 3, Lemma 2.
Lemma 3: Backtracking-Armijo stepsize lower bound
The bArmijo linesearch terminates in finitely many steps and the accepted stepsize is bounded below:
\[\alpha^k \ge \min\{\alpha _ {(0)},\ \tau\,\alpha^k _ {\max}\}.\]A short or absent lower bound here would break the telescoping global-convergence argument.
- Statement∆armijo-stepsize-bound — $\alpha^k\ge\min\{\alpha _ {(0)},\tau\alpha^k _ {\max}\}$.
- Finite termination∆barmijo-terminates-finite-steps — descent $s^k \implies$ bArmijo halts in finitely many backtracks.
- Proof∆armijo-backtracking-bound — two cases on whether $\alpha _ {(0)}$ is accepted; minimality of first accepted index.
- Strategy∆bite-armijo-backtracking-strategy — case split + “one step too far” $\alpha^k>\tau\alpha^k _ {\max}$.
- Key inequality∆bite-armijo-backtracking-key-inequality-tau-alphamax — $\alpha^k>\tau\alpha^k _ {\max}$.
Source: Lecture 3, Lemma 3.
Theorem 4: Global convergence of the generic linesearch method (bArmijo)
With $f \in \mathcal C^1$ bounded below, $\nabla f$ Lipschitz, and descent directions $s^k$, the bArmijo GLM satisfies: either $\nabla f(x^l)=0$ for some $l$, or
\[\lim _ {k\to\infty} \min\!\left\{\frac{ \vert \nabla f(x^k)^\top s^k \vert }{\ \vert s^k\ \vert },\ \vert \nabla f(x^k)^\top s^k \vert \right\} = 0.\]The $\min$ form is exactly why $s^k$ must stay bounded away from orthogonality to $-\nabla f$ (the $\cos\theta _ k \ge \delta$ requirement) for genuine $\ \vert \nabla f\ \vert \to 0$.
- Statement∆global-convergence-of-backtracking-armijo — the $\min\{\cdot,\cdot\}\to 0$ dichotomy.
- Intuition∆global-convergence-of-backtracking-armijo-intuition — $\cos\theta _ k$ reading; need $\cos\theta _ k\ge\delta>0$.
- Proof∆global-convergence-of-backtracking-armijo-proof — per-step Armijo decrease, telescope (bounded below), split $\mathcal K _ 1/\mathcal K _ 2$.
- Strategy∆bite-glm-bArmijo-convergence-strategy — decrease + telescope + stepsize-regime split.
- Key inequalities∆bite-glm-bArmijo-convergence-key-inequality-per-step-decrease, ∆bite-glm-bArmijo-convergence-key-inequality-K1-bound — per-step decrease; $\mathcal K _ 1$ lower bound.
- Hypothesis role∆bite-glm-bArmijo-convergence-hypothesis-bounded-below — where “bounded below” enters (telescoping).
Source: Lecture 4, Theorem 4.
Theorem 5: Global convergence of steepest descent
Steepest descent ($s^k=-\nabla f(x^k)$) with (b)Armijo and the Theorem 4 hypotheses gives $\nabla f(x^l)=0$ for some $l$ or $\ \vert \nabla f(x^k)\ \vert \to 0$. It is a one-line corollary of Theorem 4 with $\cos\theta _ k\equiv 1$.
- Statement∆steepest-descent-global-convergence — SD + (b)Armijo: $\ \vert \nabla f(x^k)\ \vert \to 0$.
- Definition / descent∆steepest-descent-definition, ∆steepest-descent-direction-when-descent — $s^k=-\nabla f$; descent iff $\nabla f\ne 0$.
- Proof∆steepest-descent-global-convergence-proof — apply Theorem 4 with $s^k=-\nabla f(x^k)$.
Source: Lecture 5, Theorem 5.
Theorem 6: Local linear rate of steepest descent
Near a PD-Hessian minimiser, SD converges only linearly, with rate governed by the Hessian conditioning:
\[\rho \le \rho _ {\mathrm{SD}} = \frac{\kappa(x^\ast)-1}{\kappa(x^\ast)+1}, \qquad \kappa(x^\ast)=\frac{\lambda^\ast _ {\max}}{\lambda^\ast _ {\min}}.\]Ill-conditioning ($\kappa \gg 1$) pushes $\rho _ {\mathrm{SD}} \to 1$ (Rosenbrock: $\kappa\approx 258$, $\rho _ {\mathrm{SD}}\approx 0.992$), the core weakness of SD.
- Statement∆steepest-descent-local-linear-rate-via-condition-number — linear rate $\rho\le(\kappa-1)/(\kappa+1)$.
- Rate formula∆bite-steepest-descent-rho-sd-formula — $\rho _ {\mathrm{SD}}=(\kappa-1)/(\kappa+1)$.
- Condition number∆bite-condition-number-spd-definition — $\kappa(A)=\lambda _ {\max}/\lambda _ {\min}$.
- Why slow in practice∆steepest-descent-asymptotic-rate-near-one — $\rho$ often $\approx 1$.
- Rosenbrock numbers∆bite-steepest-descent-rosenbrock-numbers — $\kappa\approx 258$, $\rho _ {\mathrm{SD}}\approx 0.992$.
- Poor-scaling example & fix∆steepest-descent-poorly-scaled-example, ∆steepest-descent-rescaling-fix-example — $\tfrac12(ax _ 1^2+x _ 2^2)$; rescale to one-step convergence.
- Disadvantages∆bite-steepest-descent-three-disadvantages — scale-dependent, slow linear, no 2nd-order guarantee.
- Proof— no flashcard (rate quoted; ill-conditioning shown via the poorly-scaled example).
Source: Lecture 5, Theorem 6.
Theorem 7: Local convergence of pure Newton
With $\nabla^2 f(x^\ast)$ nonsingular, $\nabla^2 f$ locally Lipschitz, and $x^{k _ 0}$ close enough to $x^\ast$, pure Newton is well-defined and converges quadratically:
\[\ \vert x^{k+1}-x^\ast\ \vert = O(\ \vert x^k-x^\ast\ \vert ^2).\]The “close enough” neighbourhood depends on unknown problem constants, which is why pure Newton is wrapped in a globalisation in practice.
- Statement∆newtons-method-convergence — well-defined iterates; $x^k\to x^\ast$, $\nabla f(x^k)\to 0$ both quadratically.
- Newton direction / pure Newton∆newton-direction, ∆pure-newtons-method — $s^k=-(\nabla^2 f)^{-1}\nabla f$; $\alpha^k\equiv 1$.
- Root-finding view∆newton-optimisation-as-root-finding-of-gradient, ∆bite-hessian-is-jacobian-of-gradient — Newton on $\nabla f=0$; Jacobian $=\nabla^2 f$.
- Proof∆newtons-method-convergence-proof — Taylor-expand $\nabla f$ at $x^k$, evaluate at $x^\ast$, invert $\nabla^2 f(x^k)$, identify Newton step.
- Strategy∆bite-newton-local-convergence-strategy — Taylor-on-gradient skeleton.
- Key identity / hypotheses∆bite-newton-local-convergence-key-inequality-taylor-on-grad, ∆bite-newton-local-convergence-hypothesis-lipschitz-hessian, ∆bite-newton-local-convergence-hypothesis-hessian-nonsingular — $O(\ \vert \cdot\ \vert ^2)$ remainder; Lipschitz $\to$ quadratic vs continuity $\to$ superlinear; nonsingularity well-defines the step.
- Why not a practical $x^0$ recipe∆bite-newton-local-neighbourhood-unknown — basin depends on unknown constants.
- Failure example∆pure-newton-failure-oscillation-example — degree-6 $f$, $x^0=1$ cycles $\pm 1$.
- Advantages / disadvantages∆newtons-method-advantages-disadvantages — quadratic, scale-invariant, auto step vs singular/saddle/far-start.
- Per-iteration cost∆bite-newton-cost-per-iteration — $O(n^3)$ dense solve.
Source: Lecture 6, Theorem 7.
Theorem 8: Newton + bArmijo accepts the unit step
With $\beta < \tfrac12$ and $\alpha _ {(0)}=1$, backtracking-Armijo eventually accepts $\alpha^k=1$, so globalised Newton recovers the pure-Newton quadratic rate near $x^\ast$. The mechanism is the “half-decrease” identity $f(x^k+s^k)=f(x^k)+\tfrac12\nabla f(x^k)^\top s^k+o(\ \vert s^k\ \vert ^2)$.
- Statement∆newtons-method-using-backtracking-armijo-convergence — $\beta<\tfrac12,\alpha _ {(0)}=1 \implies \alpha^k=1$ eventually, quadratic.
- Damped Newton name∆bite-damped-newton-name — Newton + linesearch = damped Newton.
- Proof∆newtons-method-using-backtracking-armijo-convergence-proof — half-decrease identity; Armijo at $\alpha=1$ reduces to $(\tfrac12-\beta)\nabla f^\top s+o\le 0$; then Theorem 7.
- Strategy∆bite-newton-barmijo-quadratic-strategy — unit-step-accepted + pure-Newton-rate.
- Key identity / hypothesis∆bite-newton-barmijo-quadratic-key-inequality-half-decrease, ∆bite-newton-barmijo-quadratic-hypothesis-beta-lt-half — half-decrease; why $\beta<\tfrac12$.
Source: Lecture 6, Theorem 8; Problem Sheet 2 Q C.3.
Theorem 9: Global convergence of Newton + bArmijo
If the Hessian eigenvalues at the iterates are positive and uniformly bounded below away from zero, then bArmijo-Newton gives $\nabla f(x^l)=0$ for some $l$ or $\ \vert \nabla f(x^k)\ \vert \to 0$. (Strong convexity is a clean sufficient set of hypotheses.) Positivity makes the Newton step descent; the uniform lower bound controls $\ \vert s^k\ \vert $.
- Statement∆newton-barmijo-global-convergence — uniformly-bounded-below PD Hessian $\implies \ \vert \nabla f(x^k)\ \vert \to 0$.
- Simpler sufficient conditions∆newton-barmijo-global-convergence-simpler-conditions — $\mathcal C^2$ + Lipschitz $\nabla f$ + strongly convex.
- Strong convexity∆bite-strong-convexity-definition — eigenvalues of $\nabla^2 f$ uniformly bounded below.
- Proof∆newton-barmijo-global-convergence-proof — Theorem 4 + Rayleigh-Ritz on $H _ k$, $H _ k^{-2}$ to turn $E _ k\to 0$ into $\ \vert \nabla f\ \vert \to 0$.
- Strategy∆bite-newton-barmijo-global-strategy — apply Thm 4, convert $E _ k\to 0$ via Rayleigh-Ritz.
- Key inequalities / hypothesis∆bite-newton-barmijo-global-key-inequality-descent-bound, ∆bite-newton-barmijo-global-key-inequality-direction-length, ∆bite-newton-barmijo-global-hypothesis-uniform-eigenvalues — descent bound; direction-length bound; the two roles of the eigenvalue hypothesis.
Source: Lecture 6, Theorem 9.
Theorem 10: Global convergence of a general second-order method
The Theorem 9 conclusion holds for any $s^k$ solving $B^k s^k = -\nabla f(x^k)$ when the eigenvalues of $B^k$ are uniformly bounded above and below away from zero (so the proof is structurally identical with $B^k$ in place of $\nabla^2 f$). This is the umbrella result covering quasi-Newton and modified-Newton globalisation.
- Direction is descent / is a model minimiser∆general-second-order-direction-is-descent-justification, ∆general-second-order-direction-as-quadratic-model-minimiser — $B^k\succ 0 \implies$ descent; minimiser of the quadratic model.
- Statement∆general-second-order-glm-global-convergence — bounded-eigenvalue $B^k$ $\implies \ \vert \nabla f(x^k)\ \vert \to 0$.
- Proof— no flashcard (structurally identical to Theorem 9 with $B^k$ for $\nabla^2 f$; pending course to-do item).
Source: Lecture 6, Theorem 10.
Newton's method is scale invariant
Newton’s iterates commute with any nonsingular linear change of variables $y=Ax$: $s _ y = A s _ x$, hence $y+\alpha s _ y = A(x+\alpha s _ x)$. This is the structural reason Newton is immune to the conditioning problems that cripple steepest descent.
- Statement∆newtons-method-scale-invariance-statement — invariant under nonsingular linear changes of variable.
- Direction identity∆bite-newton-scale-invariance-direction-identity — $s _ y=As _ x$.
- Proof∆newtons-method-scale-invariance-proof — chain-rule $\nabla\bar f,\nabla^2\bar f$, cancel $B^\top$ factors.
- Strategy∆bite-newton-scale-invariance-strategy — chain rule + $(B^\top MB)^{-1}$ cancellation.
Source: Lecture 6 (Scale invariance of Newton’s method).
Modified / damped Newton
When $\nabla^2 f(x^k)$ is not PD the Newton direction need not be descent; solve $(\nabla^2 f(x^k)+M^k)s^k=-\nabla f(x^k)$ with $M^k$ chosen to make the matrix sufficiently PD (zero when already PD). Three standard recipes: spectral, smallest-eigenvalue shift, modified Cholesky.
- Overview∆modified-damped-newton-overview — solve with $\nabla^2 f+M^k$ sufficiently PD.
- Spectral∆modified-damped-newton-spectral-decomposition — $Q\max(\epsilon I, \vert D \vert )Q^\top$ (expensive).
- Eigenvalue shift∆modified-damped-newton-smallest-eigenvalue-estimate — $M^k=\max(0,\epsilon-\lambda _ {\min})I$.
- Modified Cholesky∆modified-damped-newton-cholesky — modify $L^k$ if factorisation is in danger.
Source: Lecture 6 (Modified Newton methods).
Quasi-Newton (SR1, BFGS) convergence properties
BFGS preserves positive-definiteness under the curvature condition $(\delta^k)^\top\gamma^k>0$ (enforced by Wolfe, not by Armijo alone), converges Q-superlinearly locally, and terminates in $n$ steps on a strictly convex quadratic with exact linesearch. SR1 is rank-1 and does not preserve PD.
- PD preservation∆bfgs-preserves-positive-definiteness — PD + curvature condition $\implies B^{k+1}\succ 0$.
- Needs Wolfe∆bfgs-requires-wolfe-for-global-convergence — Wolfe (not Armijo) for global convergence.
- Curvature condition∆bite-bfgs-curvature-condition — $(\delta^k)^\top\gamma^k>0$ via Wolfe.
- Superlinear∆bfgs-superlinear-convergence — local Q-superlinear.
- Finite termination∆bfgs-finite-termination-on-quadratic — exact linesearch on strictly convex quadratic: $B^k=\nabla^2 f$ after $n$ steps.
- SR1 vs BFGS∆bite-sr1-vs-bfgs-comparison — rank, PD preservation, reliability.
- PD-preservation proof— no flashcard (Cholesky-based, Problem Sheet 3 Q B.1(iv); pending course to-do item).
- Superlinear-rate proof— no flashcard (rate quoted only; beyond course scope).
Source: Lecture 7 (Quasi-Newton methods); Problem Sheet 3 Q B.1.
Gauss-Newton convergence (nonlinear least-squares)
For $\min \tfrac12\ \vert r(x)\ \vert ^2$, Gauss-Newton uses $\widetilde{\nabla^2}f=J^\top J$. It converges globally under a uniformly full-rank Jacobian and Lipschitz $\nabla f$ (to a stationary point of $f$, not necessarily $r=0$); locally it is quadratic in the zero-residual case (reduces to Theorem 7) and only linear in the nonzero-residual case (rate $ \vert \lambda \vert $, $\lambda$ the top eigenvalue of $(J^\top J)^{-1}\sum _ j r _ j\nabla^2 r _ j$).
- Statement∆gauss-newton-convergence — global (full-rank $J$) + local (quadratic if $r(x^\ast)=0$).
- Zero vs nonzero residual∆bite-zero-vs-nonzero-residual — quadratic vs linear.
- Nonzero-residual rate∆bite-gauss-newton-linear-rate-nonzero-residual — rate $ \vert \lambda \vert $ and why $J^\top J$ is a bad Hessian there.
- Descent condition∆gauss-newton-direction-is-descent-when — descent iff $J$ full column rank.
- Proof— no flashcard (zero-residual case reduces to Theorem 7; nonzero case quoted).
Source: Lecture 7 (Gauss-Newton method); Problem Sheet 3 Q B.2.
Theorem 14: Global convergence of the trust-region method
With $f \in \mathcal C^2$ bounded below, $\nabla f$ Lipschitz, and the Cauchy-decrease condition, the generic trust-region method gives $\nabla f(x^k)=0$ for some $k$ or $\liminf _ {k\to\infty}\ \vert \nabla f(x^k)\ \vert =0$ (only $\liminf$, not $\lim$). It rests on the Cauchy model-decrease lemma (Lemma 12) and the trust-region radius lower bound (Lemma 13).
- Statement∆gtr-global-convergence-theorem — $\liminf\ \vert \nabla f(x^k)\ \vert =0$ under Cauchy decrease.
- Limit-point form∆one-limit-point-of-gtr-method-stationary — at least one limit point is stationary.
- Cauchy decrease condition∆cauchy-decrease-condition — $m _ k(s^k)\le m _ k(s^k _ c)$.
- Lemma 12 (model decrease)∆sufficient-model-decrease-given-cauchy-condition — Cauchy decrease $\to$ guaranteed $f$-reduction $\tfrac12\ \vert \nabla f\ \vert \min\{\Delta,\ \vert \nabla f\ \vert /\ \vert H\ \vert \}$.
- Cauchy point∆cauchy-point-definition, ∆cauchy-point-computation — minimiser of the model along $-\nabla f$ within the region; 3-case computation.
- Lemma 13 (radius lower bound)∆lower-bound-on-trust-region-radius — $\Delta _ k$ bounded below.
- Proof (liminf)∆one-limit-point-of-gtr-method-stationary-proof, ∆cauchy-decrease-lemma-proof — successful-step decrease + telescope; Lemma 12 proof.
- Strategy∆bite-gtr-liminf-strategy — uniform decrease on successful steps + contradiction.
- Full proof of Theorem 14— no flashcard (complete proof non-examinable; Lecture 9 handout).
Source: Lectures 8-9, Theorem 14 (with Lemmas 12-13); complete proof non-examinable.
Trust-region method with arbitrary $B^k$
The Theorem 14 conclusion still holds with an arbitrary symmetric model Hessian $B^k$ provided $\ \vert B^k\ \vert \le \kappa _ B$; the convergence constants merely weaken to $c/(L+\kappa _ B)$ (Sheet 3 Q B.4).
- Statement∆gk-with-arbitrary-bk — convergence under $\ \vert B^k\ \vert \le\kappa _ B$, weakened constants.
- Proof— no flashcard (Theorem 14 proof with $\nabla^2 f\to B^k$; Sheet 3 Q B.4).
Source: Lectures 8-9; Problem Sheet 3 Q B.4.
Trust-region subproblem: global solution
$s^\ast$ solves $\min _ {\ \vert s\ \vert \le\Delta} g^\top s+\tfrac12 s^\top H s$ iff there is $\lambda^\ast \ge 0$ with
\[(H+\lambda^\ast I)s^\ast = -g, \qquad H+\lambda^\ast I \succeq 0, \qquad \lambda^\ast(\ \vert s^\ast\ \vert -\Delta)=0,\]and the solution is unique when $H+\lambda^\ast I \succ 0$. The secular equation $1/\ \vert s(\lambda)\ \vert - 1/\Delta = 0$ turns this into a 1-D rootfind; the hard case is when $g \perp$ the $\lambda _ {\min}$-eigenspace.
- Conditions∆exact-conditions-for-solving-trust-region-subproblem — the $(H+\lambda^\ast I)$ characterisation.
- Solving recipe∆trust-region-subproblem-solving-recipe — interior vs boundary; find $\lambda^\ast$.
- Secular equation∆trust-region-secular-equation — $\phi(\lambda)=1/\ \vert s(\lambda)\ \vert -1/\Delta=0$.
- Hard case∆trust-region-hard-case — $g\perp$ eigenspace of $\lambda _ {\min}<0$; nonunique solution.
- Proof (secular equation)∆trust-region-secular-equation-proof — spectral decomposition reduction.
- Worked examples∆trust-region-hard-case-example, ∆trust-region-case-1-example, ∆trust-region-case-2-pd-example, ∆trust-region-case-2-indefinite-example — concrete subproblem solves.
Source: Lecture 9 (Solving the trust-region subproblem); Problem Sheet 3 Q B.3.
Theorem 21: Convergence of the quadratic penalty method
For $\min f(x)$ s.t. $c(x)=0$ solved via $\Phi _ \sigma=f+\tfrac{1}{2\sigma}\ \vert c\ \vert ^2$ with $\sigma^k\to 0$ and approximate stationarity, if $x^k\to x^\ast$ with $\{\nabla c _ i(x^\ast)\}$ linearly independent (LICQ) then $x^\ast$ is a KKT point and the estimates $y^k=-c(x^k)/\sigma^k \to y^\ast$. LICQ is what makes the pseudo-inverse $J^+=(JJ^\top)^{-1}J$ well-defined; without $\sigma^k\to 0$ the limit need not be feasible.
- Statement∆quadratic-penalty-method-convergence — LICQ + $\sigma^k\to 0$ $\implies$ KKT point, $y^k\to y^\ast$.
- Multiplier estimate∆bite-penalty-multiplier-estimate — $y^k=-c(x^k)/\sigma^k$.
- Proof∆quadratic-penalty-method-convergence-proof — $\nabla\Phi=\nabla f-J^\top y^k$; pseudo-inverse; multiplier convergence; feasibility from $c=-\sigma y$.
- Strategy∆bite-penalty-method-convergence-strategy — pseudo-inverse setup + multiplier + stationarity + feasibility.
- Key identity / hypothesis∆bite-penalty-method-convergence-key-inequality-feasibility-identity, ∆bite-penalty-method-convergence-hypothesis-sigma-to-zero — $c(x^k)=-\sigma^k y^k$; why $\sigma^k\to 0$.
Source: Lecture 12, Theorem 21.
Theorem 22: Convergence of the augmented Lagrangian method
For $\Phi=f-u^\top c+\tfrac{1}{2\sigma}\ \vert c\ \vert ^2$ with multiplier estimate $y^k=u^k-c(x^k)/\sigma^k$, under LICQ the multipliers converge and Lagrangian stationarity holds; the limit is a KKT point if either $\sigma^k\to 0$ (bounded $u^k$) or $u^k\to y^\ast$ (bounded $\sigma^k$). The second case is the distinctive advantage: KKT without driving $\sigma\to 0$, so the Hessian stays well-conditioned.
- Setup / name∆augmented-lagrangian-setup, ∆augmented-lagrangian-name-justification — $\Phi=f-u^\top c+\tfrac{1}{2\sigma}\ \vert c\ \vert ^2$; $\nabla _ x\Phi=\nabla _ x\mathcal L$ with $y=u-c/\sigma$.
- Statement∆aug-lagrangian-global-convergence — multiplier convergence + two routes to KKT.
- Limits ill-conditioning∆bite-aug-lagrangian-limits-ill-conditioning — KKT with $\sigma^k$ bounded if $u^k\to y^\ast$.
- Two interpretations∆bite-aug-lagrangian-two-interpretations — shifted penalty / convexified Lagrangian.
- Proof∆aug-lagrangian-global-convergence-proof — multiplier step as in penalty; feasibility via $c=\sigma(u-y)$, two cases.
- Strategy∆bite-aug-lagrangian-convergence-strategy — penalty-style multiplier + two-case feasibility.
Source: Lecture 13, Theorem 22.
Theorem 27: Existence of the central path
For $\min f(x)$ s.t. $c(x)\ge 0$, under strict complementarity, LICQ, and a second-order sufficient condition at $x^\ast$, there is a unique $\mathcal C^1$ central path $\mu\mapsto x(\mu)$ of barrier minimisers for small $\mu>0$, with $x(\mu)\to x^\ast$ as $\mu\to 0$. (Nonconvex problems can have one central path per local minimiser.)
- Central path definition∆central-path — $\{x(\mu):\mu>0\}$, $x(\mu)=\arg\min f _ \mu$ on $\Omega^o$.
- Statement∆central-path-existence — strict complementarity + LICQ + 2nd-order $\implies$ unique $\mathcal C^1$ path, $x(\mu)\to x^\ast$.
- Strict complementarity∆bite-strict-complementarity — $\lambda _ i^\ast>0$ on every active constraint.
- Convex vs nonconvex∆central-path-visualisation-convex-vs-nonconvex — one path per local minimiser in the nonconvex case.
- Proof— no flashcard (local existence/uniqueness via the implicit function theorem; beyond course scope).
Source: Lectures 14-15, Theorem 27.
Theorem 28: Global convergence of the barrier method
Under LICQ, with $\mu^k\to 0$ and approximate stationarity of $f _ {\mu^k}$, the barrier iterates converge to a KKT point of (iCP) and the estimates $\lambda^k=\mu^k/c(x^k)\to\lambda^\ast$. The proof splits active/inactive indices: inactive multipliers vanish, active ones converge via the active-Jacobian pseudo-inverse.
- Statement∆barrier-method-global-convergence — LICQ + $\mu^k\to 0$ $\implies$ KKT point, $\lambda^k\to\lambda^\ast$.
- Multiplier estimates∆barrier-multiplier-estimates — $\lambda _ i(\mu)=\mu/c _ i(x(\mu))$; OPT$ _ \mu$ vs KKT.
- Proof∆barrier-method-global-convergence-proof — active/inactive split; inactive $\lambda\to 0$; active via $J _ {\mathcal A}^+$; stationarity in the limit.
- Strategy∆bite-barrier-method-convergence-strategy — pseudo-inverse setup + 3-step active/inactive argument.
- Key inequalities / hypothesis∆bite-barrier-method-convergence-key-inequality-inactive-vanish, ∆bite-barrier-method-convergence-key-inequality-active-block-residual, ∆bite-barrier-method-convergence-hypothesis-licq-role — inactive vanish; active-block residual; LICQ role.
Source: Lectures 14-15, Theorem 28.
Supporting results (Useful miscellany)
Auxiliary theorems repeatedly used inside the convergence proofs above.
- Rayleigh-Ritz bound (+ proof)∆rayleigh-ritz-theorem, ∆rayleigh-ritz-theorem-proof — $\lambda _ {\min}\le s^\top Ms/\ \vert s\ \vert ^2\le\lambda _ {\max}$; eigenbasis expansion. Used in Theorems 9-10.
- Lipschitz $\nabla f$ $\iff$ bounded Hessian (+ proof)∆lipschitz-continuous-iff-bounded-hessian, ∆lipschitz-continuous-iff-bounded-hessian-proof — $\ \vert \nabla^2 f\ \vert \le L$; FTC / difference-quotient. Used everywhere Lipschitz $\nabla f$ appears.
- Descent lemma∆descent-lemma — $f(x+d)\le f(x)+\nabla f(x)^\top d+\tfrac L2\ \vert d\ \vert ^2$ (Sheet 2 Q B.4(i)).
- Sherman-Morrison-Woodbury (proof on card)∆sherman-morrison-woodbury — rank-$m$ inverse update; why SR1/BFGS cost $O(n^2)$ (Sheet 3 Q C.1).
- Strategies∆bite-rayleigh-ritz-strategy, ∆bite-lipschitz-bounded-hessian-strategy, ∆bite-descent-lemma-strategy, ∆bite-smw-strategy — one-line proof skeletons.
Source: Lecture 5 / Problem Sheets 2-3 (Rayleigh-Ritz, descent lemma, SMW).
Methods
Generic linesearch method (GLM)
Iterate $x^{k+1}=x^k+\alpha^k s^k$ with a descent direction $s^k$ and a stepsize rule for $\alpha^k$. The rule must keep $\alpha^k$ neither too long (Armijo) nor too short (Wolfe / Goldstein curvature side); convergence is Theorem 4, resting on Lemmas 2-3.
- Algorithm∆generic-linesearch-method — choose descent $s^k$, stepsize $\alpha^k$, update; stop when $\ \vert \nabla f\ \vert \le\epsilon$.
- Descent direction∆descent-direction — $\nabla f(x^k)^\top s^k<0$.
- Exact linesearch (+ quadratic optimum, proof)∆exact-linesearch-definition, ∆exact-linesearch-quadratic-optimal-stepsize, ∆exact-linesearch-quadratic-optimal-stepsize-proof — $\alpha^\ast=-(s^k)^\top\nabla q/(s^k)^\top Hs^k$.
- Backtracking / Armijo / bArmijo∆basic-backtracking-algorithm, ∆armijo-condition, ∆backtracking-armijo-linesearch — sufficient-decrease line; backtrack by $\tau$.
- Wolfe / Goldstein-Armijo∆wolfe-conditions, ∆bite-goldstein-armijo-conditions — curvature side that rules out short steps.
- Convergence∆global-convergence-of-backtracking-armijo — Theorem 4 (rests on Lemmas 2-3).
- Rates of convergence∆rates-of-convergence — $p$-rate; linear/quadratic/superlinear.
- Too-short / too-long counterexamples∆bite-stepsize-too-short-counterexample, ∆bite-stepsize-too-long-counterexample — why both bounds are needed.
Source: Lectures 3-4 (Generic linesearch methods).
Steepest descent
GLM with $s^k=-\nabla f(x^k)$ — the direction that most decreases the linear model on a fixed-norm sphere. Globally convergent (Theorem 5) but only linearly local (Theorem 6), and scale-dependent.
- Definition∆steepest-descent-definition — $s^k=-\nabla f(x^k)$.
- Why "steepest"∆steepest-descent-why-name — unique minimiser of the linear model at fixed $\ \vert s\ \vert $.
- Scale dependence∆bite-steepest-descent-scale-dependent — a linear change of variables can drastically change SD.
- Convergence∆steepest-descent-global-convergence, ∆steepest-descent-local-linear-rate-via-condition-number — Theorem 5 global; Theorem 6 linear rate.
Source: Lecture 5 (Steepest descent).
Newton's method
GLM with $s^k$ solving $\nabla^2 f(x^k)s^k=-\nabla f(x^k)$ — minimiser of the second-order model. Quadratic local rate (Theorem 7), unit step under bArmijo (Theorem 8), global under bounded PD Hessian (Theorem 9), scale invariant; globalised by linesearch or trust region.
- Algorithm / direction∆newtons-method, ∆newton-direction-minimises-second-order-taylor — solve the Newton system; minimiser of the 2nd-order Taylor model.
- Convergence∆newtons-method-convergence, ∆newtons-method-using-backtracking-armijo-convergence, ∆newton-barmijo-global-convergence — Theorems 7-9.
- Scale invariance∆newtons-method-scale-invariance-statement — invariant under linear changes of variable.
- Modified / damped∆modified-damped-newton-overview — handle indefinite $\nabla^2 f$.
- General second-order∆general-second-order-glm-global-convergence — Theorem 10 umbrella.
Source: Lecture 6 (Newton’s method for unconstrained optimisation).
Quasi-Newton (SR1, BFGS)
GLM with $s^k=-(B^k)^{-1}\nabla f(x^k)$ where $B^k$ approximates the Hessian via the secant equation $\gamma^k=B^{k+1}\delta^k$, updated cheaply ($O(n^2)$ via Sherman-Morrison-Woodbury). BFGS is the rank-2 PD-preserving update; SR1 is rank-1 and may be indefinite.
- Wish list / secant equation∆secant-approximation-wish-list, ∆secant-equation, ∆secant-equation-intuition — symmetric/PD/cheap/$\to\nabla^2 f$; $\gamma^k=B^{k+1}\delta^k$.
- SR1 update∆sr1-update, ∆sr1-update-issues — rank-1 $B^k+uu^\top$; PD/descent/denominator issues.
- BFGS update∆bfgs-update-formula, ∆bfgs-method-definition — rank-2 update; the BFGS GLM.
- Cholesky-factor implementation∆bfgs-cholesky-factors — maintain $L^k$, $O(n^2)$, auto-PD.
- Frobenius derivation∆bite-bfgs-frobenius-derivation — closest factored update meeting the secant constraint.
- Cost via SMW∆bite-quasi-newton-cost-via-smw — $O(n^2)$ low-rank inverse update.
Source: Lecture 7 (Quasi-Newton methods).
Gauss-Newton (nonlinear least-squares)
For $\min\tfrac12\ \vert r(x)\ \vert ^2$, approximate $\nabla^2 f\approx J^\top J$ and solve a linear least-squares subproblem $\min _ s\tfrac12\ \vert J(x^k)s+r(x^k)\ \vert ^2$ each step.
- Problem / gradient / Hessian∆least-squares-problem-definition, ∆least-squares-gradient, ∆least-squares-hessian — $\nabla f=J^\top r$; $\nabla^2 f=J^\top J+\sum r _ j\nabla^2 r _ j$.
- Direction / algorithm∆gauss-newton-direction, ∆gauss-newton-algorithm — drop $\sum r _ j\nabla^2 r _ j$; LLS subproblem.
- As an LLS subproblem∆bite-gauss-newton-as-lls-subproblem — $\min _ s\tfrac12\ \vert Js+r\ \vert ^2$ inner solve.
- Linear LS / normal equations / geometry∆bite-linear-least-squares-normal-equations, ∆bite-lls-projection-geometry — $J^\top Jx=-J^\top r$; orthogonal projection onto $\operatorname{col}(J)$.
- Convergence∆gauss-newton-convergence — global (full-rank $J$); local quadratic/linear.
Source: Lecture 7 (Nonlinear least-squares and Gauss-Newton).
Trust-region method
Minimise a local quadratic model inside $\ \vert s\ \vert \le\Delta _ k$; accept/reject by the actual-to-predicted ratio $\rho _ k$, adjusting $\Delta _ k$. Globally convergent (Theorem 14) via the Cauchy point; subproblem solved by the secular equation.
- Subproblem / radius rule / algorithm∆trust-region-subproblem, ∆trust-region-radius-update-rule, ∆generic-trust-region-algorithm — model in $\ \vert s\ \vert \le\Delta$; $\rho _ k$ accept/reject.
- Linear vs quadratic models∆trust-region-linear-vs-quadratic-models — why a bounded region is needed.
- Cauchy point / decrease∆cauchy-point-definition, ∆cauchy-decrease-condition — guaranteed-decrease anchor.
- Convergence∆gtr-global-convergence-theorem, ∆gk-with-arbitrary-bk — Theorem 14; arbitrary-$B^k$ variant.
- Subproblem solution∆exact-conditions-for-solving-trust-region-subproblem, ∆trust-region-subproblem-solving-recipe, ∆trust-region-secular-equation, ∆trust-region-hard-case — conditions, recipe, secular equation, hard case.
- Linesearch vs trust region∆trust-region-vs-linesearch, ∆bite-linesearch-vs-trust-region-philosophy — direction-then-length vs region-then-step.
- Large-scale subproblem∆bite-trust-region-large-scale-cg-lanczos — CG/Lanczos inner solves.
Source: Lectures 8-9 (Trust-region methods).
Quadratic penalty method (equality constrained)
Minimise $\Phi _ \sigma=f+\tfrac{1}{2\sigma}\ \vert c\ \vert ^2$ for a decreasing $\sigma^k\to 0$. Converges to a KKT point under LICQ (Theorem 21) but the Hessian is $O(1/\sigma)$ ill-conditioned; an auxiliary-variable Newton system computes the step cleanly.
- Motivation / function / algorithm∆penalty-method-motivation, ∆quadratic-penalty-function-definition, ∆quadratic-penalty-method — unconstrained surrogate; $\sigma^k\to 0$.
- Derivatives as Lagrangian∆penalty-derivatives-as-lagrangian — $\nabla\Phi _ \sigma=\nabla _ x\mathcal L(x,y(\sigma))$.
- Convergence∆quadratic-penalty-method-convergence — Theorem 21.
- Ill-conditioning∆penalty-hessian-ill-conditioning, ∆bite-penalty-hessian-condition-number — $\kappa\sim O(1/\sigma)$, $m$ eigenvalues $\to\infty$.
- Clean Newton direction∆newton-direction-with-auxiliary-variable — $\sigma$-independent augmented system; caveat on step shape.
- Exact penalties∆exact-penalty-function, ∆exact-penalty-functions-l1-l2, ∆bite-exact-penalty-nonsmooth-consequence — $\ell _ 1/\ell _ 2$ exactness; non-smoothness tradeoff.
Source: Lecture 12 (Penalty methods).
Augmented Lagrangian method (equality constrained)
Minimise $\Phi=f-u^\top c+\tfrac{1}{2\sigma}\ \vert c\ \vert ^2$, updating the multiplier estimate $u$ and penalty $\sigma$. Converges to a KKT point (Theorem 22) and — unlike the pure penalty — can do so with $\sigma$ bounded, limiting ill-conditioning.
- Setup / derivatives / algorithm∆augmented-lagrangian-setup, ∆augmented-lagrangian-derivatives, ∆augmented-lagrangian-algorithm — $\Phi$; $\nabla^2\Phi=\nabla^2\mathcal L+\tfrac1\sigma J^\top J$; $(u,\sigma)$ update rule.
- Multiplier estimate∆bite-aug-lagrangian-multiplier-estimate — $y^k=u^k-c(x^k)/\sigma^k$.
- Update rule∆bite-aug-lagrangian-update-rule — promote $u$ when feasible, else shrink $\sigma$.
- Convergence∆aug-lagrangian-global-convergence — Theorem 22.
Source: Lecture 13 (Augmented Lagrangian methods).
Barrier / interior-point methods (inequality constrained)
Replace $c(x)\ge 0$ by the log barrier $f _ \mu=f-\mu\sum\log c _ i$ and follow the central path as $\mu\to 0$ (Theorems 27-28). The basic primal method is ill-conditioned and has poor warm-starts; primal-dual methods fix this by making the multipliers explicit Newton variables (perturbed KKT $c _ i\lambda _ i=\mu$). For LP, IPMs are polynomial-time vs simplex’s exponential worst case.
- Problem / log barrier∆icp, ∆logarithmic-barrier-function — (iCP); $f _ \mu=f-\mu\sum\log c _ i$, $\Omega^o$.
- Central path∆central-path, ∆central-path-existence — $\{x(\mu)\}$; Theorem 27.
- Barrier algorithm / convergence∆basic-barrier-method-algorithm, ∆barrier-method-global-convergence — Theorem 28.
- Lagrangian connection / ill-conditioning∆barrier-lagrangian-connection, ∆barrier-hessian-ill-conditioning — $\nabla f _ \mu=\nabla _ x\mathcal L$; $\kappa\sim O(1/\mu)$.
- Poor starting points / difficulties∆barrier-poor-starting-points, ∆barrier-potential-difficulties — full Newton step infeasible if $\mu$ more than halves.
- Perturbed optimality / primal-dual∆perturbed-optimality-conditions, ∆primal-dual-newton-system, ∆primal-dual-ipm-algorithm — OPT$ _ \mu$ ($c _ i\lambda _ i=\mu$); reduced Newton system; primal-dual algorithm.
- Primal-dual advantage∆bite-primal-dual-key-advantage — explicit vs implicit multipliers.
- LP: simplex vs IPM∆simplex-vs-ipm-for-lp, ∆bite-ipm-polynomial-vs-simplex-exponential — polynomial vs exponential worst case; Karmarkar 1984.
Source: Lectures 14-15 (Interior point methods).