Continuous Optimisation HT26, Augmented Lagrangian methods
Flashcards
Suppose we have the equality-constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c(x) = 0,\]
where:
- $f : \mathbb R^n \to \mathbb R$ smooth
- $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth
@State the setup of the augmented Lagrangian function for solving this class of problems.
where both $u \in \mathbb R^m$ and $\sigma > 0$ are parameters we modify to encourage convergence.
Suppose we have the equality-constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c(x) = 0,\]
where:
- $f : \mathbb R^n \to \mathbb R$ smooth
- $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth
In this context, the augmented Lagrangian function is
\[\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \frac{1}{2\sigma} \| c(x) \|^2\]
where both $u \in \mathbb R^m$ and $\sigma > 0$ are parameters we modify to encourage convergence.
@Justify the name “augmented Lagrangian function”.
Recall that the Lagrangian is $\mathcal L(x, y) = f(x) - y^\top c(x)$ for any $y \in \mathbb R^m$. Then note
\[\begin{aligned} &\nabla _ x \Phi(x, u, \sigma) = \nabla f(x) - J(x)^\top u + \frac 1 \sigma J(x)^\top c(x) \\ \implies& \nabla _ x \Phi(x, u, \sigma) = \nabla f(x) - J(x)^\top y = \nabla _ x \mathcal L(x, y) \end{aligned}\]where:
\[y := u - \frac{c(x)}{\sigma}\]which we may interpret as an estimate of the Lagrangian.
Suppose we have the equality-constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c(x) = 0,\]
where:
- $f : \mathbb R^n \to \mathbb R$ smooth
- $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth
In this context, the augmented Lagrangian function is
\[\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \frac{1}{2\sigma} \| c(x) \|^2\]
where both $u \in \mathbb R^m$ and $\sigma > 0$ are parameters we modify to encourage convergence.
Derive the first and second derivatives of $\Phi$.
Recall that the Lagrangian is $\mathcal L(x, y) = f(x) - y^\top c(x)$ for any $y \in \mathbb R^m$.
First derivative:
\[\begin{aligned} &\nabla _ x \Phi(x, y, \sigma) = \nabla f(x) - J(x)^\top u + \frac 1 \sigma J(x)^\top c(x) \\ \implies& \nabla _ x \Phi(x, u, \sigma) = \nabla f(x) - J(x)^\top y = \nabla _ x \mathcal L(x, y) \end{aligned}\]where $y := u - c(x)/\sigma$.
Second derivative:
\[\begin{aligned} &\nabla^2 \Phi(x, u, \sigma) = \nabla^2 f(x) - \sum^m _ {i=1} u _ i \nabla^2 c _ i(x) + \frac 1 \sigma \sum^m _ {i=1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x) \\ \implies& \nabla^2 \Phi(x, u, \sigma) = \nabla^2 f(x) - \sum^m _ {i = 1} y _ i \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x) \\ \implies& \nabla^2 \Phi(x, u, \sigma) = \nabla^2 \mathcal L(x, y) + \frac 1 \sigma J(x)^\top J(x) \end{aligned}\]where $u = c(x)/\sigma$.
Suppose we have a nonlinear equality-constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad\text{s.t.}\quad c(x) = 0,\]
where $f : \mathbb R^n \to \mathbb R$, $c : \mathbb R^n \to \mathbb R^m$ are smooth. @State the basic augmented Lagrangian @algorithm for this problem.
Inputs:
- Starting iterate $x^0$,
- Initial $u^0 \in \mathbb R^m$ and $\sigma^0 > 0$,
- Decay factor $\tau \in (0, 1)$,
- Feasibility-tolerance schedule $\eta^k \to 0$,
- Optimality-tolerance schedule $\epsilon^k \to 0$.
where the schedules are not fixed ahead of time and may depend on $\sigma^k$.
Algorithm:
Let $k = 0$. Until convergence:
- Compute the Lagrange multiplier estimate $y^k := u^k - c(x^k)/\sigma^k$
- Update $(u, \sigma)$:
- If $\|c(x^k)\| \le \eta^k$ (feasibility on track): $u^{k+1} = y^k$ (promote the multiplier estimate), $\sigma^{k+1} = \sigma^k$.
- Otherwise: $u^{k+1} = u^k$, $\sigma^{k+1} \le \tau \sigma^k$ (shrink penalty to drive infeasibility down).
- Minimise: from $x _ 0^k$ (often just $x^k _ 0 := x^k)$, use an unconstrained minimiser to find $x^{k+1}$ with $\|\nabla _ x \Phi(x^{k+1}, u^{k+1}, \sigma^{k+1})\| \le \epsilon^{k+1}$ where
- Set $k := k+1$.
Some reasonable choices:
- $\tau = \min(0.1, \sqrt{\sigma^k})$
- $\epsilon^k = (\sigma^k)^{j+1}$
- $\eta^k = (\sigma^k)^{0.1 + 0.9j}$
- where $j$ counts iterations since $\sigma^k$ last changed.
Suppose we have the equality-constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c(x) = 0,\]
where:
- $f : \mathbb R^n \to \mathbb R$ smooth
- $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth
- $\Phi(x, y, \sigma) = f(x) - u^\top c(x) + \frac{1}{2\sigma}\|c(x)\|^2$
- $u \in \mathbb R^m$ and $\sigma > 0$ are chosen parameters
Consider any sequence of iterates $(x^k, u^k, \sigma^k)$, where $x^k$ is an approximate stationary point of $\Phi(\cdot, u^k, \sigma^k)$, e.g. those produced by ∆augmented-lagrangian-algorithm. Let $y^k := u^k - c(x^k)/\sigma^k$ be the Lagrangian multiplier estimate.
@State a theorem about the global convergence of $(x^k, y^k)$, and give an intuitive explanation.
Suppose:
- $f, c \in \mathcal C^1$
- $\|\nabla _ x \Phi(x^k, u^k, \sigma^k)\| \le \epsilon^k$, where $\epsilon^k \to 0$ as $k \to \infty$
- $x^k \to x^\ast$ where $\nabla c _ i(x^\ast), i =1, \ldots, m$ are linearly independent
Then:
- $y^k \to y^\ast$ as $k \to \infty$, where
- $y^\ast$ satisfies $\nabla f(x^\ast) - J(x^\ast)^\top y^\ast = 0$.
If additionally either:
- $\sigma^k \to 0$ with $u^k$ bounded, or
- $u^k \to y^\ast$ with $\sigma^k$ bounded
Then:
- $x^\ast$ is a KKT point of the original problem with associated Lagrange multipliers $y^\ast$.
Intuitive explanation: This statement is really talking about how if you optimise $x^k \to x^\ast$, you get Lagrangian stationarity $\nabla f(x^k) - J(x^\ast)^\top y^\ast$ automatically holds, given that we take $y^k := u^k - c(x^k)/\sigma^k$ as our Lagrangian estimate at each step.
Additionally, we also can upgrade to saying that we actually get a KKT point, given that either:
- Increase the penalty asymptotically ($\sigma^k \to 0$)
- Push the multiplier to its true value $u^k \to y^\ast$ (with $\sigma^k$ kept bounded)
Suppose we have a nonlinear equality-constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad\text{s.t.}\quad c(x) = 0,\]
and the augmented Lagrangian function
\[\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \frac{1}{2\sigma} \|c(x)\|^2.\]
Consider any sequence of iterates $(x^k, u^k, \sigma^k)$ where $x^k$ is an approximate stationary point of $\Phi(\cdot, u^k, \sigma^k)$, e.g. those produced by ∆augmented-lagrangian-algorithm. Let $y^k := u^k - c(x^k)/\sigma^k$ be the Lagrange multiplier estimate.
@Prove that if
- $f, c \in \mathcal C^1$
- $\|\nabla _ x \Phi(x^k, u^k, \sigma^k)\| \le \epsilon^k$, where $\epsilon^k \to 0$ as $k \to \infty$
- $x^k \to x^\ast$ where $\nabla c _ i(x^\ast)$, $i = 1, \ldots, m$ are linearly independent
then:
- $y^k \to y^\ast$
- $\nabla f(x^\ast) - J(x^\ast)^\top y^\ast = 0$
and if either:
- $\sigma^k \to 0$ with $u^k$ bounded, or
- $u^k \to y^\ast$ with $\sigma^k$ bounded
then:
- $x^\ast$ is a KKT point of the original problem with multiplier $y^\ast$.
Multiplier convergence:
The expression
\[\nabla _ x \Phi(x^k, u^k, \sigma^k) = \nabla f(x^k) - J(x^k)^\top y^k\]is the only equation needed, so the argument is identical to the corresponding step in ∆quadratic-penalty-method-convergence-proof (the pure-penalty case is the specialisation $u^k = 0$). The hypothesis that $\nabla c _ i(x^\ast)$ are linearly independent is ∆licq at $x^\ast$. Hence $J(x^\ast)$ has full row rank and $J(x^\ast)^+ := (J(x^\ast)J(x^\ast)^\top)^{-1} J(x^\ast)$ is well-defined, so by continuity $J(x^k)^+$ exists and is bounded for large $k$. The hypothesis $\nabla _ x \Phi \to 0$ then forces $y^k \to y^\ast := J(x^\ast)^+ \nabla f(x^\ast)$, and hence by construction $\nabla f(x^\ast) - J(x^\ast)^\top y^\ast = 0$.
(Note that this uses none of the asymptotic hypotheses on $\sigma^k$ or $u^k$).
Feasibility ($c(x^\ast) = 0$):
From $y^k = u^k - c(x^k)/\sigma^k$, rearrange:
\[c(x^k) = \sigma^k(u^k - y^k).\]The triangle inequality then gives
\[\begin{aligned} \|c(x^k)\| &= \sigma^k \|u^k - y^k\| \\ &\le \sigma^k \|y^k - y^\ast\| + \sigma^k \|u^k - y^\ast\|. \qquad (\ast) \end{aligned}\]The LHS tends to $\|c(x^\ast)\|$ by continuity, so we just need RHS $\to 0$.
- Case (i): $\sigma^k \to 0$, $\|u^k\| \le M$. Then $\sigma^k \|u^k - y^\ast\| \le \sigma^k(M + \|y^\ast\|) \to 0$, and $\sigma^k \|y^k - y^\ast\| \to 0$ since $y^k \to y^\ast$ and $\sigma^k$ is bounded.
- Case (ii): $u^k \to y^\ast$, $\sigma^k \le \overline \sigma$. Then $\sigma^k \|u^k - y^\ast\| \le \overline \sigma \|u^k - y^\ast\| \to 0$, and $\sigma^k \|y^k - y^\ast \| \le \overline \sigma \|y^k - y^\ast\| \to 0$.
In either case, the RHS of $(\ast) \to 0$, so $c(x^\ast) = 0$. Combined with Lagrangian stationarity, $x^\ast$ is KKT.
Note that the Augmented Lagrangian may converge to KKT points without $\sigma^k \to 0$, which limits the ill-conditioning.
Worked examples
Consider
\[\min _ {x \in \mathbb R^2} x _ 1^2 + (x _ 2 - 1)^2 \quad \text{s.t.} \quad x _ 1 + x _ 2 - 2 = 0, \qquad x^\ast = \left(\tfrac12, \tfrac32\right),\ y^\ast = 1.\]
For the augmented Lagrangian $\Phi(x,u,\sigma) = f(x) - u\,c(x) + \tfrac{1}{2\sigma} c(x)^2$: (i) find its global minimiser $x(u,\sigma)$; (ii) show $x(u,\sigma) \to x^\ast$ as $\sigma \to 0$ ($u$ fixed) and that $\operatorname{cond}\nabla^2 _ {xx}\Phi \to \infty$; (iii) with $\sigma$ fixed and $u^{k+1} = u^k - c(x(u^k,\sigma))/\sigma$, show $u^k \to 1$ linearly and $x(u^k,\sigma) \to x^\ast$.
(i) Find $x(u,\sigma)$. Differentiate with respect to $x$ only ($u, \sigma$ are parameters, not variables). Using $\nabla _ x[-u\,c] = -u\nabla c$ and $\nabla _ x[\tfrac{1}{2\sigma}c^2] = \tfrac1\sigma c\,\nabla c$ (chain rule), with $\nabla c = (1,1)$:
\[2x _ 1 - u + \tfrac1\sigma(x _ 1 + x _ 2 - 2) = 0, \qquad 2(x _ 2 - 1) - u + \tfrac1\sigma(x _ 1 + x _ 2 - 2) = 0.\]Subtracting the two equations: $2x _ 1 - 2(x _ 2 - 1) = 0$, so $x _ 1 = x _ 2 - 1$. Substituting back,
\[x _ 1(u,\sigma) = \frac{1 + \sigma u}{2(1+\sigma)}, \qquad x _ 2(u,\sigma) = \frac{3 + \sigma(u+2)}{2(1+\sigma)}.\]The Hessian $\nabla^2 _ {xx}\Phi = \begin{pmatrix} 2 + 1/\sigma & 1/\sigma \\ 1/\sigma & 2 + 1/\sigma \end{pmatrix} \succ 0$, so this is the unique global minimiser.
(ii) Limit and conditioning. As $\sigma \to 0$ with $u$ fixed, $x _ 1 \to \tfrac12$ and $x _ 2 \to \tfrac32$, recovering $x^\ast$. The Hessian has eigenvalues $\lambda _ {\max} = 2 + 2/\sigma$ (eigenvector $(1,1)$) and $\lambda _ {\min} = 2$ (eigenvector $(1,-1)$), so
\[\operatorname{cond}\nabla^2 _ {xx}\Phi = \frac{\lambda _ {\max}}{\lambda _ {\min}} = 1 + \frac1\sigma \to \infty.\](iii) Multiplier iteration. The multiplier estimate is $y = u - c(x)/\sigma$ (shift $u$ minus penalty-scaled constraint, not $\sigma - c/u$). Substituting $x(u^k,\sigma)$ into the constraint,
\[c(x(u^k,\sigma)) = \frac{\sigma(u^k - 1)}{1+\sigma}.\]The update becomes $u^{k+1} - 1 = \tfrac{\sigma}{1+\sigma}(u^k - 1)$, linear convergence with factor $\rho = \tfrac{\sigma}{1+\sigma} \in (0,1)$. So $u^k \to 1 = y^\ast$ linearly, and feeding $u \to 1$ into $x(u,\sigma)$ gives $x(u^k,\sigma) \to x^\ast$. This is the distinctive advantage over the pure penalty: convergence with $\sigma$ fixed and bounded, so no $O(1/\sigma)$ ill-conditioning (∆aug-lagrangian-global-convergence).
Bite-sized
For the augmented Lagrangian $\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \tfrac{1}{2\sigma}\ \vert c(x)\ \vert ^2$, the Lagrange multiplier estimate at iterate $x^k$ is $y^k := u^k - c(x^k)/\sigma^k$.
The augmented Lagrangian $\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \tfrac{1}{2\sigma}\|c(x)\|^2$ admits two complementary interpretations. State both.
- Shifted quadratic penalty: $\Phi$ is the quadratic penalty function for the shifted constraint $c(x) - \sigma u = 0$ (modulo a constant). The $u$ parameter shifts where the penalty is centred.
- Convexification of the Lagrangian: $\Phi(x, u, \sigma) = \mathcal L(x, u) + \tfrac{1}{2\sigma}\|c(x)\|^2$, i.e. the standard Lagrangian plus a quadratic regulariser that pulls $x$ toward feasibility. The regulariser convexifies $\mathcal L$ near the constraint manifold and ensures a well-defined minimiser even when $\mathcal L(\cdot, u)$ is unbounded below.
Both views are useful: shifted-penalty for designing $\sigma$-update rules, convexified-Lagrangian for Newton-based inner solves.
Unlike the pure quadratic penalty (which requires $\sigma \to 0$ to recover KKT, blowing up the condition number of $\nabla^2 \Phi _ \sigma$), the augmented Lagrangian can converge to a KKT point with $\sigma^k$ bounded above, provided the multiplier estimate $u^k \to y^\ast$. This is what limits the ill-conditioning of the augmented Lagrangian method.
In the practical augmented Lagrangian algorithm, the update rule for $(u, \sigma)$ depends on whether feasibility is on track: if $\ \vert c(x^k)\ \vert \le \eta^k$ then $u^{k+1} := y^k$ (promote the multiplier estimate) and $\sigma^{k+1} := \sigma^k$ (keep penalty); otherwise $u^{k+1} := u^k$ (keep multiplier) and $\sigma^{k+1} \le \tau \sigma^k$ (shrink penalty).
Give the proof strategy for ∆aug-lagrangian-global-convergence-proof (Theorem 22: augmented Lagrangian convergence under LICQ).
- Multiplier convergence (and Lagrangian stationarity): identical to ∆quadratic-penalty-method-convergence-proof — the penalty case is just $u^k \equiv 0$. From $\nabla _ x \Phi(x^k, u^k, \sigma^k) = \nabla f(x^k) - J(x^k)^\top y^k$ and the LICQ-driven pseudo-inverse $J(x^\ast)^+$, $\epsilon^k \to 0$ forces $y^k \to y^\ast := J(x^\ast)^+ \nabla f(x^\ast)$ and $\nabla f(x^\ast) = J(x^\ast)^\top y^\ast$. Note: this part uses neither asymptotic hypothesis on $\sigma^k$ or $u^k$.
- Feasibility ($c(x^\ast) = 0$): rearrange $y^k = u^k - c(x^k)/\sigma^k$ to $c(x^k) = \sigma^k(u^k - y^k)$. Triangle inequality: $\|c(x^k)\| \le \sigma^k \|y^k - y^\ast\| + \sigma^k \|u^k - y^\ast\|$.
- Case (i) $\sigma^k \to 0$, $\|u^k\| \le M$: $\sigma^k \|u^k - y^\ast\| \le \sigma^k(M + \|y^\ast\|) \to 0$, and $\sigma^k \|y^k - y^\ast\| \to 0$ since $y^k \to y^\ast$ with $\sigma^k$ bounded. So RHS $\to 0$.
- Case (ii) $u^k \to y^\ast$, $\sigma^k \le \overline\sigma$: $\sigma^k \|u^k - y^\ast\| \le \overline\sigma \|u^k - y^\ast\| \to 0$, and $\sigma^k \|y^k - y^\ast\| \le \overline\sigma \|y^k - y^\ast\| \to 0$.
- In either case, $\|c(x^k)\| \to 0$, so by continuity $c(x^\ast) = 0$, and KKT holds.
The augmented-Lagrangian feasibility identity (analogue of ∆bite-penalty-method-convergence-key-inequality-feasibility-identity). Rearranging $y^k = u^k - c(x^k)/\sigma^k$:
\[c(x^k) = <span class="cloze" tabindex="0">\sigma^k(u^k - y^k)</span>.\]This is the penalty-method identity $c(x^k) = -\sigma^k y^k$ shifted by $\sigma^k u^k$, and it explains why two distinct hypotheses can force $c(x^\ast) = 0$ in the limit — driving either $\sigma^k \to 0$ (penalty-style) or $\ \vert u^k - y^k\ \vert \to 0$ (multiplier-promotion-style).
In ∆aug-lagrangian-global-convergence-proof (Theorem 22), Case (ii) — convergence to KKT with $\sigma^k$ bounded and $u^k \to y^\ast$ — is the distinctive augmented-Lagrangian case (the pure penalty cannot achieve this). Where does this case enter, and what’s the practical consequence?
- It enters in the feasibility step. From $\|c(x^k)\| \le \sigma^k \|y^k - y^\ast\| + \sigma^k \|u^k - y^\ast\|$, the second term goes to zero by hypothesis ($u^k \to y^\ast$, $\sigma^k \le \overline\sigma$), and the first goes to zero because $y^k \to y^\ast$ with $\sigma^k$ bounded.
- Why $u^k \to y^\ast$ is plausible in practice: the AL algorithm (∆augmented-lagrangian-algorithm) actively promotes $u^{k+1} := y^k$ whenever the feasibility tolerance $\|c(x^k)\| \le \eta^k$ is met. Since $y^k \to y^\ast$ already from the multiplier-convergence step, this forces $u^k \to y^\ast$ over the subsequence of “feasible-enough” iterations.
- Practical consequence: $\sigma^k$ can stay bounded above, so the augmented-Lagrangian Hessian $\nabla^2 _ {xx} \mathcal L(x, y(\sigma)) + (1/\sigma) J^\top J$ remains well-conditioned. This avoids the $O(1/\sigma)$ ill-conditioning that plagues the pure penalty method — see ∆bite-aug-lagrangian-limits-ill-conditioning.