Continuous Optimisation HT26, Augmented Lagrangian methods


Flashcards

Suppose we have the equality-constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c(x) = 0,\]

where:

  • $f : \mathbb R^n \to \mathbb R$ smooth
  • $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth

@State the setup of the augmented Lagrangian function for solving this class of problems.

\[\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \frac{1}{2\sigma} \| c(x) \|^2\]

where both $u \in \mathbb R^m$ and $\sigma > 0$ are parameters we modify to encourage convergence.

@exam~

Suppose we have the equality-constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c(x) = 0,\]

where:

  • $f : \mathbb R^n \to \mathbb R$ smooth
  • $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth

In this context, the augmented Lagrangian function is

\[\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \frac{1}{2\sigma} \| c(x) \|^2\]

where both $u \in \mathbb R^m$ and $\sigma > 0$ are parameters we modify to encourage convergence.

@Justify the name “augmented Lagrangian function”.

Recall that the Lagrangian is $\mathcal L(x, y) = f(x) - y^\top c(x)$ for any $y \in \mathbb R^m$. Then note

\[\begin{aligned} &\nabla _ x \Phi(x, u, \sigma) = \nabla f(x) - J(x)^\top u + \frac 1 \sigma J(x)^\top c(x) \\ \implies& \nabla _ x \Phi(x, u, \sigma) = \nabla f(x) - J(x)^\top y = \nabla _ x \mathcal L(x, y) \end{aligned}\]

where:

\[y := u - \frac{c(x)}{\sigma}\]

which we may interpret as an estimate of the Lagrangian.

@exam~

Suppose we have the equality-constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c(x) = 0,\]

where:

  • $f : \mathbb R^n \to \mathbb R$ smooth
  • $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth

In this context, the augmented Lagrangian function is

\[\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \frac{1}{2\sigma} \| c(x) \|^2\]

where both $u \in \mathbb R^m$ and $\sigma > 0$ are parameters we modify to encourage convergence.

Derive the first and second derivatives of $\Phi$.

Recall that the Lagrangian is $\mathcal L(x, y) = f(x) - y^\top c(x)$ for any $y \in \mathbb R^m$.

First derivative:

\[\begin{aligned} &\nabla _ x \Phi(x, y, \sigma) = \nabla f(x) - J(x)^\top u + \frac 1 \sigma J(x)^\top c(x) \\ \implies& \nabla _ x \Phi(x, u, \sigma) = \nabla f(x) - J(x)^\top y = \nabla _ x \mathcal L(x, y) \end{aligned}\]

where $y := u - c(x)/\sigma$.

Second derivative:

\[\begin{aligned} &\nabla^2 \Phi(x, u, \sigma) = \nabla^2 f(x) - \sum^m _ {i=1} u _ i \nabla^2 c _ i(x) + \frac 1 \sigma \sum^m _ {i=1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x) \\ \implies& \nabla^2 \Phi(x, u, \sigma) = \nabla^2 f(x) - \sum^m _ {i = 1} y _ i \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x) \\ \implies& \nabla^2 \Phi(x, u, \sigma) = \nabla^2 \mathcal L(x, y) + \frac 1 \sigma J(x)^\top J(x) \end{aligned}\]

where $u = c(x)/\sigma$.

@exam~

Suppose we have a nonlinear equality-constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad\text{s.t.}\quad c(x) = 0,\]

where $f : \mathbb R^n \to \mathbb R$, $c : \mathbb R^n \to \mathbb R^m$ are smooth. @State the basic augmented Lagrangian @algorithm for this problem.

Inputs:

  • Starting iterate $x^0$,
  • Initial $u^0 \in \mathbb R^m$ and $\sigma^0 > 0$,
  • Decay factor $\tau \in (0, 1)$,
  • Feasibility-tolerance schedule $\eta^k \to 0$,
  • Optimality-tolerance schedule $\epsilon^k \to 0$.

where the schedules are not fixed ahead of time and may depend on $\sigma^k$.

Algorithm:

Let $k = 0$. Until convergence:

  • Compute the Lagrange multiplier estimate $y^k := u^k - c(x^k)/\sigma^k$
  • Update $(u, \sigma)$:
    • If $\|c(x^k)\| \le \eta^k$ (feasibility on track): $u^{k+1} = y^k$ (promote the multiplier estimate), $\sigma^{k+1} = \sigma^k$.
    • Otherwise: $u^{k+1} = u^k$, $\sigma^{k+1} \le \tau \sigma^k$ (shrink penalty to drive infeasibility down).
  • Minimise: from $x _ 0^k$ (often just $x^k _ 0 := x^k)$, use an unconstrained minimiser to find $x^{k+1}$ with $\|\nabla _ x \Phi(x^{k+1}, u^{k+1}, \sigma^{k+1})\| \le \epsilon^{k+1}$ where
\[\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \frac{1}{2\sigma}\|c(x)\|^2.\]
  • Set $k := k+1$.

Some reasonable choices:

  • $\tau = \min(0.1, \sqrt{\sigma^k})$
  • $\epsilon^k = (\sigma^k)^{j+1}$
  • $\eta^k = (\sigma^k)^{0.1 + 0.9j}$
  • where $j$ counts iterations since $\sigma^k$ last changed.

@exam~

Suppose we have the equality-constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c(x) = 0,\]

where:

  • $f : \mathbb R^n \to \mathbb R$ smooth
  • $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth
  • $\Phi(x, y, \sigma) = f(x) - u^\top c(x) + \frac{1}{2\sigma}\|c(x)\|^2$
  • $u \in \mathbb R^m$ and $\sigma > 0$ are chosen parameters

Consider any sequence of iterates $(x^k, u^k, \sigma^k)$, where $x^k$ is an approximate stationary point of $\Phi(\cdot, u^k, \sigma^k)$, e.g. those produced by ∆augmented-lagrangian-algorithm. Let $y^k := u^k - c(x^k)/\sigma^k$ be the Lagrangian multiplier estimate.

@State a theorem about the global convergence of $(x^k, y^k)$, and give an intuitive explanation.

Suppose:

  • $f, c \in \mathcal C^1$
  • $\|\nabla _ x \Phi(x^k, u^k, \sigma^k)\| \le \epsilon^k$, where $\epsilon^k \to 0$ as $k \to \infty$
  • $x^k \to x^\ast$ where $\nabla c _ i(x^\ast), i =1, \ldots, m$ are linearly independent

Then:

  • $y^k \to y^\ast$ as $k \to \infty$, where
  • $y^\ast$ satisfies $\nabla f(x^\ast) - J(x^\ast)^\top y^\ast = 0$.

If additionally either:

  • $\sigma^k \to 0$ with $u^k$ bounded, or
  • $u^k \to y^\ast$ with $\sigma^k$ bounded

Then:

  • $x^\ast$ is a KKT point of the original problem with associated Lagrange multipliers $y^\ast$.

Intuitive explanation: This statement is really talking about how if you optimise $x^k \to x^\ast$, you get Lagrangian stationarity $\nabla f(x^k) - J(x^\ast)^\top y^\ast$ automatically holds, given that we take $y^k := u^k - c(x^k)/\sigma^k$ as our Lagrangian estimate at each step.

Additionally, we also can upgrade to saying that we actually get a KKT point, given that either:

  • Increase the penalty asymptotically ($\sigma^k \to 0$)
  • Push the multiplier to its true value $u^k \to y^\ast$ (with $\sigma^k$ kept bounded)

@exam~

Suppose we have a nonlinear equality-constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad\text{s.t.}\quad c(x) = 0,\]

and the augmented Lagrangian function

\[\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \frac{1}{2\sigma} \|c(x)\|^2.\]

Consider any sequence of iterates $(x^k, u^k, \sigma^k)$ where $x^k$ is an approximate stationary point of $\Phi(\cdot, u^k, \sigma^k)$, e.g. those produced by ∆augmented-lagrangian-algorithm. Let $y^k := u^k - c(x^k)/\sigma^k$ be the Lagrange multiplier estimate.

@Prove that if

  • $f, c \in \mathcal C^1$
  • $\|\nabla _ x \Phi(x^k, u^k, \sigma^k)\| \le \epsilon^k$, where $\epsilon^k \to 0$ as $k \to \infty$
  • $x^k \to x^\ast$ where $\nabla c _ i(x^\ast)$, $i = 1, \ldots, m$ are linearly independent

then:

  • $y^k \to y^\ast$
  • $\nabla f(x^\ast) - J(x^\ast)^\top y^\ast = 0$

and if either:

  • $\sigma^k \to 0$ with $u^k$ bounded, or
  • $u^k \to y^\ast$ with $\sigma^k$ bounded

then:

  • $x^\ast$ is a KKT point of the original problem with multiplier $y^\ast$.

Multiplier convergence:

The expression

\[\nabla _ x \Phi(x^k, u^k, \sigma^k) = \nabla f(x^k) - J(x^k)^\top y^k\]

is the only equation needed, so the argument is identical to the corresponding step in ∆quadratic-penalty-method-convergence-proof (the pure-penalty case is the specialisation $u^k = 0$). The hypothesis that $\nabla c _ i(x^\ast)$ are linearly independent is ∆licq at $x^\ast$. Hence $J(x^\ast)$ has full row rank and $J(x^\ast)^+ := (J(x^\ast)J(x^\ast)^\top)^{-1} J(x^\ast)$ is well-defined, so by continuity $J(x^k)^+$ exists and is bounded for large $k$. The hypothesis $\nabla _ x \Phi \to 0$ then forces $y^k \to y^\ast := J(x^\ast)^+ \nabla f(x^\ast)$, and hence by construction $\nabla f(x^\ast) - J(x^\ast)^\top y^\ast = 0$.

(Note that this uses none of the asymptotic hypotheses on $\sigma^k$ or $u^k$).

Feasibility ($c(x^\ast) = 0$):

From $y^k = u^k - c(x^k)/\sigma^k$, rearrange:

\[c(x^k) = \sigma^k(u^k - y^k).\]

The triangle inequality then gives

\[\begin{aligned} \|c(x^k)\| &= \sigma^k \|u^k - y^k\| \\ &\le \sigma^k \|y^k - y^\ast\| + \sigma^k \|u^k - y^\ast\|. \qquad (\ast) \end{aligned}\]

The LHS tends to $\|c(x^\ast)\|$ by continuity, so we just need RHS $\to 0$.

  • Case (i): $\sigma^k \to 0$, $\|u^k\| \le M$. Then $\sigma^k \|u^k - y^\ast\| \le \sigma^k(M + \|y^\ast\|) \to 0$, and $\sigma^k \|y^k - y^\ast\| \to 0$ since $y^k \to y^\ast$ and $\sigma^k$ is bounded.
  • Case (ii): $u^k \to y^\ast$, $\sigma^k \le \overline \sigma$. Then $\sigma^k \|u^k - y^\ast\| \le \overline \sigma \|u^k - y^\ast\| \to 0$, and $\sigma^k \|y^k - y^\ast \| \le \overline \sigma \|y^k - y^\ast\| \to 0$.

In either case, the RHS of $(\ast) \to 0$, so $c(x^\ast) = 0$. Combined with Lagrangian stationarity, $x^\ast$ is KKT.

Note that the Augmented Lagrangian may converge to KKT points without $\sigma^k \to 0$, which limits the ill-conditioning.

Worked examples

Consider

\[\min _ {x \in \mathbb R^2} x _ 1^2 + (x _ 2 - 1)^2 \quad \text{s.t.} \quad x _ 1 + x _ 2 - 2 = 0, \qquad x^\ast = \left(\tfrac12, \tfrac32\right),\ y^\ast = 1.\]

For the augmented Lagrangian $\Phi(x,u,\sigma) = f(x) - u\,c(x) + \tfrac{1}{2\sigma} c(x)^2$: (i) find its global minimiser $x(u,\sigma)$; (ii) show $x(u,\sigma) \to x^\ast$ as $\sigma \to 0$ ($u$ fixed) and that $\operatorname{cond}\nabla^2 _ {xx}\Phi \to \infty$; (iii) with $\sigma$ fixed and $u^{k+1} = u^k - c(x(u^k,\sigma))/\sigma$, show $u^k \to 1$ linearly and $x(u^k,\sigma) \to x^\ast$.

(i) Find $x(u,\sigma)$. Differentiate with respect to $x$ only ($u, \sigma$ are parameters, not variables). Using $\nabla _ x[-u\,c] = -u\nabla c$ and $\nabla _ x[\tfrac{1}{2\sigma}c^2] = \tfrac1\sigma c\,\nabla c$ (chain rule), with $\nabla c = (1,1)$:

\[2x _ 1 - u + \tfrac1\sigma(x _ 1 + x _ 2 - 2) = 0, \qquad 2(x _ 2 - 1) - u + \tfrac1\sigma(x _ 1 + x _ 2 - 2) = 0.\]

Subtracting the two equations: $2x _ 1 - 2(x _ 2 - 1) = 0$, so $x _ 1 = x _ 2 - 1$. Substituting back,

\[x _ 1(u,\sigma) = \frac{1 + \sigma u}{2(1+\sigma)}, \qquad x _ 2(u,\sigma) = \frac{3 + \sigma(u+2)}{2(1+\sigma)}.\]

The Hessian $\nabla^2 _ {xx}\Phi = \begin{pmatrix} 2 + 1/\sigma & 1/\sigma \\ 1/\sigma & 2 + 1/\sigma \end{pmatrix} \succ 0$, so this is the unique global minimiser.

(ii) Limit and conditioning. As $\sigma \to 0$ with $u$ fixed, $x _ 1 \to \tfrac12$ and $x _ 2 \to \tfrac32$, recovering $x^\ast$. The Hessian has eigenvalues $\lambda _ {\max} = 2 + 2/\sigma$ (eigenvector $(1,1)$) and $\lambda _ {\min} = 2$ (eigenvector $(1,-1)$), so

\[\operatorname{cond}\nabla^2 _ {xx}\Phi = \frac{\lambda _ {\max}}{\lambda _ {\min}} = 1 + \frac1\sigma \to \infty.\]

(iii) Multiplier iteration. The multiplier estimate is $y = u - c(x)/\sigma$ (shift $u$ minus penalty-scaled constraint, not $\sigma - c/u$). Substituting $x(u^k,\sigma)$ into the constraint,

\[c(x(u^k,\sigma)) = \frac{\sigma(u^k - 1)}{1+\sigma}.\]

The update becomes $u^{k+1} - 1 = \tfrac{\sigma}{1+\sigma}(u^k - 1)$, linear convergence with factor $\rho = \tfrac{\sigma}{1+\sigma} \in (0,1)$. So $u^k \to 1 = y^\ast$ linearly, and feeding $u \to 1$ into $x(u,\sigma)$ gives $x(u^k,\sigma) \to x^\ast$. This is the distinctive advantage over the pure penalty: convergence with $\sigma$ fixed and bounded, so no $O(1/\sigma)$ ill-conditioning (∆aug-lagrangian-global-convergence).

Source: Continuous Optimisation 2017, Q3(a); method per Lecture 13 (∆augmented-lagrangian-setup, multiplier update).

@example~ @exam~

Bite-sized

For the augmented Lagrangian $\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \tfrac{1}{2\sigma}\ \vert c(x)\ \vert ^2$, the Lagrange multiplier estimate at iterate $x^k$ is $y^k := u^k - c(x^k)/\sigma^k$.

Source: Lecture 13, Derivatives of the augmented Lagrangian function slide.

@bite~

@exam~

The augmented Lagrangian $\Phi(x, u, \sigma) = f(x) - u^\top c(x) + \tfrac{1}{2\sigma}\|c(x)\|^2$ admits two complementary interpretations. State both.

  • Shifted quadratic penalty: $\Phi$ is the quadratic penalty function for the shifted constraint $c(x) - \sigma u = 0$ (modulo a constant). The $u$ parameter shifts where the penalty is centred.
  • Convexification of the Lagrangian: $\Phi(x, u, \sigma) = \mathcal L(x, u) + \tfrac{1}{2\sigma}\|c(x)\|^2$, i.e. the standard Lagrangian plus a quadratic regulariser that pulls $x$ toward feasibility. The regulariser convexifies $\mathcal L$ near the constraint manifold and ensures a well-defined minimiser even when $\mathcal L(\cdot, u)$ is unbounded below.

Both views are useful: shifted-penalty for designing $\sigma$-update rules, convexified-Lagrangian for Newton-based inner solves.

Source: Lecture 13, Nonlinear equality-constrained problems introductory slide.

@bite~

Unlike the pure quadratic penalty (which requires $\sigma \to 0$ to recover KKT, blowing up the condition number of $\nabla^2 \Phi _ \sigma$), the augmented Lagrangian can converge to a KKT point with $\sigma^k$ bounded above, provided the multiplier estimate $u^k \to y^\ast$. This is what limits the ill-conditioning of the augmented Lagrangian method.

Source: Lecture 13, remark following Theorem 22.

@bite~

@exam~

In the practical augmented Lagrangian algorithm, the update rule for $(u, \sigma)$ depends on whether feasibility is on track: if $\ \vert c(x^k)\ \vert \le \eta^k$ then $u^{k+1} := y^k$ (promote the multiplier estimate) and $\sigma^{k+1} := \sigma^k$ (keep penalty); otherwise $u^{k+1} := u^k$ (keep multiplier) and $\sigma^{k+1} \le \tau \sigma^k$ (shrink penalty).

Source: Lecture 13, A basic augmented Lagrangian method slide.

@bite~

@exam~

Give the proof strategy for ∆aug-lagrangian-global-convergence-proof (Theorem 22: augmented Lagrangian convergence under LICQ).

  • Multiplier convergence (and Lagrangian stationarity): identical to ∆quadratic-penalty-method-convergence-proof — the penalty case is just $u^k \equiv 0$. From $\nabla _ x \Phi(x^k, u^k, \sigma^k) = \nabla f(x^k) - J(x^k)^\top y^k$ and the LICQ-driven pseudo-inverse $J(x^\ast)^+$, $\epsilon^k \to 0$ forces $y^k \to y^\ast := J(x^\ast)^+ \nabla f(x^\ast)$ and $\nabla f(x^\ast) = J(x^\ast)^\top y^\ast$. Note: this part uses neither asymptotic hypothesis on $\sigma^k$ or $u^k$.
  • Feasibility ($c(x^\ast) = 0$): rearrange $y^k = u^k - c(x^k)/\sigma^k$ to $c(x^k) = \sigma^k(u^k - y^k)$. Triangle inequality: $\|c(x^k)\| \le \sigma^k \|y^k - y^\ast\| + \sigma^k \|u^k - y^\ast\|$.
  • Case (i) $\sigma^k \to 0$, $\|u^k\| \le M$: $\sigma^k \|u^k - y^\ast\| \le \sigma^k(M + \|y^\ast\|) \to 0$, and $\sigma^k \|y^k - y^\ast\| \to 0$ since $y^k \to y^\ast$ with $\sigma^k$ bounded. So RHS $\to 0$.
  • Case (ii) $u^k \to y^\ast$, $\sigma^k \le \overline\sigma$: $\sigma^k \|u^k - y^\ast\| \le \overline\sigma \|u^k - y^\ast\| \to 0$, and $\sigma^k \|y^k - y^\ast\| \le \overline\sigma \|y^k - y^\ast\| \to 0$.
  • In either case, $\|c(x^k)\| \to 0$, so by continuity $c(x^\ast) = 0$, and KKT holds.

Source: Lecture 13, Proof of Theorem 22.

@bite~ @proofsupport~

The augmented-Lagrangian feasibility identity (analogue of ∆bite-penalty-method-convergence-key-inequality-feasibility-identity). Rearranging $y^k = u^k - c(x^k)/\sigma^k$:

\[c(x^k) = <span class="cloze" tabindex="0">\sigma^k(u^k - y^k)</span>.\]

This is the penalty-method identity $c(x^k) = -\sigma^k y^k$ shifted by $\sigma^k u^k$, and it explains why two distinct hypotheses can force $c(x^\ast) = 0$ in the limit — driving either $\sigma^k \to 0$ (penalty-style) or $\ \vert u^k - y^k\ \vert \to 0$ (multiplier-promotion-style).

Source: Lecture 13, Proof of Theorem 22, equation $(\ast)$.

@bite~ @proofsupport~

In ∆aug-lagrangian-global-convergence-proof (Theorem 22), Case (ii) — convergence to KKT with $\sigma^k$ bounded and $u^k \to y^\ast$ — is the distinctive augmented-Lagrangian case (the pure penalty cannot achieve this). Where does this case enter, and what’s the practical consequence?

  • It enters in the feasibility step. From $\|c(x^k)\| \le \sigma^k \|y^k - y^\ast\| + \sigma^k \|u^k - y^\ast\|$, the second term goes to zero by hypothesis ($u^k \to y^\ast$, $\sigma^k \le \overline\sigma$), and the first goes to zero because $y^k \to y^\ast$ with $\sigma^k$ bounded.
  • Why $u^k \to y^\ast$ is plausible in practice: the AL algorithm (∆augmented-lagrangian-algorithm) actively promotes $u^{k+1} := y^k$ whenever the feasibility tolerance $\|c(x^k)\| \le \eta^k$ is met. Since $y^k \to y^\ast$ already from the multiplier-convergence step, this forces $u^k \to y^\ast$ over the subsequence of “feasible-enough” iterations.
  • Practical consequence: $\sigma^k$ can stay bounded above, so the augmented-Lagrangian Hessian $\nabla^2 _ {xx} \mathcal L(x, y(\sigma)) + (1/\sigma) J^\top J$ remains well-conditioned. This avoids the $O(1/\sigma)$ ill-conditioning that plagues the pure penalty method — see ∆bite-aug-lagrangian-limits-ill-conditioning.

Source: Lecture 13, Proof of Theorem 22, Case (ii) and concluding remark.

@bite~ @proofsupport~