Continuous Optimisation HT26, Penalty methods
Flashcards
Suppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
where $f : \mathbb R^n \to \mathbb R$, $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth. What is the idea and motivation behind penalty methods for such problems?
It is very difficult to find local solutions given such nonlinear constraints. Instead, we form a single, parameterised and unconstrained objective so that minimisers approach initial problem solutions as parameters vary.
Suppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
@State the quadratic penalty function given a penalty parameter $\sigma > 0$.
Suppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
@State an @algorithm for a quadratic penalty method.
Suppose we have $\sigma^0 > 0$. Let $k = 0$. Until convergence:
- Choose $0 < \sigma^{k+1} < \sigma^k$ (so that $\sigma^k \to 0$ as $k \to \infty$)
- Starting from $x _ 0^k$, use an unconstrained minimisation algorithm to find an approximate minimiser $x^{k+1}$ of $\Phi _ {\sigma^{k+1}} = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
- Let $k := k+1$
Suppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
and we solve via the quadratic penalty method where given $\sigma^0 > 0$ and $k = 0$, until convergence:
- Choose $0 < \sigma^{k+1} < \sigma^k$ (so that $\sigma^k \to 0$ as $k \to \infty$)
- Starting from $x _ 0^k$, use an unconstrained minimisation algorithm to find an approximate minimiser $x^{k+1}$ of $\Phi _ {\sigma^{k+1}} = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
- Let $k := k+1$
@State a theorem about the global convergence of this method.
Suppose:
- $f, c \in \mathcal C^1$
- $y _ i^k = -c _ i(x^k)/\sigma^k$ for $i = 1, \ldots, m$
- $ \vert \vert \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \le \epsilon^k$ where $\epsilon^k \to 0$ as $k \to \infty$
- $\sigma^k \to 0$ as $k \to \infty$
- $x^k \to x^\ast$ where $\nabla c _ i(x^\ast)$ for each $i = 1, \ldots, m$ are linearly independent
Then:
- $x^\ast$ is a KKT point of the optimisation problem
- $y^k \to y^\ast$ where $y^\ast$ is the vector of Lagrange multipliers of the constraints
Remarks on the Jacobian rank hypothesis:
- $\nabla c _ i(x^\ast)$ being linearly independent for $i = 1, \ldots, m$ is equivalent to $J(x^\ast) \in \mathbb R^{m \times n}$ having full row-rank, which forces $m \le n$ (at most as many equality constraints as variables).
- If $J(x^\ast)$ is rank-deficient, then the theorem fails. The multiplier estimates $y^k \to \infty$, and $x^\ast$ instead locally minimises the infeasibility $|c(x)|$. It’s closest-feasible-point in a degenerate-constraint sense, not a KKT point of the original problem.
- In the corresponding ∆quadratic-penalty-method-convergence-proof, full row rank is what makes the pseudo-inverse $J(x^\ast)^+ = (J J^\top)^{-1} J$ well-defined.t
Suppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
and we solve via the quadratic penalty method where given $\sigma^0 > 0$ and $k = 0$, until convergence:
- Choose $0 < \sigma^{k+1} < \sigma^k$ (so that $\sigma^k \to 0$ as $k \to \infty$)
- Starting from $x _ 0^k$, use an unconstrained minimisation algorithm to find an approximate minimiser $x^{k+1}$ of $\Phi _ {\sigma^{k+1}} = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
- Let $k := k+1$
@Prove that if
- $f, c \in \mathcal C^1$
- $y _ i^k = -c _ i(x^k)/\sigma^k$ for $i = 1, \ldots, m$
- $ \vert \vert \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \le \epsilon^k$ where $\epsilon^k \to 0$ as $k \to \infty$
- $\sigma^k \to 0$ as $k \to \infty$
- $x^k \to x^\ast$ where $\nabla c _ i(x^\ast)$ for each $i = 1, \ldots, m$ are linearly independent
then:
- $x^\ast$ is a KKT point of the optimisation problem
- $y^k \to y^\ast$ where $y^\ast$ is the vector of Lagrange multipliers of the constraints
What we must show: since (eCP) has only equality constraints, “$x^\ast$ is a KKT point” means precisely that there exists $y^\ast \in \mathbb R^m$ with
- feasibility $c(x^\ast) = 0$, and
- stationarity $\nabla f(x^\ast) = J(x^\ast)^\top y^\ast$.
The multiplier estimate: differentiating $\Phi _ {\sigma^k}$ gives (this is ∆penalty-derivatives-as-lagrangian)
\[\begin{aligned} \nabla \Phi _ {\sigma^k}(x^k) &= \nabla f(x^k) + \frac{1}{\sigma^k} J(x^k)^\top c(x^k) \\ &= \nabla f(x^k) - J(x^k)^\top y^k \end{aligned}\]where $y^k := -c(x^k)/\sigma^k$. So $y^k$ is chosen exactly to make $\nabla \Phi _ {\sigma^k}(x^k) = \nabla _ x \mathcal L(x^k, y^k)$: approximate stationarity of $\Phi$ is approximate Lagrangian stationarity.
The candidate multiplier (built from LICQ alone): for fixed $x$, the least-squares problem
\[\overline y(x) := \arg \min _ {y \in \mathbb R^m} \vert \vert \nabla f(x) - J(x)^\top y \vert \vert ^2\]has normal equations $J(x) J(x)^\top \overline y = J(x) \nabla f(x)$, so $\overline y(x) = J(x)^+ \nabla f(x)$ with pseudo-inverse $J(x)^+ := (J(x) J(x)^\top)^{-1} J(x)$, provided $J(x)$ has full row rank. By ∆licq, $J(x^\ast)$ has full row rank, so $J(x^\ast)^+$ exists; by continuity $J(x^k)^+$ also exists, is continuous and bounded for all large $k$. Define
\[y^\ast := J(x^\ast)^+ \nabla f(x^\ast).\]This construction uses only LICQ, not optimality. Note that a priori $y^\ast$ solves only the normal equations $J(x^\ast)J(x^\ast)^\top y^\ast = J(x^\ast)\nabla f(x^\ast)$; the exact stationarity $\nabla f(x^\ast) = J(x^\ast)^\top y^\ast$ holds only if the least-squares residual vanishes, which is what Step 2 establishes.
Step 1 (multiplier convergence): we show $y^k \to y^\ast$. By the triangle inequality,
\[\vert \vert y^k - y^\ast \vert \vert \le \vert \vert J(x^k)^+ \nabla f(x^k) - J(x^\ast)^+ \nabla f(x^\ast) \vert \vert + \vert \vert J(x^k)^+ \nabla f(x^k) - y^k \vert \vert .\]The first term $\to 0$, since $x^k \to x^\ast$ with $J^+$ and $\nabla f$ continuous gives $J(x^k)^+ \nabla f(x^k) \to J(x^\ast)^+ \nabla f(x^\ast) = y^\ast$. For the second term, full row rank gives $J^+ J^\top = (JJ^\top)^{-1}(JJ^\top) = I _ m$, so using $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$,
\[\begin{aligned} \vert \vert J(x^k)^+ \nabla f(x^k) - y^k \vert \vert &= \vert \vert J(x^k)^+ (\nabla f(x^k) - J(x^k)^\top y^k) \vert \vert \\ &= \vert \vert J(x^k)^+ \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \\ &\le \vert \vert J(x^k)^+ \vert \vert \cdot \epsilon^k \to 0, \end{aligned}\]as $ \vert \vert J(x^k)^+ \vert \vert $ is bounded and $\epsilon^k \to 0$. Both terms vanish, so $y^k \to y^\ast$.
Step 2 (stationarity): pass to the limit in $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$. The right-hand side $\to \nabla f(x^\ast) - J(x^\ast)^\top y^\ast$ by continuity and $y^k \to y^\ast$, while the left-hand side $\to 0$ since $ \vert \vert \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \le \epsilon^k \to 0$. Hence
\[\nabla f(x^\ast) = J(x^\ast)^\top y^\ast.\]Step 3 (feasibility): rearranging the definition of $y^k$ gives $c(x^k) = -\sigma^k y^k$. As $k \to \infty$ the right-hand side $\to 0$ (since $\sigma^k \to 0$ and $y^k \to y^\ast$ is bounded), while the left-hand side $\to c(x^\ast)$ by continuity. Hence
\[c(x^\ast) = 0.\]Steps 2 and 3 are exactly stationarity and feasibility, so $(x^\ast, y^\ast)$ is a KKT point of (eCP) with $y^\ast$ the vector of Lagrange multipliers, as required.
@exam~
**^quadratic-penalty-method-convergence-proof
Suppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
Let $L$ be the Lagrangian function
\[L(x,y) = f(x) - y^\top c(x)\]
and let $\Phi _ \sigma$ be the quadratic penalty function
\[\Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2\]
Give an interpretation of the derivative $\nabla \Phi _ \sigma$ and Hessian $\nabla^2 \Phi _ \sigma$ in terms of the Lagrangian.
where $y(\sigma) := -c(x)/\sigma$. Likewise
\[\begin{aligned} \nabla^2 \Phi _ \sigma(x) &= \nabla^2 f(x) + \frac 1 \sigma \sum^m _ {i=1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x) \\ &= \nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x) \end{aligned}\]Suppose:
- We have a nonlinear equality constrained optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ such that $c(x) = 0$
- Lagrangian $L(x,y) = f(x) - y^\top c(x)$
- $\Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
Recall then that
\[\begin{aligned}
\nabla^2 \Phi _ \sigma(x) &= \nabla^2 f(x) + \frac 1 \sigma \sum^m _ {i = 1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x)\\
&= \nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x)
\end{aligned}\]
Describe the behaviour of this function as $\sigma \to 0$ (appealing to results you do not need to prove).
- Generally, $c _ i(x) \to 0$ at the same rate with $\sigma$ for all $i$ and hence usually $\nabla^2 _ {xx} L(x, y(\sigma)$ is well-behaved
- For the second term, $J(x)^\top J(x) / \sigma \to \infty$
Suppose:
- We have a nonlinear equality constrained optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ such that $c(x) = 0$
- Lagrangian $L(x,y) = f(x) - y^\top c(x)$
- $\Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
Recall then that
\[\begin{aligned}
\nabla^2 \Phi _ \sigma(x) &= \nabla^2 f(x) + \frac 1 \sigma \sum^m _ {i = 1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x)\\
&= \nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x)
\end{aligned}\]
@State a theorem about the behaviour of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ in the limit as $k \to \infty$ and what this means for solving nonlinear equality constrained optimisation problems.
In this context:
- There are $n$ eigenvalues of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ in total (by dimensions)
- $m$ of these eigenvalues are $\mathcal O(1 / \sigma^k)$ and hence tend to infinity as $k \to \infty$
- The remaining $n - m$ eigenvalues are $\mathcal O(1)$ in the limit
- Hence the condition number of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ is $\mathcal O(1/\sigma^k)$
This is bad news because many optimisation algorithms depend on the conditioning properties of $\nabla^2 \Phi _ {\sigma^k}$ (for example Newton’s method).
Suppose:
- Nonlinear equality-constrained problem $\min _ {x \in \mathbb R^n} f(x)$ subject to $c(x) = 0$.
- Lagrangian $L(x, y) = f(x) - y^\top c(x)$.
- Quadratic penalty function $\Phi _ \sigma(x) = f(x) + \tfrac{1}{2\sigma} |c(x)|^2$.
We have
\[\nabla^2 \Phi _ \sigma(x) = \nabla^2 _ {xx} L(x, y(\sigma)) + \tfrac{1}{\sigma} J(x)^\top J(x), \qquad y(\sigma) = -c(x)/\sigma.\]
The conditioning problem. The Hessian $\nabla^2 \Phi _ {\sigma^k}(x^k)$ has $n$ eigenvalues, of which:
- $m$ are $\mathcal O(1/\sigma^k)$ and diverge as $\sigma^k \to 0$, coming from the $\tfrac{1}{\sigma} J^\top J$ term;
- the remaining $n - m$ are $\mathcal O(1)$.
So $\kappa _ 2(\nabla^2 \Phi _ {\sigma^k}(x^k)) = \mathcal O(1/\sigma^k)$. This is bad news for algorithms relying on accurate solves of the Newton system
\[\nabla^2 \Phi _ \sigma(x) \, s = -\nabla \Phi _ \sigma(x), \quad (\ast)\]
where $s$ is the Newton direction at some current $x = x^{k,i}$, $\sigma = \sigma^{k+1}$.
@State and @justify a method for accurately determining the Newton direction $s$ by introducing an auxiliary variable that avoids these conditioning issues.
The fix in one sentence. Don’t solve the $n \times n$ system $(\ast)$ directly. Instead, introduce an auxiliary unknown $w \in \mathbb R^m$ and solve a coupled $(n + m) \times (n + m)$ system in $(s, w)$ whose matrix is well-conditioned as $\sigma \to 0$. The same $s$ is recovered, more accurately.
Setup. Substituting the expressions for $\nabla \Phi _ \sigma$ and $\nabla^2 \Phi _ \sigma$, the Newton system $(\ast)$ reads
\[\left(\nabla^2 _ {xx} L(x, y(\sigma)) + \tfrac{1}{\sigma} J(x)^\top J(x)\right) s = -\left(\nabla f(x) + \tfrac{1}{\sigma} J(x)^\top c(x)\right). \quad (\ast\ast)\]Both sides contain the troublesome $1/\sigma$ factors. The idea is to extract them into a new variable.
Introduce the auxiliary $w$. Define
\[w := \tfrac{1}{\sigma} \big(J(x) s + c(x)\big). \quad (\dagger)\]By construction $(\dagger)$, we have the identity $\tfrac{1}{\sigma} J(x)^\top \big(J(x) s + c(x)\big) = J(x)^\top w$, so $(\ast\ast)$ becomes
\[\nabla^2 _ {xx} L(x, y(\sigma)) \, s + J(x)^\top w = -\nabla f(x). \quad (Eq.\ 1)\]And rearranging the definition $(\dagger)$ gives
\[J(x) s - \sigma w = -c(x). \quad (Eq.\ 2)\]Stacking Eq.\ 1 and Eq.\ 2 into a $(n + m) \times (n + m)$ system:
\[\begin{pmatrix} \nabla^2 _ {xx} L(x, y(\sigma)) & J(x)^\top \\ J(x) & -\sigma I \end{pmatrix} \begin{pmatrix} s \\ w \end{pmatrix} = -\begin{pmatrix} \nabla f(x) \\ c(x) \end{pmatrix}. \quad (\ddagger)\]Why solving the augmented system gives the same $s$, with $w$ automatically taking its defining value. The system $(\ddagger)$ has $n + m$ scalar equations in $n + m$ unknowns. Eq.\ 2 is literally the defining relation $(\dagger)$ rearranged, so any solution $(s, w)$ to $(\ddagger)$ has $w$ equal to the value $\tfrac{1}{\sigma}(J(x) s + c(x))$ given by its definition — there is no “different value of $w$” possible. And substituting that defining value of $w$ back into Eq.\ 1 yields exactly $(\ast\ast)$, the original Newton equation for $s$.
So the augmented system is mathematically equivalent to the original Newton system. The point is purely numerical: solving $(\ddagger)$ behaves better than solving $(\ast)$ when $\sigma$ is small.
Why the augmented system has good conditioning. The original Hessian gets its $\mathcal O(1/\sigma)$ condition number from the $\tfrac{1}{\sigma} J^\top J$ block. In the augmented form, $\sigma$ enters only through the $-\sigma I$ block, which vanishes (rather than diverges) as $\sigma \to 0$. As $\sigma \to 0$, the matrix tends to
\[\begin{pmatrix} \nabla^2 _ {xx} L(x, y^\ast) & J(x)^\top \\ J(x) & 0 \end{pmatrix},\]the KKT/SQP saddle-point matrix at the limit. This is well-conditioned provided $J(x)$ has full row rank and $\nabla^2 _ {xx} L$ is positive definite on $\ker J(x)$ — the standard non-degeneracy assumptions at a regular constrained minimiser.
Interpretation of $w$. From $w = \tfrac{1}{\sigma}(J(x) s + c(x))$ and $y(\sigma) = -c(x)/\sigma$, after one linearised step $s$ the predicted constraint value is $c(x) + J(x) s$, so $-w = -\tfrac{1}{\sigma}(c(x) + J(x) s)$ is the multiplier estimate at the next iterate. Unlike $y(\sigma) = -c(x)/\sigma$, which behaves badly at intermediate iterates where $c(x) \ne 0$ and $\sigma \to 0$, the quantity $w$ stays bounded (and converges to $y^\ast$) because $c(x) + J(x) s$ is the first-order prediction of $c(x _ {\text{next}})$, which vanishes at the same rate as $\sigma$.
Caveat. The cleanly-computed $s$ is necessary but not sufficient for a good step. The underlying $\Phi _ \sigma$ remains poorly scaled: its level sets are elongated with $\mathcal O(1/\sigma)$ aspect ratio. Consequences:
- Linesearch: a Euclidean step $\alpha s$ that looks fine on the quadratic model can exit the region where the quadratic is actually accurate, overshooting in high-curvature directions.
- Trust region: a Euclidean constraint $|s| \le \Delta$ inherits the same scaling problem — it allows large moves in low-curvature directions while clamping high-curvature ones equally.
- Fix: use a $B$-norm trust region $|s| _ B \le \Delta$, where $B$ encodes the ill-conditioned curvature. The trust region then tracks the elongated level sets, encouraging roughly equal model decrease in all directions.
Consider the general constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c _ E(x) = 0, c _ I(x) \ge 0\]
@Define what it means for $\Phi(x, \sigma)$ to be an exact penalty function.
There exists $\sigma _ \ast > 0$ such that if $\sigma < \sigma _ \ast$, any local solution of the constrained optimisation problem is a local minimiser of $\Phi(x, \sigma)$.
Consider the general constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c _ E(x) = 0, c _ I(x) \ge 0\]
Recall that $\Phi(x, \sigma)$ is an ∆exact-penalty-function if there exists $\sigma _ \ast > 0$ such that if $\sigma < \sigma _ \ast$, any local solution of the constrained optimisation problem is a local minimiser of $\Phi(x, \sigma)$. @State the $l _ 1$ and $l _ 2$ exact penalty functions.
Let $[z]^- := \max(-z, 0)$ denote the negative-part of $z$: zero when $z \ge 0$ and $-z > 0$ when $z < 0$. Write $[c _ I(x)]^- := ([c _ i(x)]^-) _ {i \in I}$ for the entrywise negative-part of the inequality block. Then both penalties take the unified form $\Phi(x, \sigma) = f(x) + (1/\sigma) |(c _ E(x),\ [c _ I(x)]^-)| _ p$ with $p = 1, 2$.
$l _ 2$ penalty function ($p = 2$):
\[\Phi(x, \sigma) = f(x) + \frac 1 \sigma \big\|\big(c _ E(x),\ [c _ I(x)]^-\big)\big\| _ 2 = f(x) + \frac 1 \sigma \sqrt{\|c _ E(x)\|^2 + \sum _ {i \in I} ([c _ i(x)]^-)^2}.\]$l _ 1$ penalty function ($p = 1$):
\[\Phi(x, \sigma) = f(x) + \frac 1 \sigma \sum _ {i \in E} \vert c _ i(x) \vert + \frac 1 \sigma \sum _ {i \in I} [c _ i(x)]^-.\]Both kink at ${c _ E = 0}$ (the unsquared norm / absolute value) and at every active or violated inequality (the negative-part), which is the price of exactness — see ∆bite-exact-penalty-nonsmooth-consequence.
Note on the lecture: Cartis’s Lecture 12 “Other penalty functions” slide states the $\ell _ 2$ penalty only in the equality-only form $f + (1/\sigma)|c _ E|$; the mixed-constraint form above is the standard Fletcher exact penalty and is what makes the $\ell _ 1$/$\ell _ 2$ pair sit on equal footing.
Worked examples
For
\[\min _ {x \in \mathbb R^2} x _ 1^2 + x _ 2^2 \quad \text{s.t.} \quad x _ 1 + x _ 2^2 = \alpha, \qquad \alpha \ge 0,\]
write the quadratic penalty $\Phi _ \sigma$, find its stationary points $x(\sigma)$, say when each is a local minimiser, take $\sigma \to 0$ and connect to the constrained solution, and show the Hessian’s condition number grows unbounded as $\sigma \to 0$.
Step 1 (penalty function): with $c(x) = x _ 1 + x _ 2^2 - \alpha$,
\[\Phi _ \sigma(x) = x _ 1^2 + x _ 2^2 + \frac{1}{2\sigma}(x _ 1 + x _ 2^2 - \alpha)^2.\]Watch: the coefficient is $\tfrac{1}{2\sigma}$, not $\tfrac1\sigma$ (∆quadratic-penalty-function-definition). The factor of $2$ propagates into every formula below.
Step 2 (stationary points): using $\nabla _ x[\tfrac{1}{2\sigma}c^2] = \tfrac1\sigma c\,\nabla c$,
\[\nabla\Phi _ \sigma = \begin{pmatrix} 2x _ 1 + \tfrac1\sigma c \\ 2x _ 2\left(1 + \tfrac1\sigma c\right) \end{pmatrix} = 0.\]From the first equation $\tfrac1\sigma c = -2x _ 1$, so the second becomes $2x _ 2(1 - 2x _ 1) = 0$: either $x _ 2 = 0$ or $x _ 1 = \tfrac12$.
- $x _ 2 = 0$: $2x _ 1 + \tfrac1\sigma(x _ 1 - \alpha) = 0$ gives $x(\sigma) = \left(\dfrac{\alpha}{1 + 2\sigma},\ 0\right)$.
- $x _ 1 = \tfrac12$: $\tfrac1\sigma c = -1$ gives $x _ 2^2 = \alpha - \tfrac12 - \sigma$, real only for $\alpha > \tfrac12$ and $\sigma \le \alpha - \tfrac12$: $\hat x(\sigma) = \left(\tfrac12,\ \pm\sqrt{\alpha - \tfrac12 - \sigma}\right)$.
Step 3 (which are local minima): at the $x _ 2 = 0$ branch,
\[\nabla^2 _ {xx}\Phi _ \sigma(x(\sigma)) = \operatorname{diag}\left(2 + \tfrac1\sigma,\ \frac{2(1 - 2\alpha + 2\sigma)}{1 + 2\sigma}\right),\]which is positive definite iff $1 - 2\alpha + 2\sigma > 0$ (always true when $\alpha \le \tfrac12$). The branch $\hat x(\sigma)$ has positive determinant for $\sigma < \alpha - \tfrac12$, so it too is a local minimiser there.
Step 4 ($\sigma \to 0$): $x(\sigma) \to (\alpha, 0)$ and $\hat x(\sigma) \to \left(\tfrac12, \pm\sqrt{\alpha - \tfrac12}\right)$, exactly the KKT points of the constrained problem (∆kkt-equality-worked-example-parabola-constraint).
Step 5 (ill-conditioning): along the $x _ 2 = 0$ branch the top eigenvalue $2 + \tfrac1\sigma \to \infty$ while the bottom $\to 2(1 - 2\alpha)$ stays finite, so $\operatorname{cond}\nabla^2 _ {xx}\Phi _ \sigma = O(1/\sigma) \to \infty$ as $\sigma \to 0$ (∆penalty-hessian-ill-conditioning).
Bite-sized
In the quadratic penalty $\Phi _ \sigma(x) = f(x) + \tfrac{1}{2\sigma} \ \vert c(x)\ \vert ^2$, small $\sigma$ enforces feasibility: the penalty term $\ \vert c(x)\ \vert ^2 / (2\sigma)$ becomes very steep in the $c \ne 0$ direction, so minimisers $x(\sigma)$ are pushed toward the constraint manifold ${c = 0}$ as $\sigma \to 0$.
For the quadratic penalty $\Phi _ \sigma$ on (eCP), the natural Lagrange multiplier estimate at iterate $x^k$ is $y^k = -c(x^k)/\sigma^k$ (note the sign). This is what makes $\nabla \Phi _ {\sigma^k}(x^k) = \nabla _ x \mathcal L(x^k, y^k)$, identifying penalty-minimisation with Lagrangian stationarity.
Why use exact penalty functions at all, what makes the $\ell _ 1$ and $\ell _ 2$ versions non-smooth, and what’s the consequence?
Why use exact penalties at all: the quadratic penalty $\Phi _ \sigma = f + \tfrac{1}{2\sigma}|c|^2$ only converges to a KKT point of the original constrained problem in the limit $\sigma \to 0$ (∆quadratic-penalty-method-convergence), and this drives the Hessian’s condition number to $O(1/\sigma) \to \infty$ (∆penalty-hessian-ill-conditioning, ∆bite-penalty-hessian-condition-number). An exact penalty $\Phi(x, \sigma)$ instead recovers any local solution of the constrained problem as a local minimiser of $\Phi(\cdot, \sigma)$ for all $\sigma < \sigma_\ast$ at some finite $\sigma_\ast > 0$ (∆exact-penalty-function) — so there’s no need to drive $\sigma \to 0$, and no ill-conditioning. The $\ell _ 1$ and $\ell _ 2$ constructions are the standard exact penalties (∆exact-penalty-functions-l1-l2).
What makes them non-smooth:
- The $\ell _ 2$ penalty $\Phi(x, \sigma) = f(x) + (1/\sigma) |c _ E(x)|$ has a kink at ${c _ E(x) = 0}$ because $|\cdot|$ (un-squared) is not differentiable at $0$.
- The $\ell _ 1$ penalty $\Phi(x, \sigma) = f(x) + (1/\sigma) \sum \vert c _ i(x) \vert + (1/\sigma) \sum [c _ i(x)]^-$ has kinks at every active or violated constraint.
Consequence: standard smooth-optimisation algorithms (Newton, BFGS, trust region with quadratic models) don’t apply directly — minimisation of exact penalties typically requires bundle methods, smoothing, or specialised SQP-style algorithms. So the tradeoff is smooth-but-inexact (quadratic, $\sigma \to 0$, ill-conditioning) vs nonsmooth-but-exact ($\ell _ 1, \ell _ 2$, finite $\sigma$, well-conditioned but the inner minimisation is harder).
The Hessian $\nabla^2 \Phi _ {\sigma^k}(x^k)$ of the quadratic penalty has condition number $\kappa \sim O(1/\sigma^k)$ as $\sigma^k \to 0$, with $m$ eigenvalues blowing up to $\infty$ (from the $J^\top J / \sigma$ term) and $n - m$ remaining $O(1)$.
Give the proof strategy for ∆quadratic-penalty-method-convergence-proof (Theorem 21: under LICQ at $x^\ast$, the quadratic penalty method converges to a KKT point and $y^k \to y^\ast$).
- Setup: from $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$ with $y^k := -c(x^k)/\sigma^k$, the approximate-stationarity hypothesis becomes $|\nabla f(x^k) - J(x^k)^\top y^k| \le \epsilon^k$.
- LICQ ⟹ $J(x^\ast)$ has full row rank ⟹ pseudo-inverse $J(x^\ast)^+ = (JJ^\top)^{-1} J$ exists; by continuity it persists for nearby $x^k$.
- Step 1 (multipliers converge): define candidate $y^\ast := J(x^\ast)^+ \nabla f(x^\ast)$. Triangle inequality: $|y^k - y^\ast| \le |J(x^k)^+ \nabla f(x^k) - J(x^\ast)^+ \nabla f(x^\ast)| + |J(x^k)^+ \nabla f(x^k) - y^k|$. First piece → 0 by continuity. Second piece: use $J^+ J^\top = I _ m$ to rewrite $|J(x^k)^+ \nabla f(x^k) - y^k| = |J(x^k)^+ \nabla \Phi _ {\sigma^k}(x^k)| \le |J(x^k)^+| \cdot \epsilon^k \to 0$.
- Step 2 (stationarity): pass to the limit in $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$. LHS → 0; RHS → $\nabla f(x^\ast) - J(x^\ast)^\top y^\ast$. Hence $\nabla f(x^\ast) = J(x^\ast)^\top y^\ast$.
- Step 3 (feasibility): from $y^k = -c(x^k)/\sigma^k$, rearrange to $c(x^k) = -\sigma^k y^k$. Pass to limit: $c(x^\ast) \leftarrow c(x^k) = -\sigma^k y^k \to 0$ (since $\sigma^k \to 0$ and $y^k \to y^\ast$ bounded). So $c(x^\ast) = 0$.
Together, $(x^\ast, y^\ast)$ satisfies the equality-case KKT conditions.
The feasibility identity at the heart of Step 3 of ∆quadratic-penalty-method-convergence-proof. From the definition of the multiplier estimate $y^k := -c(x^k)/\sigma^k$:
\[c(x^k) = <span class="cloze" tabindex="0">-\sigma^k y^k</span>.\]Passing to the limit, LHS → $c(x^\ast)$ by continuity, RHS → $0$ (since $\sigma^k \to 0$ and $y^k \to y^\ast$ stays bounded). Hence $c(x^\ast) = 0$ — primal feasibility of the limit point.
In ∆quadratic-penalty-method-convergence-proof (Theorem 21), where does the hypothesis $\sigma^k \to 0$ enter, and what fails if $\sigma^k$ stays bounded away from zero?
- $\sigma^k \to 0$ enters in the feasibility step: from the rearrangement $c(x^k) = -\sigma^k y^k$, only $\sigma^k \to 0$ (combined with bounded $y^k$) forces $c(x^k) \to 0$, hence $c(x^\ast) = 0$.
- If $\sigma^k$ stays bounded away from zero: the multiplier estimates $y^k = -c(x^k)/\sigma^k$ can still converge (Step 1), and Lagrangian stationarity (Step 2) can still hold, but there’s no reason $c(x^\ast)$ should equal zero. The penalty method can therefore converge to a critical point of $\Phi _ \sigma$ that is not a KKT point of the constrained problem — it minimises a tradeoff $f + (1/(2\sigma))|c|^2$ rather than $f$ on the feasible set.
- Contrast with the augmented Lagrangian method (∆aug-lagrangian-global-convergence), which can converge to a KKT point even without $\sigma^k \to 0$, because the multiplier update absorbs the role $\sigma^k \to 0$ plays here.