Notes - Continuous Optimisation HT26, Penalty methods

Course - Continuous Optimisation HT26^U
- Notes - Continuous Optimisation HT26, KKT conditions^U

Flashcards

Suppose we have a nonlinear equality constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]

where $f : \mathbb R^n \to \mathbb R$, $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth. What is the idea and motivation behind penalty methods for such problems?

penalty-method-motivation

It is very difficult to find local solutions given such nonlinear constraints. Instead, we form a single, parameterised and unconstrained objective so that minimisers approach initial problem solutions as parameters vary.

Suppose we have a nonlinear equality constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]

@State the quadratic penalty function given a penalty parameter $\sigma > 0$.

quadratic-penalty-function-definition

\[\min _ {x \in \mathbb R^n} \Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2\]

@exam~

Suppose we have a nonlinear equality constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]

@State an @algorithm for a quadratic penalty method.

quadratic-penalty-method

Suppose we have $\sigma^0 > 0$. Let $k = 0$. Until convergence:

Choose $0 < \sigma^{k+1} < \sigma^k$ (so that $\sigma^k \to 0$ as $k \to \infty$)
Starting from $x _ 0^k$, use an unconstrained minimisation algorithm to find an approximate minimiser $x^{k+1}$ of $\Phi _ {\sigma^{k+1}} = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
Let $k := k+1$

@exam~

Suppose we have a nonlinear equality constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]

and we solve via the quadratic penalty method where given $\sigma^0 > 0$ and $k = 0$, until convergence:

Choose $0 < \sigma^{k+1} < \sigma^k$ (so that $\sigma^k \to 0$ as $k \to \infty$)
Starting from $x _ 0^k$, use an unconstrained minimisation algorithm to find an approximate minimiser $x^{k+1}$ of $\Phi _ {\sigma^{k+1}} = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
Let $k := k+1$

@State a theorem about the global convergence of this method.

quadratic-penalty-method-convergence

Suppose:

$f, c \in \mathcal C^1$
$y _ i^k = -c _ i(x^k)/\sigma^k$ for $i = 1, \ldots, m$
$ \vert \vert \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \le \epsilon^k$ where $\epsilon^k \to 0$ as $k \to \infty$
$\sigma^k \to 0$ as $k \to \infty$
$x^k \to x^\ast$ where $\nabla c _ i(x^\ast)$ for each $i = 1, \ldots, m$ are linearly independent

Then:

$x^\ast$ is a KKT point of the optimisation problem
$y^k \to y^\ast$ where $y^\ast$ is the vector of Lagrange multipliers of the constraints

Remarks on the Jacobian rank hypothesis:

$\nabla c _ i(x^\ast)$ being linearly independent for $i = 1, \ldots, m$ is equivalent to $J(x^\ast) \in \mathbb R^{m \times n}$ having full row-rank, which forces $m \le n$ (at most as many equality constraints as variables).
If $J(x^\ast)$ is rank-deficient, then the theorem fails. The multiplier estimates $y^k \to \infty$, and $x^\ast$ instead locally minimises the infeasibility $|c(x)|$. It’s closest-feasible-point in a degenerate-constraint sense, not a KKT point of the original problem.
In the corresponding ∆quadratic-penalty-method-convergence-proof, full row rank is what makes the pseudo-inverse $J(x^\ast)^+ = (J J^\top)^{-1} J$ well-defined.t

@exam~

Suppose we have a nonlinear equality constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]

and we solve via the quadratic penalty method where given $\sigma^0 > 0$ and $k = 0$, until convergence:

Choose $0 < \sigma^{k+1} < \sigma^k$ (so that $\sigma^k \to 0$ as $k \to \infty$)
Starting from $x _ 0^k$, use an unconstrained minimisation algorithm to find an approximate minimiser $x^{k+1}$ of $\Phi _ {\sigma^{k+1}} = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
Let $k := k+1$

@Prove that if

$f, c \in \mathcal C^1$
$y _ i^k = -c _ i(x^k)/\sigma^k$ for $i = 1, \ldots, m$
$ \vert \vert \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \le \epsilon^k$ where $\epsilon^k \to 0$ as $k \to \infty$
$\sigma^k \to 0$ as $k \to \infty$
$x^k \to x^\ast$ where $\nabla c _ i(x^\ast)$ for each $i = 1, \ldots, m$ are linearly independent

then:

$x^\ast$ is a KKT point of the optimisation problem
$y^k \to y^\ast$ where $y^\ast$ is the vector of Lagrange multipliers of the constraints

dullness-wipeout-custom

What we must show: since (eCP) has only equality constraints, “$x^\ast$ is a KKT point” means precisely that there exists $y^\ast \in \mathbb R^m$ with

feasibility $c(x^\ast) = 0$, and
stationarity $\nabla f(x^\ast) = J(x^\ast)^\top y^\ast$.

The multiplier estimate: differentiating $\Phi _ {\sigma^k}$ gives (this is ∆penalty-derivatives-as-lagrangian)

\[\begin{aligned} \nabla \Phi _ {\sigma^k}(x^k) &= \nabla f(x^k) + \frac{1}{\sigma^k} J(x^k)^\top c(x^k) \\ &= \nabla f(x^k) - J(x^k)^\top y^k \end{aligned}\]

where $y^k := -c(x^k)/\sigma^k$. So $y^k$ is chosen exactly to make $\nabla \Phi _ {\sigma^k}(x^k) = \nabla _ x \mathcal L(x^k, y^k)$: approximate stationarity of $\Phi$ is approximate Lagrangian stationarity.

The candidate multiplier (built from LICQ alone): for fixed $x$, the least-squares problem

\[\overline y(x) := \arg \min _ {y \in \mathbb R^m} \vert \vert \nabla f(x) - J(x)^\top y \vert \vert ^2\]

has normal equations $J(x) J(x)^\top \overline y = J(x) \nabla f(x)$, so $\overline y(x) = J(x)^+ \nabla f(x)$ with pseudo-inverse $J(x)^+ := (J(x) J(x)^\top)^{-1} J(x)$, provided $J(x)$ has full row rank. By ∆licq, $J(x^\ast)$ has full row rank, so $J(x^\ast)^+$ exists; by continuity $J(x^k)^+$ also exists, is continuous and bounded for all large $k$. Define

\[y^\ast := J(x^\ast)^+ \nabla f(x^\ast).\]

This construction uses only LICQ, not optimality. Note that a priori $y^\ast$ solves only the normal equations $J(x^\ast)J(x^\ast)^\top y^\ast = J(x^\ast)\nabla f(x^\ast)$; the exact stationarity $\nabla f(x^\ast) = J(x^\ast)^\top y^\ast$ holds only if the least-squares residual vanishes, which is what Step 2 establishes.

Step 1 (multiplier convergence): we show $y^k \to y^\ast$. By the triangle inequality,

\[\vert \vert y^k - y^\ast \vert \vert \le \vert \vert J(x^k)^+ \nabla f(x^k) - J(x^\ast)^+ \nabla f(x^\ast) \vert \vert + \vert \vert J(x^k)^+ \nabla f(x^k) - y^k \vert \vert .\]

The first term $\to 0$, since $x^k \to x^\ast$ with $J^+$ and $\nabla f$ continuous gives $J(x^k)^+ \nabla f(x^k) \to J(x^\ast)^+ \nabla f(x^\ast) = y^\ast$. For the second term, full row rank gives $J^+ J^\top = (JJ^\top)^{-1}(JJ^\top) = I _ m$, so using $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$,

\[\begin{aligned} \vert \vert J(x^k)^+ \nabla f(x^k) - y^k \vert \vert &= \vert \vert J(x^k)^+ (\nabla f(x^k) - J(x^k)^\top y^k) \vert \vert \\ &= \vert \vert J(x^k)^+ \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \\ &\le \vert \vert J(x^k)^+ \vert \vert \cdot \epsilon^k \to 0, \end{aligned}\]

as $ \vert \vert J(x^k)^+ \vert \vert $ is bounded and $\epsilon^k \to 0$. Both terms vanish, so $y^k \to y^\ast$.

Step 2 (stationarity): pass to the limit in $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$. The right-hand side $\to \nabla f(x^\ast) - J(x^\ast)^\top y^\ast$ by continuity and $y^k \to y^\ast$, while the left-hand side $\to 0$ since $ \vert \vert \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \le \epsilon^k \to 0$. Hence

\[\nabla f(x^\ast) = J(x^\ast)^\top y^\ast.\]

Step 3 (feasibility): rearranging the definition of $y^k$ gives $c(x^k) = -\sigma^k y^k$. As $k \to \infty$ the right-hand side $\to 0$ (since $\sigma^k \to 0$ and $y^k \to y^\ast$ is bounded), while the left-hand side $\to c(x^\ast)$ by continuity. Hence

\[c(x^\ast) = 0.\]

Steps 2 and 3 are exactly stationarity and feasibility, so $(x^\ast, y^\ast)$ is a KKT point of (eCP) with $y^\ast$ the vector of Lagrange multipliers, as required.

@exam~

**^quadratic-penalty-method-convergence-proof

Suppose we have a nonlinear equality constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]

Let $L$ be the Lagrangian function

\[L(x,y) = f(x) - y^\top c(x)\]

and let $\Phi _ \sigma$ be the quadratic penalty function

\[\Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2\]

Give an interpretation of the derivative $\nabla \Phi _ \sigma$ and Hessian $\nabla^2 \Phi _ \sigma$ in terms of the Lagrangian.

penalty-derivatives-as-lagrangian

\[\begin{aligned} \nabla \Phi _ \sigma(x) &= \nabla f(x) + \frac 1 \sigma J(x)^\top c(x) \\ &= \nabla _ x L(x, y(\sigma)) \end{aligned}\]

where $y(\sigma) := -c(x)/\sigma$. Likewise

\[\begin{aligned} \nabla^2 \Phi _ \sigma(x) &= \nabla^2 f(x) + \frac 1 \sigma \sum^m _ {i=1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x) \\ &= \nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x) \end{aligned}\]

@exam~

Suppose:

We have a nonlinear equality constrained optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ such that $c(x) = 0$
Lagrangian $L(x,y) = f(x) - y^\top c(x)$
$\Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$

Recall then that

\[\begin{aligned} \nabla^2 \Phi _ \sigma(x) &= \nabla^2 f(x) + \frac 1 \sigma \sum^m _ {i = 1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x)\\ &= \nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x) \end{aligned}\]

Describe the behaviour of this function as $\sigma \to 0$ (appealing to results you do not need to prove).

penalty-hessian-behaviour-as-sigma-zero

Generally, $c _ i(x) \to 0$ at the same rate with $\sigma$ for all $i$ and hence usually $\nabla^2 _ {xx} L(x, y(\sigma)$ is well-behaved
For the second term, $J(x)^\top J(x) / \sigma \to \infty$

Suppose:

We have a nonlinear equality constrained optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ such that $c(x) = 0$
Lagrangian $L(x,y) = f(x) - y^\top c(x)$
$\Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$

Recall then that

\[\begin{aligned} \nabla^2 \Phi _ \sigma(x) &= \nabla^2 f(x) + \frac 1 \sigma \sum^m _ {i = 1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x)\\ &= \nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x) \end{aligned}\]

@State a theorem about the behaviour of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ in the limit as $k \to \infty$ and what this means for solving nonlinear equality constrained optimisation problems.

penalty-hessian-ill-conditioning

In this context:

There are $n$ eigenvalues of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ in total (by dimensions)
$m$ of these eigenvalues are $\mathcal O(1 / \sigma^k)$ and hence tend to infinity as $k \to \infty$
The remaining $n - m$ eigenvalues are $\mathcal O(1)$ in the limit
Hence the condition number of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ is $\mathcal O(1/\sigma^k)$

This is bad news because many optimisation algorithms depend on the conditioning properties of $\nabla^2 \Phi _ {\sigma^k}$ (for example Newton’s method).

@exam~

Suppose:

Nonlinear equality-constrained problem $\min _ {x \in \mathbb R^n} f(x)$ subject to $c(x) = 0$.
Lagrangian $L(x, y) = f(x) - y^\top c(x)$.
Quadratic penalty function $\Phi _ \sigma(x) = f(x) + \tfrac{1}{2\sigma} |c(x)|^2$.

We have

\[\nabla^2 \Phi _ \sigma(x) = \nabla^2 _ {xx} L(x, y(\sigma)) + \tfrac{1}{\sigma} J(x)^\top J(x), \qquad y(\sigma) = -c(x)/\sigma.\]

The conditioning problem. The Hessian $\nabla^2 \Phi _ {\sigma^k}(x^k)$ has $n$ eigenvalues, of which:

$m$ are $\mathcal O(1/\sigma^k)$ and diverge as $\sigma^k \to 0$, coming from the $\tfrac{1}{\sigma} J^\top J$ term;
the remaining $n - m$ are $\mathcal O(1)$.

So $\kappa _ 2(\nabla^2 \Phi _ {\sigma^k}(x^k)) = \mathcal O(1/\sigma^k)$. This is bad news for algorithms relying on accurate solves of the Newton system

\[\nabla^2 \Phi _ \sigma(x) \, s = -\nabla \Phi _ \sigma(x), \quad (\ast)\]

where $s$ is the Newton direction at some current $x = x^{k,i}$, $\sigma = \sigma^{k+1}$.

@State and @justify a method for accurately determining the Newton direction $s$ by introducing an auxiliary variable that avoids these conditioning issues.

newton-direction-with-auxiliary-variable

The fix in one sentence. Don’t solve the $n \times n$ system $(\ast)$ directly. Instead, introduce an auxiliary unknown $w \in \mathbb R^m$ and solve a coupled $(n + m) \times (n + m)$ system in $(s, w)$ whose matrix is well-conditioned as $\sigma \to 0$. The same $s$ is recovered, more accurately.

Setup. Substituting the expressions for $\nabla \Phi _ \sigma$ and $\nabla^2 \Phi _ \sigma$, the Newton system $(\ast)$ reads

\[\left(\nabla^2 _ {xx} L(x, y(\sigma)) + \tfrac{1}{\sigma} J(x)^\top J(x)\right) s = -\left(\nabla f(x) + \tfrac{1}{\sigma} J(x)^\top c(x)\right). \quad (\ast\ast)\]

Both sides contain the troublesome $1/\sigma$ factors. The idea is to extract them into a new variable.

Introduce the auxiliary $w$. Define

\[w := \tfrac{1}{\sigma} \big(J(x) s + c(x)\big). \quad (\dagger)\]

By construction $(\dagger)$, we have the identity $\tfrac{1}{\sigma} J(x)^\top \big(J(x) s + c(x)\big) = J(x)^\top w$, so $(\ast\ast)$ becomes

\[\nabla^2 _ {xx} L(x, y(\sigma)) \, s + J(x)^\top w = -\nabla f(x). \quad (Eq.\ 1)\]

And rearranging the definition $(\dagger)$ gives

\[J(x) s - \sigma w = -c(x). \quad (Eq.\ 2)\]

Stacking Eq.\ 1 and Eq.\ 2 into a $(n + m) \times (n + m)$ system:

\[\begin{pmatrix} \nabla^2 _ {xx} L(x, y(\sigma)) & J(x)^\top \\ J(x) & -\sigma I \end{pmatrix} \begin{pmatrix} s \\ w \end{pmatrix} = -\begin{pmatrix} \nabla f(x) \\ c(x) \end{pmatrix}. \quad (\ddagger)\]

Why solving the augmented system gives the same $s$, with $w$ automatically taking its defining value. The system $(\ddagger)$ has $n + m$ scalar equations in $n + m$ unknowns. Eq.\ 2 is literally the defining relation $(\dagger)$ rearranged, so any solution $(s, w)$ to $(\ddagger)$ has $w$ equal to the value $\tfrac{1}{\sigma}(J(x) s + c(x))$ given by its definition — there is no “different value of $w$” possible. And substituting that defining value of $w$ back into Eq.\ 1 yields exactly $(\ast\ast)$, the original Newton equation for $s$.

So the augmented system is mathematically equivalent to the original Newton system. The point is purely numerical: solving $(\ddagger)$ behaves better than solving $(\ast)$ when $\sigma$ is small.

Why the augmented system has good conditioning. The original Hessian gets its $\mathcal O(1/\sigma)$ condition number from the $\tfrac{1}{\sigma} J^\top J$ block. In the augmented form, $\sigma$ enters only through the $-\sigma I$ block, which vanishes (rather than diverges) as $\sigma \to 0$. As $\sigma \to 0$, the matrix tends to

\[\begin{pmatrix} \nabla^2 _ {xx} L(x, y^\ast) & J(x)^\top \\ J(x) & 0 \end{pmatrix},\]

the KKT/SQP saddle-point matrix at the limit. This is well-conditioned provided $J(x)$ has full row rank and $\nabla^2 _ {xx} L$ is positive definite on $\ker J(x)$ — the standard non-degeneracy assumptions at a regular constrained minimiser.

Interpretation of $w$. From $w = \tfrac{1}{\sigma}(J(x) s + c(x))$ and $y(\sigma) = -c(x)/\sigma$, after one linearised step $s$ the predicted constraint value is $c(x) + J(x) s$, so $-w = -\tfrac{1}{\sigma}(c(x) + J(x) s)$ is the multiplier estimate at the next iterate. Unlike $y(\sigma) = -c(x)/\sigma$, which behaves badly at intermediate iterates where $c(x) \ne 0$ and $\sigma \to 0$, the quantity $w$ stays bounded (and converges to $y^\ast$) because $c(x) + J(x) s$ is the first-order prediction of $c(x _ {\text{next}})$, which vanishes at the same rate as $\sigma$.

Caveat. The cleanly-computed $s$ is necessary but not sufficient for a good step. The underlying $\Phi _ \sigma$ remains poorly scaled: its level sets are elongated with $\mathcal O(1/\sigma)$ aspect ratio. Consequences:

Linesearch: a Euclidean step $\alpha s$ that looks fine on the quadratic model can exit the region where the quadratic is actually accurate, overshooting in high-curvature directions.
Trust region: a Euclidean constraint $|s| \le \Delta$ inherits the same scaling problem — it allows large moves in low-curvature directions while clamping high-curvature ones equally.
Fix: use a $B$-norm trust region $|s| _ B \le \Delta$, where $B$ encodes the ill-conditioned curvature. The trust region then tracks the elongated level sets, encouraging roughly equal model decrease in all directions.

Consider the general constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c _ E(x) = 0, c _ I(x) \ge 0\]

@Define what it means for $\Phi(x, \sigma)$ to be an exact penalty function.

exact-penalty-function

There exists $\sigma _ \ast > 0$ such that if $\sigma < \sigma _ \ast$, any local solution of the constrained optimisation problem is a local minimiser of $\Phi(x, \sigma)$.

Consider the general constrained optimisation problem

\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c _ E(x) = 0, c _ I(x) \ge 0\]

Recall that $\Phi(x, \sigma)$ is an ∆exact-penalty-function if there exists $\sigma _ \ast > 0$ such that if $\sigma < \sigma _ \ast$, any local solution of the constrained optimisation problem is a local minimiser of $\Phi(x, \sigma)$. @State the $l _ 1$ and $l _ 2$ exact penalty functions.

exact-penalty-functions-l1-l2

Let $[z]^- := \max(-z, 0)$ denote the negative-part of $z$: zero when $z \ge 0$ and $-z > 0$ when $z < 0$. Write $[c _ I(x)]^- := ([c _ i(x)]^-) _ {i \in I}$ for the entrywise negative-part of the inequality block. Then both penalties take the unified form $\Phi(x, \sigma) = f(x) + (1/\sigma) |(c _ E(x),\ [c _ I(x)]^-)| _ p$ with $p = 1, 2$.

$l _ 2$ penalty function ($p = 2$):

\[\Phi(x, \sigma) = f(x) + \frac 1 \sigma \big\|\big(c _ E(x),\ [c _ I(x)]^-\big)\big\| _ 2 = f(x) + \frac 1 \sigma \sqrt{\|c _ E(x)\|^2 + \sum _ {i \in I} ([c _ i(x)]^-)^2}.\]

$l _ 1$ penalty function ($p = 1$):

\[\Phi(x, \sigma) = f(x) + \frac 1 \sigma \sum _ {i \in E} \vert c _ i(x) \vert + \frac 1 \sigma \sum _ {i \in I} [c _ i(x)]^-.\]

Both kink at ${c _ E = 0}$ (the unsquared norm / absolute value) and at every active or violated inequality (the negative-part), which is the price of exactness — see ∆bite-exact-penalty-nonsmooth-consequence.

Note on the lecture: Cartis’s Lecture 12 “Other penalty functions” slide states the $\ell _ 2$ penalty only in the equality-only form $f + (1/\sigma)|c _ E|$; the mixed-constraint form above is the standard Fletcher exact penalty and is what makes the $\ell _ 1$/$\ell _ 2$ pair sit on equal footing.

Worked examples

For

\[\min _ {x \in \mathbb R^2} x _ 1^2 + x _ 2^2 \quad \text{s.t.} \quad x _ 1 + x _ 2^2 = \alpha, \qquad \alpha \ge 0,\]

write the quadratic penalty $\Phi _ \sigma$, find its stationary points $x(\sigma)$, say when each is a local minimiser, take $\sigma \to 0$ and connect to the constrained solution, and show the Hessian’s condition number grows unbounded as $\sigma \to 0$.

penalty-method-worked-example-full-cycle

Step 1 (penalty function): with $c(x) = x _ 1 + x _ 2^2 - \alpha$,

\[\Phi _ \sigma(x) = x _ 1^2 + x _ 2^2 + \frac{1}{2\sigma}(x _ 1 + x _ 2^2 - \alpha)^2.\]

Watch: the coefficient is $\tfrac{1}{2\sigma}$, not $\tfrac1\sigma$ (∆quadratic-penalty-function-definition). The factor of $2$ propagates into every formula below.

Step 2 (stationary points): using $\nabla _ x[\tfrac{1}{2\sigma}c^2] = \tfrac1\sigma c\,\nabla c$,

\[\nabla\Phi _ \sigma = \begin{pmatrix} 2x _ 1 + \tfrac1\sigma c \\ 2x _ 2\left(1 + \tfrac1\sigma c\right) \end{pmatrix} = 0.\]

From the first equation $\tfrac1\sigma c = -2x _ 1$, so the second becomes $2x _ 2(1 - 2x _ 1) = 0$: either $x _ 2 = 0$ or $x _ 1 = \tfrac12$.

$x _ 2 = 0$: $2x _ 1 + \tfrac1\sigma(x _ 1 - \alpha) = 0$ gives $x(\sigma) = \left(\dfrac{\alpha}{1 + 2\sigma},\ 0\right)$.
$x _ 1 = \tfrac12$: $\tfrac1\sigma c = -1$ gives $x _ 2^2 = \alpha - \tfrac12 - \sigma$, real only for $\alpha > \tfrac12$ and $\sigma \le \alpha - \tfrac12$: $\hat x(\sigma) = \left(\tfrac12,\ \pm\sqrt{\alpha - \tfrac12 - \sigma}\right)$.

Step 3 (which are local minima): at the $x _ 2 = 0$ branch,

\[\nabla^2 _ {xx}\Phi _ \sigma(x(\sigma)) = \operatorname{diag}\left(2 + \tfrac1\sigma,\ \frac{2(1 - 2\alpha + 2\sigma)}{1 + 2\sigma}\right),\]

which is positive definite iff $1 - 2\alpha + 2\sigma > 0$ (always true when $\alpha \le \tfrac12$). The branch $\hat x(\sigma)$ has positive determinant for $\sigma < \alpha - \tfrac12$, so it too is a local minimiser there.

Step 4 ($\sigma \to 0$): $x(\sigma) \to (\alpha, 0)$ and $\hat x(\sigma) \to \left(\tfrac12, \pm\sqrt{\alpha - \tfrac12}\right)$, exactly the KKT points of the constrained problem (∆kkt-equality-worked-example-parabola-constraint).

Step 5 (ill-conditioning): along the $x _ 2 = 0$ branch the top eigenvalue $2 + \tfrac1\sigma \to \infty$ while the bottom $\to 2(1 - 2\alpha)$ stays finite, so $\operatorname{cond}\nabla^2 _ {xx}\Phi _ \sigma = O(1/\sigma) \to \infty$ as $\sigma \to 0$ (∆penalty-hessian-ill-conditioning).

Source: Continuous Optimisation 2016, Q3(b); method per Lecture 12 (quadratic penalty, ill-conditioning).

@example~ @exam~

Bite-sized

In the quadratic penalty $\Phi _ \sigma(x) = f(x) + \tfrac{1}{2\sigma} \ \vert c(x)\ \vert ^2$, small $\sigma$ enforces feasibility: the penalty term $\ \vert c(x)\ \vert ^2 / (2\sigma)$ becomes very steep in the $c \ne 0$ direction, so minimisers $x(\sigma)$ are pushed toward the constraint manifold ${c = 0}$ as $\sigma \to 0$.

Source: Lecture 12, A penalty function for (eCP) slide.

@bite~

bite-penalty-sigma-direction

For the quadratic penalty $\Phi _ \sigma$ on (eCP), the natural Lagrange multiplier estimate at iterate $x^k$ is $y^k = -c(x^k)/\sigma^k$ (note the sign). This is what makes $\nabla \Phi _ {\sigma^k}(x^k) = \nabla _ x \mathcal L(x^k, y^k)$, identifying penalty-minimisation with Lagrangian stationarity.

Source: Lecture 12, proof of Theorem 21 (penalty convergence).

@bite~

bite-penalty-multiplier-estimate

Why use exact penalty functions at all, what makes the $\ell _ 1$ and $\ell _ 2$ versions non-smooth, and what’s the consequence?

bite-exact-penalty-nonsmooth-consequence

Why use exact penalties at all: the quadratic penalty $\Phi _ \sigma = f + \tfrac{1}{2\sigma}|c|^2$ only converges to a KKT point of the original constrained problem in the limit $\sigma \to 0$ (∆quadratic-penalty-method-convergence), and this drives the Hessian’s condition number to $O(1/\sigma) \to \infty$ (∆penalty-hessian-ill-conditioning, ∆bite-penalty-hessian-condition-number). An exact penalty $\Phi(x, \sigma)$ instead recovers any local solution of the constrained problem as a local minimiser of $\Phi(\cdot, \sigma)$ for all $\sigma < \sigma_\ast$ at some finite $\sigma_\ast > 0$ (∆exact-penalty-function) — so there’s no need to drive $\sigma \to 0$, and no ill-conditioning. The $\ell _ 1$ and $\ell _ 2$ constructions are the standard exact penalties (∆exact-penalty-functions-l1-l2).

What makes them non-smooth:

The $\ell _ 2$ penalty $\Phi(x, \sigma) = f(x) + (1/\sigma) |c _ E(x)|$ has a kink at ${c _ E(x) = 0}$ because $|\cdot|$ (un-squared) is not differentiable at $0$.
The $\ell _ 1$ penalty $\Phi(x, \sigma) = f(x) + (1/\sigma) \sum \vert c _ i(x) \vert + (1/\sigma) \sum [c _ i(x)]^-$ has kinks at every active or violated constraint.

Consequence: standard smooth-optimisation algorithms (Newton, BFGS, trust region with quadratic models) don’t apply directly — minimisation of exact penalties typically requires bundle methods, smoothing, or specialised SQP-style algorithms. So the tradeoff is smooth-but-inexact (quadratic, $\sigma \to 0$, ill-conditioning) vs nonsmooth-but-exact ($\ell _ 1, \ell _ 2$, finite $\sigma$, well-conditioned but the inner minimisation is harder).

Source: Lecture 12, Other penalty functions slide.

@bite~

The Hessian $\nabla^2 \Phi _ {\sigma^k}(x^k)$ of the quadratic penalty has condition number $\kappa \sim O(1/\sigma^k)$ as $\sigma^k \to 0$, with $m$ eigenvalues blowing up to $\infty$ (from the $J^\top J / \sigma$ term) and $n - m$ remaining $O(1)$.

Source: Lecture 12, Ill-conditioning of the penalty’s Hessian slide.

@bite~

bite-penalty-hessian-condition-number

Give the proof strategy for ∆quadratic-penalty-method-convergence-proof (Theorem 21: under LICQ at $x^\ast$, the quadratic penalty method converges to a KKT point and $y^k \to y^\ast$).

bite-penalty-method-convergence-strategy

Setup: from $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$ with $y^k := -c(x^k)/\sigma^k$, the approximate-stationarity hypothesis becomes $|\nabla f(x^k) - J(x^k)^\top y^k| \le \epsilon^k$.
LICQ ⟹ $J(x^\ast)$ has full row rank ⟹ pseudo-inverse $J(x^\ast)^+ = (JJ^\top)^{-1} J$ exists; by continuity it persists for nearby $x^k$.
Step 1 (multipliers converge): define candidate $y^\ast := J(x^\ast)^+ \nabla f(x^\ast)$. Triangle inequality: $|y^k - y^\ast| \le |J(x^k)^+ \nabla f(x^k) - J(x^\ast)^+ \nabla f(x^\ast)| + |J(x^k)^+ \nabla f(x^k) - y^k|$. First piece → 0 by continuity. Second piece: use $J^+ J^\top = I _ m$ to rewrite $|J(x^k)^+ \nabla f(x^k) - y^k| = |J(x^k)^+ \nabla \Phi _ {\sigma^k}(x^k)| \le |J(x^k)^+| \cdot \epsilon^k \to 0$.
Step 2 (stationarity): pass to the limit in $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$. LHS → 0; RHS → $\nabla f(x^\ast) - J(x^\ast)^\top y^\ast$. Hence $\nabla f(x^\ast) = J(x^\ast)^\top y^\ast$.
Step 3 (feasibility): from $y^k = -c(x^k)/\sigma^k$, rearrange to $c(x^k) = -\sigma^k y^k$. Pass to limit: $c(x^\ast) \leftarrow c(x^k) = -\sigma^k y^k \to 0$ (since $\sigma^k \to 0$ and $y^k \to y^\ast$ bounded). So $c(x^\ast) = 0$.

Together, $(x^\ast, y^\ast)$ satisfies the equality-case KKT conditions.

Source: Lecture 12, Proof of Theorem 21.

@bite~ @proofsupport~

The feasibility identity at the heart of Step 3 of ∆quadratic-penalty-method-convergence-proof. From the definition of the multiplier estimate $y^k := -c(x^k)/\sigma^k$:

\[c(x^k) = <span class="cloze" tabindex="0">-\sigma^k y^k</span>.\]

Passing to the limit, LHS → $c(x^\ast)$ by continuity, RHS → $0$ (since $\sigma^k \to 0$ and $y^k \to y^\ast$ stays bounded). Hence $c(x^\ast) = 0$ — primal feasibility of the limit point.

Source: Lecture 12, Proof of Theorem 21, feasibility step.

@bite~ @proofsupport~

bite-penalty-method-convergence-key-inequality-feasibility-identity

In ∆quadratic-penalty-method-convergence-proof (Theorem 21), where does the hypothesis $\sigma^k \to 0$ enter, and what fails if $\sigma^k$ stays bounded away from zero?

bite-penalty-method-convergence-hypothesis-sigma-to-zero

$\sigma^k \to 0$ enters in the feasibility step: from the rearrangement $c(x^k) = -\sigma^k y^k$, only $\sigma^k \to 0$ (combined with bounded $y^k$) forces $c(x^k) \to 0$, hence $c(x^\ast) = 0$.
If $\sigma^k$ stays bounded away from zero: the multiplier estimates $y^k = -c(x^k)/\sigma^k$ can still converge (Step 1), and Lagrangian stationarity (Step 2) can still hold, but there’s no reason $c(x^\ast)$ should equal zero. The penalty method can therefore converge to a critical point of $\Phi _ \sigma$ that is not a KKT point of the constrained problem — it minimises a tradeoff $f + (1/(2\sigma))|c|^2$ rather than $f$ on the feasible set.
Contrast with the augmented Lagrangian method (∆aug-lagrangian-global-convergence), which can converge to a KKT point even without $\sigma^k \to 0$, because the multiplier update absorbs the role $\sigma^k \to 0$ plays here.

Source: Lecture 12, Theorem 21 hypotheses and proof; cf. Augmented Lagrangian advantage noted in Lecture 13.

@bite~ @proofsupport~

Flashcards

Worked examples

Bite-sized

Related posts