Continuous Optimisation HT26, Penalty methods
Flashcards
penalty-method-motivationSuppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
where $f : \mathbb R^n \to \mathbb R$, $c = (c _ 1, \ldots, c _ m) : \mathbb R^n \to \mathbb R^m$ smooth. What is the idea and motivation behind penalty methods for such problems.
It is very difficult to find local solutions given such nonlinear constraints. Instead, we form a single, parameterised and unconstrained objective so that minimisers approach initial problem solutions as parameters vary.
quadratic-penalty-function-definitionSuppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
@State the quadratic penalty function given a penalty parameter $\sigma > 0$.
quadratic-penalty-methodSuppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
@State an @algorithm for a quadratic penalty method.
Suppose we have $\sigma^0 > 0$. Let $k = 0$. Until convergence:
- Choose $0 < \sigma^{k+1} < \sigma^k$ (so that $\sigma^k \to 0$ as $k \to \infty$)
- Starting from $x _ 0^k$, use an unconstrained minimisation algorithm to find an approximate minimiser $x^{k+1}$ of $\Phi _ {\sigma^{k+1}} = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
- Let $k := k+1$
quadratic-penalty-method-convergenceSuppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
and we solve via the quadratic penalty method where given $\sigma^0 > 0$ and $k = 0$, until convergence:
- Choose $0 < \sigma^{k+1} < \sigma^k$ (so that $\sigma^k \to 0$ as $k \to \infty$)
- Starting from $x _ 0^k$, use an unconstrained minimisation algorithm to find an approximate minimiser $x^{k+1}$ of $\Phi _ {\sigma^{k+1}} = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
- Let $k := k+1$
@State a theorem about the global convergence of this method.
Suppose:
- $f, c \in \mathcal C^1$
- $y _ i^k = -c _ i(x^k)/\sigma^k$ for $i = 1, \ldots, m$
- $ \vert \vert \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \le \epsilon^k$ where $\epsilon^k \to 0$ as $k \to \infty$
- $\sigma^k \to 0$ as $k \to \infty$
- $x^k \to x^\ast$ where $\nabla c _ i(x^\ast)$ for each $i = 1, \ldots, m$ are linearly independent
Then:
- $x^\ast$ is a KKT point of the optimisation problem
- $y^k \to y^\ast$ where $y^\ast$ is the vector of Lagrange multipliers of the constraints
Remarks on the Jacobian rank hypothesis:
- $\nabla c _ i(x^\ast)$ being linearly independent for $i = 1, \ldots, m$ is equivalent to $J(x^\ast) \in \mathbb R^{m \times n}$ having full row-rank, which forces $m \le n$ (at most as many equality constraints as variables).
- If $J(x^\ast)$ is rank-deficient, then the theorem fails. The multiplier estimates $y^k \to \infty$, and $x^\ast$ instead locally minimises the infeasibility $|c(x)|$. It’s closest-feasible-point in a degenerate-constraint sense, not a KKT point of the original problem.
- In the corresponding ∆quadratic-penalty-method-convergence-proof, full row rank is what makes the pseudo-inverse $J(x^\ast)^+ = (J J^\top)^{-1} J$ well-defined.t
quadratic-penalty-method-convergence-proofSuppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
and we solve via the quadratic penalty method where given $\sigma^0 > 0$ and $k = 0$, until convergence:
- Choose $0 < \sigma^{k+1} < \sigma^k$ (so that $\sigma^k \to 0$ as $k \to \infty$)
- Starting from $x _ 0^k$, use an unconstrained minimisation algorithm to find an approximate minimiser $x^{k+1}$ of $\Phi _ {\sigma^{k+1}} = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
- Let $k := k+1$
@Prove that if
- $f, c \in \mathcal C^1$
- $y _ i^k = -c _ i(x^k)/\sigma^k$ for $i = 1, \ldots, m$
- $ \vert \vert \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \le \epsilon^k$ where $\epsilon^k \to 0$ as $k \to \infty$
- $\sigma^k \to 0$ as $k \to \infty$
- $x^k \to x^\ast$ where $\nabla c _ i(x^\ast)$ for each $i = 1, \ldots, m$ are linearly independent
then:
- $x^\ast$ is a KKT point of the optimisation problem
- $y^k \to y^\ast$ where $y^\ast$ is the vector of Lagrange multipliers of the constraints
The ∆kkt-conditions imply that:
- $c(x^\ast) = 0$
- $\nabla f(x^\ast) = J(x^\ast)^\top y^\ast$ for some $y \in \mathbb R^m$
We have that
\[\begin{aligned} \nabla \Phi _ {\sigma _ k}(x^k) &= \nabla f(x^k) + \frac{1}{\sigma _ k} J(x^k)^\top c(x^k) \\ &= \nabla f(x^k) - J(x^k)^\top y^k \end{aligned}\]where $y^k = -c(x^k)/\sigma _ k \in \mathbb R^m$. Note also that if
\[\overline y = \arg \min _ {y \in \mathbb R^m} \vert \vert \nabla f(x) - J(x)^\top y \vert \vert ^2\]then $\overline y$ is the solution of an overdetermined linear least squares system. Thus from the normal equations, we have that $\overline y$ satisfies
\[J(x) J(x)^\top \overline y = J(x) \nabla f(x)\]Or, equivalently
\[\overline y = J(x)^+ \nabla f(x)\]where $J(x)^+ = (J(x) J(x)^\top)^{-1} J(x)$ is the pseudo-inverse, if $J(x)$ is full row rank (so that $J(x) J(x)^\top$ is nonsingular). For instance, the ∆licq implies that $J(x^\ast)$ is full row rank and so $J(x^\ast)^+$ exists. Since we assume that $x^k \to x^\ast$ and $J$ is continuous, then $J(x^k)^+$ exists and is bounded above and continuous for all sufficiently large $k$.
We first show that $y^k \to y^\ast$ as $k \to \infty$, and the multiplier $y^\ast$ solves the normal equations with $x = x^\ast$ and so $y^\ast = J(x^\ast)^+ \nabla f(x^\ast)$. We evaluate
\[\begin{aligned} \vert \vert y^k - y^\ast \vert \vert \le \vert \vert J(x^k)^+ \nabla f(x^k) - J(x^\ast)^+ \nabla f(x^\ast) \vert \vert + \vert \vert J(x^k)^+ \nabla f(x^k) - y^k \vert \vert \end{aligned}\]As $x^k \to x^\ast$, $J^+$ and $\nabla f$ continuous, then $J(x^k)^+ \to J(x^\ast)^+$ and $\nabla f(x^k) \to \nabla f(x^\ast)$. Thus $ \vert \vert J(x^k)^+ \nabla f(x^k) - J(x^\ast)^+ \nabla f(x^\ast) \vert \vert \to 0$.
For the second term, note that since $J$ is $m \times n$ with full row rank, $J^+ J^\top = (JJ^\top)^{-1}(JJ^\top) = I _ m$. Combined with $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$, we have
\[\begin{aligned} \vert \vert J(x^k)^+ \nabla f(x^k) - y^k \vert \vert &= \vert \vert J(x^k)^+ \nabla f(x^k) - J(x^k)^+ J(x^k)^\top y^k \vert \vert \\ &= \vert \vert J(x^k)^+ (\nabla f(x^k) - J(x^k)^\top y^k) \vert \vert \\ &= \vert \vert J(x^k)^+ \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \\ &\le \vert \vert J(x^k)^+ \vert \vert \cdot \vert \vert \nabla \Phi _ {\sigma^k}(x^k) \vert \vert \\ &\le C\epsilon _ k \end{aligned}\]for some $C$, and hence as $\epsilon _ k \to 0$, it follows $y^k \to y^\ast$.
It remains to show that the limit point $x^\ast$ of ${x^k}$ is a KKT point of the problem. Passing to the limit in
\[\begin{aligned} \nabla \Phi _ {\sigma _ k}(x^k) &= \nabla f(x^k) + \frac{1}{\sigma _ k} J(x^k)^\top c(x^k) \\ &= \nabla f(x^k) - J(x^k)^\top y^k \end{aligned}\]where $y^k = -c(x^k)/\sigma _ k \in \mathbb R^m$ and using $x^k \to x^\ast$, $y^k \to y^\ast$ and the continuity of $\nabla f$ and $J$, it follows that
\[\nabla \Phi(x^k) \to \nabla f(x^\ast) - J(x^\ast)^\top y^\ast\]Since $\nabla \Phi(x^k) \to 0$, it follows that $\nabla f(x^\ast) - J(x^\ast)^\top y^\ast = 0$.
The definition of $y^k$ gives
\[c(x^k) = -\sigma^k y^k\]since $\sigma^k \to 0$, $y^k \to y^\ast$ implies the right hand side $\to 0$. For the left hand side, $c(x^k) \to 0$ as $x^k \to x^\ast$, and hence $c(x^\ast) = 0$.
Thus $x^\ast$ is a KKT point, as required.
penalty-derivatives-as-lagrangianSuppose we have a nonlinear equality constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{ s.t. }\quad c(x) = 0\]
Let $L$ be the Lagrangian function
\[L(x,y) = f(x) - y^\top c(x)\]
and let $\Phi _ \sigma$ be the quadratic penalty function
\[\Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2\]
Give an interpretation of the derivative $\nabla \Phi _ \sigma$ and Hessian $\nabla^2 \Phi _ \sigma$ in terms of the Lagrangian.
where $y(\sigma) := -c(x)/\sigma$. Likewise
\[\begin{aligned} \nabla^2 \Phi _ \sigma(x) &= \nabla^2 f(x) + \frac 1 \sigma \sum^m _ {i=1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x) \\ &= \nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x) \end{aligned}\]penalty-hessian-behaviour-as-sigma-zeroSuppose:
- We have a nonlinear equality constrained optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ such that $c(x) = 0$
- Lagrangian $L(x,y) = f(x) - y^\top c(x)$
- $\Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
Recall then that
\[\begin{aligned}
\nabla^2 \Phi _ \sigma(x) &= \nabla^2 f(x) + \frac 1 \sigma \sum^m _ {i = 1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x)\\
&= \nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x)
\end{aligned}\]
Describe the behaviour of this function as $\sigma \to 0$ (appealing to results you do not need to prove).
- Generally, $c _ i(x) \to 0$ at the same rate with $\sigma$ for all $i$ and hence usually $\nabla^2 _ {xx} L(x, y(\sigma)$ is well-behaved
- For the second term, $J(x)^\top J(x) / \sigma \to \infty$
penalty-hessian-ill-conditioningSuppose:
- We have a nonlinear equality constrained optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ such that $c(x) = 0$
- Lagrangian $L(x,y) = f(x) - y^\top c(x)$
- $\Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
Recall then that
\[\begin{aligned}
\nabla^2 \Phi _ \sigma(x) &= \nabla^2 f(x) + \frac 1 \sigma \sum^m _ {i = 1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x)\\
&= \nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x)
\end{aligned}\]
@State a theorem about the behaviour of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ in the limit as $k \to \infty$ and what this means for solving nonlinear equality constrained optimisation problems.
In this context:
- There are $n$ eigenvalues of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ in total (by dimensions)
- $m$ of these eigenvalues are $\mathcal O(1 / \sigma^k)$ and hence tend to infinity as $k \to \infty$
- The remaining $n - m$ eigenvalues are $\mathcal O(1)$ in the limit
- Hence the condition number of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ is $\mathcal O(1/\sigma^k)$
This is bad news because many optimisation algorithms depend on the conditioning properties of $\nabla^2 \Phi _ {\sigma^k}$ (for example Newton’s method).
newton-direction-with-auxiliary-variableSuppose:
- We have a nonlinear equality constrained optimisation problem $\min _ {x \in \mathbb R^n} f(x)$ such that $c(x) = 0$
- Lagrangian $L(x,y) = f(x) - y^\top c(x)$
- $\Phi _ \sigma(x) = f(x) + \frac{1}{2\sigma} \vert \vert c(x) \vert \vert ^2$
Recall then that
\[\begin{aligned}
\nabla^2 \Phi _ \sigma(x) &= \nabla^2 f(x) + \frac 1 \sigma \sum^m _ {i = 1} c _ i(x) \nabla^2 c _ i(x) + \frac 1 \sigma J(x)^\top J(x)\\
&= \nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x)
\end{aligned}\]
One issue in this context is that
- There are $n$ eigenvalues of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ in total (by dimensions)
- $m$ of these eigenvalues are $\mathcal O(1 / \sigma^k)$ and hence tend to infinity as $k \to \infty$
- The remaining $n - m$ eigenvalues are $\mathcal O(1)$ in the limit
- Hence the condition number of $\nabla^2 \Phi _ {\sigma^k}(x^k)$ is $\mathcal O(1/\sigma^k)$
This is bad news because many optimisation algorithms depend on the conditioning properties of $\nabla^2 \Phi _ {\sigma^k}$, for example Newton’s method where we have to solve the system
\[\nabla^2 \Phi _ \sigma(x) \text dx = -\nabla \Phi _ \sigma(x) \tag{$\ast$}\]
for $\text{d}x$ for some current $x = x^{k,i}$ and $\sigma = \sigma^{k+1}$.
@State and @justify a method for accurately determining this Newton direction $\text dx$ by introducing an auxiliary variable, and which avoids these conditioning issues.
Rewriting $(\ast)$ using the expressions we have determined for the derivatives, we obtain
\[\left(\nabla^2 _ {xx} L(x, y(\sigma)) + \frac 1 \sigma J(x)^\top J(x)\right)\text dx = -\left(\nabla f(x) + \frac{1}{\sigma} J(x)^\top c(x)\right)\]where $y(\sigma) = -c(x)/\sigma$. Define
\[w := \frac 1 \sigma \left( J(x) \text dx + c(x) \right)\]so that
\[J(x) \text dx - \sigma w = -c(x)\]and
\[\nabla^2 _ {xx} L(x, y(\sigma)) \text dx + J(x)^\top w = -\nabla f(x)\]and the Newton system can be rewritten as
\[\begin{pmatrix} \nabla^2 L(x, y(\sigma)) & J(x)^\top \\ J(x) & -\sigma I \end{pmatrix} \begin{pmatrix} \text dx \\ w \end{pmatrix} = -\begin{pmatrix} \nabla f(x) \\ c(x) \end{pmatrix}\]which is essentially independent of $\sigma$ for small $\sigma$.
Why this resolves conditioning issue: the LHS matrix is essentially independent of $\sigma$ for small $\sigma$ (only the $-\sigma I$ block depends on $\sigma$, and it vanishes rather than diverges). So the Newton direction $\text dx$ can be computed accurately as $\sigma \to 0$, despite the original Hessian $\nabla^2 \Phi _ \sigma$ having condition number $O(1/\sigma)$.
But you still need to be careful when using $\text dx$. The underlying $\Phi _ \sigma$ remains poorly scaled, since its level sets are elongated with $O(1/\sigma)$ aspect ratio. So:
- Linesearch: a Euclidean step $\alpha \text dx$ that looks fine on the quadratic model can exit the region where the quadratic is actually accurate, overshooting in high-curvature directions.
- Trust region: A Euclidean constraint $|\text dx| \le \Delta$ inherits the same scaling problem; it allows large moves in low-curvature directions while clamping high-curvature ones equalyly.
- Fix: use a $B$-norm trust region $|\text dx| _ B \le \Delta$, where $B$ encodes the ill-conditioned curvature. The trust region then tracks the elongated level sets, encouraging roughly equal model decrease in all directions.
The cleanly-computed $\text dx$ is necessary but not sufficient, since the step length / step shape still needs scaling-aware safeguards.
exact-penalty-functionConsider the general constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c _ E(x) = 0, c _ I(x) \ge 0\]
@Define what it means for $\Phi(x, \sigma)$ to be an exact penalty function.
There exists $\sigma _ \ast > 0$ such that if $\sigma < \sigma _ \ast$, any local solution of the constrained optimisation problem is a local minimiser of $\Phi(x, \sigma)$.
exact-penalty-functions-l1-l2Consider the general constrained optimisation problem
\[\min _ {x \in \mathbb R^n} f(x) \quad \text{s.t.} \quad c _ E(x) = 0, c _ I(x) \ge 0\]
Recall that $\Phi(x, \sigma)$ is an ∆exact-penalty-function if there exists $\sigma _ \ast > 0$ such that if $\sigma < \sigma _ \ast$, any local solution of the constrained optimisation problem is a local minimiser of $\Phi(x, \sigma)$. @State the $l _ 1$ and $l _ 2$ exact penalty functions.
$l _ 2$ penalty function:
\[\Phi(x, \sigma) = f(x) + \frac 1 \sigma \vert \vert c _ E(x) \vert \vert\]$l _ 1$ penalty function:
\[\Phi(x, \sigma) = f(x) + \frac 1 \sigma \sum _ {i \in E} \vert c _ i(x) \vert + \frac 1 \sigma \sum _ {i \in I} \max(-c _ i(x), 0)\]where the inequality penalty $\max(-c _ i(x), 0) = [c _ i(x)]^-$ with $[z]^- := \max(-z, 0)$ is the negative-part of $c _ i(x)$: zero when $c _ i(x) \ge 0$ (constraint satisfied) and $-c _ i(x) > 0$ when $c _ i(x) < 0$ (constraint violated).
Bite-sized
In the quadratic penalty $\Phi _ \sigma(x) = f(x) + \tfrac{1}{2\sigma} \ \vert c(x)\ \vert ^2$, small $\sigma$ enforces feasibility: the penalty term $\ \vert c(x)\ \vert ^2 / (2\sigma)$ becomes very steep in the $c \ne 0$ direction, so minimisers $x(\sigma)$ are pushed toward the constraint manifold ${c = 0}$ as $\sigma \to 0$.
Source: Lecture 12, A penalty function for (eCP) slide.
@bite~
For the quadratic penalty $\Phi _ \sigma$ on (eCP), the natural Lagrange multiplier estimate at iterate $x^k$ is $y^k = -c(x^k)/\sigma^k$ (note the sign). This is what makes $\nabla \Phi _ {\sigma^k}(x^k) = \nabla _ x \mathcal L(x^k, y^k)$, identifying penalty-minimisation with Lagrangian stationarity.
Source: Lecture 12, proof of Theorem 21 (penalty convergence).
@bite~
bite-exact-penalty-nonsmooth-consequenceWhy are the $\ell _ 1$ and $\ell _ 2$ exact penalty functions non-smooth, and what’s the consequence?
- The $\ell _ 2$ penalty $\Phi(x, \sigma) = f(x) + (1/\sigma) |c _ E(x)|$ has a kink at ${c _ E(x) = 0}$ because $|\cdot|$ (un-squared) is not differentiable at $0$.
- The $\ell _ 1$ penalty $\Phi(x, \sigma) = f(x) + (1/\sigma) \sum \vert c _ i(x) \vert + (1/\sigma) \sum [c _ i(x)]^-$ has kinks at every active or violated constraint.
Consequence: standard smooth-optimisation algorithms (Newton, BFGS, trust region with quadratic models) don’t apply directly — minimisation of exact penalties typically requires bundle methods, smoothing, or specialised SQP-style algorithms. This is the tradeoff for exactness: smooth-but-inexact (quadratic) vs nonsmooth-but-exact ($\ell _ 1, \ell _ 2$).
Source: Lecture 12, Other penalty functions slide.
@bite~
The Hessian $\nabla^2 \Phi _ {\sigma^k}(x^k)$ of the quadratic penalty has condition number $\kappa \sim O(1/\sigma^k)$ as $\sigma^k \to 0$, with $m$ eigenvalues blowing up to $\infty$ (from the $J^\top J / \sigma$ term) and $n - m$ remaining $O(1)$.
Source: Lecture 12, Ill-conditioning of the penalty’s Hessian slide.
@bite~
bite-penalty-method-convergence-strategyGive the proof strategy for ∆quadratic-penalty-method-convergence-proof (Theorem 21: under LICQ at $x^\ast$, the quadratic penalty method converges to a KKT point and $y^k \to y^\ast$).
- Setup: from $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$ with $y^k := -c(x^k)/\sigma^k$, the approximate-stationarity hypothesis becomes $|\nabla f(x^k) - J(x^k)^\top y^k| \le \epsilon^k$.
- LICQ ⟹ $J(x^\ast)$ has full row rank ⟹ pseudo-inverse $J(x^\ast)^+ = (JJ^\top)^{-1} J$ exists; by continuity it persists for nearby $x^k$.
- Step 1 (multipliers converge): define candidate $y^\ast := J(x^\ast)^+ \nabla f(x^\ast)$. Triangle inequality: $|y^k - y^\ast| \le |J(x^k)^+ \nabla f(x^k) - J(x^\ast)^+ \nabla f(x^\ast)| + |J(x^k)^+ \nabla f(x^k) - y^k|$. First piece → 0 by continuity. Second piece: use $J^+ J^\top = I _ m$ to rewrite $|J(x^k)^+ \nabla f(x^k) - y^k| = |J(x^k)^+ \nabla \Phi _ {\sigma^k}(x^k)| \le |J(x^k)^+| \cdot \epsilon^k \to 0$.
- Step 2 (stationarity): pass to the limit in $\nabla \Phi _ {\sigma^k}(x^k) = \nabla f(x^k) - J(x^k)^\top y^k$. LHS → 0; RHS → $\nabla f(x^\ast) - J(x^\ast)^\top y^\ast$. Hence $\nabla f(x^\ast) = J(x^\ast)^\top y^\ast$.
- Step 3 (feasibility): from $y^k = -c(x^k)/\sigma^k$, rearrange to $c(x^k) = -\sigma^k y^k$. Pass to limit: $c(x^\ast) \leftarrow c(x^k) = -\sigma^k y^k \to 0$ (since $\sigma^k \to 0$ and $y^k \to y^\ast$ bounded). So $c(x^\ast) = 0$.
Together, $(x^\ast, y^\ast)$ satisfies the equality-case KKT conditions.
Source: Lecture 12, Proof of Theorem 21.
@bite~ @proofsupport~
The feasibility identity at the heart of Step 3 of ∆quadratic-penalty-method-convergence-proof. From the definition of the multiplier estimate $y^k := -c(x^k)/\sigma^k$:
\[c(x^k) = <span class="cloze" tabindex="0">-\sigma^k y^k</span>.\]Passing to the limit, LHS → $c(x^\ast)$ by continuity, RHS → $0$ (since $\sigma^k \to 0$ and $y^k \to y^\ast$ stays bounded). Hence $c(x^\ast) = 0$ — primal feasibility of the limit point.
Source: Lecture 12, Proof of Theorem 21, feasibility step.
@bite~ @proofsupport~
bite-penalty-method-convergence-hypothesis-sigma-to-zeroIn ∆quadratic-penalty-method-convergence-proof (Theorem 21), where does the hypothesis $\sigma^k \to 0$ enter, and what fails if $\sigma^k$ stays bounded away from zero?
- $\sigma^k \to 0$ enters in the feasibility step: from the rearrangement $c(x^k) = -\sigma^k y^k$, only $\sigma^k \to 0$ (combined with bounded $y^k$) forces $c(x^k) \to 0$, hence $c(x^\ast) = 0$.
- If $\sigma^k$ stays bounded away from zero: the multiplier estimates $y^k = -c(x^k)/\sigma^k$ can still converge (Step 1), and Lagrangian stationarity (Step 2) can still hold, but there’s no reason $c(x^\ast)$ should equal zero. The penalty method can therefore converge to a critical point of $\Phi _ \sigma$ that is not a KKT point of the constrained problem — it minimises a tradeoff $f + (1/(2\sigma))|c|^2$ rather than $f$ on the feasible set.
- Contrast with the augmented Lagrangian method (∆aug-lagrangian-global-convergence), which can converge to a KKT point even without $\sigma^k \to 0$, because the multiplier update absorbs the role $\sigma^k \to 0$ plays here.
Source: Lecture 12, Theorem 21 hypotheses and proof; cf. Augmented Lagrangian advantage noted in Lecture 13.
@bite~ @proofsupport~