Continuous Optimisation HT26, Inequality constrained optimisation problems
Flashcards
@Define what is meant by a (nonconvex) inequality-constrained optimisation problem (iCP) and its strictly feasible set $\Omega^o$. @State the standard assumption used to derive interior point methods.
Problem (iCP):
\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.} \qquad c(x) \ge 0,\]where $f : \mathbb R^n \to \mathbb R$, $c = (c _ 1, \ldots, c _ p) : \mathbb R^n \to \mathbb R^p$ are smooth. The aim is to find KKT points (or local minimisers of this problem.
Strictly feasible set:
\[\Omega^o := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]i.e. the interior of the feasible region.
Assumption: $\Omega^0 \ne \emptyset$, i.e. the ∆scq (Slater) condition.
Consider the inequality-constrained optimisation problem ∆icp
\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]
@Define the logarithmic barrier function $f _ \mu$, the associated barrier subproblem $\text{iCP} _ \mu$, and give the intuition for the role of $\mu > 0$.
For each $\mu > 0$, the logarithmic barrier function is:
\[f _ \mu (x) := f(x) - \mu \sum^p _ {i = 1} \log c _ i(x), \qquad \text{dom}(f _ \mu) = \Omega^o.\]The associated logarithmic barrier subproblem is
\[\min _ {x \in \mathbb R^n} f _ \mu(x) \qquad \text{s.t.} \qquad c(x) > 0. \quad (\text{iCP} _ \mu)\]This is essentially unconstrained. The constraint $c(x) > 0$ is enforced implicitly by the barrier term, since $c _ i(x) \to 0^+ \implies -\log c _ i(x) \to +\infty$.
Role of $\mu$:
- $\mu \to \infty$: the barrier term dominates, so a minimiser $x(\mu)$ is pushed deep into $\Omega^o$, far from the boundary. Strict feasibility is well-enforced.
- $\mu \to 0$: the barrier term vanishes pointwise on $\Omega^o$, so $x(\mu)$ approaches the constrained minimiser. This is where ill-conditioning can occur.
Consider the following inequality-constrained optimisation problem
\[\min x _ 1^2 + x _ 2^2 \qquad \text{s.t.} \qquad x _ 1 + x _ 2^2 \ge 1\]
and the associated barrier function
\[f _ \mu(x) := x _ 1^2 + x _ 2^2 - \mu \log(x _ 1 - x _ 2^2 - 1).\]
@Visualise the contours of $f _ \mu$ for $\mu = 10, 1, 0.1, 0.01$.


Consider the inequality-constrained problem ∆icp and its ∆logarithmic-barrier-function
\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x)\]
@State the gradient $\nabla f _ \mu$, and derive the first-order optimality conditions for $\text{iCP} _ \mu$, and compare with the KKT conditions for $\text{iCP}$.
Gradient:
\[\begin{aligned} \nabla f _ \mu(x) &= \nabla f(x) - \sum^p _ {i = 1} \frac{\mu}{c _ i(x)} \nabla c _ i(x) \\ &= \nabla f(x) - \mu J(x)^\top c(x)^{-1} \end{aligned}\]where $J(x)$ is the constraint Jacobian (∆constraint-jacobian-matrices), and $c(x)^{-1} = (1/c _ 1(x), \ldots, 1/c _ p(x))$.
Stationarity for $\text{iCP} _ \mu$: We have that $x(\mu)$ is a minimiser iff $\nabla f _ \mu(x(\mu)) = 0$, or equivalently
\[\nabla f(x(\mu)) = \sum^p _ {i = 1} \underbrace{\frac{\mu}{c _ i(x(\mu))}} _ {=:\, \lambda _ i(\mu)} \nabla c _ i(x(\mu)),\]with $\lambda _ i(\mu) > 0$ on $\Omega^o$.
KKT for $\text{iCP}$: We require
\[\nabla f(x^\ast) = \sum _ i \lambda _ i^\ast \, \nabla c _ i(x^\ast)\]and $\lambda^\ast \ge 0$, $\lambda _ i^\ast c _ i(x^\ast) = 0$.
Comparison: These two expressions have the same algebraic shape:
\[\lambda _ i(\mu)c _ i(x(\mu)) = \mu \quad \text{v.s.} \quad \lambda _ i^\ast c _ i(x^\ast) = 0.\]This identifies $\lambda _ i(\mu) := \mu / c _ i(x(\mu))$ as a natural Lagrange multiplier estimate, and motivates the central question: under what conditions do $x(\mu) \to x^\ast$ and $\lambda _ i(\mu) \to \lambda _ i^\ast$ as $\mu \to 0$?
Consider the inequality-constrained optimisation problem ∆icp
\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]
and its ∆logarithmic-barrier-function
\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x).\]
and the corresponding strictly feasible set
\[\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]
i.e. the interior of the feasible region. In this context, @define the central path.
The central path is the set
\[\{x(\mu) : \mu > 0\}\]where $x(\mu)$ is a (global) minimiser of $f _ \mu$ on $\Omega^o$. As $\mu$ decreases, $x(\mu)$ traces a curve deep in $\Omega^o$ toward the boundary of $\Omega$, and under suitable conditions (see ∆central-path-existence) converges to a constrained minimiser of $\text{iCP}$ as $\mu \to 0$.
Consider the inequality-constrained optimisation problem ∆icp
\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]
and its ∆logarithmic-barrier-function
\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x).\]
and the corresponding strictly feasible set
\[\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]
i.e. the interior of the feasible region. Recall that the central path is the set
\[\{x(\mu) : \mu > 0\}\]
where $x(\mu)$ is a (global) minimiser of $f _ \mu$ on $\Omega^o$.
@State a theorem giving sufficient conditions for the central path to exist locally and converge to a local minimiser of (iCP) as $\mu \to 0$.
Suppose $\Omega^o \ne \emptyset$, and $x^\ast$ is a local minimiser of $\text{iCP}$ with KKT multipliers $\lambda^\ast$, satisfying:
- Strict complementarity: $\lambda _ i^\ast > 0$ whenever $c _ i(x^\ast) = 0$.
- ∆licq: $\nabla c _ i(x^\ast), i \in \mathcal A := \{i : c _ i(x^\ast) = 0\}$, are linearly independent.
- Second-order sufficient condition: there exists $\alpha > 0$ such that
where $\mathcal L$ is the ∆lagrangian-of-constrained-optimisation-problem.
Then there is a unique $\mathcal C^1$ path $\mu \mapsto x(\mu)$ of (global) minimisers of $f _ \mu$ defined for all $\mu > 0$ sufficiently small, and $x(\mu) \to x^\ast$ as $\mu \to 0$.
Consider the inequality-constrained optimisation problem ∆icp
\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]
and its ∆logarithmic-barrier-function
\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x).\]
and the corresponding strictly feasible set
\[\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]
i.e. the interior of the feasible region. Recall that the central path is the set
\[\{x(\mu) : \mu > 0\}\]
where $x(\mu)$ is a (global) minimiser of $f _ \mu$ on $\Omega^o$ and that we have the theorem ∆central-path-existence guaranteeing that when $\Omega^o \ne \emptyset$, and $x^\ast$ is a local minimiser of $\text{iCP}$ with KKT multipliers $\lambda^\ast$, satisfying:
- Strict complementarity: $\lambda _ i^\ast > 0$ whenever $c _ i(x^\ast) = 0$.
- ∆licq: $\nabla c _ i(x^\ast), i \in \mathcal A := \{i : c _ i(x^\ast) = 0\}$, are linearly independent.
- Second-order sufficient condition: there exists $\alpha > 0$ such that
\[s^\top \nabla^2 _ {xx} \mathcal L(x^\ast, \lambda^\ast)s \ge \alpha \|s\|^2 \quad\text{for all }s\text{ with } J _ \mathcal A(x^\ast)s = 0,\]
where $\mathcal L$ is the ∆lagrangian-of-constrained-optimisation-problem.
Then there is a unique $\mathcal C^1$ path $\mu \mapsto x(\mu)$ of (global) minimisers of $f _ \mu$ defined for all $\mu > 0$ sufficiently small, and $x(\mu) \to x^\ast$ as $\mu \to 0$.
@Visualise the central path trajectory for two problems: one convex, and one nonconvex. What is the difference between them?
Convex:

Nonconvex:

Difference:
In the nonconvex case, there are two local minimisers. Hence there are two central paths starting within the feasible region (for $\mu$ sufficiently small), both of which converge to each local minimiser.
Consider the inequality-constrained optimisation problem ∆icp
\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]
and its ∆logarithmic-barrier-function
\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x).\]
and the corresponding strictly feasible set
\[\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]
i.e. the interior of the feasible region. Recall that the central path is the set
\[\{x(\mu) : \mu > 0\}\]
where $x(\mu)$ is a (global) minimiser of $f _ \mu$ on $\Omega^o$.
In this context, @state the basic barrier method @algorithm.
Inputs: initial $\mu^0 > 0$, starting iterate $x^0 \in \Omega^o$.
Algorithm: Let $k = 0$. Until convergence:
- Choose $0 < \mu^{k+1} < \mu^k$ (commonly $\mu^{k+1} = 0.1 \mu^k$ or $\mu^{k+1} = (\mu^k)^2$).
- Choose a starting point $x_0^k \in \Omega^o$ for the inner minimisation (commonly $x_0^k := x^k$).
- Inner minimisation: from $x_0^k$, use an unconstrained minimiser to find an approximate minimiser $x^{k+1}$ of $f_{\mu^{k+1}}$.
- $k := k+1$
Convergence requirement: $\mu^k \to 0$ as $k \to \infty$.
Inner-minimisation algorithms: $f_\mu$ has log singularities at the boundary of $\Omega$, so we may use:
- Linesearch methods: use special linesearch to cope with the singularity of the log.
- Trust-region methods: shape the trust region to cope with the log contours, reject any candidate with $c(x^k + s^k) \not> 0$.
Suppose we have the inequality-constrained optimisation problem ∆icp
\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{ s.t } \qquad c(x) \ge 0,\]
and the ∆logarithmic-barrier-function $f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x)$. Consider any sequence of iterates $(x^k, \mu^k)$ where $x^k$ is an approximate stationary point of $f _ {\mu^k}$ with $c(x^k) > 0$, i.e. those produced by ∆basic-barrier-method-algorithm. Let $\lambda _ i^k := \mu^k / c _ i(x^k)$ be the Lagrange multiplier estimates (from ∆barrier-multiplier-estimates).
@State a theorem about the global convergence of this scheme.
Suppose:
- $f, c \in \mathcal C^1$
- $\|\nabla f _ {\mu^k}(x^k) \| \le \epsilon^k$, where $\epsilon^k \to 0$ as $k \to \infty$
- $\mu^k \to 0$ as $k \to \infty$
- $x^k \to x^\ast$ where $\nabla c _ i(x^\ast), i \in \mathcal A := \{i : c _ i(x^\ast) = 0\}$, are linear independent (∆licq at $x^\ast$)
Then $x^\ast$ is a KKT point of $\text{iCP}$, and $\lambda^k \to \lambda^\ast$, where $\lambda^\ast$ is the vector of Lagrange multipliers at $x^\ast$.
(compare to ∆quadratic-penalty-method-convergence, ∆aug-lagrangian-global-convergence).
Suppose we have the inequality-constrained optimisation problem ∆icp
\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{ s.t } \qquad c(x) \ge 0,\]
and the ∆logarithmic-barrier-function $f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x)$. Consider any sequence of iterates $(x^k, \mu^k)$ where $x^k$ is an approximate stationary point of $f _ {\mu^k}$ with $c(x^k) > 0$, i.e. those produced by ∆basic-barrier-method-algorithm. Let $\lambda _ i^k := \mu^k / c _ i(x^k)$ be the Lagrange multiplier estimates (from ∆barrier-multiplier-estimates).
@Prove that if
- $f, c \in \mathcal C^1$
- $\|\nabla f _ {\mu^k}(x^k) \| \le \epsilon^k$, where $\epsilon^k \to 0$ as $k \to \infty$
- $\mu^k \to 0$ as $k \to \infty$
- $x^k \to x^\ast$ where $\nabla c _ i(x^\ast), i \in \mathcal A := \{i : c _ i(x^\ast) = 0\}$, are linear independent (∆licq at $x^\ast$)
then $x^\ast$ is a KKT point of $\text{iCP}$, and $\lambda^k \to \lambda^\ast$, where $\lambda^\ast$ is the vector of Lagrange multipliers at $x^\ast$.
By ∆licq, $J _ \mathcal A(x^\ast)$ has full row rank, so the pseudo-inverse $J _ {\mathcal A}(x^\ast)^+ = (J _ \mathcal A J _ \mathcal A^\top)^{-1} J _ {\mathcal A}$ is well-defined. By continuity, $J _ {\mathcal A}(x^k)^+$ exists, is bounded, and converges to $J _ {\mathcal A}(x^\ast)^+$ for $k$ large.
Let $\mathcal I = \{1, \ldots, p\} \setminus \mathcal A$. From $c(x^k) > 0$ and $x^k \to x^\ast$: $c(x^\ast) \ge 0$ with $c _ {\mathcal A}(x^\ast) = 0$ and $c _ {\mathcal I}(x^\ast) > 0$, so $x^\ast$ is feasible (primal feasibility).
Define $\lambda^\ast = (\lambda _ {\mathcal A}^\ast, \lambda^\ast _ {\mathcal I})$ via
\[\lambda _ {\mathcal A}^\ast := J _ {\mathcal A}(x^\ast)^+ \nabla f(x^\ast), \qquad \lambda^\ast _ {\mathcal I} := 0.\]Complementary slackness $\lambda _ i^\ast c _ i(x^\ast) = 0$ is immediate. $\lambda _ {\mathcal I}^\ast = 0$ means inactive $c _ {\mathcal A}(x^\ast) = 0$ kills active ones.
It remains to show that:
- $\lambda^k _ {\mathcal I} \to 0$
- $\lambda^k _ {\mathcal A} \to \lambda^\ast _ {\mathcal A} \ge 0$
- Lagrangian stationarity $\nabla f(x^\ast) = J _ {\mathcal A}(x^\ast)^\top \lambda^\ast _ {\mathcal A}$.
For (1), note that for $i \in \mathcal I$, $\lambda^k _ i = \mu^k / c _ i(x^k) \to / c _ i(x^\ast) = 0$, since $\mu^k \to 0$ and $c _ i(x^\ast) > 0$.
For (2), write $r^k := \nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}$ for the active-block residual. The multiplier estimate (∆barrier-multiplier-estimates) gives $\nabla f _ {\mu^k}(x^k) = \nabla f(x^k) - J(x^k)^\top \lambda^k$, and splitting $J^\top \lambda^k = J _ {\mathcal A}^\top \lambda^k _ {\mathcal A} + J _ {\mathcal I}^\top \lambda^k _ {\mathcal I}$ rearranges this to $r^k = \nabla f _ {\mu^k}(x^k) + J _ {\mathcal I}(x^k)^\top \lambda^k _ {\mathcal I}$. Hence
\[\begin{aligned} \|r^k\| &= \|\nabla f _ {\mu^k}(x^k) + J _ {\mathcal I}(x^k)^\top \lambda^k _ {\mathcal I}\| &&(\star 1) \\ &\le \|\nabla f _ {\mu^k}(x^k)\| + \|J _ {\mathcal I}(x^k)\| \cdot \|\lambda^k _ {\mathcal I}\| &&(\star 2) \\ &\le \epsilon^k + \|J _ {\mathcal I}(x^k)\| \cdot \|\lambda^k _ {\mathcal I}\| &&(\star 3) \\ &\to 0 &&(\star 4) \end{aligned}\]where:
- $(\star 1)$: substituting the identity $r^k = \nabla f _ {\mu^k}(x^k) + J _ {\mathcal I}(x^k)^\top \lambda^k _ {\mathcal I}$ derived above.
- $(\star 2)$: triangle inequality, then submultiplicativity $\|J _ {\mathcal I}(x^k)^\top \lambda^k _ {\mathcal I}\| \le \|J _ {\mathcal I}(x^k)\| \cdot \|\lambda^k _ {\mathcal I}\|$.
- $(\star 3)$: the approximate-stationarity hypothesis $\|\nabla f _ {\mu^k}(x^k)\| \le \epsilon^k$.
- $(\star 4)$: $\epsilon^k \to 0$; $\|J _ {\mathcal I}(x^k)\|$ is bounded ($J _ {\mathcal I}$ continuous along $x^k \to x^\ast$); and $\|\lambda^k _ {\mathcal I}\| \to 0$ by (1).
LICQ makes $J _ {\mathcal A}(x^k)$ full row rank, so $J _ {\mathcal A}(x^k)^+ J _ {\mathcal A}(x^k)^\top = (J _ {\mathcal A} J _ {\mathcal A}^\top)^{-1}(J _ {\mathcal A} J _ {\mathcal A}^\top) = I$; left-multiplying $r^k = \nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}$ by $J _ {\mathcal A}(x^k)^+$ and rearranging gives $\lambda^k _ {\mathcal A} = J _ {\mathcal A}(x^k)^+ \nabla f(x^k) - J _ {\mathcal A}(x^k)^+ r^k$. Hence
\[\begin{aligned} \|\lambda^k _ {\mathcal A} - \lambda^\ast _ {\mathcal A}\| &= \left\|\left(J _ {\mathcal A}(x^k)^+ \nabla f(x^k) - \lambda^\ast _ {\mathcal A}\right) - J _ {\mathcal A}(x^k)^+ r^k\right\| &&(\star 5) \\ &\le \|J _ {\mathcal A}(x^k)^+ \nabla f(x^k) - \lambda^\ast _ {\mathcal A}\| + \|J _ {\mathcal A}(x^k)^+\| \cdot \|r^k\| &&(\star 6) \\ &\to 0 &&(\star 7) \end{aligned}\]where:
- $(\star 5)$: substituting $\lambda^k _ {\mathcal A} = J _ {\mathcal A}(x^k)^+ \nabla f(x^k) - J _ {\mathcal A}(x^k)^+ r^k$ and grouping the two limit-bearing pieces.
- $(\star 6)$: triangle inequality and submultiplicativity $\|J _ {\mathcal A}(x^k)^+ r^k\| \le \|J _ {\mathcal A}(x^k)^+\| \cdot \|r^k\|$.
- $(\star 7)$: the first term $\to 0$ because $J _ {\mathcal A}(x^k)^+ \nabla f(x^k) \to J _ {\mathcal A}(x^\ast)^+ \nabla f(x^\ast) = \lambda^\ast _ {\mathcal A}$ (continuity of $J _ {\mathcal A}^+$ and $\nabla f$, $x^k \to x^\ast$, and the definition of $\lambda^\ast _ {\mathcal A}$); the second term $\to 0$ because $\|J _ {\mathcal A}(x^k)^+\|$ is bounded and $\|r^k\| \to 0$ by $(\star 4)$.
Therefore $\lambda^k _ {\mathcal A} \to \lambda^\ast _ {\mathcal A}$, and since $\lambda^k > 0$ for all $k$, $\lambda^\ast _ {\mathcal A} \ge 0$ (dual feasibility).
For (3), pass to the limit in $r^k = \nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}$. By continuity and (2), it converges to $\nabla f(x^\ast) - J _ {\mathcal A}(x^\ast)^\top \lambda^\ast _ {\mathcal A}$. But then $(\star 4)$ says it also converges to $0$. Hence
\[\nabla f(x^\ast) = J _ {\mathcal A}(x^\ast)^\top \lambda^\ast _ {\mathcal A} = \sum _ {i \in \mathcal A} \lambda _ i^\ast \nabla c _ i(x^\ast).\]All four ∆kkt-conditions hold at $(x^\ast, \lambda^\ast)$, so $x^\ast$ is KKT, and $\lambda^k \to \lambda^\ast$ overall.
Consider the inequality-constrained problem ∆icp and its ∆logarithmic-barrier-function $f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x)$, with multiplier estimates $\lambda (x) = \mu c(x)^{-1}$ (entry-wise, from ∆barrier-multiplier-estimates).
@State $\nabla f _ \mu$ and $\nabla^2 f _ \mu$ in terms of the $\text{iCP}$ ∆lagrangian-of-constrained-optimisation-problem (which collapses to $\mathcal L(x, \lambda) = f(x) - \sum _ i \lambda _ i c _ i(x)$ for inequality-only problems).
Gradient:
\[\begin{aligned} \nabla f _ \mu(x) &= \nabla f(x) - \mu J(x)^\top c(x)^{-1} \\ &= \nabla f(x) - J(x)^\top \lambda(x) \\ &= \nabla _ x \mathcal L(x, \lambda(x)). \end{aligned}\]So minimising $f _ \mu$ in $x$ is equivalent to seeking Lagrangian stationarity $(x, \lambda(x))$.
Hessian: (writing $C(x) := \text{diag}(c _ 1(x), \ldots, c _ p(x))$):
\[\nabla^2 f _ \mu(x) = \underbrace{\nabla^2 f(x) - \sum^p _ {i = 1} \frac{\mu}{c _ i(x)} \nabla^2 c _ i(x)} _ {=\,\nabla^2 _ {xx} \mathcal L(x, \lambda(x))} + \mu J(x)^\top C(x)^{-2} J(x).\]The “extra” term $\mu J(x)^\top C(x)^{-2} J(x)$ comes from differentiating the multiplier estimate $\lambda(x) = \mu c(x)^{-1}$: each entry $\lambda _ i(x) = \mu/c _ i(x)$ contributes $-(\mu/c _ i(x)^2)\,\nabla c _ i(x) \nabla c _ i(x)^\top$ when re-differentiated. For $i \in \mathcal A$ (active at $x^\ast$), $c _ i(x(\mu)) = O(\mu)$ along the central path, so $\mu/c _ i(x)^2 = O(1/\mu) \to \infty$ as $\mu \to 0$ — this is the mechanism behind ∆barrier-hessian-ill-conditioning.
In the setup of ∆barrier-lagrangian-connection,
\[\nabla^2 f _ \mu(x) = \nabla^2 _ {xx} \mathcal L(x, \lambda(x)) + \mu J(x)^\top C(x)^{-2} J(x)\]
@Describe the behaviour of $\nabla^2 f _ \mu$ as $\mu \to 0$ along the ∆central-path, and the consequences for Newton-based barrier methods.
Along the central path with $x(\mu) \to x^\ast$ and $\lambda _ i(\mu) = \mu/c _ i(x(\mu))$ staying bounded (i.e. $c _ i(x(\mu)) \to c _ i(x^\ast)$ at rate $O(\mu)$ for active constraints):
- $i \in \mathcal A$ (active, $c _ i(x^\ast) = 0$): $\mu / c _ i(x(\mu))^2 \to \infty$
- $i \in \mathcal I$ (inactive, $c _ i(x^\ast) > 0$): $\mu / c _ i(x(\mu))^2 \to 0$.
So as $\mu \to 0$ (along with a Fact from Gould), $\nabla^2 f _ {\mu}(x(\mu))$ splits into:
- $p _ a := \vert \mathcal A \vert $ eigenvalues of $\nabla^2 f _ {\mu^k}(x^k)$ tend to infinity as $k \to \infty$.
- $n - p _ a$ eigenvalues that are $O(1)$ (from $\nabla^2 _ {xx} \mathcal L$) (inactive contributions to the extra term vanish)
- The condition number of $\nabla^2 f _ {\mu^k}(x^k)$ is $O(1/\mu^k)$.
This blows up as $k \to \infty$, and we may not able to compute $x^k$ accurately.
Consider the ∆basic-barrier-method-algorithm. At each outer iteration we decrease $\mu^k$ to $\mu^{k+1}$ and need a starting point $x^k _ 0$ for the inner minimisation of $f _ {\mu^{k+1}}$. The obvious choice $x^k _ 0 := x^k$ turns out to be a very poor one. @Describe what goes wrong.
Even though $x^k$ is an approximate minimiser of $f _ {\mu^k}$ (and hence interior), once we shift to the new barrier $f _ {\mu^{k+1}}$, the full Newton step $s^k$ from $x^k _ 0 = x^k$ is asymptotically infeasible.:
\[c(x^k + s^k) < 0 \quad \text{whenever } \mu^{k+1} < 0.5 \mu^k.\]Any non-trivial decrease in $\mu$ (anything more aggressive than halving) drives the Newton iterate out of $\Omega^o$. So either:
- We must take a damped step (small $\alpha^k$), which loses the local quadratic convergence of Newton, or
- Use only small decreases $\mu^{k+1} \ge 0.5 \mu^k$, which means we have very slow outer convergence.
Consider the inequality-constrained optimisation problem ∆icp
\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]
and its ∆logarithmic-barrier-function
\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x).\]
and the corresponding strictly feasible set
\[\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]
i.e. the interior of the feasible region. Recall that the central path is the set
\[\{x(\mu) : \mu > 0\}\]
where $x(\mu)$ is a (global) minimiser of $f _ \mu$ on $\Omega^o$.
In this context, the basic barrier method algorithm is:
Inputs: initial $\mu^0 > 0$, starting iterate $x^0 \in \Omega^o$.
Algorithm: Let $k = 0$. Until convergence:
- Choose $0 < \mu^{k+1} < \mu^k$ (commonly $\mu^{k+1} = 0.1 \mu^k$ or $\mu^{k+1} = (\mu^k)^2$).
- Choose a starting point $x_0^k \in \Omega^o$ for the inner minimisation (commonly $x_0^k := x^k$).
- Inner minimisation: from $x_0^k$, use an unconstrained minimiser to find an approximate minimiser $x^{k+1}$ of $f_{\mu^{k+1}}$.
- $k := k+1$
Convergence requirement: $\mu^k \to 0$ as $k \to \infty$.
Inner-minimisation algorithms: $f_\mu$ has log singularities at the boundary of $\Omega$, so we may use:
- Linesearch methods: use special linesearch to cope with the singularity of the log.
- Trust-region methods: shape the trust region to cope with the log contours, reject any candidate with $c(x^k + s^k) \not> 0$.
@Describe the two issues with this method, and the technique used to address these difficulties.
- The Hessian of $f _ {\mu^k}$ becomes ill-conditioned as $\mu^k \to 0$ (condition number $O(1/\mu^k)$), so e.g. Newton’s method will get worse and worse (∆barrier-hessian-ill-conditioning).
- It can be shown that the natural starting point $x^k _ 0 := x^k$ is a poor choice, since in order for iterates to remain feasible we must either dampen the Newton step (losing quadratic convergence), or only take small decreases in $\mu^k$, which means the outer loop converges very slowly (∆barrier-poor-starting-points).
The technique to address these issues is ∆primal-dual-newton-system, which treat the multiplier estimates as explicit variables.
Consider the inequality-constrained problem ∆icp, its ∆logarithmic-barrier-function $f _ \mu$, and a minimiser $x(\mu) \in \Omega^o$ of the barrier subproblem $\text{iCP} _ \mu$.
@Derive the perturbed optimality conditions ($\text{OPT} _ \mu$) that $(x(\mu), \lambda(\mu))$ satisfy, where $\lambda(\mu) := \mu c(x(\mu))^{-1}$ . Compare with the ∆kkt-conditions for ($\text{iCP}$).
First-order necessary conditions for $\text{iCP} _ \mu$ (which is essentially unconstrained, since $f _ \mu \to \infty$ at the boundary) give $\nabla f _ {\mu} (x(\mu)) = 0$:
\[\nabla f(x(\mu)) - \mu J(x(\mu))^\top c(x(\mu))^{-1} = 0.\]With $\lambda(\mu) := \mu c(x(\mu))^{-1}$, this is Lagrangian stationarity:
\[\boxed{\nabla f(x(\mu)) - J(x(\mu))^\top \lambda(\mu) = 0}\]Multiplying $\lambda _ i(\mu) = \mu / c _ i(x(\mu))$ by $c _ i(x(\mu))$ gives perturbed complementarity:
\[\boxed{c _ i(x(\mu))\lambda _ i(\mu) = \mu, \quad i = 1, \ldots, p.}\]Strict feasibility $x(\mu) \in \Omega^o$ gives $\boxed{c > 0, \lambda > 0}$.
So $(x(\mu), \lambda(\mu))$ satisfies $\text{OPT} _ \mu$:
\[\begin{cases} \nabla f(x) - J(x)^\top \lambda = 0 \\ c _ i(x) \lambda _ i = \mu, &\,i = 1, \ldots ,p \end{cases} \qquad \text{with }c(x) > 0, \lambda > 0\]Contrast this with KKT for $\text{iCP}$:
\[\begin{cases} \nabla f(x) - J(x)^\top \lambda = 0 \\ c _ i(x) \lambda _ i = 0 &\,i =1, \ldots, p \end{cases} \quad \text{with } c(x) \ge 0, \lambda \ge 0.\]There are two differences: complementarity is $c _ i \lambda _ i = \mu$ vs $0$, and the inequalities are strict vs non-strict. As $\mu \to 0$, $\text{OPT} _ \mu$ recovers KKT.
Why this matters: KKT complementarity $c _ i \lambda _ i$ is a combinatorial statement: for each $i$, either $c _ i = 0$ (active) or $\lambda _ i = 0$ (inactive), and the algorithm has to discover which. $\text{OPT} _ \mu$ loosens this to a smooth equation $c _ i \lambda _ i = \mu$, so that every constraint may be “slightly active”, and Newton’s method now applies to the joint $(x, \lambda)$ unknowns (in contrast to ∆barrier-multiplier-estimates, where $\lambda$ was derived from $x$). As $\mu \to 0$, the active set emerges automatically: active constraints have $c _ i \to 0$ with $\lambda _ i$ bounded, inactive ones $\lambda _ i \to 0$ with $c _ i$ bounded. In this view, $\text{OPT} _ \mu$ is a smooth perturbation of KKT, and primal-dual methods follow the path of solutions to $\mu = 0$ (see ∆primal-dual-newton-system).
Apply Newton’s method to the ∆perturbed-optimality-conditions $\text{OPT} _ \mu$:
\[\begin{cases}
\nabla f(x) - J(x)^\top \lambda = 0 \\
c _ i(x) \lambda _ i = \mu, &\,i = 1, \ldots ,p
\end{cases} \qquad \text{with }c(x) > 0, \lambda > 0\]
treating $(x, \lambda)$ as joint unknowns, maintaining $c(x) > 0, \lambda > 0$. Derive the Newton direction $(dx, d\lambda)$ and its reduced form after eliminating $d\lambda$.
Write the system as $g(x, \lambda) = 0$ with $g = (g _ 1; g _ 2)$, where
\[\begin{aligned} g _ 1(x) &:= \nabla f(x) - J(x)^\top \lambda \\ g _ 2 (x) &:= C(x) \lambda - \mu e \end{aligned}\]with $C(x) := \text{diag}(c _ 1(x), \ldots, c _ p(x))$ and $e = (1, \ldots, 1)^\top$. Linearising, we obtain
- $\partial g _ 1 / \partial x = \nabla^2 f(x) - \sum _ i \lambda _ i \nabla^2 c _ i(x) = \nabla^2 _ {xx} \mathcal L(x, \lambda)$ (cf. ∆barrier-lagrangian-connection)
- $\partial g _ 1 / \partial \lambda = -J(x)^\top$
and
- $\partial g _ 2 / \partial x = \Lambda J(x)$ where $\Lambda := \text{diag}(\lambda)$ (the $i$th component $c _ i(x) \lambda _ i$ differentiates to row $\lambda _ i \nabla c _ i(x)^\top)$
- $\partial g _ 2 / \partial \Lambda = C(x)$
Then Newton’s method solves for
\[\begin{pmatrix} \nabla^2 _ {xx} \mathcal L(x, \lambda) & -J(x)^\top \\ \Lambda J(x) & C(x) \end{pmatrix} \begin{pmatrix} \text dx \\ \text d\lambda \end{pmatrix} = -\begin{pmatrix} \nabla f(x) - J(x)^\top \lambda \\ C(x) \lambda - \mu e \end{pmatrix}\]Reduce by eliminating $d\lambda$: from the bottom block,
\[d\lambda = -\lambda + \mu c(x)^{-1} - C(x)^{-1} \Lambda J(x) \text dx\]Substituting into the top block, the $J^\top \lambda$ terms cancel and we get
\[\begin{aligned} (\nabla^2 _ {xx} \mathcal L(x, \lambda) + J(x)^\top C^{-1}(x) \Lambda J(x)) \text dx &= -(\nabla f(x) - \mu J(x)^\top c(x)^{-1}) \\ &= -\nabla f _ \mu(x) \end{aligned}\]Why this matters: The RHS matches the primal Newton step on $f _ \mu$ (∆barrier-lagrangian-connection). The LHS differs by using explicit $\lambda$ where primal methods are forced to use the implicit $\lambda(x) := \mu c(x)^{-1}$. So the multipliers get their own Newton update $\text d \lambda$ rather than being recomputed by division by vanishing $c _ i$ each iteration. This decouples the multiplier dynamics from the singularity $\lambda(x) = \mu / c(x)$ that creates the ∆barrier-hessian-ill-conditioning and ∆barrier-poor-starting-points issues. The multipliers are now full state variables that Newton can update directly, rather than quantities derived from $x$.
Consider the inequality-constrained problem ∆icp. Combining ∆perturbed-optimality-conditions with the ∆primal-dual-newton-system Newton step, @state the basic primal-dual interior point method @algorithm for solving $\text{iCP}$ in practice.
Inputs: initial $\mu^0 > 0$ and $(x^0, \lambda^0)$ with $c(x^0) > 0, \lambda^0 > 0$ (e.g. $\lambda _ i^0 := \mu^0 / c _ i(x^0))$, $\mu$-shrinkage rule (often $\mu^{k+1} := (\mu^k)^2)$.
Algorithm: Until $\mu^k$ is small and KKT residual below tolerance:
- Inner Newton on $\text{OPT} _ {\mu^k}$, warm-started at $(x^k, \lambda^k)$, repeat a small number of times:
- Compute $(\text dx, \text d\lambda)$ from ∆primal-dual-newton-system at the current $(x, \lambda)$ with parameter $\mu^k$.
- Damped step $(x, \lambda) \leftarrow (x + \alpha \text dx , \lambda + \alpha \text d\lambda)$ with $\alpha \in (0, 1]$ chosen by linesearch / trust-region to:
- Maintain $c(x) > 0$ and $\lambda > 0$ (say strictly inside the cone in both primal and dual);
- Descent the $\text{OPT} _ {\mu^k}$ residual norm.
- Repeat until approximate stationarity at $\text{OPT} _ {\mu^k}$.
- Set $(x^{k+1}, \lambda^{k+1})$ to the converged inner iterate.
- Shrink $\mu^{k+1} := \tau(\mu^k)$, commonly $\mu^{k+1} = (\mu^k)^2$.
- $k := k+1$.
Compare the simplex method and the primal-dual interior point method (∆primal-dual-ipm-algorithm specialised to linear programs) for solving linear programming problems, on worst-case complexity, average-case behaviour, and practical considerations.
Simplex method: walk along the vertices of the feasible polytope, at each pivot stepping to an adjacent vertex with better objective.
- Worst-case: exponential in problem dimension
- Average-cost: linear polynomial in practice
- Per-iteration cost: cheap pivot operation on the simplex tableau
Primal-dual IPM: cuts through the interior along the central path.
- Worst-case: polynomial in dimension and input length
- Average-cost: $O(\log (\text{LP dimension}))$ outer iterations until convergence
- Per-iteration cost: $O(N^3)$ for the primal-dual Newton solve
In practice:
- For small / medium LPs, simplex is often faster
- For very large linear programs, IPMs are often faster
Bite-sized
Strict complementarity at a KKT point of (iCP) holds iff for every active inequality $c _ i(x^\ast) = 0$, the corresponding multiplier is $\lambda _ i^\ast > 0$ (strictly). This rules out “weakly active” constraints where both $c _ i = 0$ and $\lambda _ i = 0$, and is a hypothesis of the ∆central-path-existence theorem.
In primal-dual interior point methods, the central path is followed by Newton iterates on the perturbed system $\text{OPT} _ \mu$ with the shrinkage rule $\mu^{k+1} = (\mu^k)^2$ (quadratic), giving fast asymptotic convergence as $\mu \to 0$.
Why does the primal-dual approach succeed where the basic primal barrier method gets stuck? Because in primal-dual methods, the multipliers $\lambda$ are explicit state variables updated by Newton, whereas in primal methods they are implicitly $\lambda(x) := \mu/c(x)$, derived from $x$ by division by potentially-vanishing $c _ i$ — which is exactly what causes the Hessian ill-conditioning and the poor-starting-point pathology.
Historical milestone: IPMs solve linear programs in polynomial time in the worst case, while the simplex method has exponential worst-case running time (Klee-Minty 1972). Karmarkar’s 1984 result was the breakthrough that re-popularised barrier-style methods after they fell out of favour in the 1960s.
The logarithmic barrier $-\log c _ i(x)$ blows up to $+\infty$ as $c _ i(x) \to 0^+$, so the unconstrained minimiser of $f _ \mu$ on $\{c > 0\}$ stays strictly inside $\Omega^o$. This is why we call $-\mu \sum _ i \log c _ i(x)$ a barrier function — it actively repels iterates from the boundary of $\Omega$, rather than penalising violations after the fact.
Give the proof strategy for ∆barrier-method-global-convergence-proof (Theorem 28: under LICQ, the barrier method’s iterates $x^k$ converge to a KKT point of (iCP), and multiplier estimates $\lambda^k \to \lambda^\ast$).
- Setup: LICQ ⟹ $J _ {\mathcal A}(x^\ast)$ has full row rank, so the pseudo-inverse $J _ {\mathcal A}(x^\ast)^+ = (J _ {\mathcal A} J _ {\mathcal A}^\top)^{-1} J _ {\mathcal A}$ exists; by continuity it exists, is bounded, and is continuous at $x^k$ for large $k$. Split indices into active $\mathcal A := \{i : c _ i(x^\ast) = 0\}$ and inactive $\mathcal I$.
- Define the candidate $\lambda^\ast = (\lambda^\ast _ {\mathcal A}, 0)$ via $\lambda^\ast _ {\mathcal A} := J _ {\mathcal A}(x^\ast)^+ \nabla f(x^\ast)$. Complementary slackness $\lambda^\ast _ i c _ i(x^\ast) = 0$ is then immediate.
- Step 1 (inactive multipliers vanish): $\lambda^k _ i = \mu^k / c _ i(x^k) \to 0 / c _ i(x^\ast) = 0$ for $i \in \mathcal I$, using $\mu^k \to 0$ and $c _ i(x^\ast) > 0$.
- Step 2 (active multipliers converge): starting from $\nabla f _ {\mu^k}(x^k) = \nabla f(x^k) - J(x^k)^\top \lambda^k$, isolate the active part and bound $\|\nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}\| \le \epsilon^k + \|J _ {\mathcal I}(x^k)\| \|\lambda^k _ {\mathcal I}\| \to 0$. Apply $J _ {\mathcal A}(x^k)^+$ to deduce $\lambda^k _ {\mathcal A} \to J _ {\mathcal A}(x^\ast)^+ \nabla f(x^\ast) = \lambda^\ast _ {\mathcal A}$. Since $\lambda^k > 0$, $\lambda^\ast _ {\mathcal A} \ge 0$.
- Step 3 (stationarity): passing to the limit in $\nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}$ gives both $\nabla f(x^\ast) - J _ {\mathcal A}(x^\ast)^\top \lambda^\ast _ {\mathcal A}$ (by continuity) and $0$ (by Step 2’s bound), hence $\nabla f(x^\ast) = J _ {\mathcal A}(x^\ast)^\top \lambda^\ast _ {\mathcal A}$ — Lagrangian stationarity.
All four KKT pieces (primal feasibility, stationarity, dual feasibility, complementary slackness) hold at $(x^\ast, \lambda^\ast)$.
Step 1 of ∆barrier-method-global-convergence-proof (Theorem 28): the inactive multipliers go to zero. For $i \in \mathcal I$ (i.e. $c _ i(x^\ast) > 0$),
\[\lambda^k _ i = \frac{\mu^k}{c _ i(x^k)} \longrightarrow \frac{0}{c _ i(x^\ast)} = <span class="cloze" tabindex="0">0</span>\]using $\mu^k \to 0$ and $c _ i(x^k) \to c _ i(x^\ast) > 0$. This is what gives complementary slackness on inactive constraints in the limit.
Step 2 of ∆barrier-method-global-convergence-proof (Theorem 28): the active-block residual goes to zero. Starting from $\nabla f _ {\mu^k}(x^k) = \nabla f(x^k) - J(x^k)^\top \lambda^k$ and splitting $J^\top \lambda^k = J _ {\mathcal A}^\top \lambda^k _ {\mathcal A} + J _ {\mathcal I}^\top \lambda^k _ {\mathcal I}$, the triangle inequality gives
\[\ \vert \nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}\ \vert \le \ \vert \nabla f _ {\mu^k}(x^k)\ \vert + \ \vert J _ {\mathcal I}(x^k)\ \vert \cdot \ \vert \lambda^k _ {\mathcal I}\ \vert \le <span class="cloze" tabindex="0">\epsilon^k + \ \vert J _ {\mathcal I}(x^k)\ \vert \ \vert \lambda^k _ {\mathcal I}\ \vert</span> \to 0,\]using the approximate-stationarity hypothesis $\ \vert \nabla f _ {\mu^k}(x^k)\ \vert \le \epsilon^k \to 0$, boundedness of $\ \vert J _ {\mathcal I}(x^k)\ \vert $, and $\lambda^k _ {\mathcal I} \to 0$ from Step 1.
In ∆barrier-method-global-convergence-proof (Theorem 28), where does the LICQ hypothesis (linear independence of $\nabla c _ i(x^\ast)$ for $i \in \mathcal A$) enter, and what specifically breaks if LICQ fails?
- LICQ ⟺ $J _ {\mathcal A}(x^\ast)$ has full row rank ⟺ $J _ {\mathcal A} J _ {\mathcal A}^\top$ is invertible, so the pseudo-inverse $J _ {\mathcal A}(x^\ast)^+ := (J _ {\mathcal A} J _ {\mathcal A}^\top)^{-1} J _ {\mathcal A}$ exists, is bounded, and (by continuity) is continuous at $x^k$ for large $k$. Without this, the pseudo-inverse blows up or is undefined.
- The pseudo-inverse is used twice: to define $\lambda^\ast _ {\mathcal A} := J _ {\mathcal A}(x^\ast)^+ \nabla f(x^\ast)$, and in Step 2 to convert the active-block residual bound into a bound on $\lambda^k _ {\mathcal A}$ itself (“apply $J _ {\mathcal A}(x^k)^+$ to both sides”).
- Without LICQ: the multiplier candidate $\lambda^\ast$ may not exist, or be non-unique; the limit $\lambda^k$ can blow up or oscillate; convergence to a KKT point cannot be proved this way.
The same role is played by LICQ in the analogous penalty-method proof (∆quadratic-penalty-method-convergence-proof) and aug-Lagrangian proof (∆aug-lagrangian-global-convergence-proof).
Visualising the central path
The ∆logarithmic-barrier-function $f _ \mu = f - \mu \sum _ i \log c _ i$ turns the inequalities into a smooth wall (∆visualise-log-barrier-contours), and its minimisers trace the ∆central-path as $\mu$ varies. Slide $\mu$ or step the ∆basic-barrier-method-algorithm and watch $x(\mu)$ leave the analytic centre (large $\mu$, deep inside) and slide to the constrained optimum as $\mu \to 0$; drag the objective’s minimiser to change where it ends up. The ∆barrier-multiplier-estimates $\lambda _ i(\mu) = \mu / c _ i(x(\mu))$ shown alongside satisfy the perturbed complementarity $\lambda _ i c _ i = \mu$ and converge to the KKT multipliers.