Continuous Optimisation HT26, Inequality constrained optimisation problems


Flashcards

@Define what is meant by a (nonconvex) inequality-constrained optimisation problem (iCP) and its strictly feasible set $\Omega^o$. @State the standard assumption used to derive interior point methods.

Problem (iCP):

\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.} \qquad c(x) \ge 0,\]

where $f : \mathbb R^n \to \mathbb R$, $c = (c _ 1, \ldots, c _ p) : \mathbb R^n \to \mathbb R^p$ are smooth. The aim is to find KKT points (or local minimisers of this problem.

Strictly feasible set:

\[\Omega^o := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]

i.e. the interior of the feasible region.

Assumption: $\Omega^0 \ne \emptyset$, i.e. the ∆scq (Slater) condition.

Consider the inequality-constrained optimisation problem ∆icp

\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]

@Define the logarithmic barrier function $f _ \mu$, the associated barrier subproblem $\text{iCP} _ \mu$, and give the intuition for the role of $\mu > 0$.

For each $\mu > 0$, the logarithmic barrier function is:

\[f _ \mu (x) := f(x) - \mu \sum^p _ {i = 1} \log c _ i(x), \qquad \text{dom}(f _ \mu) = \Omega^o.\]

The associated logarithmic barrier subproblem is

\[\min _ {x \in \mathbb R^n} f _ \mu(x) \qquad \text{s.t.} \qquad c(x) > 0. \quad (\text{iCP} _ \mu)\]

This is essentially unconstrained. The constraint $c(x) > 0$ is enforced implicitly by the barrier term, since $c _ i(x) \to 0^+ \implies -\log c _ i(x) \to +\infty$.

Role of $\mu$:

  • $\mu \to \infty$: the barrier term dominates, so a minimiser $x(\mu)$ is pushed deep into $\Omega^o$, far from the boundary. Strict feasibility is well-enforced.
  • $\mu \to 0$: the barrier term vanishes pointwise on $\Omega^o$, so $x(\mu)$ approaches the constrained minimiser. This is where ill-conditioning can occur.

@exam~

Consider the following inequality-constrained optimisation problem

\[\min x _ 1^2 + x _ 2^2 \qquad \text{s.t.} \qquad x _ 1 + x _ 2^2 \ge 1\]

and the associated barrier function

\[f _ \mu(x) := x _ 1^2 + x _ 2^2 - \mu \log(x _ 1 - x _ 2^2 - 1).\]

@Visualise the contours of $f _ \mu$ for $\mu = 10, 1, 0.1, 0.01$.

Consider the inequality-constrained problem ∆icp and its ∆logarithmic-barrier-function

\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x)\]

@State the gradient $\nabla f _ \mu$, and derive the first-order optimality conditions for $\text{iCP} _ \mu$, and compare with the KKT conditions for $\text{iCP}$.

Gradient:

\[\begin{aligned} \nabla f _ \mu(x) &= \nabla f(x) - \sum^p _ {i = 1} \frac{\mu}{c _ i(x)} \nabla c _ i(x) \\ &= \nabla f(x) - \mu J(x)^\top c(x)^{-1} \end{aligned}\]

where $J(x)$ is the constraint Jacobian (∆constraint-jacobian-matrices), and $c(x)^{-1} = (1/c _ 1(x), \ldots, 1/c _ p(x))$.

Stationarity for $\text{iCP} _ \mu$: We have that $x(\mu)$ is a minimiser iff $\nabla f _ \mu(x(\mu)) = 0$, or equivalently

\[\nabla f(x(\mu)) = \sum^p _ {i = 1} \underbrace{\frac{\mu}{c _ i(x(\mu))}} _ {=:\, \lambda _ i(\mu)} \nabla c _ i(x(\mu)),\]

with $\lambda _ i(\mu) > 0$ on $\Omega^o$.

KKT for $\text{iCP}$: We require

\[\nabla f(x^\ast) = \sum _ i \lambda _ i^\ast \, \nabla c _ i(x^\ast)\]

and $\lambda^\ast \ge 0$, $\lambda _ i^\ast c _ i(x^\ast) = 0$.

Comparison: These two expressions have the same algebraic shape:

\[\lambda _ i(\mu)c _ i(x(\mu)) = \mu \quad \text{v.s.} \quad \lambda _ i^\ast c _ i(x^\ast) = 0.\]

This identifies $\lambda _ i(\mu) := \mu / c _ i(x(\mu))$ as a natural Lagrange multiplier estimate, and motivates the central question: under what conditions do $x(\mu) \to x^\ast$ and $\lambda _ i(\mu) \to \lambda _ i^\ast$ as $\mu \to 0$?

Consider the inequality-constrained optimisation problem ∆icp

\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]

and its ∆logarithmic-barrier-function

\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x).\]

and the corresponding strictly feasible set

\[\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]

i.e. the interior of the feasible region. In this context, @define the central path.

The central path is the set

\[\{x(\mu) : \mu > 0\}\]

where $x(\mu)$ is a (global) minimiser of $f _ \mu$ on $\Omega^o$. As $\mu$ decreases, $x(\mu)$ traces a curve deep in $\Omega^o$ toward the boundary of $\Omega$, and under suitable conditions (see ∆central-path-existence) converges to a constrained minimiser of $\text{iCP}$ as $\mu \to 0$.

Consider the inequality-constrained optimisation problem ∆icp

\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]

and its ∆logarithmic-barrier-function

\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x).\]

and the corresponding strictly feasible set

\[\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]

i.e. the interior of the feasible region. Recall that the central path is the set

\[\{x(\mu) : \mu > 0\}\]

where $x(\mu)$ is a (global) minimiser of $f _ \mu$ on $\Omega^o$.

@State a theorem giving sufficient conditions for the central path to exist locally and converge to a local minimiser of (iCP) as $\mu \to 0$.

Suppose $\Omega^o \ne \emptyset$, and $x^\ast$ is a local minimiser of $\text{iCP}$ with KKT multipliers $\lambda^\ast$, satisfying:

  1. Strict complementarity: $\lambda _ i^\ast > 0$ whenever $c _ i(x^\ast) = 0$.
  2. ∆licq: $\nabla c _ i(x^\ast), i \in \mathcal A := \{i : c _ i(x^\ast) = 0\}$, are linearly independent.
  3. Second-order sufficient condition: there exists $\alpha > 0$ such that
\[s^\top \nabla^2 _ {xx} \mathcal L(x^\ast, \lambda^\ast)s \ge \alpha \|s\|^2 \quad\text{for all }s\text{ with } J _ \mathcal A(x^\ast)s = 0,\]

where $\mathcal L$ is the ∆lagrangian-of-constrained-optimisation-problem.

Then there is a unique $\mathcal C^1$ path $\mu \mapsto x(\mu)$ of (global) minimisers of $f _ \mu$ defined for all $\mu > 0$ sufficiently small, and $x(\mu) \to x^\ast$ as $\mu \to 0$.

Consider the inequality-constrained optimisation problem ∆icp

\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]

and its ∆logarithmic-barrier-function

\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x).\]

and the corresponding strictly feasible set

\[\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]

i.e. the interior of the feasible region. Recall that the central path is the set

\[\{x(\mu) : \mu > 0\}\]

where $x(\mu)$ is a (global) minimiser of $f _ \mu$ on $\Omega^o$ and that we have the theorem ∆central-path-existence guaranteeing that when $\Omega^o \ne \emptyset$, and $x^\ast$ is a local minimiser of $\text{iCP}$ with KKT multipliers $\lambda^\ast$, satisfying:

  1. Strict complementarity: $\lambda _ i^\ast > 0$ whenever $c _ i(x^\ast) = 0$.
  2. ∆licq: $\nabla c _ i(x^\ast), i \in \mathcal A := \{i : c _ i(x^\ast) = 0\}$, are linearly independent.
  3. Second-order sufficient condition: there exists $\alpha > 0$ such that
\[s^\top \nabla^2 _ {xx} \mathcal L(x^\ast, \lambda^\ast)s \ge \alpha \|s\|^2 \quad\text{for all }s\text{ with } J _ \mathcal A(x^\ast)s = 0,\]

where $\mathcal L$ is the ∆lagrangian-of-constrained-optimisation-problem.

Then there is a unique $\mathcal C^1$ path $\mu \mapsto x(\mu)$ of (global) minimisers of $f _ \mu$ defined for all $\mu > 0$ sufficiently small, and $x(\mu) \to x^\ast$ as $\mu \to 0$.

@Visualise the central path trajectory for two problems: one convex, and one nonconvex. What is the difference between them?

Convex:

Nonconvex:

Difference:

In the nonconvex case, there are two local minimisers. Hence there are two central paths starting within the feasible region (for $\mu$ sufficiently small), both of which converge to each local minimiser.

Consider the inequality-constrained optimisation problem ∆icp

\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]

and its ∆logarithmic-barrier-function

\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x).\]

and the corresponding strictly feasible set

\[\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]

i.e. the interior of the feasible region. Recall that the central path is the set

\[\{x(\mu) : \mu > 0\}\]

where $x(\mu)$ is a (global) minimiser of $f _ \mu$ on $\Omega^o$.

In this context, @state the basic barrier method @algorithm.

Inputs: initial $\mu^0 > 0$, starting iterate $x^0 \in \Omega^o$.

Algorithm: Let $k = 0$. Until convergence:

  • Choose $0 < \mu^{k+1} < \mu^k$ (commonly $\mu^{k+1} = 0.1 \mu^k$ or $\mu^{k+1} = (\mu^k)^2$).
  • Choose a starting point $x_0^k \in \Omega^o$ for the inner minimisation (commonly $x_0^k := x^k$).
  • Inner minimisation: from $x_0^k$, use an unconstrained minimiser to find an approximate minimiser $x^{k+1}$ of $f_{\mu^{k+1}}$.
  • $k := k+1$

Convergence requirement: $\mu^k \to 0$ as $k \to \infty$.

Inner-minimisation algorithms: $f_\mu$ has log singularities at the boundary of $\Omega$, so we may use:

  • Linesearch methods: use special linesearch to cope with the singularity of the log.
  • Trust-region methods: shape the trust region to cope with the log contours, reject any candidate with $c(x^k + s^k) \not> 0$.

@exam~

Suppose we have the inequality-constrained optimisation problem ∆icp

\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{ s.t } \qquad c(x) \ge 0,\]

and the ∆logarithmic-barrier-function $f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x)$. Consider any sequence of iterates $(x^k, \mu^k)$ where $x^k$ is an approximate stationary point of $f _ {\mu^k}$ with $c(x^k) > 0$, i.e. those produced by ∆basic-barrier-method-algorithm. Let $\lambda _ i^k := \mu^k / c _ i(x^k)$ be the Lagrange multiplier estimates (from ∆barrier-multiplier-estimates).

@State a theorem about the global convergence of this scheme.

Suppose:

  • $f, c \in \mathcal C^1$
  • $\|\nabla f _ {\mu^k}(x^k) \| \le \epsilon^k$, where $\epsilon^k \to 0$ as $k \to \infty$
  • $\mu^k \to 0$ as $k \to \infty$
  • $x^k \to x^\ast$ where $\nabla c _ i(x^\ast), i \in \mathcal A := \{i : c _ i(x^\ast) = 0\}$, are linear independent (∆licq at $x^\ast$)

Then $x^\ast$ is a KKT point of $\text{iCP}$, and $\lambda^k \to \lambda^\ast$, where $\lambda^\ast$ is the vector of Lagrange multipliers at $x^\ast$.

(compare to ∆quadratic-penalty-method-convergence, ∆aug-lagrangian-global-convergence).

@exam~

Suppose we have the inequality-constrained optimisation problem ∆icp

\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{ s.t } \qquad c(x) \ge 0,\]

and the ∆logarithmic-barrier-function $f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x)$. Consider any sequence of iterates $(x^k, \mu^k)$ where $x^k$ is an approximate stationary point of $f _ {\mu^k}$ with $c(x^k) > 0$, i.e. those produced by ∆basic-barrier-method-algorithm. Let $\lambda _ i^k := \mu^k / c _ i(x^k)$ be the Lagrange multiplier estimates (from ∆barrier-multiplier-estimates).

@Prove that if

  • $f, c \in \mathcal C^1$
  • $\|\nabla f _ {\mu^k}(x^k) \| \le \epsilon^k$, where $\epsilon^k \to 0$ as $k \to \infty$
  • $\mu^k \to 0$ as $k \to \infty$
  • $x^k \to x^\ast$ where $\nabla c _ i(x^\ast), i \in \mathcal A := \{i : c _ i(x^\ast) = 0\}$, are linear independent (∆licq at $x^\ast$)

then $x^\ast$ is a KKT point of $\text{iCP}$, and $\lambda^k \to \lambda^\ast$, where $\lambda^\ast$ is the vector of Lagrange multipliers at $x^\ast$.

By ∆licq, $J _ \mathcal A(x^\ast)$ has full row rank, so the pseudo-inverse $J _ {\mathcal A}(x^\ast)^+ = (J _ \mathcal A J _ \mathcal A^\top)^{-1} J _ {\mathcal A}$ is well-defined. By continuity, $J _ {\mathcal A}(x^k)^+$ exists, is bounded, and converges to $J _ {\mathcal A}(x^\ast)^+$ for $k$ large.

Let $\mathcal I = \{1, \ldots, p\} \setminus \mathcal A$. From $c(x^k) > 0$ and $x^k \to x^\ast$: $c(x^\ast) \ge 0$ with $c _ {\mathcal A}(x^\ast) = 0$ and $c _ {\mathcal I}(x^\ast) > 0$, so $x^\ast$ is feasible (primal feasibility).

Define $\lambda^\ast = (\lambda _ {\mathcal A}^\ast, \lambda^\ast _ {\mathcal I})$ via

\[\lambda _ {\mathcal A}^\ast := J _ {\mathcal A}(x^\ast)^+ \nabla f(x^\ast), \qquad \lambda^\ast _ {\mathcal I} := 0.\]

Complementary slackness $\lambda _ i^\ast c _ i(x^\ast) = 0$ is immediate. $\lambda _ {\mathcal I}^\ast = 0$ means inactive $c _ {\mathcal A}(x^\ast) = 0$ kills active ones.

It remains to show that:

  1. $\lambda^k _ {\mathcal I} \to 0$
  2. $\lambda^k _ {\mathcal A} \to \lambda^\ast _ {\mathcal A} \ge 0$
  3. Lagrangian stationarity $\nabla f(x^\ast) = J _ {\mathcal A}(x^\ast)^\top \lambda^\ast _ {\mathcal A}$.

For (1), note that for $i \in \mathcal I$, $\lambda^k _ i = \mu^k / c _ i(x^k) \to / c _ i(x^\ast) = 0$, since $\mu^k \to 0$ and $c _ i(x^\ast) > 0$.

For (2), write $r^k := \nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}$ for the active-block residual. The multiplier estimate (∆barrier-multiplier-estimates) gives $\nabla f _ {\mu^k}(x^k) = \nabla f(x^k) - J(x^k)^\top \lambda^k$, and splitting $J^\top \lambda^k = J _ {\mathcal A}^\top \lambda^k _ {\mathcal A} + J _ {\mathcal I}^\top \lambda^k _ {\mathcal I}$ rearranges this to $r^k = \nabla f _ {\mu^k}(x^k) + J _ {\mathcal I}(x^k)^\top \lambda^k _ {\mathcal I}$. Hence

\[\begin{aligned} \|r^k\| &= \|\nabla f _ {\mu^k}(x^k) + J _ {\mathcal I}(x^k)^\top \lambda^k _ {\mathcal I}\| &&(\star 1) \\ &\le \|\nabla f _ {\mu^k}(x^k)\| + \|J _ {\mathcal I}(x^k)\| \cdot \|\lambda^k _ {\mathcal I}\| &&(\star 2) \\ &\le \epsilon^k + \|J _ {\mathcal I}(x^k)\| \cdot \|\lambda^k _ {\mathcal I}\| &&(\star 3) \\ &\to 0 &&(\star 4) \end{aligned}\]

where:

  • $(\star 1)$: substituting the identity $r^k = \nabla f _ {\mu^k}(x^k) + J _ {\mathcal I}(x^k)^\top \lambda^k _ {\mathcal I}$ derived above.
  • $(\star 2)$: triangle inequality, then submultiplicativity $\|J _ {\mathcal I}(x^k)^\top \lambda^k _ {\mathcal I}\| \le \|J _ {\mathcal I}(x^k)\| \cdot \|\lambda^k _ {\mathcal I}\|$.
  • $(\star 3)$: the approximate-stationarity hypothesis $\|\nabla f _ {\mu^k}(x^k)\| \le \epsilon^k$.
  • $(\star 4)$: $\epsilon^k \to 0$; $\|J _ {\mathcal I}(x^k)\|$ is bounded ($J _ {\mathcal I}$ continuous along $x^k \to x^\ast$); and $\|\lambda^k _ {\mathcal I}\| \to 0$ by (1).

LICQ makes $J _ {\mathcal A}(x^k)$ full row rank, so $J _ {\mathcal A}(x^k)^+ J _ {\mathcal A}(x^k)^\top = (J _ {\mathcal A} J _ {\mathcal A}^\top)^{-1}(J _ {\mathcal A} J _ {\mathcal A}^\top) = I$; left-multiplying $r^k = \nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}$ by $J _ {\mathcal A}(x^k)^+$ and rearranging gives $\lambda^k _ {\mathcal A} = J _ {\mathcal A}(x^k)^+ \nabla f(x^k) - J _ {\mathcal A}(x^k)^+ r^k$. Hence

\[\begin{aligned} \|\lambda^k _ {\mathcal A} - \lambda^\ast _ {\mathcal A}\| &= \left\|\left(J _ {\mathcal A}(x^k)^+ \nabla f(x^k) - \lambda^\ast _ {\mathcal A}\right) - J _ {\mathcal A}(x^k)^+ r^k\right\| &&(\star 5) \\ &\le \|J _ {\mathcal A}(x^k)^+ \nabla f(x^k) - \lambda^\ast _ {\mathcal A}\| + \|J _ {\mathcal A}(x^k)^+\| \cdot \|r^k\| &&(\star 6) \\ &\to 0 &&(\star 7) \end{aligned}\]

where:

  • $(\star 5)$: substituting $\lambda^k _ {\mathcal A} = J _ {\mathcal A}(x^k)^+ \nabla f(x^k) - J _ {\mathcal A}(x^k)^+ r^k$ and grouping the two limit-bearing pieces.
  • $(\star 6)$: triangle inequality and submultiplicativity $\|J _ {\mathcal A}(x^k)^+ r^k\| \le \|J _ {\mathcal A}(x^k)^+\| \cdot \|r^k\|$.
  • $(\star 7)$: the first term $\to 0$ because $J _ {\mathcal A}(x^k)^+ \nabla f(x^k) \to J _ {\mathcal A}(x^\ast)^+ \nabla f(x^\ast) = \lambda^\ast _ {\mathcal A}$ (continuity of $J _ {\mathcal A}^+$ and $\nabla f$, $x^k \to x^\ast$, and the definition of $\lambda^\ast _ {\mathcal A}$); the second term $\to 0$ because $\|J _ {\mathcal A}(x^k)^+\|$ is bounded and $\|r^k\| \to 0$ by $(\star 4)$.

Therefore $\lambda^k _ {\mathcal A} \to \lambda^\ast _ {\mathcal A}$, and since $\lambda^k > 0$ for all $k$, $\lambda^\ast _ {\mathcal A} \ge 0$ (dual feasibility).

For (3), pass to the limit in $r^k = \nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}$. By continuity and (2), it converges to $\nabla f(x^\ast) - J _ {\mathcal A}(x^\ast)^\top \lambda^\ast _ {\mathcal A}$. But then $(\star 4)$ says it also converges to $0$. Hence

\[\nabla f(x^\ast) = J _ {\mathcal A}(x^\ast)^\top \lambda^\ast _ {\mathcal A} = \sum _ {i \in \mathcal A} \lambda _ i^\ast \nabla c _ i(x^\ast).\]

All four ∆kkt-conditions hold at $(x^\ast, \lambda^\ast)$, so $x^\ast$ is KKT, and $\lambda^k \to \lambda^\ast$ overall.

@exam~

Consider the inequality-constrained problem ∆icp and its ∆logarithmic-barrier-function $f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x)$, with multiplier estimates $\lambda (x) = \mu c(x)^{-1}$ (entry-wise, from ∆barrier-multiplier-estimates).

@State $\nabla f _ \mu$ and $\nabla^2 f _ \mu$ in terms of the $\text{iCP}$ ∆lagrangian-of-constrained-optimisation-problem (which collapses to $\mathcal L(x, \lambda) = f(x) - \sum _ i \lambda _ i c _ i(x)$ for inequality-only problems).

Gradient:

\[\begin{aligned} \nabla f _ \mu(x) &= \nabla f(x) - \mu J(x)^\top c(x)^{-1} \\ &= \nabla f(x) - J(x)^\top \lambda(x) \\ &= \nabla _ x \mathcal L(x, \lambda(x)). \end{aligned}\]

So minimising $f _ \mu$ in $x$ is equivalent to seeking Lagrangian stationarity $(x, \lambda(x))$.

Hessian: (writing $C(x) := \text{diag}(c _ 1(x), \ldots, c _ p(x))$):

\[\nabla^2 f _ \mu(x) = \underbrace{\nabla^2 f(x) - \sum^p _ {i = 1} \frac{\mu}{c _ i(x)} \nabla^2 c _ i(x)} _ {=\,\nabla^2 _ {xx} \mathcal L(x, \lambda(x))} + \mu J(x)^\top C(x)^{-2} J(x).\]

The “extra” term $\mu J(x)^\top C(x)^{-2} J(x)$ comes from differentiating the multiplier estimate $\lambda(x) = \mu c(x)^{-1}$: each entry $\lambda _ i(x) = \mu/c _ i(x)$ contributes $-(\mu/c _ i(x)^2)\,\nabla c _ i(x) \nabla c _ i(x)^\top$ when re-differentiated. For $i \in \mathcal A$ (active at $x^\ast$), $c _ i(x(\mu)) = O(\mu)$ along the central path, so $\mu/c _ i(x)^2 = O(1/\mu) \to \infty$ as $\mu \to 0$ — this is the mechanism behind ∆barrier-hessian-ill-conditioning.

@exam~

In the setup of ∆barrier-lagrangian-connection,

\[\nabla^2 f _ \mu(x) = \nabla^2 _ {xx} \mathcal L(x, \lambda(x)) + \mu J(x)^\top C(x)^{-2} J(x)\]

@Describe the behaviour of $\nabla^2 f _ \mu$ as $\mu \to 0$ along the ∆central-path, and the consequences for Newton-based barrier methods.

Along the central path with $x(\mu) \to x^\ast$ and $\lambda _ i(\mu) = \mu/c _ i(x(\mu))$ staying bounded (i.e. $c _ i(x(\mu)) \to c _ i(x^\ast)$ at rate $O(\mu)$ for active constraints):

  • $i \in \mathcal A$ (active, $c _ i(x^\ast) = 0$): $\mu / c _ i(x(\mu))^2 \to \infty$
  • $i \in \mathcal I$ (inactive, $c _ i(x^\ast) > 0$): $\mu / c _ i(x(\mu))^2 \to 0$.

So as $\mu \to 0$ (along with a Fact from Gould), $\nabla^2 f _ {\mu}(x(\mu))$ splits into:

  • $p _ a := \vert \mathcal A \vert $ eigenvalues of $\nabla^2 f _ {\mu^k}(x^k)$ tend to infinity as $k \to \infty$.
  • $n - p _ a$ eigenvalues that are $O(1)$ (from $\nabla^2 _ {xx} \mathcal L$) (inactive contributions to the extra term vanish)
  • The condition number of $\nabla^2 f _ {\mu^k}(x^k)$ is $O(1/\mu^k)$.

This blows up as $k \to \infty$, and we may not able to compute $x^k$ accurately.

@exam~

Consider the ∆basic-barrier-method-algorithm. At each outer iteration we decrease $\mu^k$ to $\mu^{k+1}$ and need a starting point $x^k _ 0$ for the inner minimisation of $f _ {\mu^{k+1}}$. The obvious choice $x^k _ 0 := x^k$ turns out to be a very poor one. @Describe what goes wrong.

Even though $x^k$ is an approximate minimiser of $f _ {\mu^k}$ (and hence interior), once we shift to the new barrier $f _ {\mu^{k+1}}$, the full Newton step $s^k$ from $x^k _ 0 = x^k$ is asymptotically infeasible.:

\[c(x^k + s^k) < 0 \quad \text{whenever } \mu^{k+1} < 0.5 \mu^k.\]

Any non-trivial decrease in $\mu$ (anything more aggressive than halving) drives the Newton iterate out of $\Omega^o$. So either:

  • We must take a damped step (small $\alpha^k$), which loses the local quadratic convergence of Newton, or
  • Use only small decreases $\mu^{k+1} \ge 0.5 \mu^k$, which means we have very slow outer convergence.

Consider the inequality-constrained optimisation problem ∆icp

\[\min _ {x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0\]

and its ∆logarithmic-barrier-function

\[f _ \mu(x) = f(x) - \mu \sum _ i \log c _ i(x).\]

and the corresponding strictly feasible set

\[\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c _ i(x) > 0 \text{ for all }1 \le i \le p\}\]

i.e. the interior of the feasible region. Recall that the central path is the set

\[\{x(\mu) : \mu > 0\}\]

where $x(\mu)$ is a (global) minimiser of $f _ \mu$ on $\Omega^o$.

In this context, the basic barrier method algorithm is:

Inputs: initial $\mu^0 > 0$, starting iterate $x^0 \in \Omega^o$.

Algorithm: Let $k = 0$. Until convergence:

  • Choose $0 < \mu^{k+1} < \mu^k$ (commonly $\mu^{k+1} = 0.1 \mu^k$ or $\mu^{k+1} = (\mu^k)^2$).
  • Choose a starting point $x_0^k \in \Omega^o$ for the inner minimisation (commonly $x_0^k := x^k$).
  • Inner minimisation: from $x_0^k$, use an unconstrained minimiser to find an approximate minimiser $x^{k+1}$ of $f_{\mu^{k+1}}$.
  • $k := k+1$

Convergence requirement: $\mu^k \to 0$ as $k \to \infty$.

Inner-minimisation algorithms: $f_\mu$ has log singularities at the boundary of $\Omega$, so we may use:

  • Linesearch methods: use special linesearch to cope with the singularity of the log.
  • Trust-region methods: shape the trust region to cope with the log contours, reject any candidate with $c(x^k + s^k) \not> 0$.

@Describe the two issues with this method, and the technique used to address these difficulties.

  • The Hessian of $f _ {\mu^k}$ becomes ill-conditioned as $\mu^k \to 0$ (condition number $O(1/\mu^k)$), so e.g. Newton’s method will get worse and worse (∆barrier-hessian-ill-conditioning).
  • It can be shown that the natural starting point $x^k _ 0 := x^k$ is a poor choice, since in order for iterates to remain feasible we must either dampen the Newton step (losing quadratic convergence), or only take small decreases in $\mu^k$, which means the outer loop converges very slowly (∆barrier-poor-starting-points).

The technique to address these issues is ∆primal-dual-newton-system, which treat the multiplier estimates as explicit variables.

Consider the inequality-constrained problem ∆icp, its ∆logarithmic-barrier-function $f _ \mu$, and a minimiser $x(\mu) \in \Omega^o$ of the barrier subproblem $\text{iCP} _ \mu$.

@Derive the perturbed optimality conditions ($\text{OPT} _ \mu$) that $(x(\mu), \lambda(\mu))$ satisfy, where $\lambda(\mu) := \mu c(x(\mu))^{-1}$ . Compare with the ∆kkt-conditions for ($\text{iCP}$).

First-order necessary conditions for $\text{iCP} _ \mu$ (which is essentially unconstrained, since $f _ \mu \to \infty$ at the boundary) give $\nabla f _ {\mu} (x(\mu)) = 0$:

\[\nabla f(x(\mu)) - \mu J(x(\mu))^\top c(x(\mu))^{-1} = 0.\]

With $\lambda(\mu) := \mu c(x(\mu))^{-1}$, this is Lagrangian stationarity:

\[\boxed{\nabla f(x(\mu)) - J(x(\mu))^\top \lambda(\mu) = 0}\]

Multiplying $\lambda _ i(\mu) = \mu / c _ i(x(\mu))$ by $c _ i(x(\mu))$ gives perturbed complementarity:

\[\boxed{c _ i(x(\mu))\lambda _ i(\mu) = \mu, \quad i = 1, \ldots, p.}\]

Strict feasibility $x(\mu) \in \Omega^o$ gives $\boxed{c > 0, \lambda > 0}$.


So $(x(\mu), \lambda(\mu))$ satisfies $\text{OPT} _ \mu$:

\[\begin{cases} \nabla f(x) - J(x)^\top \lambda = 0 \\ c _ i(x) \lambda _ i = \mu, &\,i = 1, \ldots ,p \end{cases} \qquad \text{with }c(x) > 0, \lambda > 0\]

Contrast this with KKT for $\text{iCP}$:

\[\begin{cases} \nabla f(x) - J(x)^\top \lambda = 0 \\ c _ i(x) \lambda _ i = 0 &\,i =1, \ldots, p \end{cases} \quad \text{with } c(x) \ge 0, \lambda \ge 0.\]

There are two differences: complementarity is $c _ i \lambda _ i = \mu$ vs $0$, and the inequalities are strict vs non-strict. As $\mu \to 0$, $\text{OPT} _ \mu$ recovers KKT.

Why this matters: KKT complementarity $c _ i \lambda _ i$ is a combinatorial statement: for each $i$, either $c _ i = 0$ (active) or $\lambda _ i = 0$ (inactive), and the algorithm has to discover which. $\text{OPT} _ \mu$ loosens this to a smooth equation $c _ i \lambda _ i = \mu$, so that every constraint may be “slightly active”, and Newton’s method now applies to the joint $(x, \lambda)$ unknowns (in contrast to ∆barrier-multiplier-estimates, where $\lambda$ was derived from $x$). As $\mu \to 0$, the active set emerges automatically: active constraints have $c _ i \to 0$ with $\lambda _ i$ bounded, inactive ones $\lambda _ i \to 0$ with $c _ i$ bounded. In this view, $\text{OPT} _ \mu$ is a smooth perturbation of KKT, and primal-dual methods follow the path of solutions to $\mu = 0$ (see ∆primal-dual-newton-system).

Apply Newton’s method to the ∆perturbed-optimality-conditions $\text{OPT} _ \mu$:

\[\begin{cases} \nabla f(x) - J(x)^\top \lambda = 0 \\ c _ i(x) \lambda _ i = \mu, &\,i = 1, \ldots ,p \end{cases} \qquad \text{with }c(x) > 0, \lambda > 0\]

treating $(x, \lambda)$ as joint unknowns, maintaining $c(x) > 0, \lambda > 0$. Derive the Newton direction $(dx, d\lambda)$ and its reduced form after eliminating $d\lambda$.

Write the system as $g(x, \lambda) = 0$ with $g = (g _ 1; g _ 2)$, where

\[\begin{aligned} g _ 1(x) &:= \nabla f(x) - J(x)^\top \lambda \\ g _ 2 (x) &:= C(x) \lambda - \mu e \end{aligned}\]

with $C(x) := \text{diag}(c _ 1(x), \ldots, c _ p(x))$ and $e = (1, \ldots, 1)^\top$. Linearising, we obtain

  • $\partial g _ 1 / \partial x = \nabla^2 f(x) - \sum _ i \lambda _ i \nabla^2 c _ i(x) = \nabla^2 _ {xx} \mathcal L(x, \lambda)$ (cf. ∆barrier-lagrangian-connection)
  • $\partial g _ 1 / \partial \lambda = -J(x)^\top$

and

  • $\partial g _ 2 / \partial x = \Lambda J(x)$ where $\Lambda := \text{diag}(\lambda)$ (the $i$th component $c _ i(x) \lambda _ i$ differentiates to row $\lambda _ i \nabla c _ i(x)^\top)$
  • $\partial g _ 2 / \partial \Lambda = C(x)$

Then Newton’s method solves for

\[\begin{pmatrix} \nabla^2 _ {xx} \mathcal L(x, \lambda) & -J(x)^\top \\ \Lambda J(x) & C(x) \end{pmatrix} \begin{pmatrix} \text dx \\ \text d\lambda \end{pmatrix} = -\begin{pmatrix} \nabla f(x) - J(x)^\top \lambda \\ C(x) \lambda - \mu e \end{pmatrix}\]

Reduce by eliminating $d\lambda$: from the bottom block,

\[d\lambda = -\lambda + \mu c(x)^{-1} - C(x)^{-1} \Lambda J(x) \text dx\]

Substituting into the top block, the $J^\top \lambda$ terms cancel and we get

\[\begin{aligned} (\nabla^2 _ {xx} \mathcal L(x, \lambda) + J(x)^\top C^{-1}(x) \Lambda J(x)) \text dx &= -(\nabla f(x) - \mu J(x)^\top c(x)^{-1}) \\ &= -\nabla f _ \mu(x) \end{aligned}\]

Why this matters: The RHS matches the primal Newton step on $f _ \mu$ (∆barrier-lagrangian-connection). The LHS differs by using explicit $\lambda$ where primal methods are forced to use the implicit $\lambda(x) := \mu c(x)^{-1}$. So the multipliers get their own Newton update $\text d \lambda$ rather than being recomputed by division by vanishing $c _ i$ each iteration. This decouples the multiplier dynamics from the singularity $\lambda(x) = \mu / c(x)$ that creates the ∆barrier-hessian-ill-conditioning and ∆barrier-poor-starting-points issues. The multipliers are now full state variables that Newton can update directly, rather than quantities derived from $x$.

Consider the inequality-constrained problem ∆icp. Combining ∆perturbed-optimality-conditions with the ∆primal-dual-newton-system Newton step, @state the basic primal-dual interior point method @algorithm for solving $\text{iCP}$ in practice.

Inputs: initial $\mu^0 > 0$ and $(x^0, \lambda^0)$ with $c(x^0) > 0, \lambda^0 > 0$ (e.g. $\lambda _ i^0 := \mu^0 / c _ i(x^0))$, $\mu$-shrinkage rule (often $\mu^{k+1} := (\mu^k)^2)$.

Algorithm: Until $\mu^k$ is small and KKT residual below tolerance:

  • Inner Newton on $\text{OPT} _ {\mu^k}$, warm-started at $(x^k, \lambda^k)$, repeat a small number of times:
    • Compute $(\text dx, \text d\lambda)$ from ∆primal-dual-newton-system at the current $(x, \lambda)$ with parameter $\mu^k$.
    • Damped step $(x, \lambda) \leftarrow (x + \alpha \text dx , \lambda + \alpha \text d\lambda)$ with $\alpha \in (0, 1]$ chosen by linesearch / trust-region to:
      • Maintain $c(x) > 0$ and $\lambda > 0$ (say strictly inside the cone in both primal and dual);
      • Descent the $\text{OPT} _ {\mu^k}$ residual norm.
    • Repeat until approximate stationarity at $\text{OPT} _ {\mu^k}$.
  • Set $(x^{k+1}, \lambda^{k+1})$ to the converged inner iterate.
  • Shrink $\mu^{k+1} := \tau(\mu^k)$, commonly $\mu^{k+1} = (\mu^k)^2$.
  • $k := k+1$.

Compare the simplex method and the primal-dual interior point method (∆primal-dual-ipm-algorithm specialised to linear programs) for solving linear programming problems, on worst-case complexity, average-case behaviour, and practical considerations.

Simplex method: walk along the vertices of the feasible polytope, at each pivot stepping to an adjacent vertex with better objective.

  • Worst-case: exponential in problem dimension
  • Average-cost: linear polynomial in practice
  • Per-iteration cost: cheap pivot operation on the simplex tableau

Primal-dual IPM: cuts through the interior along the central path.

  • Worst-case: polynomial in dimension and input length
  • Average-cost: $O(\log (\text{LP dimension}))$ outer iterations until convergence
  • Per-iteration cost: $O(N^3)$ for the primal-dual Newton solve

In practice:

  • For small / medium LPs, simplex is often faster
  • For very large linear programs, IPMs are often faster

Bite-sized

Strict complementarity at a KKT point of (iCP) holds iff for every active inequality $c _ i(x^\ast) = 0$, the corresponding multiplier is $\lambda _ i^\ast > 0$ (strictly). This rules out “weakly active” constraints where both $c _ i = 0$ and $\lambda _ i = 0$, and is a hypothesis of the ∆central-path-existence theorem.

Source: Lectures 14-15, The ‘central’ path of barrier minimizers exists locally slide (Theorem 27).

@bite~

In primal-dual interior point methods, the central path is followed by Newton iterates on the perturbed system $\text{OPT} _ \mu$ with the shrinkage rule $\mu^{k+1} = (\mu^k)^2$ (quadratic), giving fast asymptotic convergence as $\mu \to 0$.

Source: Lectures 14-15, Primal-dual path-following methods — practical aspects slide.

@bite~

Why does the primal-dual approach succeed where the basic primal barrier method gets stuck? Because in primal-dual methods, the multipliers $\lambda$ are explicit state variables updated by Newton, whereas in primal methods they are implicitly $\lambda(x) := \mu/c(x)$, derived from $x$ by division by potentially-vanishing $c _ i$ — which is exactly what causes the Hessian ill-conditioning and the poor-starting-point pathology.

Source: Lectures 14-15, Primal-dual versus primal methods slide; cf. ∆primal-dual-newton-system, ∆barrier-hessian-ill-conditioning.

@bite~

Historical milestone: IPMs solve linear programs in polynomial time in the worst case, while the simplex method has exponential worst-case running time (Klee-Minty 1972). Karmarkar’s 1984 result was the breakthrough that re-popularised barrier-style methods after they fell out of favour in the 1960s.

Source: Lectures 14-15, The simplex versus interior point methods for LP slide.

@bite~

The logarithmic barrier $-\log c _ i(x)$ blows up to $+\infty$ as $c _ i(x) \to 0^+$, so the unconstrained minimiser of $f _ \mu$ on $\{c > 0\}$ stays strictly inside $\Omega^o$. This is why we call $-\mu \sum _ i \log c _ i(x)$ a barrier function — it actively repels iterates from the boundary of $\Omega$, rather than penalising violations after the fact.

Source: Lectures 14-15, The logarithmic barrier function for (iCP) discussion.

@bite~

Give the proof strategy for ∆barrier-method-global-convergence-proof (Theorem 28: under LICQ, the barrier method’s iterates $x^k$ converge to a KKT point of (iCP), and multiplier estimates $\lambda^k \to \lambda^\ast$).

  • Setup: LICQ ⟹ $J _ {\mathcal A}(x^\ast)$ has full row rank, so the pseudo-inverse $J _ {\mathcal A}(x^\ast)^+ = (J _ {\mathcal A} J _ {\mathcal A}^\top)^{-1} J _ {\mathcal A}$ exists; by continuity it exists, is bounded, and is continuous at $x^k$ for large $k$. Split indices into active $\mathcal A := \{i : c _ i(x^\ast) = 0\}$ and inactive $\mathcal I$.
  • Define the candidate $\lambda^\ast = (\lambda^\ast _ {\mathcal A}, 0)$ via $\lambda^\ast _ {\mathcal A} := J _ {\mathcal A}(x^\ast)^+ \nabla f(x^\ast)$. Complementary slackness $\lambda^\ast _ i c _ i(x^\ast) = 0$ is then immediate.
  • Step 1 (inactive multipliers vanish): $\lambda^k _ i = \mu^k / c _ i(x^k) \to 0 / c _ i(x^\ast) = 0$ for $i \in \mathcal I$, using $\mu^k \to 0$ and $c _ i(x^\ast) > 0$.
  • Step 2 (active multipliers converge): starting from $\nabla f _ {\mu^k}(x^k) = \nabla f(x^k) - J(x^k)^\top \lambda^k$, isolate the active part and bound $\|\nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}\| \le \epsilon^k + \|J _ {\mathcal I}(x^k)\| \|\lambda^k _ {\mathcal I}\| \to 0$. Apply $J _ {\mathcal A}(x^k)^+$ to deduce $\lambda^k _ {\mathcal A} \to J _ {\mathcal A}(x^\ast)^+ \nabla f(x^\ast) = \lambda^\ast _ {\mathcal A}$. Since $\lambda^k > 0$, $\lambda^\ast _ {\mathcal A} \ge 0$.
  • Step 3 (stationarity): passing to the limit in $\nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}$ gives both $\nabla f(x^\ast) - J _ {\mathcal A}(x^\ast)^\top \lambda^\ast _ {\mathcal A}$ (by continuity) and $0$ (by Step 2’s bound), hence $\nabla f(x^\ast) = J _ {\mathcal A}(x^\ast)^\top \lambda^\ast _ {\mathcal A}$ — Lagrangian stationarity.

All four KKT pieces (primal feasibility, stationarity, dual feasibility, complementary slackness) hold at $(x^\ast, \lambda^\ast)$.

Source: Lectures 14-15, Proof of Theorem 28.

@bite~ @proofsupport~

Step 1 of ∆barrier-method-global-convergence-proof (Theorem 28): the inactive multipliers go to zero. For $i \in \mathcal I$ (i.e. $c _ i(x^\ast) > 0$),

\[\lambda^k _ i = \frac{\mu^k}{c _ i(x^k)} \longrightarrow \frac{0}{c _ i(x^\ast)} = <span class="cloze" tabindex="0">0</span>\]

using $\mu^k \to 0$ and $c _ i(x^k) \to c _ i(x^\ast) > 0$. This is what gives complementary slackness on inactive constraints in the limit.

Source: Lectures 14-15, Proof of Theorem 28, step (i).

@bite~ @proofsupport~

Step 2 of ∆barrier-method-global-convergence-proof (Theorem 28): the active-block residual goes to zero. Starting from $\nabla f _ {\mu^k}(x^k) = \nabla f(x^k) - J(x^k)^\top \lambda^k$ and splitting $J^\top \lambda^k = J _ {\mathcal A}^\top \lambda^k _ {\mathcal A} + J _ {\mathcal I}^\top \lambda^k _ {\mathcal I}$, the triangle inequality gives

\[\ \vert \nabla f(x^k) - J _ {\mathcal A}(x^k)^\top \lambda^k _ {\mathcal A}\ \vert \le \ \vert \nabla f _ {\mu^k}(x^k)\ \vert + \ \vert J _ {\mathcal I}(x^k)\ \vert \cdot \ \vert \lambda^k _ {\mathcal I}\ \vert \le <span class="cloze" tabindex="0">\epsilon^k + \ \vert J _ {\mathcal I}(x^k)\ \vert \ \vert \lambda^k _ {\mathcal I}\ \vert</span> \to 0,\]

using the approximate-stationarity hypothesis $\ \vert \nabla f _ {\mu^k}(x^k)\ \vert \le \epsilon^k \to 0$, boundedness of $\ \vert J _ {\mathcal I}(x^k)\ \vert $, and $\lambda^k _ {\mathcal I} \to 0$ from Step 1.

Source: Lectures 14-15, Proof of Theorem 28, inequality $(\diamondsuit)$.

@bite~ @proofsupport~

In ∆barrier-method-global-convergence-proof (Theorem 28), where does the LICQ hypothesis (linear independence of $\nabla c _ i(x^\ast)$ for $i \in \mathcal A$) enter, and what specifically breaks if LICQ fails?

  • LICQ ⟺ $J _ {\mathcal A}(x^\ast)$ has full row rank ⟺ $J _ {\mathcal A} J _ {\mathcal A}^\top$ is invertible, so the pseudo-inverse $J _ {\mathcal A}(x^\ast)^+ := (J _ {\mathcal A} J _ {\mathcal A}^\top)^{-1} J _ {\mathcal A}$ exists, is bounded, and (by continuity) is continuous at $x^k$ for large $k$. Without this, the pseudo-inverse blows up or is undefined.
  • The pseudo-inverse is used twice: to define $\lambda^\ast _ {\mathcal A} := J _ {\mathcal A}(x^\ast)^+ \nabla f(x^\ast)$, and in Step 2 to convert the active-block residual bound into a bound on $\lambda^k _ {\mathcal A}$ itself (“apply $J _ {\mathcal A}(x^k)^+$ to both sides”).
  • Without LICQ: the multiplier candidate $\lambda^\ast$ may not exist, or be non-unique; the limit $\lambda^k$ can blow up or oscillate; convergence to a KKT point cannot be proved this way.

The same role is played by LICQ in the analogous penalty-method proof (∆quadratic-penalty-method-convergence-proof) and aug-Lagrangian proof (∆aug-lagrangian-global-convergence-proof).

Source: Lectures 14-15, Theorem 28 statement and proof setup.

@bite~ @proofsupport~

Visualising the central path

The ∆logarithmic-barrier-function $f _ \mu = f - \mu \sum _ i \log c _ i$ turns the inequalities into a smooth wall (∆visualise-log-barrier-contours), and its minimisers trace the ∆central-path as $\mu$ varies. Slide $\mu$ or step the ∆basic-barrier-method-algorithm and watch $x(\mu)$ leave the analytic centre (large $\mu$, deep inside) and slide to the constrained optimum as $\mu \to 0$; drag the objective’s minimiser to change where it ends up. The ∆barrier-multiplier-estimates $\lambda _ i(\mu) = \mu / c _ i(x(\mu))$ shown alongside satisfy the perturbed complementarity $\lambda _ i c _ i = \mu$ and converge to the KKT multipliers.

The barrier method replaces the inequalities \(c_i(x) \ge 0\) with a smooth wall: minimise \(f_\mu(x) = f(x) - \mu \sum \log c_i(x)\). Increase the penalty and the barrier parameter \(\mu \to 0\): the wall thins and the minimiser \(x(\mu)\) is pulled from deep inside the region to the constrained optimum. The estimates \(\lambda_i(\mu) = \mu/c_i(x(\mu))\) satisfy the perturbed complementarity \(\lambda_i c_i = \mu\), and \(\to\) the KKT multipliers as \(\mu \to 0\).
Central path. The set \(\{x(\mu) : \mu > 0\}\) of barrier-subproblem minimisers — one point per penalty level. As the penalty rises (\(\mu \to 0\)) it runs from the analytic centre (the point furthest from all walls) to a constrained minimiser.
Penalty
lowhigh
Feasible region
Objective (or drag its minimiser)

Purple = contours of \(f_\mu\) (note the wall near the boundary); blue dots = \(x(\mu)\) at other penalties; orange = current \(x(\mu)\); ✦ = optimum (\(\mu \to 0\)).