# Notes - Continuous Optimisation HT26, Inequality constrained optimisation problems

> Source: https://ollybritton.com/notes/uni/part-c/ht26/continuous-optimisation/notes/inequality-constrained-optimisation-problems/ · Updated: 2026-06-08 · Tags: uni, notes

- [Course - Continuous Optimisation HT26](https://ollybritton.com/notes/uni/part-c/ht26/continuous-optimisation/)
	- [Notes - Continuous Optimisation HT26, Constrained optimisation problems](https://ollybritton.com/notes/uni/part-c/ht26/continuous-optimisation/notes/constrained-optimisation-problems/)

### Flashcards
@Define what is meant by a (nonconvex) inequality-constrained optimisation problem (iCP) and its strictly feasible set $\Omega^o$. @State the standard assumption used to derive interior point methods.::

**Problem (iCP)**:
$$
\min_{x \in \mathbb R^n} f(x) \qquad \text{s.t.} \qquad c(x) \ge 0,
$$
where $f : \mathbb R^n \to \mathbb R$, $c = (c_1, \ldots, c_p) : \mathbb R^n \to \mathbb R^p$ are smooth. The aim is to find KKT points (or local minimisers of this problem.

**Strictly feasible set**:
$$
\Omega^o := \{x \mid c(x) > 0\} = \{x \mid c_i(x) > 0 \text{ for all }1 \le i \le p\}
$$
i.e. the *interior* of the feasible region.

**Assumption**: $\Omega^0 \ne \emptyset$.

Consider the inequality-constrained optimisation problem ∆icp
$$
\min_{x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0
$$
@Define the logarithmic barrier function $f_\mu$, the associated barrier subproblem $\text{iCP}_\mu$, and give the intuition for the role of $\mu > 0$.::

For each $\mu > 0$, the logarithmic barrier function is:
$$
f_\mu (x) := f(x) - \mu \sum^p_{i = 1} \log c_i(x), \qquad \text{dom}(f_\mu) = \Omega^o.
$$
The associated logarithmic barrier subproblem is
$$
\min_{x \in \mathbb R^n} f_\mu(x) \qquad \text{s.t.} \qquad c(x) > 0. \quad (\text{iCP}_\mu)
$$
This is essentially unconstrained. The constraint $c(x) > 0$ is enforced implicitly by the barrier term, since $c_i(x) \to 0^+ \implies -\log c_i(x) \to +\infty$.

**Role of $\mu$**:

- $\mu \to \infty$: the barrier term dominates, so a minimiser $x(\mu)$ is pushed deep into $\Omega^o$, far from the boundary. Strict feasibility is well-enforced.
- $\mu \to 0$: the barrier term vanishes pointwise on $\Omega^o$, so $x(\mu)$ approaches the constrained minimiser. This is where ill-conditioning can occur.

@exam~

Consider the following inequality-constrained optimisation problem
$$
\min x_1^2 + x_2^2 \qquad \text{s.t.} \qquad x_1 + x_2^2 \ge 1
$$
and the associated barrier function
$$
f_\mu(x) := x_1^2 + x_2^2 - \mu \log(x_1 - x_2^2 - 1).
$$
@Visualise the contours of $f_\mu$ for $\mu = 10, 1, 0.1, 0.01$.::

![Screenshot 2026-05-13 at 09.57.13.png](https://ollybritton.com/assets/attachments/img/Screenshot 2026-05-13 at 09.57.13.png)
![Screenshot 2026-05-13 at 09.57.24.png](https://ollybritton.com/assets/attachments/img/Screenshot 2026-05-13 at 09.57.24.png)

Consider the inequality-constrained problem ∆icp and its ∆logarithmic-barrier-function
$$
f_\mu(x) = f(x) - \mu \sum_i \log c_i(x)
$$
@State the gradient $\nabla f_\mu$, and derive the first-order optimality conditions for $\text{iCP}_\mu$, and compare with the KKT conditions for $\text{iCP}$.::

**Gradient**:
$$
\begin{aligned}
\nabla f_\mu(x) &= \nabla f(x) - \sum^p_{i = 1} \frac{\mu}{c_i(x)} \nabla c_i(x) \\
&= \nabla f(x) - \mu J(x)^\top c(x)^{-1}
\end{aligned}
$$
where $J(x)$ is the constraint Jacobian (∆constraint-jacobian-matrices), and $c(x)^{-1} = (1/c_1(x), \ldots, 1/c_p(x))$.

**Stationarity for $\text{iCP}_\mu$**: We have that $x(\mu)$ is a minimiser iff $\nabla f_\mu(x(\mu)) = 0$, or equivalently
$$
\nabla f(x(\mu)) = \sum^p_{i = 1} \underbrace{\frac{\mu}{c_i(x(\mu))}}_{=:\, \lambda_i(\mu)} \nabla c_i(x(\mu)),
$$
with $\lambda_i(\mu) > 0$ on $\Omega^o$.

**KKT for $\text{iCP}$**: We require
$$
\nabla f(x^\ast) = \sum_i \lambda_i^\ast \, \nabla c_i(x^\ast)
$$
and $\lambda^\ast \ge 0$, $\lambda_i^\ast c_i(x^\ast) = 0$.

**Comparison**: These two expressions have the same algebraic shape:
$$
\lambda_i(\mu)c_i(x(\mu)) = \mu \quad \text{v.s.} \quad \lambda_i^\ast c_i(x^\ast) = 0.
$$
This identifies $\lambda_i(\mu) := \mu / c_i(x(\mu))$ as a natural Lagrange multiplier estimate, and motivates the central question: under what conditions do $x(\mu) \to x^\ast$ and $\lambda_i(\mu) \to \lambda_i^\ast$ as $\mu \to 0$?

Consider the inequality-constrained optimisation problem ∆icp
$$
\min_{x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0
$$
and its ∆logarithmic-barrier-function
$$
f_\mu(x) = f(x) - \mu \sum_i \log c_i(x).
$$
and the corresponding strictly feasible set
$$
\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c_i(x) > 0 \text{ for all }1 \le i \le p\}
$$
i.e. the interior of the feasible region. In this context, @define the central path.::

The central path is the set
$$
\{x(\mu) : \mu > 0\}
$$
where $x(\mu)$ is a (global) minimiser of $f_\mu$ on $\Omega^o$. As $\mu$ decreases, $x(\mu)$ traces a curve deep in $\Omega^o$ toward the boundary of $\Omega$, and under suitable conditions (see ∆central-path-existence) converges to a constrained minimiser of $\text{iCP}$ as $\mu \to 0$.

Consider the inequality-constrained optimisation problem ∆icp
$$
\min_{x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0
$$
and its ∆logarithmic-barrier-function
$$
f_\mu(x) = f(x) - \mu \sum_i \log c_i(x).
$$
and the corresponding strictly feasible set
$$
\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c_i(x) > 0 \text{ for all }1 \le i \le p\}
$$
i.e. the interior of the feasible region. Recall that the central path is the set
$$
\{x(\mu) : \mu > 0\}
$$
where $x(\mu)$ is a (global) minimiser of $f_\mu$ on $\Omega^o$.

@State a theorem giving sufficient conditions for the central path to exist locally and converge to a local minimiser of (iCP) as $\mu \to 0$.::

Suppose $\Omega^o \ne \emptyset$, and $x^\ast$ is a local minimiser of $\text{iCP}$ with KKT multipliers $\lambda^\ast$, satisfying:

1. **Strict complementarity**: $\lambda_i^\ast > 0$ whenever $c_i(x^\ast) = 0$.
2. **∆licq**: $\nabla c_i(x^\ast), i \in \mathcal A := \{i : c_i(x^\ast) = 0\}$, are linearly independent.
3. **Second-order sufficient condition**: there exists $\alpha > 0$ such that $$s^\top \nabla^2_{xx} \mathcal L(x^\ast, \lambda^\ast)s \ge \alpha \|s\|^2 \quad\text{for all }s\text{ with } J_\mathcal A(x^\ast)s = 0,$$where $\mathcal L$ is the ∆lagrangian-of-constrained-optimisation-problem.

Then there is a unique $\mathcal C^1$ path $\mu \mapsto x(\mu)$ of (global) minimisers of $f_\mu$ defined for all $\mu > 0$ sufficiently small, and $x(\mu) \to x^\ast$ as $\mu \to 0$.

Consider the inequality-constrained optimisation problem ∆icp
$$
\min_{x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0
$$
and its ∆logarithmic-barrier-function
$$
f_\mu(x) = f(x) - \mu \sum_i \log c_i(x).
$$
and the corresponding strictly feasible set
$$
\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c_i(x) > 0 \text{ for all }1 \le i \le p\}
$$
i.e. the interior of the feasible region. Recall that the central path is the set
$$
\{x(\mu) : \mu > 0\}
$$
where $x(\mu)$ is a (global) minimiser of $f_\mu$ on $\Omega^o$ and that we have the theorem ∆central-path-existence guaranteeing that when $\Omega^o \ne \emptyset$, and $x^\ast$ is a local minimiser of $\text{iCP}$ with KKT multipliers $\lambda^\ast$, satisfying:

1. **Strict complementarity**: $\lambda_i^\ast > 0$ whenever $c_i(x^\ast) = 0$.
2. **∆licq**: $\nabla c_i(x^\ast), i \in \mathcal A := \{i : c_i(x^\ast) = 0\}$, are linearly independent.
3. **Second-order sufficient condition**: there exists $\alpha > 0$ such that $$s^\top \nabla^2_{xx} \mathcal L(x^\ast, \lambda^\ast)s \ge \alpha \|s\|^2 \quad\text{for all }s\text{ with } J_\mathcal A(x^\ast)s = 0,$$where $\mathcal L$ is the ∆lagrangian-of-constrained-optimisation-problem.

Then there is a unique $\mathcal C^1$ path $\mu \mapsto x(\mu)$ of (global) minimisers of $f_\mu$ defined for all $\mu > 0$ sufficiently small, and $x(\mu) \to x^\ast$ as $\mu \to 0$.

@Visualise the central path trajectory for two problems: one convex, and one nonconvex. What is the difference between them?::

**Convex**:

![Screenshot 2026-05-13 at 10.38.22.png](https://ollybritton.com/assets/attachments/img/Screenshot 2026-05-13 at 10.38.22.png)

**Nonconvex**:

![Screenshot 2026-05-13 at 10.38.47.png](https://ollybritton.com/assets/attachments/img/Screenshot 2026-05-13 at 10.38.47.png)

**Difference**:

In the nonconvex case, there are two local minimisers. Hence there are two central paths starting within the feasible region (for $\mu$ sufficiently small), both of which converge to each local minimiser.

Consider the inequality-constrained optimisation problem ∆icp
$$
\min_{x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0
$$
and its ∆logarithmic-barrier-function
$$
f_\mu(x) = f(x) - \mu \sum_i \log c_i(x).
$$
and the corresponding strictly feasible set
$$
\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c_i(x) > 0 \text{ for all }1 \le i \le p\}
$$
i.e. the interior of the feasible region. Recall that the central path is the set
$$
\{x(\mu) : \mu > 0\}
$$
where $x(\mu)$ is a (global) minimiser of $f_\mu$ on $\Omega^o$.

In this context, @state the basic barrier method @algorithm.::

**Inputs**: initial $\mu^0 > 0$, starting iterate $x^0 \in \Omega^o$.

**Algorithm**: Let $k = 0$. Until convergence:

- Choose $0 < \mu^{k+1} < \mu^k$ (commonly $\mu^{k+1} = 0.1 \mu^k$ or $\mu^{k+1} = (\mu^k)^2$).
- Choose a starting point $x_0^k \in \Omega^o$ for the inner minimisation (commonly $x_0^k := x^k$).
- *Inner minimisation*: from $x_0^k$, use an unconstrained minimiser to find an approximate minimiser $x^{k+1}$ of $f_{\mu^{k+1}}$.
- $k := k+1$

**Convergence requirement**: $\mu^k \to 0$ as $k \to \infty$.

**Inner-minimisation algorithms**: $f_\mu$ has log singularities at the boundary of $\Omega$, so we may use:

- *Linesearch methods*: use special linesearch to cope with the singularity of the log.
- *Trust-region methods*: shape the trust region to cope with the log contours, reject any candidate with $c(x^k + s^k) \not> 0$.

@exam~

Suppose we have the inequality-constrained optimisation problem ∆icp
$$
\min_{x \in \mathbb R^n} f(x) \qquad \text{ s.t } \qquad c(x) \ge 0,
$$
and the ∆logarithmic-barrier-function $f_\mu(x) = f(x) - \mu \sum_i \log c_i(x)$. Consider any sequence of iterates $(x^k, \mu^k)$ where $x^k$ is an approximate stationary point of $f_{\mu^k}$ with $c(x^k) > 0$, i.e. those produced by ∆basic-barrier-method-algorithm. Let $\lambda_i^k := \mu^k / c_i(x^k)$ be the Lagrange multiplier estimates (from ∆barrier-multiplier-estimates).

@State a theorem about the global convergence of this scheme.::

Suppose:

- $f, c \in \mathcal C^1$
- $\|\nabla f_{\mu^k}(x^k) \| \le \epsilon^k$, where $\epsilon^k \to 0$ as $k \to \infty$
- $\mu^k \to 0$ as $k \to \infty$
- $x^k \to x^\ast$ where $\nabla c_i(x^\ast), i \in \mathcal A := \{i : c_i(x^\ast) = 0\}$, are linear independent (∆licq at $x^\ast$)

Then $x^\ast$ is a KKT point of $\text{iCP}$, and $\lambda^k \to \lambda^\ast$, where $\lambda^\ast$ is the vector of Lagrange multipliers at $x^\ast$.

(compare to ∆quadratic-penalty-method-convergence, ∆aug-lagrangian-global-convergence).

@exam~

Suppose we have the inequality-constrained optimisation problem ∆icp
$$
\min_{x \in \mathbb R^n} f(x) \qquad \text{ s.t } \qquad c(x) \ge 0,
$$
and the ∆logarithmic-barrier-function $f_\mu(x) = f(x) - \mu \sum_i \log c_i(x)$. Consider any sequence of iterates $(x^k, \mu^k)$ where $x^k$ is an approximate stationary point of $f_{\mu^k}$ with $c(x^k) > 0$, i.e. those produced by ∆basic-barrier-method-algorithm. Let $\lambda_i^k := \mu^k / c_i(x^k)$ be the Lagrange multiplier estimates (from ∆barrier-multiplier-estimates).

@Prove that if

- $f, c \in \mathcal C^1$
- $\|\nabla f_{\mu^k}(x^k) \| \le \epsilon^k$, where $\epsilon^k \to 0$ as $k \to \infty$
- $\mu^k \to 0$ as $k \to \infty$
- $x^k \to x^\ast$ where $\nabla c_i(x^\ast), i \in \mathcal A := \{i : c_i(x^\ast) = 0\}$, are linear independent (∆licq at $x^\ast$)

then $x^\ast$ is a KKT point of $\text{iCP}$, and $\lambda^k \to \lambda^\ast$, where $\lambda^\ast$ is the vector of Lagrange multipliers at $x^\ast$.::

By ∆licq, $J_\mathcal A(x^\ast)$ has full row rank, so the pseudo-inverse $J_{\mathcal A}(x^\ast)^+ = (J_\mathcal A J_\mathcal A^\top)^{-1} J_{\mathcal A}$ is well-defined. By continuity, $J_{\mathcal A}(x^k)^+$ exists, is bounded, and converges to $J_{\mathcal A}(x^\ast)^+$ for $k$ large.

Let $\mathcal I = \{1, \ldots, p\} \setminus \mathcal A$. From $c(x^k) > 0$ and $x^k \to x^\ast$: $c(x^\ast) \ge 0$ with $c_{\mathcal A}(x^\ast) = 0$ and $c_{\mathcal I}(x^\ast) > 0$, so $x^\ast$ is feasible (primal feasibility).

Define $\lambda^\ast = (\lambda_{\mathcal A}^\ast, \lambda^\ast_{\mathcal I})$ via
$$
\lambda_{\mathcal A}^\ast := J_{\mathcal A}(x^\ast)^+ \nabla f(x^\ast), \qquad \lambda^\ast_{\mathcal I} := 0.
$$
**Complementary slackness $\lambda_i^\ast c_i(x^\ast) = 0$** is immediate. $\lambda_{\mathcal I}^\ast = 0$ means inactive $c_{\mathcal A}(x^\ast) = 0$ kills active ones.

It remains to show that:

1. $\lambda^k_{\mathcal I} \to 0$
2. $\lambda^k_{\mathcal A} \to \lambda^\ast_{\mathcal A} \ge 0$
3. Lagrangian stationarity $\nabla f(x^\ast) = J_{\mathcal A}(x^\ast)^\top \lambda^\ast_{\mathcal A}$.

**For (1),** note that for $i \in \mathcal I$, $\lambda^k_i = \mu^k / c_i(x^k) \to / c_i(x^\ast) = 0$, since $\mu^k \to 0$ and $c_i(x^\ast) > 0$.

**For (2)**, write $r^k := \nabla f(x^k) - J_{\mathcal A}(x^k)^\top \lambda^k_{\mathcal A}$ for the active-block residual. The multiplier estimate (∆barrier-multiplier-estimates) gives $\nabla f_{\mu^k}(x^k) = \nabla f(x^k) - J(x^k)^\top \lambda^k$, and splitting $J^\top \lambda^k = J_{\mathcal A}^\top \lambda^k_{\mathcal A} + J_{\mathcal I}^\top \lambda^k_{\mathcal I}$ rearranges this to $r^k = \nabla f_{\mu^k}(x^k) + J_{\mathcal I}(x^k)^\top \lambda^k_{\mathcal I}$. Hence
$$
\begin{aligned}
\|r^k\| &= \|\nabla f_{\mu^k}(x^k) + J_{\mathcal I}(x^k)^\top \lambda^k_{\mathcal I}\| &&(\star 1) \\
&\le \|\nabla f_{\mu^k}(x^k)\| + \|J_{\mathcal I}(x^k)\| \cdot \|\lambda^k_{\mathcal I}\| &&(\star 2) \\
&\le \epsilon^k + \|J_{\mathcal I}(x^k)\| \cdot \|\lambda^k_{\mathcal I}\| &&(\star 3) \\
&\to 0 &&(\star 4)
\end{aligned}
$$
where:

- $(\star 1)$: substituting the identity $r^k = \nabla f_{\mu^k}(x^k) + J_{\mathcal I}(x^k)^\top \lambda^k_{\mathcal I}$ derived above.
- $(\star 2)$: triangle inequality, then submultiplicativity $\|J_{\mathcal I}(x^k)^\top \lambda^k_{\mathcal I}\| \le \|J_{\mathcal I}(x^k)\| \cdot \|\lambda^k_{\mathcal I}\|$.
- $(\star 3)$: the approximate-stationarity hypothesis $\|\nabla f_{\mu^k}(x^k)\| \le \epsilon^k$.
- $(\star 4)$: $\epsilon^k \to 0$; $\|J_{\mathcal I}(x^k)\|$ is bounded ($J_{\mathcal I}$ continuous along $x^k \to x^\ast$); and $\|\lambda^k_{\mathcal I}\| \to 0$ by **(1)**.

LICQ makes $J_{\mathcal A}(x^k)$ full row rank, so $J_{\mathcal A}(x^k)^+ J_{\mathcal A}(x^k)^\top = (J_{\mathcal A} J_{\mathcal A}^\top)^{-1}(J_{\mathcal A} J_{\mathcal A}^\top) = I$; left-multiplying $r^k = \nabla f(x^k) - J_{\mathcal A}(x^k)^\top \lambda^k_{\mathcal A}$ by $J_{\mathcal A}(x^k)^+$ and rearranging gives $\lambda^k_{\mathcal A} = J_{\mathcal A}(x^k)^+ \nabla f(x^k) - J_{\mathcal A}(x^k)^+ r^k$. Hence
$$
\begin{aligned}
\|\lambda^k_{\mathcal A} - \lambda^\ast_{\mathcal A}\| &= \left\|\left(J_{\mathcal A}(x^k)^+ \nabla f(x^k) - \lambda^\ast_{\mathcal A}\right) - J_{\mathcal A}(x^k)^+ r^k\right\| &&(\star 5) \\
&\le \|J_{\mathcal A}(x^k)^+ \nabla f(x^k) - \lambda^\ast_{\mathcal A}\| + \|J_{\mathcal A}(x^k)^+\| \cdot \|r^k\| &&(\star 6) \\
&\to 0 &&(\star 7)
\end{aligned}
$$
where:

- $(\star 5)$: substituting $\lambda^k_{\mathcal A} = J_{\mathcal A}(x^k)^+ \nabla f(x^k) - J_{\mathcal A}(x^k)^+ r^k$ and grouping the two limit-bearing pieces.
- $(\star 6)$: triangle inequality and submultiplicativity $\|J_{\mathcal A}(x^k)^+ r^k\| \le \|J_{\mathcal A}(x^k)^+\| \cdot \|r^k\|$.
- $(\star 7)$: the first term $\to 0$ because $J_{\mathcal A}(x^k)^+ \nabla f(x^k) \to J_{\mathcal A}(x^\ast)^+ \nabla f(x^\ast) = \lambda^\ast_{\mathcal A}$ (continuity of $J_{\mathcal A}^+$ and $\nabla f$, $x^k \to x^\ast$, and the definition of $\lambda^\ast_{\mathcal A}$); the second term $\to 0$ because $\|J_{\mathcal A}(x^k)^+\|$ is bounded and $\|r^k\| \to 0$ by $(\star 4)$.

Therefore $\lambda^k_{\mathcal A} \to \lambda^\ast_{\mathcal A}$, and since $\lambda^k > 0$ for all $k$, $\lambda^\ast_{\mathcal A} \ge 0$ (dual feasibility).

**For (3)**, pass to the limit in $r^k = \nabla f(x^k) - J_{\mathcal A}(x^k)^\top \lambda^k_{\mathcal A}$. By continuity and **(2)**, it converges to $\nabla f(x^\ast) - J_{\mathcal A}(x^\ast)^\top \lambda^\ast_{\mathcal A}$. But then $(\star 4)$ says it also converges to $0$. Hence
$$
\nabla f(x^\ast) = J_{\mathcal A}(x^\ast)^\top \lambda^\ast_{\mathcal A} = \sum_{i \in \mathcal A} \lambda_i^\ast \nabla c_i(x^\ast).
$$

All four ∆kkt-conditions hold at $(x^\ast, \lambda^\ast)$, so $x^\ast$ is KKT, and $\lambda^k \to \lambda^\ast$ overall.

@exam~

Consider the inequality-constrained problem ∆icp and its ∆logarithmic-barrier-function $f_\mu(x) = f(x) - \mu \sum_i \log c_i(x)$, with multiplier estimates $\lambda (x) = \mu c(x)^{-1}$ (entry-wise, from ∆barrier-multiplier-estimates).

@State $\nabla f_\mu$ and $\nabla^2 f_\mu$ in terms of the $\text{iCP}$ ∆lagrangian-of-constrained-optimisation-problem (which collapses to $\mathcal L(x, \lambda) = f(x) - \sum_i \lambda_i c_i(x)$ for inequality-only problems).::

**Gradient**:
$$
\begin{aligned}
\nabla f_\mu(x) &= \nabla f(x) - \mu J(x)^\top c(x)^{-1} \\
&= \nabla f(x) - J(x)^\top \lambda(x) \\
&= \nabla_x \mathcal L(x, \lambda(x)).
\end{aligned}
$$
So minimising $f_\mu$ in $x$ is equivalent to seeking Lagrangian stationarity $(x, \lambda(x))$.

**Hessian**: (writing $C(x) := \text{diag}(c_1(x), \ldots, c_p(x))$):
$$
\nabla^2 f_\mu(x) = \underbrace{\nabla^2 f(x) - \sum^p_{i = 1} \frac{\mu}{c_i(x)} \nabla^2 c_i(x)}_{=\,\nabla^2_{xx} \mathcal L(x, \lambda(x))} + \mu J(x)^\top C(x)^{-2} J(x).
$$
The "extra" term $\mu J(x)^\top C(x)^{-2} J(x)$ comes from differentiating the multiplier estimate $\lambda(x) = \mu c(x)^{-1}$: each entry $\lambda_i(x) = \mu/c_i(x)$ contributes $-(\mu/c_i(x)^2)\,\nabla c_i(x) \nabla c_i(x)^\top$ when re-differentiated. For $i \in \mathcal A$ (active at $x^\ast$), $c_i(x(\mu)) = O(\mu)$ along the central path, so $\mu/c_i(x)^2 = O(1/\mu) \to \infty$ as $\mu \to 0$ — this is the mechanism behind ∆barrier-hessian-ill-conditioning.

@exam~

In the setup of ∆barrier-lagrangian-connection,
$$
\nabla^2 f_\mu(x) = \nabla^2_{xx} \mathcal L(x, \lambda(x)) + \mu J(x)^\top C(x)^{-2} J(x)
$$
@Describe the behaviour of $\nabla^2 f_\mu$ as $\mu \to 0$ along the ∆central-path, and the consequences for Newton-based barrier methods.::

Along the central path with $x(\mu) \to x^\ast$ and $\lambda_i(\mu) = \mu/c_i(x(\mu))$ staying bounded (i.e. $c_i(x(\mu)) \to c_i(x^\ast)$ at rate $O(\mu)$ for active constraints):

- $i \in \mathcal A$ (active, $c_i(x^\ast) = 0$): $\mu / c_i(x(\mu))^2 \to \infty$
- $i \in \mathcal I$ (inactive, $c_i(x^\ast) > 0$): $\mu / c_i(x(\mu))^2 \to 0$.

So as $\mu \to 0$ (along with a Fact from Gould), $\nabla^2 f_{\mu}(x(\mu))$ splits into:

- $p_a := |\mathcal A|$ eigenvalues of $\nabla^2 f_{\mu^k}(x^k)$ tend to infinity as $k \to \infty$.
- $n - p_a$ eigenvalues that are $O(1)$ (from $\nabla^2_{xx} \mathcal L$) (inactive contributions to the extra term vanish)
- The condition number of $\nabla^2 f_{\mu^k}(x^k)$ is $O(1/\mu^k)$.

This blows up as $k \to \infty$, and we may not able to compute $x^k$ accurately.

@exam~

Consider the ∆basic-barrier-method-algorithm. At each outer iteration we decrease $\mu^k$ to $\mu^{k+1}$ and need a starting point $x^k_0$ for the inner minimisation of $f_{\mu^{k+1}}$. The obvious choice $x^k_0 := x^k$ turns out to be a very poor one. @Describe what goes wrong.::

Even though $x^k$ is an approximate minimiser of $f_{\mu^k}$ (and hence interior), once we shift to the new barrier $f_{\mu^{k+1}}$, the full Newton step $s^k$ from $x^k_0 = x^k$ is asymptotically infeasible.:
$$
c(x^k + s^k) < 0 \quad \text{whenever } \mu^{k+1} < 0.5 \mu^k.
$$
Any non-trivial decrease in $\mu$ (anything more aggressive than halving) drives the Newton iterate out of $\Omega^o$. So either:

- We must take a damped step (small $\alpha^k$), which loses the local quadratic convergence of Newton, or
- Use only small decreases $\mu^{k+1} \ge 0.5 \mu^k$, which means we have very slow outer convergence.

Consider the inequality-constrained optimisation problem ∆icp
$$
\min_{x \in \mathbb R^n} f(x) \qquad \text{s.t.}\qquad c(x) \ge 0
$$
and its ∆logarithmic-barrier-function
$$
f_\mu(x) = f(x) - \mu \sum_i \log c_i(x).
$$
and the corresponding strictly feasible set
$$
\Omega^0 := \{x \mid c(x) > 0\} = \{x \mid c_i(x) > 0 \text{ for all }1 \le i \le p\}
$$
i.e. the interior of the feasible region. Recall that the central path is the set
$$
\{x(\mu) : \mu > 0\}
$$
where $x(\mu)$ is a (global) minimiser of $f_\mu$ on $\Omega^o$.

In this context, the basic barrier method algorithm is:

**Inputs**: initial $\mu^0 > 0$, starting iterate $x^0 \in \Omega^o$.

**Algorithm**: Let $k = 0$. Until convergence:

- Choose $0 < \mu^{k+1} < \mu^k$ (commonly $\mu^{k+1} = 0.1 \mu^k$ or $\mu^{k+1} = (\mu^k)^2$).
- Choose a starting point $x_0^k \in \Omega^o$ for the inner minimisation (commonly $x_0^k := x^k$).
- *Inner minimisation*: from $x_0^k$, use an unconstrained minimiser to find an approximate minimiser $x^{k+1}$ of $f_{\mu^{k+1}}$.
- $k := k+1$

**Convergence requirement**: $\mu^k \to 0$ as $k \to \infty$.

**Inner-minimisation algorithms**: $f_\mu$ has log singularities at the boundary of $\Omega$, so we may use:

- *Linesearch methods*: use special linesearch to cope with the singularity of the log.
- *Trust-region methods*: shape the trust region to cope with the log contours, reject any candidate with $c(x^k + s^k) \not> 0$.

@Describe the two issues with this method, and the technique used to address these difficulties.::

- The Hessian of $f_{\mu^k}$ becomes ill-conditioned as $\mu^k \to 0$ (condition number $O(1/\mu^k)$), so e.g. Newton's method will get worse and worse (∆barrier-hessian-ill-conditioning).
- It can be shown that the natural starting point $x^k_0 := x^k$ is a poor choice, since in order for iterates to remain feasible we must either dampen the Newton step (losing quadratic convergence), or only take small decreases in $\mu^k$, which means the outer loop converges very slowly (∆barrier-poor-starting-points).

The technique to address these issues is ∆primal-dual-newton-system, which treat the multiplier estimates as explicit variables.

Consider the inequality-constrained problem ∆icp, its ∆logarithmic-barrier-function $f_\mu$, and a minimiser $x(\mu) \in \Omega^o$ of the barrier subproblem $\text{iCP}_\mu$.

@Derive the *perturbed optimality conditions* ($\text{OPT}_\mu$) that $(x(\mu), \lambda(\mu))$ satisfy, where $\lambda(\mu) := \mu c(x(\mu))^{-1}$ . Compare with the ∆kkt-conditions for ($\text{iCP}$).::

First-order necessary conditions for $\text{iCP}_\mu$ (which is essentially unconstrained, since $f_\mu \to \infty$ at the boundary) give $\nabla f_{\mu} (x(\mu)) = 0$:
$$
\nabla f(x(\mu)) - \mu J(x(\mu))^\top c(x(\mu))^{-1} = 0.
$$
With $\lambda(\mu) := \mu c(x(\mu))^{-1}$, this is Lagrangian stationarity:
$$
\boxed{\nabla f(x(\mu)) - J(x(\mu))^\top \lambda(\mu) = 0}
$$

---

Multiplying $\lambda_i(\mu) = \mu / c_i(x(\mu))$ by $c_i(x(\mu))$ gives perturbed complementarity:
$$
\boxed{c_i(x(\mu))\lambda_i(\mu) = \mu, \quad i = 1, \ldots, p.}
$$

---

Strict feasibility $x(\mu) \in \Omega^o$ gives $\boxed{c > 0, \lambda > 0}$.

---

So $(x(\mu), \lambda(\mu))$ satisfies $\text{OPT}_\mu$:
$$
\begin{cases}
\nabla f(x) - J(x)^\top \lambda = 0 \\
c_i(x) \lambda_i = \mu, &\,i = 1, \ldots ,p
\end{cases}  \qquad \text{with }c(x) > 0, \lambda > 0
$$

Contrast this with KKT for $\text{iCP}$:
$$
\begin{cases}
\nabla f(x) - J(x)^\top \lambda = 0 \\
c_i(x) \lambda_i = 0 &\,i =1, \ldots, p 
\end{cases} \quad \text{with } c(x) \ge 0, \lambda \ge 0.
$$
There are two differences: complementarity is $c_i \lambda_i = \mu$ vs $0$, and the inequalities are strict vs non-strict. As $\mu \to 0$, $\text{OPT}_\mu$ recovers KKT.

**Why this matters**: KKT complementarity $c_i \lambda_i$ is a combinatorial statement: for each $i$, either $c_i = 0$ (active) or $\lambda_i = 0$ (inactive), and the algorithm has to discover which. $\text{OPT}_\mu$ loosens this to a smooth equation $c_i \lambda_i = \mu$, so that every constraint may be "slightly active", and Newton's method now applies to the joint $(x, \lambda)$ unknowns (in contrast to ∆barrier-multiplier-estimates, where $\lambda$ was derived from $x$). As $\mu \to 0$, the active set emerges automatically: active constraints have $c_i \to 0$ with $\lambda_i$ bounded, inactive ones $\lambda_i \to 0$ with $c_i$ bounded. In this view, $\text{OPT}_\mu$ is a smooth perturbation of KKT, and primal-dual methods follow the path of solutions to $\mu = 0$ (see ∆primal-dual-newton-system).

Apply Newton's method to the ∆perturbed-optimality-conditions $\text{OPT}_\mu$:
$$
\begin{cases}
\nabla f(x) - J(x)^\top \lambda = 0 \\
c_i(x) \lambda_i = \mu, &\,i = 1, \ldots ,p
\end{cases}  \qquad \text{with }c(x) > 0, \lambda > 0
$$
treating $(x, \lambda)$ as joint unknowns, maintaining $c(x) > 0, \lambda > 0$. Derive the Newton direction $(dx, d\lambda)$ and its reduced form after eliminating $d\lambda$.::

Write the system as $g(x, \lambda) = 0$ with $g = (g_1; g_2)$, where
$$
\begin{aligned}
g_1(x) &:= \nabla f(x) - J(x)^\top \lambda \\
g_2 (x) &:= C(x) \lambda - \mu e
\end{aligned}
$$
with $C(x) := \text{diag}(c_1(x), \ldots, c_p(x))$ and $e = (1, \ldots, 1)^\top$. Linearising, we obtain

- $\partial g_1 / \partial x = \nabla^2 f(x) - \sum_i \lambda_i \nabla^2 c_i(x) = \nabla^2_{xx} \mathcal L(x, \lambda)$ (cf. ∆barrier-lagrangian-connection)
- $\partial g_1 / \partial \lambda = -J(x)^\top$

and

- $\partial g_2 / \partial x = \Lambda J(x)$ where $\Lambda := \text{diag}(\lambda)$ (the $i$th component $c_i(x) \lambda_i$ differentiates to row $\lambda_i \nabla c_i(x)^\top)$
- $\partial g_2 / \partial \Lambda = C(x)$

Then Newton's method solves for
$$
\begin{pmatrix}
\nabla^2_{xx} \mathcal L(x, \lambda) & -J(x)^\top \\
\Lambda J(x) & C(x)
\end{pmatrix}
\begin{pmatrix}
\text dx \\
\text d\lambda
\end{pmatrix}
=
-\begin{pmatrix}
\nabla f(x) - J(x)^\top \lambda \\
C(x) \lambda - \mu e
\end{pmatrix}
$$
Reduce by eliminating $d\lambda$: from the bottom block,
$$
d\lambda = -\lambda + \mu c(x)^{-1} - C(x)^{-1} \Lambda J(x) \text dx
$$
Substituting into the top block, the $J^\top \lambda$ terms cancel and we get
$$
\begin{aligned}
(\nabla^2_{xx} \mathcal L(x, \lambda) + J(x)^\top C^{-1}(x) \Lambda J(x)) \text dx &= -(\nabla f(x) - \mu J(x)^\top c(x)^{-1}) \\
&= -\nabla f_\mu(x)
\end{aligned}
$$

**Why this matters**: The RHS matches the primal Newton step on $f_\mu$ (∆barrier-lagrangian-connection). The LHS differs by using explicit $\lambda$ where primal methods are forced to use the implicit $\lambda(x) := \mu c(x)^{-1}$. So the multipliers get their own Newton update $\text d \lambda$ rather than being recomputed by division by vanishing $c_i$ each iteration. This decouples the multiplier dynamics from the singularity $\lambda(x) = \mu / c(x)$ that creates the ∆barrier-hessian-ill-conditioning and ∆barrier-poor-starting-points issues. The multipliers are now full state variables that Newton can update directly, rather than quantities derived from $x$.

Consider the inequality-constrained problem ∆icp. Combining ∆perturbed-optimality-conditions with the ∆primal-dual-newton-system Newton step, @state the basic primal-dual interior point method @algorithm for solving $\text{iCP}$ in practice.::

**Inputs**: initial $\mu^0 > 0$ and $(x^0, \lambda^0)$ with $c(x^0) > 0, \lambda^0 > 0$ (e.g. $\lambda_i^0 := \mu^0 / c_i(x^0))$, $\mu$-shrinkage rule (often $\mu^{k+1} := (\mu^k)^2)$.

**Algorithm**: Until $\mu^k$ is small and KKT residual below tolerance:

- Inner Newton on $\text{OPT}_{\mu^k}$, warm-started at $(x^k, \lambda^k)$, repeat a small number of times:
	- Compute $(\text dx, \text d\lambda)$ from ∆primal-dual-newton-system at the current $(x, \lambda)$ with parameter $\mu^k$.
	- Damped step $(x, \lambda) \leftarrow (x + \alpha \text dx , \lambda + \alpha \text d\lambda)$ with $\alpha \in (0, 1]$ chosen by linesearch / trust-region to:
		- Maintain $c(x) > 0$ and $\lambda > 0$ (say strictly inside the cone in both primal and dual);
		- Descent the $\text{OPT}_{\mu^k}$ residual norm.
	- Repeat until approximate stationarity at $\text{OPT}_{\mu^k}$.
- Set $(x^{k+1}, \lambda^{k+1})$ to the converged inner iterate.
- Shrink $\mu^{k+1} := \tau(\mu^k)$, commonly $\mu^{k+1} = (\mu^k)^2$.
- $k := k+1$.

Compare the simplex method and the primal-dual interior point method (∆primal-dual-ipm-algorithm specialised to linear programs) for solving linear programming problems, on worst-case complexity, average-case behaviour, and practical considerations.::

**Simplex method**: walk along the vertices of the feasible polytope, at each pivot stepping to an adjacent vertex with better objective.

- **Worst-case**: exponential in problem dimension
- **Average-cost**: linear polynomial in practice
- **Per-iteration cost**: cheap pivot operation on the simplex tableau

**Primal-dual IPM**: cuts through the interior along the central path.

- **Worst-case**: polynomial in dimension and input length
- **Average-cost**: $O(\log (\text{LP dimension}))$ outer iterations until convergence
- **Per-iteration cost**: $O(N^3)$ for the primal-dual Newton solve

In practice:

- For small / medium LPs, simplex is often faster
- For very large linear programs, IPMs are often faster

### Bite-sized
**Strict complementarity** at a KKT point of (iCP) holds iff for every active inequality $c_i(x^\ast) = 0$, the corresponding multiplier is $\lambda_i^\ast > 0$ (strictly). This rules out "weakly active" constraints where both $c_i = 0$ and $\lambda_i = 0$, and is a hypothesis of the ∆central-path-existence theorem.

**Source**: Lectures 14-15, **The 'central' path of barrier minimizers exists locally** slide (Theorem 27).

@bite~

In primal-dual interior point methods, the central path is followed by Newton iterates on the perturbed system $\text{OPT}_\mu$ with the shrinkage rule $\mu^{k+1} = (\mu^k)^2$  (quadratic), giving fast asymptotic convergence as $\mu \to 0$.

**Source**: Lectures 14-15, **Primal-dual path-following methods — practical aspects** slide.

@bite~

Why does the primal-dual approach succeed where the basic primal barrier method gets stuck? Because in primal-dual methods, the multipliers $\lambda$ are explicit state variables updated by Newton, whereas in primal methods they are implicitly $\lambda(x) := \mu/c(x)$, derived from $x$ by division by potentially-vanishing $c_i$ — which is exactly what causes the Hessian ill-conditioning and the poor-starting-point pathology.

**Source**: Lectures 14-15, **Primal-dual versus primal methods** slide; cf. ∆primal-dual-newton-system, ∆barrier-hessian-ill-conditioning.

@bite~

**Historical milestone**: IPMs solve linear programs in polynomial time  in the worst case, while the simplex method has exponential  worst-case running time (Klee-Minty 1972). Karmarkar's 1984 result was the breakthrough that re-popularised barrier-style methods after they fell out of favour in the 1960s.

**Source**: Lectures 14-15, **The simplex versus interior point methods for LP** slide.

@bite~

The **logarithmic** barrier $-\log c_i(x)$ blows up to $+\infty$ as $c_i(x) \to 0^+$, so the unconstrained minimiser of $f_\mu$ on $\{c > 0\}$ stays strictly inside $\Omega^o$. This is why we call $-\mu \sum_i \log c_i(x)$ a barrier function — it actively repels iterates from the boundary of $\Omega$, rather than penalising violations after the fact.

**Source**: Lectures 14-15, **The logarithmic barrier function for (iCP)** discussion.

@bite~

Give the proof strategy for ∆barrier-method-global-convergence-proof (Theorem 28: under LICQ, the barrier method's iterates $x^k$ converge to a KKT point of (iCP), and multiplier estimates $\lambda^k \to \lambda^\ast$).

::

- **Setup**: LICQ ⟹ $J_{\mathcal A}(x^\ast)$ has full row rank, so the pseudo-inverse $J_{\mathcal A}(x^\ast)^+ = (J_{\mathcal A} J_{\mathcal A}^\top)^{-1} J_{\mathcal A}$ exists; by continuity it exists, is bounded, and is continuous at $x^k$ for large $k$. Split indices into active $\mathcal A := \{i : c_i(x^\ast) = 0\}$ and inactive $\mathcal I$.
- **Define the candidate** $\lambda^\ast = (\lambda^\ast_{\mathcal A}, 0)$ via $\lambda^\ast_{\mathcal A} := J_{\mathcal A}(x^\ast)^+ \nabla f(x^\ast)$. Complementary slackness $\lambda^\ast_i c_i(x^\ast) = 0$ is then immediate.
- **Step 1 (inactive multipliers vanish)**: $\lambda^k_i = \mu^k / c_i(x^k) \to 0 / c_i(x^\ast) = 0$ for $i \in \mathcal I$, using $\mu^k \to 0$ and $c_i(x^\ast) > 0$.
- **Step 2 (active multipliers converge)**: starting from $\nabla f_{\mu^k}(x^k) = \nabla f(x^k) - J(x^k)^\top \lambda^k$, isolate the active part and bound $\|\nabla f(x^k) - J_{\mathcal A}(x^k)^\top \lambda^k_{\mathcal A}\| \le \epsilon^k + \|J_{\mathcal I}(x^k)\| \|\lambda^k_{\mathcal I}\| \to 0$. Apply $J_{\mathcal A}(x^k)^+$ to deduce $\lambda^k_{\mathcal A} \to J_{\mathcal A}(x^\ast)^+ \nabla f(x^\ast) = \lambda^\ast_{\mathcal A}$. Since $\lambda^k > 0$, $\lambda^\ast_{\mathcal A} \ge 0$.
- **Step 3 (stationarity)**: passing to the limit in $\nabla f(x^k) - J_{\mathcal A}(x^k)^\top \lambda^k_{\mathcal A}$ gives both $\nabla f(x^\ast) - J_{\mathcal A}(x^\ast)^\top \lambda^\ast_{\mathcal A}$ (by continuity) and $0$ (by Step 2's bound), hence $\nabla f(x^\ast) = J_{\mathcal A}(x^\ast)^\top \lambda^\ast_{\mathcal A}$ — Lagrangian stationarity.

All four KKT pieces (primal feasibility, stationarity, dual feasibility, complementary slackness) hold at $(x^\ast, \lambda^\ast)$.

**Source**: Lectures 14-15, **Proof of Theorem 28**.

@bite~ @proofsupport~

Step 1 of ∆barrier-method-global-convergence-proof (Theorem 28): the inactive multipliers go to zero. For $i \in \mathcal I$ (i.e. $c_i(x^\ast) > 0$),
$$\lambda^k_i = \frac{\mu^k}{c_i(x^k)} \longrightarrow \frac{0}{c_i(x^\ast)} = 0 $$
using $\mu^k \to 0$ and $c_i(x^k) \to c_i(x^\ast) > 0$. This is what gives complementary slackness on inactive constraints in the limit.

**Source**: Lectures 14-15, **Proof of Theorem 28**, step (i).

@bite~ @proofsupport~

Step 2 of ∆barrier-method-global-convergence-proof (Theorem 28): the active-block residual goes to zero. Starting from $\nabla f_{\mu^k}(x^k) = \nabla f(x^k) - J(x^k)^\top \lambda^k$ and splitting $J^\top \lambda^k = J_{\mathcal A}^\top \lambda^k_{\mathcal A} + J_{\mathcal I}^\top \lambda^k_{\mathcal I}$, the triangle inequality gives
$$\|\nabla f(x^k) - J_{\mathcal A}(x^k)^\top \lambda^k_{\mathcal A}\| \le \|\nabla f_{\mu^k}(x^k)\| + \|J_{\mathcal I}(x^k)\| \cdot \|\lambda^k_{\mathcal I}\| \le \epsilon^k + \|J_{\mathcal I}(x^k)\| \|\lambda^k_{\mathcal I}\|  \to 0,$$
using the approximate-stationarity hypothesis $\|\nabla f_{\mu^k}(x^k)\| \le \epsilon^k \to 0$, boundedness of $\|J_{\mathcal I}(x^k)\|$, and $\lambda^k_{\mathcal I} \to 0$ from Step 1.

**Source**: Lectures 14-15, **Proof of Theorem 28**, inequality $(\diamondsuit)$.

@bite~ @proofsupport~

In ∆barrier-method-global-convergence-proof (Theorem 28), where does the LICQ hypothesis (linear independence of $\nabla c_i(x^\ast)$ for $i \in \mathcal A$) enter, and what specifically breaks if LICQ fails?

::

- LICQ ⟺ $J_{\mathcal A}(x^\ast)$ has full row rank ⟺ $J_{\mathcal A} J_{\mathcal A}^\top$ is invertible, so the pseudo-inverse $J_{\mathcal A}(x^\ast)^+ := (J_{\mathcal A} J_{\mathcal A}^\top)^{-1} J_{\mathcal A}$ exists, is bounded, and (by continuity) is continuous at $x^k$ for large $k$. Without this, the pseudo-inverse blows up or is undefined.
- The pseudo-inverse is used twice: to *define* $\lambda^\ast_{\mathcal A} := J_{\mathcal A}(x^\ast)^+ \nabla f(x^\ast)$, and in Step 2 to convert the active-block residual bound into a bound on $\lambda^k_{\mathcal A}$ itself ("apply $J_{\mathcal A}(x^k)^+$ to both sides").
- Without LICQ: the multiplier candidate $\lambda^\ast$ may not exist, or be non-unique; the limit $\lambda^k$ can blow up or oscillate; convergence to a KKT point cannot be proved this way.

The same role is played by LICQ in the analogous penalty-method proof (∆quadratic-penalty-method-convergence-proof) and aug-Lagrangian proof (∆aug-lagrangian-global-convergence-proof).

**Source**: Lectures 14-15, **Theorem 28** statement and proof setup.

@bite~ @proofsupport~

### Visualising the central path

The ∆logarithmic-barrier-function $f_\mu = f - \mu \sum_i \log c_i$ turns the inequalities into a smooth wall (∆visualise-log-barrier-contours), and its minimisers trace the ∆central-path as $\mu$ varies. Slide $\mu$ or step the ∆basic-barrier-method-algorithm and watch $x(\mu)$ leave the analytic centre (large $\mu$, deep inside) and slide to the constrained optimum as $\mu \to 0$; drag the objective's minimiser to change where it ends up. The ∆barrier-multiplier-estimates $\lambda_i(\mu) = \mu / c_i(x(\mu))$ shown alongside satisfy the perturbed complementarity $\lambda_i c_i = \mu$ and converge to the KKT multipliers.

{% include barrier_method.liquid %}

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt
