Machine Learning MT23, Optimisation


Flashcards

Suppose we have the general constrained minimisation problem

\[\begin{aligned} \text{minimise:} \quad &F(\pmb z)\\\\ \text{subject to:} \quad &g _ i(\pmb z) \ge 0 \\\\ &h _ j(\pmb z) = 0 \end{aligned}\]

with Lagrangian

\[\Lambda(\pmb z, \pmb \alpha, \pmb \mu) = F(z) - \sum^m _ {i=1} a _ i g _ i(z) - \sum^l _ {j = 1} \mu _ j h _ j(\pmb z)\]

Can you state the KKT (Karush-Kuhn-Tucker) conditions, which give necessary conditions for a stationary point $(\pmb z^\star, \pmb \alpha^\star, \pmb \mu^\star)$ to be a minimum of this problem, and what each means intuitively?


  • Primal feasibility
    • $g _ i(\pmb z^\star) \ge 0$
    • $h _ j(\pmb z^\star) = 0$
    • Satisfies the actual constraints
  • Dual feasibility
    • $\alpha _ i^\star \ge 0$
    • ( _ _ don’t have an intuitive explanation for this, but similar to how all the dual variables introduced in an LP are nonnegative _ _ )
  • Complementary slackness
    • $\alpha _ i^\star g _ i(\pmb z^\star) = 0$
    • Either a solution lies on the boundary of an inequality constraint ($g _ i(\pmb z^\star) = 0$) or the solution is inside the inequality we can actually ignore the constraint ($\alpha _ i^\star = 0$).

How could you rephrase

\[\min \quad \mathcal L(\pmb w) = \sum^N _ {i = 1} \vert \pmb x _ i^\top \pmb w - y _ i \vert\]

as a linear programming problem?


\[\begin{aligned} \min& && \sum^N _ {i = 1} \zeta _ i \\\\ \text{s.t.}& &&\pmb w^\top \pmb x _ i - y _ i \le \zeta _ i &&\forall i \\\\ & &&y _ i - \pmb w^\top \pmb x _ i \le \zeta _ i &&\forall i \\\\ \end{aligned}\]

Suppose we are optimising some parameters $\pmb w$ using gradient descent. Can you give the iterative step used in gradient descent methods?


\[\begin{aligned} \pmb w _ {t+1} &= \pmb w _ t - \eta _ t \pmb g _ t \\\\ &= \pmb w _ t - \eta _ t \nabla f(\pmb w _ t) \end{aligned}\]

where:

  • $\eta _ t$ is the learning rate
  • $\pmb g _ t$ is a shorthand for the gradient $\nabla f(\pmb w _ t)$.

Gradient descent uses the following iterative step:

\[\begin{aligned} \pmb w _ {t+1} &= \pmb w _ t - \eta _ t \pmb g _ t \\\\ &= \pmb w _ t - \eta _ t \nabla f(\pmb w _ t) \end{aligned}\]

where:

  • $\eta _ t$ is the learning rate
  • $\pmb g _ t$ is a shorthand for the gradient $\nabla f(\pmb w _ t)$

In this context, what is “line-search”?


A technique for picking $\eta _ t$ at each step where we take the global minimum of the univariate function obtained by projecting in the direction of the gradient.

Suppose we have a function $f : \mathbb R^N \to \mathbb R$. What is the local quadratic approximation of $f$ at $\pmb w _ t$?


\[f _ \text{quad}(\pmb w) = f(\pmb w _ t) + \pmb g _ t^\top (\pmb w - \pmb w _ t) + \frac 1 2 (\pmb w - \pmb w _ t)^\top \pmb H _ t (\pmb w- \pmb w _ t)\]

where

  • $\pmb g _ t$ is a shorthand for the gradient $\nabla f(\pmb w _ t)$
  • $\pmb H _ t$ is the Hessian of $f$ at $\pmb w _ t$.

For a function $f : \mathbb R^N \to \mathbb R$ we have the local quadratic approximation of $f$ at $\pmb w _ t$

\[f _ \text{quad}(\pmb w) = f(\pmb w _ t) + \pmb g _ t^\top (\pmb w - \pmb w _ t) + \frac 1 2 (\pmb w - \pmb w _ t)^\top \pmb H _ t (\pmb w- \pmb w _ t)\]

where

  • $\pmb g _ t$ is the gradient $\nabla f(\pmb w _ t)$
  • $\pmb H _ t$ is the Hessian of $f$ at $\pmb w _ t$.

Given this, quickly derive the Newton update step for $\pmb w _ {t+1}$ using.


We want $\nabla f _ \text{quad} = 0$, so

\[\pmb g _ t + \pmb H _ t (\pmb w - \pmb w _ t) = 0\]

Thus

\[\pmb w _ {t+1} = \pmb w _ t - \pmb H^{-1} _ t \pmb g _ t\]



Related posts