Notes - Machine Learning MT23, Optimisation


Flashcards

Suppose we have the general constrained minimisation problem

\[\begin{aligned} \text{minimise:} \quad &F(\pmb z)\\\\ \text{subject to:} \quad &g_i(\pmb z) \ge 0 \\\\ &h_j(\pmb z) = 0 \end{aligned}\]

with Lagrangian

\[\Lambda(\pmb z, \pmb \alpha, \pmb \mu) = F(z) - \sum^m_{i=1} a_i g_i(z) - \sum^l_{j = 1} \mu_j h_j(\pmb z)\]

Can you state the KKT (Karush-Kuhn-Tucker) conditions, which give necessary conditions for a stationary point $(\pmb z^\star, \pmb \alpha^\star, \pmb \mu^\star)$ to be a minimum of this problem, and what each means intuitively?


  • Primal feasibility
    • $g _ i(\pmb z^\star) \ge 0$
    • $h _ j(\pmb z^\star) = 0$
    • Satisfies the actual constraints
  • Dual feasibility
    • $\alpha _ i^\star \ge 0$
    • (don’t have an intuitive explanation for this, but similar to how all the dual variables introduced in an LP are nonnegative)
  • Complementary slackness
    • $\alpha _ i^\star g _ i(\pmb z^\star) = 0$
    • Either a solution lies on the boundary of an inequality constraint ($g _ i(\pmb z^\star) = 0$) or the solution is inside the inequality we can actually ignore the constraint ($\alpha _ i^\star = 0$).

How could you rephrase

\[\min \quad \mathcal L(\pmb w) = \sum^N_{i = 1} |\pmb x_i^\top \pmb w - y_i|\]

as a linear programming problem?


\[\begin{aligned} \min& && \sum^N_{i = 1} \zeta_i \\\\ \text{s.t.}& &&\pmb w^\top \pmb x_i - y_i \le \zeta_i &&\forall i \\\\ & &&y_i - \pmb w^\top \pmb x_i \le \zeta_i &&\forall i \\\\ \end{aligned}\]

Suppose we are optimising some parameters $\pmb w$ using gradient descent. Can you give the iterative step used in gradient descent methods?


\[\begin{aligned} \pmb w_{t+1} &= \pmb w_t - \eta_t \pmb g_t \\\\ &= \pmb w_t - \eta_t \nabla f(\pmb w_t) \end{aligned}\]

where:

  • $\eta _ t$ is the learning rate
  • $\pmb g _ t$ is a shorthand for the gradient $\nabla f(\pmb w _ t)$.

Gradient descent uses the following iterative step:

\[\begin{aligned} \pmb w_{t+1} &= \pmb w_t - \eta_t \pmb g_t \\\\ &= \pmb w_t - \eta_t \nabla f(\pmb w_t) \end{aligned}\]

where:

  • $\eta _ t$ is the learning rate
  • $\pmb g _ t$ is a shorthand for the gradient $\nabla f(\pmb w _ t)$

In this context, what is “line-search”?


A technique for picking $\eta _ t$ at each step where we take the global minimum of the univariate function obtained by projecting in the direction of the gradient.

Suppose we have a function $f : \mathbb R^N \to \mathbb R$. What is the local quadratic approximation of $f$ at $\pmb w _ t$?


\[f_\text{quad}(\pmb w) = f(\pmb w_t) + \pmb g_t^\top (\pmb w - \pmb w_t) + \frac 1 2 (\pmb w - \pmb w_t)^\top \pmb H_t (\pmb w- \pmb w_t)\]

where

  • $\pmb g _ t$ is a shorthand for the gradient $\nabla f(\pmb w _ t)$
  • $\pmb H _ t$ is the Hessian of $f$ at $\pmb w _ t$.

For a function $f : \mathbb R^N \to \mathbb R$ we have the local quadratic approximation of $f$ at $\pmb w _ t$

\[f_\text{quad}(\pmb w) = f(\pmb w_t) + \pmb g_t^\top (\pmb w - \pmb w_t) + \frac 1 2 (\pmb w - \pmb w_t)^\top \pmb H_t (\pmb w- \pmb w_t)\]

where

  • $\pmb g _ t$ is the gradient $\nabla f(\pmb w _ t)$
  • $\pmb H _ t$ is the Hessian of $f$ at $\pmb w _ t$.

Given this, quickly derive the Newton update step for $\pmb w _ {t+1}$ using.


We want $\nabla f _ \text{quad} = 0$, so

\[\pmb g_t + \pmb H_t (\pmb w - \pmb w_t) = 0\]

Thus

\[\pmb w_{t+1} = \pmb w_t - \pmb H^{-1}_t \pmb g_t\]



Related posts