Notes - Machine Learning MT23, Optimisation
Flashcards
Suppose we have the general constrained minimisation problem
\[\begin{aligned}
\text{minimise:} \quad &F(\pmb z)\\\\
\text{subject to:} \quad &g_i(\pmb z) \ge 0 \\\\
&h_j(\pmb z) = 0
\end{aligned}\]
with Lagrangian
\[\Lambda(\pmb z, \pmb \alpha, \pmb \mu) = F(z) - \sum^m_{i=1} a_i g_i(z) - \sum^l_{j = 1} \mu_j h_j(\pmb z)\]
Can you state the KKT (Karush-Kuhn-Tucker) conditions, which give necessary conditions for a stationary point $(\pmb z^\star, \pmb \alpha^\star, \pmb \mu^\star)$ to be a minimum of this problem, and what each means intuitively?
- Primal feasibility
- $g _ i(\pmb z^\star) \ge 0$
- $h _ j(\pmb z^\star) = 0$
- Satisfies the actual constraints
- Dual feasibility
- $\alpha _ i^\star \ge 0$
- (don’t have an intuitive explanation for this, but similar to how all the dual variables introduced in an LP are nonnegative)
- Complementary slackness
- $\alpha _ i^\star g _ i(\pmb z^\star) = 0$
- Either a solution lies on the boundary of an inequality constraint ($g _ i(\pmb z^\star) = 0$) or the solution is inside the inequality we can actually ignore the constraint ($\alpha _ i^\star = 0$).
How could you rephrase
\[\min \quad \mathcal L(\pmb w) = \sum^N_{i = 1} |\pmb x_i^\top \pmb w - y_i|\]
as a linear programming problem?
Suppose we are optimising some parameters $\pmb w$ using gradient descent. Can you give the iterative step used in gradient descent methods?
where:
- $\eta _ t$ is the learning rate
- $\pmb g _ t$ is a shorthand for the gradient $\nabla f(\pmb w _ t)$.
Gradient descent uses the following iterative step:
\[\begin{aligned}
\pmb w_{t+1} &= \pmb w_t - \eta_t \pmb g_t \\\\
&= \pmb w_t - \eta_t \nabla f(\pmb w_t)
\end{aligned}\]
where:
- $\eta _ t$ is the learning rate
- $\pmb g _ t$ is a shorthand for the gradient $\nabla f(\pmb w _ t)$
In this context, what is “line-search”?
A technique for picking $\eta _ t$ at each step where we take the global minimum of the univariate function obtained by projecting in the direction of the gradient.
Suppose we have a function $f : \mathbb R^N \to \mathbb R$. What is the local quadratic approximation of $f$ at $\pmb w _ t$?
where
- $\pmb g _ t$ is a shorthand for the gradient $\nabla f(\pmb w _ t)$
- $\pmb H _ t$ is the Hessian of $f$ at $\pmb w _ t$.
For a function $f : \mathbb R^N \to \mathbb R$ we have the local quadratic approximation of $f$ at $\pmb w _ t$
\[f_\text{quad}(\pmb w) = f(\pmb w_t) + \pmb g_t^\top (\pmb w - \pmb w_t) + \frac 1 2 (\pmb w - \pmb w_t)^\top \pmb H_t (\pmb w- \pmb w_t)\]
where
- $\pmb g _ t$ is the gradient $\nabla f(\pmb w _ t)$
- $\pmb H _ t$ is the Hessian of $f$ at $\pmb w _ t$.
Given this, quickly derive the Newton update step for $\pmb w _ {t+1}$ using.
We want $\nabla f _ \text{quad} = 0$, so
\[\pmb g_t + \pmb H_t (\pmb w - \pmb w_t) = 0\]Thus
\[\pmb w_{t+1} = \pmb w_t - \pmb H^{-1}_t \pmb g_t\]