Machine Learning MT23, Support vector machines
Flashcards
In the context of SVMs, what is the “linearly separable” setting?
When the data in question admits a linear seperator.
Suppose we have data
\[\mathcal D = \langle (\mathbf x _ i, y _ i)\rangle^N _ {i = 1}\]
where $y _ i = \pm 1$ and $\mathbf x _ i \in \mathbb R^D$, and are trying to use SVMs to classify the data. Quickly derive a condition for the data to be linearly seperable.
The data is said to be linearly seperable if it lies either side of some hyperplane, i.e. there exists some $\mathbf w, w _ 0$ such that
- If $y = 1$, $\mathbf w^\top \mathbf x + w _ 0 > 0$
- If $y = -1$, $\mathbf w^\top \mathbf x + w _ 0 < 0$.
This can be simplified to
\[\forall \mathbf x, y: y(\mathbf w^\top \mathbf x + w _ 0) > 0\]Note that since the data is finite, there will always exist some $\varepsilon$ such that the above is true iff
\[\forall \mathbf x, y: y(\mathbf w^\top \mathbf x + w _ 0) \ge \varepsilon\]Hence then there will actually always be $\mathbf w’ := \mathbf w / \varepsilon$ and $w _ 0’ := w _ 0 / \varepsilon$ such that
\[\forall \mathbf x, y: y(\mathbf w'^\top \mathbf x + w _ 0') \ge 1\]So in conclusion our condition is:
\[\mathcal D\text{ is linearly seperable} \iff \exists \mathbf w, w _ 0: \forall \mathbf (x, y) \in \mathcal D : y(\mathbf w^\top \mathbf x + w _ 0) \ge 1\]SVMs (at least in the linearly separable setting) are all about drawing linear separators between two datasets. In this context, what is the margin of a datapoint?
The distance of the point from the separating hyperplane.
SVMs (at least in the linearly separable setting) are all about drawing linear separators between two datasets. In this context, what is the maximum margin principle?
The separating boundary that maximises the smallest margin should be preferred.
Suppose we have a hyperplane $\pmb w^T \pmb x + w _ 0 = 0$. Can you give the formula for the distance of a point $\pmb x^\star$ to the hyperplane?
(this can be derived by solving for the closest point on the hyperplane to $\pmb x^\star$ using a Lagrangian, and then substituting in).
Consider SVMs in the case where the data is linearly seperable, and that we have data $\mathcal D = \langle (\pmb x _ i, y _ i) \rangle^N _ {i = 1}$, where $y _ i \in \{-1, +1\}$. Remember that a hyperplane is given by $\pmb w^T \pmb x + w _ 0 = 0$. What relationship with $y _ i$ must $\pmb w \in \mathbb R^D$ and $w _ 0 \in \mathbb R$ satisfy?
(it is more natural to see that this is in fact $> 0$, but there is an argument showing that it is with no loss of generality that we can say $\ge 1$. This is useful for when we form the Lagrangian later)
Consider SVMs in the case where the data is linearly seperable, and that we have data $\mathcal D = \langle (\pmb x _ i, y _ i) \rangle^N _ {i = 1}$, where $y _ i \in \{-1, +1\}$. We know that the formula for the distance of a point $\pmb x _ i$ is given by
\[\frac{ \vert \pmb w^T \pmb x _ i + w _ 0 \vert }{ \vert \vert \pmb w \vert \vert _ 2}\]
And that
\[y _ i (\pmb w^T \pmb x _ i + w _ 0) \ge 1\]
for some solution $\pmb w$. Can you give a simpler formula for the margin without any absolute value in the numerator, and then use an inequality to construct a simpler optimisation problem?
Note that
\[\vert \pmb w^T \pmb x _ i + w _ 0 \vert = y _ i(\pmb w^T \pmb x _ i + w _ 0)\]So we have
\[\frac{y _ i(\pmb w^T \pmb x _ i + w _ 0)}{ \vert \vert \pmb w \vert \vert _ 2}\]From the inequality constraint, this is then bounded below by
\[\frac{1}{ \vert \vert \pmb w \vert \vert _ 2}\]So we get the optimisation problem
\[\begin{aligned} \text{maximise:} \quad &\frac{1}{ \vert \vert \pmb w \vert \vert _ 2} \\\\ \text{subject to:}\quad & y _ i(\pmb w^T \pmb x _ i + w _ 0) \ge 1 \end{aligned}\]What are support vectors in the context of support vector machines?
Datapoints that are closest to the optimal seperator.
The SVM primal formulation in the linearly separable case is as follows:
\[\begin{aligned}
\text{minimise:} \quad &\frac{1}{2} \vert \vert \pmb w \vert \vert ^2 _ 2 \\\\
\text{subject to:}\quad & y _ i(\pmb w^T \pmb x _ i + w _ 0) \ge 1
\end{aligned}\]
By adding slack variables $\zeta _ i$, can you give the SVM primal formulation in the non-linearly separable case?
The SVM primal formulation in the non-linearly separable case is given by
\[\begin{aligned}
\text{minimise:} \quad &\frac{1}{2} \vert \vert \pmb w \vert \vert _ 2 + C\sum^N _ {i=1}\zeta _ i \\\\
\text{subject to:}\quad & y _ i(\pmb w^T \pmb x _ i + w _ 0) \ge 1 - \zeta _ i \\\\
&\zeta _ i \ge 0
\end{aligned}\]
Suppose we have an optimal solution $(\pmb w, w _ 0, \pmb \zeta)$. What must be true about each $\zeta _ i$ at the optimal solution?
Either $\zeta _ i = 0$ or $\zeta _ i = 1 - y _ i(\pmb w^T x _ i + w _ 0) \ge 0$, i.e.
\[\zeta _ i = \max\\{0, 1 - y _ i(\pmb w^T x _ i + w _ 0)\\}\]The SVM primal formulation in the non-linearly separable case is given by
\[\begin{aligned}
\text{minimise:} \quad &\frac{1}{2} \vert \vert \pmb w \vert \vert _ 2 + C\sum^N _ {i=1}\zeta _ i \\\\
\text{subject to:}\quad & y _ i(\pmb w^T \pmb x _ i + w _ 0) \ge 1 - \zeta _ i \\\\
&\zeta _ i \ge 0
\end{aligned}\]
Suppose we have an optimal solution $(\pmb w, w _ 0, \pmb \zeta)$. At the optimal solution, each $\zeta _ i$ must satisfy
\[\zeta _ i = \max\\{0, 1 - y _ i(\pmb w^T x _ i + w _ 0)\\}\]
By rewriting in terms of the hinge loss function $\ell$ where
\[\ell (\hat y \mid y _ i) = \max\\{0, 1 - y _ i \hat y\\}\]
Can you give the loss function view of the SVM primal formulation by describing $\mathcal L _ \text{SVM}(\pmb w, w _ 0 \mid \pmb X, \pmb y)$?
The SVM primal formulation in the non-linearly separable case is given by
\[\begin{aligned}
\text{minimise:} \quad &\frac{1}{2} \vert \vert \pmb w \vert \vert _ 2 + C\sum^N _ {i=1}\zeta _ i \\\\
\text{subject to:}\quad & y _ i(\pmb w^T \pmb x _ i + w _ 0) \ge 1 \\\\
&\zeta _ i \ge 0
\end{aligned}\]
By messing around with the Lagrangian and using the KKT conditions, we can form the dual problem
\[\begin{aligned}
\text{maximise:} \quad &\sum^N _ {i = 1} \alpha _ i - \frac 1 2 \sum^N _ {i=1}\sum^N _ {j=1} \alpha _ i \alpha _ j y _ i y _ j (\pmb x _ i \cdot \pmb x _ i) \\\\
\text{subject to:} \quad &\sum^N _ {i=1} \alpha _ i y _ i = 0 \\\\
&0 \le \alpha _ i \le C
\end{aligned}\]
It is possible then to reconstruct $\pmb w$ and $w _ 0$ from $\alpha _ i$:
\[\pmb w = \sum^N _ {i=1}\alpha _ i y _ i \pmb x _ i\]
What are the advantages and disadvantages of the dual formulation compared to the primal formulation?
- Advantages
- Only $D$ variables where $D$ is the dimension of the data, as opposed to $N + (D + 1)$ in the primal formulation
- Simpler constraints
- (and the kernel trick is more obvious)
- Disadvantages
- Objective function requires $N^2$ terms to express, whereas in the primal formulation is only requires $N + D$.
The SVM primal formulation in the non-linearly separable case is given by
\[\begin{aligned}
\text{minimise:} \quad &\frac{1}{2} \vert \vert \pmb w \vert \vert _ 2 + C\sum^N _ {i=1}\zeta _ i \\\\
\text{subject to:}\quad & y _ i(\pmb w^T \pmb x _ i + w _ 0) \ge 1 \\\\
&\zeta _ i \ge 0
\end{aligned}\]
By messing around with the Lagrangian and using the KKT conditions, we can form the dual problem
\[\begin{aligned}
\text{maximise:} \quad &\sum^N _ {i = 1} \alpha _ i - \frac 1 2 \sum^N _ {i=1}\sum^N _ {j=1} \alpha _ i \alpha _ j y _ i y _ j (\pmb x _ i \cdot \pmb x _ i) \\\\
\text{subject to:} \quad &\sum^N _ {i=1} \alpha _ i y _ i = 0 \\\\
&0 \le \alpha _ i \le C
\end{aligned}\]
It is possible then to reconstruct $\pmb w$ and $w _ 0$ from $\alpha _ i$:
\[\pmb w = \sum^N _ {i=1}\alpha _ i y _ i \pmb x _ i\]
What is the kernel trick in this context, i.e. what is special about the following observation?
\[\begin{aligned}
\phi([x, y]^T) &= [1, \sqrt2 x, \sqrt 2 y, x^2, y^2, \sqrt2 xy]^T \\\\
\phi(\pmb x _ 1)^T\phi(\pmb x _ 2) &= \cdots = (1 + \pmb x _ 1^T \pmb x _ 2)^2
\end{aligned}\]
- To make train and make predictions, we only need to find the inner product between two datapoints
- Taking the inner product in specific basis expansions actually simplifies down into functions on the original data points
- This allows us to easily find linear seperating surfaces in higher dimensional space and project the datapoints back down
The squared-slack penalty soft-margin SVM constrained optimisation problem is given by
\[\begin{aligned}
\min _
{\pmb w} \quad &\frac{1}{2} \vert \vert \pmb w \vert \vert ^2 + \frac C 2\sum^N _
{i=1}\zeta _
i^2 \\\\
\text{s.t.}\quad & y _
i(\pmb w^T \pmb x _
i + w _
0) \ge 1 - \zeta _
i \\\\
&\zeta _
i \ge 0
\end{aligned}\]
Quickly derive the dual problem
\[\begin{aligned}
& \max _
{\alpha} && \sum _
{i=1}^{N} \alpha _
i - \frac{1}{2} \sum _
{i=1}^{N} \sum _
{j=1}^{N} \alpha _
i \alpha _
j y _
i y _
j \mathbf{x} _
i^\top \mathbf{x} _
j - \frac{1}{2} \sum _ {i=1}^{N} \frac{\alpha _ i^2}{C} \\\\
& \text{s.t.} && \alpha _
i \geq 0, \; \forall i = 1, \ldots, N \\\\
& && \sum _
{i=1}^{N} \alpha _
i y _
i = 0
\end{aligned}\]
Note that we can ignore the constraint that $\zeta _ i \ge 0$, since if $\zeta _ i \le 0$ satisfies $y _ i (\pmb w^\top \pmb x _ i + w _ 0) \ge 1 - \zeta _ i$, the constraint is also satisfied with $\zeta _ i = 0$, which has smaller objective.
We first find the Lagrangian for the original problem and use the KKT conditions to get the primal variables $\pmb w, w _ 0, \zeta _ i$ in terms of the dual variables (the multipliers in the constraints). Then we find the Lagrangian dual objective function, which is where me maximise the Lagrangian with respect to the dual variables and use the constraints from the KKT conditions.
The Lagrangian is:
\[\Lambda(\pmb w, w _ 0; \pmb \zeta, \pmb \alpha) = \frac 1 2 \pmb w^\top \pmb w + \frac{C}{2} \sum^N _ {i = 1} \zeta _ i^2 - \sum^N _ {i = 1} \alpha _ i (y _ i (\pmb w^\top \pmb x _ i + w _ 0) - 1 + \zeta _ i)\]The KKT conditions then say that at the optimal solution, the gradient has to be $\pmb 0$ and we require dual feasibility $\pmb \alpha _ i \ge 0$ (we also need primal feasibility and complementary slackness, but these aren’t used in the derivation).
The gradients are:
\[\begin{aligned} \frac{\partial \Lambda}{\partial \pmb w} &= \pmb w - \sum^N _ {i = 1} \alpha _ i y _ i \pmb x _ i \\\\ \frac{\partial \Lambda}{\partial w _ 0} &= -\sum^N _ {i = 1} \alpha _ i y _ i \\\\ \frac{\partial \Lambda}{\partial \zeta _ i} &= C \zeta _ i - \alpha _ i \end{aligned}\]So at the optimal solution, we have:
\[\begin{aligned} &\pmb w = \sum^N _ {i = 1} \alpha _ i y _ i \pmb x _ i \\\\ &\sum^N _ {i = 1} \alpha _ i y _ i = 0 \\\\ &C\zeta _ i = \alpha _ i \quad \left(\implies \zeta _ i = \frac{\alpha _ i}{C}\right) \end{aligned}\]Then, rewriting the Lagrangian in terms of the duals:
\[\begin{aligned} &\frac 1 2 \pmb w^\top \pmb w + \frac{C}{2} \sum^N _ {i = 1} \zeta _ i^2 - \sum^N _ {i = 1} \alpha _ i (y _ i (\pmb w^\top \pmb x _ i + w _ 0) - 1 + \zeta _ i) \\\\ =&\frac 1 2 \left( \sum^N _ {i=1} a _ i y _ i \pmb x _ i \right)^\top \left(\sum^N _ {i = 1} \alpha _ i y _ i \pmb x _ i\right) + \frac 1 2 \sum^N _ {i = 1} \frac{\alpha _ i}{\zeta _ i} \zeta _ i^2 - \sum^N _ {i = 1} \alpha _ i \left(y _ i \left(\left(\sum^N _ {j = 1} \alpha _ j y _ j \pmb x _ j\right)^\top \pmb x _ i+w _ 0\right) - 1 + \zeta _ i\right) \\\\ =&-\frac 1 2 \sum^N _ {i =1}\sum^N _ {j = 1} \alpha _ i \alpha _ j y _ i y _ j \pmb x _ i^\top \pmb x _ j + \frac 1 2 \sum^N _ {i = 1} \alpha _ i \zeta _ i - \left(\sum^N _ {i = 1} \alpha _ i y _ i\right) w _ 0 + \sum^N _ {i = 1} \alpha _ i - \sum^N _ {i = 1} \alpha _ i \zeta _ i \\\\ =& \sum^N _ {i = 1} \alpha _ i - \frac 1 2 \sum^N _ {i =1}\sum^N _ {j = 1} \alpha _ i \alpha _ j y _ i y _ j \pmb x _ i^\top \pmb x _ j - \frac 1 2 \sum^N _ {i = 1} \frac{\alpha _ i^2}{C} \end{aligned}\]which, along with the inequality $\alpha _ i \ge 0$ from the KKT conditions, gives us the dual problem
\[\begin{aligned} & \max _ {\alpha} && \sum _ {i=1}^{N} \alpha _ i - \frac{1}{2} \sum _ {i=1}^{N} \sum _ {j=1}^{N} \alpha _ i \alpha _ j y _ i y _ j \mathbf{x} _ i^\top \mathbf{x} _ j - \frac{1}{2} \sum _ {i=1}^{N} \frac{\alpha _ i^2}{C} \\ & \text{s.t.} && \alpha _ i \geq 0, \; \forall i = 1, \ldots, N \\ & && \sum _ {i=1}^{N} \alpha _ i y _ i = 0 \end{aligned}\]The soft-margin SVM constrained optimisation problem is given by
\[\begin{aligned}
\min _
{\pmb w} \quad &\frac{1}{2} \vert \vert \pmb w \vert \vert ^2 + \frac C 2\sum^N _
{i=1}\zeta _ i \\\\
\text{s.t.}\quad & y _ i(\pmb w^T \pmb x _ i + w _ 0) \ge 1 - \zeta _ i \\\\
&\zeta _ i \ge 0
\end{aligned}\]
Quickly derive its loss function formulation, given by:
\[\min \quad \mathcal L _ {\text{SVM}
}(\pmb w, w _ 0 \mid \pmb X, \pmb y) = \frac C 2\sum^N _ {i = 1} \ell _ {\text{hinge}
}(\pmb w, w _ 0 \mid \pmb x _ i, y _ i) + \frac{1}{2} \vert \vert \pmb w \vert \vert ^2\]
Consider an optimal solution $(\pmb w, w _ 0, \pmb \zeta)$ to the original optimisation problem. Either points $(\pmb x _ i, y _ i)$ are correctly classified, in which case $y _ i (\pmb w^\top \pmb x _ i + w _ 0) \ge 1$ and $\zeta _ i = 0$, or they are incorrectly classified with $y _ i(\pmb w^\top x _ i + w _ 0) < 1$, so the smallest $\zeta _ i$ that lets the constraint $y _ i(\pmb w^T \pmb x _ i + w _ 0) \ge 1 - \zeta _ i$ be satisfied is $\zeta _ i = 1 - y _ i (\pmb w^\top \pmb x _ i + w _ 0)$.
Hence at the optimum, we need $\zeta _ i = \max \{0, 1 - y _ i(\pmb w^\top \pmb x _ i + w _ 0)\}$.
But this is equivalent to the hinge loss $\ell (\pmb w, w _ 0 \mid \pmb x _ i, y _ i)$ (recall that the hinge loss penalises the output of a $\pm 1$ classifier via $\max \{0, 1 - y _ i (\text{prediction})\} )$. Hence we can formulate the original optimisation problem as
\[\min \quad \mathcal L _ {\text{SVM} }(\pmb w, w _ 0 \mid \pmb X, \pmb y) = C\sum^N _ {i = 1} \ell _ {\text{hinge} }(\pmb w, w _ 0 \mid \pmb x _ i, y _ i) + \frac{1}{2} \vert \vert \pmb w \vert \vert ^2\]Can you give the dual formulation of SVMs, and what the solution for $\pmb w$ looks like in terms of the dual variables?
and
\[\pmb w = \sum^N _ {i = 1} \alpha _ i y _ i \pmb x _ i\]Suppose we are using the kernel trick with SVMs. This involves solving the dual problem
\[\begin{aligned}
\text{maximise:} \quad &\sum^N _ {i = 1} \alpha _ i - \frac 1 2 \sum^N _ {i=1}\sum^N _ {j=1} \alpha _ i \alpha _ j y _ i y _ j \kappa(\pmb x _ i, \pmb x _ i) \\\\
\text{subject to:} \quad &\sum^N _ {i=1} \alpha _ i y _ i = 0 \\\\
&0 \le \alpha _ i \le C
\end{aligned}\]
and then deducing the value of the weights $\pmb w$ from the dual variables. Thus to decide the class of a new point, we need to evaluate
\[\pmb w^\top \phi(\pmb x) + w _ 0\]
This looks a bit scary, since $\phi(x)$ is large (potentially infinite) vector. Why is this not a problem?
$\pmb w$ in terms of the dual variables is
\[\pmb w = \sum^N _ {i = 1} \alpha _ i y _ i \pmb x _ i\]So
\[\pmb w^\top \phi(\pmb x) + w _ 0 = w _ 0 + \sum^N _ {i = 1} \alpha _ i y _ i \kappa(\pmb x _ i, \pmb x)\]so we can use the kernel instead.
Suppose we are using the dual formulation of SVMs, this requires solving
\[\begin{aligned}
\text{maximise:} \quad &\sum^N _ {i = 1} \alpha _ i - \frac 1 2 \sum^N _ {i=1}\sum^N _ {j=1} \alpha _ i \alpha _ j y _ i y _ j (\pmb x _ i \cdot \pmb x _ j) \\\\
\text{subject to:} \quad &\sum^N _ {i=1} \alpha _ i y _ i = 0 \\\\
&0 \le \alpha _ i \le C
\end{aligned}\]
The weight vector can then be written in the form
\[\pmb w = \sum^N _ {i = 1} \alpha _ i y _ i \pmb x _ i\]
How can you then compute the bias $w _ 0$ (even in the case where we are using a kernel)?
The vectors where $0 < \alpha _ i < C$ are the support vectors, so that $\pmb x _ i$ lies exactly on the boundary. Then we can solve via
\[y _ i (\pmb w^\top \pmb x _ i + w _ 0) = 1 \implies w _ 0 = y _ i - \pmb w^\top \pmb x _ i\]Note that $y^{-1} _ i = y _ i$ since $y _ i = \pm 1$.
Suppose we are considering the dual formulation of SVMs:
\[\begin{aligned}
\text{maximise:} \quad &\sum^N _ {i = 1} \alpha _ i - \frac 1 2 \sum^N _ {i=1}\sum^N _ {j=1} \alpha _ i \alpha _ j y _ i y _ j \kappa(\pmb x _ i, \pmb x _ i) \\\\
\text{subject to:} \quad &\sum^N _ {i=1} \alpha _ i y _ i = 0 \\\\
&0 \le \alpha _ i \le C
\end{aligned}\]
How can you deduce that the points where $\alpha _ i \ne 0$ correspond to the support vectors?
Complementary slackness implies
\[\alpha _ i \cdot y _ i (\pmb w^\top \pmb x _ i + w _ 0) - 1 + \zeta _ i = 0\]So either $\alpha _ i = 0$ or $y _ i(\pmb w^\top \pmb x _ i + w _ 0) = 1 - \zeta _ i$. If $\alpha _ i$ is nonzero, we can then conclude that $\pmb x _ i$ is a support vector.
The dual formulation of SVMs in the kernelised case is:
\[\begin{aligned}
\text{maximise:} \quad &\sum^N _ {i = 1} \alpha _ i - \frac 1 2 \sum^N _ {i=1}\sum^N _ {j=1} \alpha _ i \alpha _ j y _ i y _ j \kappa (\pmb x _ i, \pmb x _ j) \\\\
\text{subject to:} \quad &\sum^N _ {i=1} \alpha _ i y _ i = 0 \\\\
&0 \le \alpha _ i \le C
\end{aligned}\]
Can you write the objective function $g(\pmb \alpha)$ succinctly in vector notation, and in doing so explain why Mercer kernels are important for SVMs?
where $\odot$ is the Hadamard product (i.e. elementwise product) and $\mathbf K$ is the Gram matrix. Mercer kernels imply that $\mathbf K$ is positive semidefinite and therefore that $g(\pmb \alpha)$ is concave.