Notes - Machine Learning MT23, Support vector machines
Flashcards
In the context of SVMs, what is the “linearly separable” setting?
When the data in question admits a linear seperator.
Suppose we have data
\[\mathcal D = \langle (\mathbf x_i, y_i)\rangle^N_{i = 1}\]
where $y _ i = \pm 1$ and $\mathbf x _ i \in \mathbb R^D$, and are trying to use SVMs to classify the data. Quickly derive a condition for the data to be linearly seperable.
The data is said to be linearly seperable if it lies either side of some hyperplane, i.e. there exists some $\mathbf w, w _ 0$ such that
- If $y = 1$, $\mathbf w^\top \mathbf x + w _ 0 > 0$
- If $y = -1$, $\mathbf w^\top \mathbf x + w _ 0 < 0$.
This can be simplified to
\[\forall \mathbf x, y: y(\mathbf w^\top \mathbf x + w_0) > 0\]Note that since the data is finite, there will always exist some $\varepsilon$ such that the above is true iff
\[\forall \mathbf x, y: y(\mathbf w^\top \mathbf x + w_0) \ge \varepsilon\]Hence then there will actually always be $\mathbf w’ := \mathbf w / \varepsilon$ and $w _ 0’ := w _ 0 / \varepsilon$ such that
\[\forall \mathbf x, y: y(\mathbf w'^\top \mathbf x + w_0') \ge 1\]So in conclusion our condition is:
\[\mathcal D\text{ is linearly seperable} \iff \exists \mathbf w, w_0: \forall \mathbf (x, y) \in \mathcal D : y(\mathbf w^\top \mathbf x + w_0) \ge 1\]SVMs (at least in the linearly separable setting) are all about drawing linear separators between two datasets. In this context, what is the margin of a datapoint?
The distance of the point from the separating hyperplane.
SVMs (at least in the linearly separable setting) are all about drawing linear separators between two datasets. In this context, what is the maximum margin principle?
The separating boundary that maximises the smallest margin should be preferred.
Suppose we have a hyperplane $\pmb w^T \pmb x + w _ 0 = 0$. Can you give the formula for the distance of a point $\pmb x^\star$ to the hyperplane?
(this can be derived by solving for the closest point on the hyperplane to $\pmb x^\star$ using a Lagrangian, and then substituting in).
Consider SVMs in the case where the data is linearly seperable, and that we have data $\mathcal D = \langle (\pmb x _ i, y _ i) \rangle^N _ {i = 1}$, where $y _ i \in \{-1, +1\}$. Remember that a hyperplane is given by $\pmb w^T \pmb x + w _ 0 = 0$. What relationship with $y _ i$ must $\pmb w \in \mathbb R^D$ and $w _ 0 \in \mathbb R$ satisfy?
(it is more natural to see that this is in fact $> 0$, but there is an argument showing that it is with no loss of generality that we can say $\ge 1$. This is useful for when we form the Lagrangian later)
Consider SVMs in the case where the data is linearly seperable, and that we have data $\mathcal D = \langle (\pmb x _ i, y _ i) \rangle^N _ {i = 1}$, where $y _ i \in \{-1, +1\}$. We know that the formula for the distance of a point $\pmb x _ i$ is given by
\[\frac{|\pmb w^T \pmb x_i + w_0|}{||\pmb w||_2}\]
And that
\[y_i (\pmb w^T \pmb x_i + w_0) \ge 1\]
for some solution $\pmb w$. Can you give a simpler formula for the margin without any absolute value in the numerator, and then use an inequality to construct a simpler optimisation problem?
Note that
\[|\pmb w^T \pmb x_i + w_0| = y_i(\pmb w^T \pmb x_i + w_0)\]So we have
\[\frac{y_i(\pmb w^T \pmb x_i + w_0)}{||\pmb w||_2}\]From the inequality constraint, this is then bounded below by
\[\frac{1}{||\pmb w||_2}\]So we get the optimisation problem
\[\begin{aligned} \text{maximise:} \quad &\frac{1}{||\pmb w||_2} \\\\ \text{subject to:}\quad & y_i(\pmb w^T \pmb x_i + w_0) \ge 1 \end{aligned}\]What are support vectors in the context of support vector machines?
Datapoints that are closest to the optimal seperator.
The SVM primal formulation in the linearly separable case is as follows:
\[\begin{aligned}
\text{minimise:} \quad &\frac{1}{2} ||\pmb w||^2_2 \\\\
\text{subject to:}\quad & y_i(\pmb w^T \pmb x_i + w_0) \ge 1
\end{aligned}\]
By adding slack variables $\zeta _ i$, can you give the SVM primal formulation in the non-linearly separable case?
The SVM primal formulation in the non-linearly separable case is given by
\[\begin{aligned}
\text{minimise:} \quad &\frac{1}{2} ||\pmb w||_2 + C\sum^N_{i=1}\zeta_i \\\\
\text{subject to:}\quad & y_i(\pmb w^T \pmb x_i + w_0) \ge 1 - \zeta_i \\\\
&\zeta_i \ge 0
\end{aligned}\]
Suppose we have an optimal solution $(\pmb w, w _ 0, \pmb \zeta)$. What must be true about each $\zeta _ i$ at the optimal solution?
Either $\zeta _ i = 0$ or $\zeta _ i = 1 - y _ i(\pmb w^T x _ i + w _ 0) \ge 0$, i.e.
\[\zeta_i = \max\\{0, 1 - y_i(\pmb w^T x_i + w_0)\\}\]The SVM primal formulation in the non-linearly separable case is given by
\[\begin{aligned}
\text{minimise:} \quad &\frac{1}{2} ||\pmb w||_2 + C\sum^N_{i=1}\zeta_i \\\\
\text{subject to:}\quad & y_i(\pmb w^T \pmb x_i + w_0) \ge 1 - \zeta_i \\\\
&\zeta_i \ge 0
\end{aligned}\]
Suppose we have an optimal solution $(\pmb w, w _ 0, \pmb \zeta)$. At the optimal solution, each $\zeta _ i$ must satisfy
\[\zeta_i = \max\\{0, 1 - y_i(\pmb w^T x_i + w_0)\\}\]
By rewriting in terms of the hinge loss function $\ell$ where
\[\ell (\hat y \mid y_i) = \max\\{0, 1 - y_i \hat y\\}\]
Can you give the loss function view of the SVM primal formulation by describing $\mathcal L _ \text{SVM}(\pmb w, w _ 0 \mid \pmb X, \pmb y)$?
The SVM primal formulation in the non-linearly separable case is given by
\[\begin{aligned}
\text{minimise:} \quad &\frac{1}{2} ||\pmb w||_2 + C\sum^N_{i=1}\zeta_i \\\\
\text{subject to:}\quad & y_i(\pmb w^T \pmb x_i + w_0) \ge 1 \\\\
&\zeta_i \ge 0
\end{aligned}\]
By messing around with the Lagrangian and using the KKT conditions, we can form the dual problem
\[\begin{aligned}
\text{maximise:} \quad &\sum^N_{i = 1} \alpha_i - \frac 1 2 \sum^N_{i=1}\sum^N_{j=1} \alpha_i \alpha_j y_i y_j (\pmb x_i \cdot \pmb x_i) \\\\
\text{subject to:} \quad &\sum^N_{i=1} \alpha_i y_i = 0 \\\\
&0 \le \alpha_i \le C
\end{aligned}\]
It is possible then to reconstruct $\pmb w$ and $w _ 0$ from $\alpha _ i$:
\[\pmb w = \sum^N_{i=1}\alpha_i y_i \pmb x_i\]
What are the advantages and disadvantages of the dual formulation compared to the primal formulation?
- Advantages
- Only $D$ variables where $D$ is the dimension of the data, as opposed to $N + (D + 1)$ in the primal formulation
- Simpler constraints
- (and the kernel trick is more obvious)
- Disadvantages
- Objective function requires $N^2$ terms to express, whereas in the primal formulation is only requires $N + D$.
The SVM primal formulation in the non-linearly separable case is given by
\[\begin{aligned}
\text{minimise:} \quad &\frac{1}{2} ||\pmb w||_2 + C\sum^N_{i=1}\zeta_i \\\\
\text{subject to:}\quad & y_i(\pmb w^T \pmb x_i + w_0) \ge 1 \\\\
&\zeta_i \ge 0
\end{aligned}\]
By messing around with the Lagrangian and using the KKT conditions, we can form the dual problem
\[\begin{aligned}
\text{maximise:} \quad &\sum^N_{i = 1} \alpha_i - \frac 1 2 \sum^N_{i=1}\sum^N_{j=1} \alpha_i \alpha_j y_i y_j (\pmb x_i \cdot \pmb x_i) \\\\
\text{subject to:} \quad &\sum^N_{i=1} \alpha_i y_i = 0 \\\\
&0 \le \alpha_i \le C
\end{aligned}\]
It is possible then to reconstruct $\pmb w$ and $w _ 0$ from $\alpha _ i$:
\[\pmb w = \sum^N_{i=1}\alpha_i y_i \pmb x_i\]
What is the kernel trick in this context, i.e. what is special about the following observation?
\[\begin{aligned}
\phi([x, y]^T) &= [1, \sqrt2 x, \sqrt 2 y, x^2, y^2, \sqrt2 xy]^T \\\\
\phi(\pmb x_1)^T\phi(\pmb x_2) &= \cdots = (1 + \pmb x_1^T \pmb x_2)^2
\end{aligned}\]
- To make train and make predictions, we only need to find the inner product between two datapoints
- Taking the inner product in specific basis expansions actually simplifies down into functions on the original data points
- This allows us to easily find linear seperating surfaces in higher dimensional space and project the datapoints back down
The squared-slack penalty soft-margin SVM constrained optimisation problem is given by
\[\begin{aligned}
\min_
{\pmb w} \quad &\frac{1}{2} ||\pmb w||^2 + \frac C 2\sum^N_
{i=1}\zeta_
i^2 \\\\
\text{s.t.}\quad & y_
i(\pmb w^T \pmb x_
i + w_
0) \ge 1 - \zeta_
i \\\\
&\zeta_
i \ge 0
\end{aligned}\]
Quickly derive the dual problem
\[\begin{aligned}
& \max_
{\alpha} && \sum_
{i=1}^{N} \alpha_
i - \frac{1}{2} \sum_
{i=1}^{N} \sum_
{j=1}^{N} \alpha_
i \alpha_
j y_
i y_
j \mathbf{x}_
i^\top \mathbf{x}_
j - \frac{1}{2} \sum_{i=1}^{N} \frac{\alpha_i^2}{C} \\\\
& \text{s.t.} && \alpha_
i \geq 0, \; \forall i = 1, \ldots, N \\\\
& && \sum_
{i=1}^{N} \alpha_
i y_
i = 0
\end{aligned}\]
Note that we can ignore the constraint that $\zeta _ i \ge 0$, since if $\zeta _ i \le 0$ satisfies $y _ i (\pmb w^\top \pmb x _ i + w _ 0) \ge 1 - \zeta _ i$, the constraint is also satisfied with $\zeta _ i = 0$, which has smaller objective.
We first find the Lagrangian for the original problem and use the KKT conditions to get the primal variables $\pmb w, w _ 0, \zeta _ i$ in terms of the dual variables (the multipliers in the constraints). Then we find the Lagrangian dual objective function, which is where me maximise the Lagrangian with respect to the dual variables and use the constraints from the KKT conditions.
The Lagrangian is:
\[\Lambda(\pmb w, w_0; \pmb \zeta, \pmb \alpha) = \frac 1 2 \pmb w^\top \pmb w + \frac{C}{2} \sum^N_{i = 1} \zeta_i^2 - \sum^N_{i = 1} \alpha_i (y_i (\pmb w^\top \pmb x_i + w_0) - 1 + \zeta_i)\]The KKT conditions then say that at the optimal solution, the gradient has to be $\pmb 0$ and we require dual feasibility $\pmb \alpha _ i \ge 0$ (we also need primal feasibility and complementary slackness, but these aren’t used in the derivation).
The gradients are:
\[\begin{aligned} \frac{\partial \Lambda}{\partial \pmb w} &= \pmb w - \sum^N_{i = 1} \alpha_i y_i \pmb x_i \\\\ \frac{\partial \Lambda}{\partial w_0} &= -\sum^N_{i = 1} \alpha_i y_i \\\\ \frac{\partial \Lambda}{\partial \zeta_i} &= C \zeta_i - \alpha_i \end{aligned}\]So at the optimal solution, we have:
\[\begin{aligned} &\pmb w = \sum^N_{i = 1} \alpha_i y_i \pmb x_i \\\\ &\sum^N_{i = 1} \alpha_i y_i = 0 \\\\ &C\zeta_i = \alpha_i \quad \left(\implies \zeta_i = \frac{\alpha_i}{C}\right) \end{aligned}\]Then, rewriting the Lagrangian in terms of the duals:
\[\begin{aligned} &\frac 1 2 \pmb w^\top \pmb w + \frac{C}{2} \sum^N_{i = 1} \zeta_i^2 - \sum^N_{i = 1} \alpha_i (y_i (\pmb w^\top \pmb x_i + w_0) - 1 + \zeta_i) \\\\ =&\frac 1 2 \left( \sum^N_{i=1} a_i y_i \pmb x_i \right)^\top \left(\sum^N_{i = 1} \alpha_i y_i \pmb x_i\right) + \frac 1 2 \sum^N_{i = 1} \frac{\alpha_i}{\zeta_i} \zeta_i^2 - \sum^N_{i = 1} \alpha_i \left(y_i \left(\left(\sum^N_{j = 1} \alpha_j y_j \pmb x_j\right)^\top \pmb x_i+w_0\right) - 1 + \zeta_i\right) \\\\ =&-\frac 1 2 \sum^N_{i =1}\sum^N_{j = 1} \alpha_i \alpha_j y_i y_j \pmb x_i^\top \pmb x_j + \frac 1 2 \sum^N_{i = 1} \alpha_i \zeta_i - \left(\sum^N_{i = 1} \alpha_i y_i\right) w_0 + \sum^N_{i = 1} \alpha_i - \sum^N_{i = 1} \alpha_i \zeta_i \\\\ =& \sum^N_{i = 1} \alpha_i - \frac 1 2 \sum^N_{i =1}\sum^N_{j = 1} \alpha_i \alpha_j y_i y_j \pmb x_i^\top \pmb x_j - \frac 1 2 \sum^N_{i = 1} \frac{\alpha_i^2}{C} \end{aligned}\]which, along with the inequality $\alpha _ i \ge 0$ from the KKT conditions, gives us the dual problem
\[\begin{aligned} & \max_ {\alpha} && \sum_ {i=1}^{N} \alpha_ i - \frac{1}{2} \sum_ {i=1}^{N} \sum_ {j=1}^{N} \alpha_ i \alpha_ j y_ i y_ j \mathbf{x}_ i^\top \mathbf{x}_ j - \frac{1}{2} \sum_{i=1}^{N} \frac{\alpha_ i^2}{C} \\ & \text{s.t.} && \alpha_ i \geq 0, \; \forall i = 1, \ldots, N \\ & && \sum_ {i=1}^{N} \alpha_ i y_ i = 0 \end{aligned}\]The soft-margin SVM constrained optimisation problem is given by
\[\begin{aligned}
\min_
{\pmb w} \quad &\frac{1}{2} ||\pmb w||^2 + \frac C 2\sum^N_
{i=1}\zeta_i \\\\
\text{s.t.}\quad & y_i(\pmb w^T \pmb x_i + w_0) \ge 1 - \zeta_i \\\\
&\zeta_i \ge 0
\end{aligned}\]
Quickly derive its loss function formulation, given by:
\[\min \quad \mathcal L_{\text{SVM}
}(\pmb w, w_0 \mid \pmb X, \pmb y) = \frac C 2\sum^N_{i = 1} \ell_{\text{hinge}
}(\pmb w, w_0 \mid \pmb x_i, y_i) + \frac{1}{2} ||\pmb w||^2\]
Consider an optimal solution $(\pmb w, w _ 0, \pmb \zeta)$ to the original optimisation problem. Either points $(\pmb x _ i, y _ i)$ are correctly classified, in which case $y _ i (\pmb w^\top \pmb x _ i + w _ 0) \ge 1$ and $\zeta _ i = 0$, or they are incorrectly classified with $y _ i(\pmb w^\top x _ i + w _ 0) < 1$, so the smallest $\zeta _ i$ that lets the constraint $y _ i(\pmb w^T \pmb x _ i + w _ 0) \ge 1 - \zeta _ i$ be satisfied is $\zeta _ i = 1 - y _ i (\pmb w^\top \pmb x _ i + w _ 0)$.
Hence at the optimum, we need $\zeta _ i = \max \{0, 1 - y _ i(\pmb w^\top \pmb x _ i + w _ 0)\}$.
But this is equivalent to the hinge loss $\ell (\pmb w, w _ 0 \mid \pmb x _ i, y _ i)$ (recall that the hinge loss penalises the output of a $\pm 1$ classifier via $\max \{0, 1 - y _ i (\text{prediction})\} )$. Hence we can formulate the original optimisation problem as
\[\min \quad \mathcal L_{\text{SVM} }(\pmb w, w_0 \mid \pmb X, \pmb y) = C\sum^N_{i = 1} \ell_{\text{hinge} }(\pmb w, w_0 \mid \pmb x_i, y_i) + \frac{1}{2} ||\pmb w||^2\]Can you give the dual formulation of SVMs, and what the solution for $\pmb w$ looks like in terms of the dual variables?
and
\[\pmb w = \sum^N_{i = 1} \alpha_i y_i \pmb x_i\]Suppose we are using the kernel trick with SVMs. This involves solving the dual problem
\[\begin{aligned}
\text{maximise:} \quad &\sum^N_{i = 1} \alpha_i - \frac 1 2 \sum^N_{i=1}\sum^N_{j=1} \alpha_i \alpha_j y_i y_j \kappa(\pmb x_i, \pmb x_i) \\\\
\text{subject to:} \quad &\sum^N_{i=1} \alpha_i y_i = 0 \\\\
&0 \le \alpha_i \le C
\end{aligned}\]
and then deducing the value of the weights $\pmb w$ from the dual variables. Thus to decide the class of a new point, we need to evaluate
\[\pmb w^\top \phi(\pmb x) + w_0\]
This looks a bit scary, since $\phi(x)$ is large (potentially infinite) vector. Why is this not a problem?
$\pmb w$ in terms of the dual variables is
\[\pmb w = \sum^N_{i = 1} \alpha_i y_i \pmb x_i\]So
\[\pmb w^\top \phi(\pmb x) + w_0 = w_0 + \sum^N_{i = 1} \alpha_i y_i \kappa(\pmb x_i, \pmb x)\]so we can use the kernel instead.
Suppose we are using the dual formulation of SVMs, this requires solving
\[\begin{aligned}
\text{maximise:} \quad &\sum^N_{i = 1} \alpha_i - \frac 1 2 \sum^N_{i=1}\sum^N_{j=1} \alpha_i \alpha_j y_i y_j (\pmb x_i \cdot \pmb x_j) \\\\
\text{subject to:} \quad &\sum^N_{i=1} \alpha_i y_i = 0 \\\\
&0 \le \alpha_i \le C
\end{aligned}\]
The weight vector can then be written in the form
\[\pmb w = \sum^N_{i = 1} \alpha_i y_i \pmb x_i\]
How can you then compute the bias $w _ 0$ (even in the case where we are using a kernel)?
The vectors where $0 < \alpha _ i < C$ are the support vectors, so that $\pmb x _ i$ lies exactly on the boundary. Then we can solve via
\[y_i (\pmb w^\top \pmb x_i + w_0) = 1 \implies w_0 = y_i - \pmb w^\top \pmb x_i\]Note that $y^{-1} _ i = y _ i$ since $y _ i = \pm 1$.
Suppose we are considering the dual formulation of SVMs:
\[\begin{aligned}
\text{maximise:} \quad &\sum^N_{i = 1} \alpha_i - \frac 1 2 \sum^N_{i=1}\sum^N_{j=1} \alpha_i \alpha_j y_i y_j \kappa(\pmb x_i, \pmb x_i) \\\\
\text{subject to:} \quad &\sum^N_{i=1} \alpha_i y_i = 0 \\\\
&0 \le \alpha_i \le C
\end{aligned}\]
How can you deduce that the points where $\alpha _ i \ne 0$ correspond to the support vectors?
Complementary slackness implies
\[\alpha_i \cdot y_i (\pmb w^\top \pmb x_i + w_0) - 1 + \zeta_i = 0\]So either $\alpha _ i = 0$ or $y _ i(\pmb w^\top \pmb x _ i + w _ 0) = 1 - \zeta _ i$. If $\alpha _ i$ is nonzero, we can then conclude that $\pmb x _ i$ is a support vector.
The dual formulation of SVMs in the kernelised case is:
\[\begin{aligned}
\text{maximise:} \quad &\sum^N_{i = 1} \alpha_i - \frac 1 2 \sum^N_{i=1}\sum^N_{j=1} \alpha_i \alpha_j y_i y_j \kappa (\pmb x_i, \pmb x_j) \\\\
\text{subject to:} \quad &\sum^N_{i=1} \alpha_i y_i = 0 \\\\
&0 \le \alpha_i \le C
\end{aligned}\]
Can you write the objective function $g(\pmb \alpha)$ succinctly in vector notation, and in doing so explain why Mercer kernels are important for SVMs?
where $\odot$ is the Hadamard product (i.e. elementwise product) and $\mathbf K$ is the Gram matrix. Mercer kernels imply that $\mathbf K$ is positive semidefinite and therefore that $g(\pmb \alpha)$ is concave.