Notes - Machine Learning MT23, Classification
Flashcards
What is one-vs-one (ovo) for using binary classifiers for $k$-class classification?
Train $K \choose 2$ binary classifiers for each possible pair and then choose the most commonly occuring label.
What is one-vs-rest (ovr) for using binary classifiers for $k$-class classification?
Train $K$ binary classifiers where for each, the $K$-th class is positive and the rest are negative.
Can you draw a $2 \times 2$ table for the possible mistakes a binary classifier can make, and label what is meant by Type I and Type II errors?
- Prediction = yes, Actual label = yes: True positive
- Prediction = yes, Actual label = no: False positive (Type I)
- Prediction = no, Actual label = yes: False negative (Type II)
- Prediction = no, Actual label = no: True negative
Can you define the True Positive Rate (TPR) of a binary classifier?
Can you define the False Positive Rate (FPR) of a binary classifier?
Can you define the precision of a binary classifier?
Can you describe the TPR, FPR, precision and recall of a binary classifier?
Note recall and TPR are exactly the same.
What is a ROC (receiver operating characteristic) curve?
A plot showing the tradeoff between the FPR and TPR as some parameter varying each is controlled.
How can you view the true positive, false negative, etc. matrix as a specific case of a confusion matrix?
It’s a confusion matrix for a binary classifier.
Suppose we have a single data point $(\pmb x _ i, y _ i)$ for some classification problem. In terms of a prediction $\hat y _ i$ given by a model:
- Give the hinge loss for this data point (highlighting what range we expect $y _ i$ to be in)
- Give the log loss for this data point (again highlighting what range we expect $y _ i$ to be in)
- Describe what model you actually end up with using a linear model and hinge loss on all data points plus a regularisation term
- Describe what model you end up with using sigmoid applied to a linear function and log loss
Hinge loss for a data point: we expect $y _ i = \pm 1$, and our model is attempting to output $\hat y _ i = \pm 1$ to classify the points
\[\ell_{\text{hinge} }(\hat y_i \mid \pmb x_i, y_i) := \max{0, 1- y_i\hat y_i}\]Log loss for a data point: we expect $y _ i = 0$ or $y _ i = 1$, and our model is outputting $\hat y _ i \in [0, 1]$ (which can be interpreted as a probability)
\[\ell_{\text{log} }(\hat y_i \mid \pmb x_i, y_i) = -y_i \log(\hat y_i) - (1-y_i) \log(1-\hat y_i)\]If you have data $\{(\pmb x _ i, y _ i)\}$ and use hinge loss for each data point to get an overall loss function for your model, you end up with the loss function for soft-margin SVMs:
\[\mathcal L(\pmb w, w_0 \mid \pmb X, \pmb y) = C\sum^N_{i = 1} \ell_\text{hinge}(\pmb w^\top \pmb x_i + w_0 \mid \pmb x_i, y_i) + \frac 1 2 ||\pmb w||^2_2\]If instead you use log loss for each data point with sigmoid applied to a linear function, you get the loss function for logistic regression:
\[\begin{aligned} \mathcal L(\pmb w, w_0 \mid \pmb X, \pmb y) &= \sum^N_{i = 1} \ell_\text{log}(\sigma(\pmb w^\top \pmb x_i + w_0) \mid \pmb x_i, y_i) \\\\ &= -\sum^N_{i = 1} \Big[ y_i\log(\sigma(\pmb w^\top \pmb x_i + w_0)) + (1-y_i)\log(1 - \sigma(\pmb w^\top \pmb x_i + w_0)) \Big] \end{aligned}\]The log-loss for a single data point $(\pmb x _ i, y _ i)$ in a classification task is given by
\[\ell_{\text{log}
}(\hat y_i \mid \pmb x_i, y_i) = -y_i \log(\hat y_i) - (1-y_i) \log(1-\hat y_i)\]
where we have $y _ i \in \{0, 1\}$ and $\hat y _ i \in [0, 1]$ is a prediction from a classification model. Quickly derive that minimising this is in fact the same as maximising the probability that the data point was generated by the model.
so minimising this is in fact maximising the likelihood of the data point given the model.