Notes - Machine Learning MT23, Classification


Flashcards

What is one-vs-one (ovo) for using binary classifiers for $k$-class classification?


Train $K \choose 2$ binary classifiers for each possible pair and then choose the most commonly occuring label.

What is one-vs-rest (ovr) for using binary classifiers for $k$-class classification?


Train $K$ binary classifiers where for each, the $K$-th class is positive and the rest are negative.

Can you draw a $2 \times 2$ table for the possible mistakes a binary classifier can make, and label what is meant by Type I and Type II errors?


  • Prediction = yes, Actual label = yes: True positive
  • Prediction = yes, Actual label = no: False positive (Type I)
  • Prediction = no, Actual label = yes: False negative (Type II)
  • Prediction = no, Actual label = no: True negative

Can you define the True Positive Rate (TPR) of a binary classifier?


\[\text{TPR} = \frac{\text{TP}\\,}{\text{TP + FN}\\,}\]

Can you define the False Positive Rate (FPR) of a binary classifier?


\[\text{FPR} = \frac{\text{FP}\\,}{\text{FP + TN}\\,}\]

Can you define the precision of a binary classifier?


\[\text{Precision} = \frac{\text{TP}\\,}{\text{FP + TP}\\,}\]

Can you describe the TPR, FPR, precision and recall of a binary classifier?


\[\text{TPR} = \frac{\text{TP}\\,}{\text{TP + FN}\\,}\] \[\text{FPR} = \frac{\text{FP}\\,}{\text{FP + TN}\\,}\] \[\text{Precision} = \frac{\text{TP}\\,}{\text{FP + TP}\\,}\] \[\text{Recall} = \frac{\text{TP}\\,}{\text{FN + TP}\\,}\]

Note recall and TPR are exactly the same.

What is a ROC (receiver operating characteristic) curve?


A plot showing the tradeoff between the FPR and TPR as some parameter varying each is controlled.

How can you view the true positive, false negative, etc. matrix as a specific case of a confusion matrix?


It’s a confusion matrix for a binary classifier.

Suppose we have a single data point $(\pmb x _ i, y _ i)$ for some classification problem. In terms of a prediction $\hat y _ i$ given by a model:

  • Give the hinge loss for this data point (highlighting what range we expect $y _ i$ to be in)
  • Give the log loss for this data point (again highlighting what range we expect $y _ i$ to be in)
  • Describe what model you actually end up with using a linear model and hinge loss on all data points plus a regularisation term
  • Describe what model you end up with using sigmoid applied to a linear function and log loss

Hinge loss for a data point: we expect $y _ i = \pm 1$, and our model is attempting to output $\hat y _ i = \pm 1$ to classify the points

\[\ell_{\text{hinge} }(\hat y_i \mid \pmb x_i, y_i) := \max{0, 1- y_i\hat y_i}\]

Log loss for a data point: we expect $y _ i = 0$ or $y _ i = 1$, and our model is outputting $\hat y _ i \in [0, 1]$ (which can be interpreted as a probability)

\[\ell_{\text{log} }(\hat y_i \mid \pmb x_i, y_i) = -y_i \log(\hat y_i) - (1-y_i) \log(1-\hat y_i)\]

If you have data $\{(\pmb x _ i, y _ i)\}$ and use hinge loss for each data point to get an overall loss function for your model, you end up with the loss function for soft-margin SVMs:

\[\mathcal L(\pmb w, w_0 \mid \pmb X, \pmb y) = C\sum^N_{i = 1} \ell_\text{hinge}(\pmb w^\top \pmb x_i + w_0 \mid \pmb x_i, y_i) + \frac 1 2 ||\pmb w||^2_2\]

If instead you use log loss for each data point with sigmoid applied to a linear function, you get the loss function for logistic regression:

\[\begin{aligned} \mathcal L(\pmb w, w_0 \mid \pmb X, \pmb y) &= \sum^N_{i = 1} \ell_\text{log}(\sigma(\pmb w^\top \pmb x_i + w_0) \mid \pmb x_i, y_i) \\\\ &= -\sum^N_{i = 1} \Big[ y_i\log(\sigma(\pmb w^\top \pmb x_i + w_0)) + (1-y_i)\log(1 - \sigma(\pmb w^\top \pmb x_i + w_0)) \Big] \end{aligned}\]

The log-loss for a single data point $(\pmb x _ i, y _ i)$ in a classification task is given by

\[\ell_{\text{log} }(\hat y_i \mid \pmb x_i, y_i) = -y_i \log(\hat y_i) - (1-y_i) \log(1-\hat y_i)\]

where we have $y _ i \in \{0, 1\}$ and $\hat y _ i \in [0, 1]$ is a prediction from a classification model. Quickly derive that minimising this is in fact the same as maximising the probability that the data point was generated by the model.


\[\begin{aligned} \ell_{\text{log} }(\hat y_i \mid \pmb x_i, y_i) &= -y_i \log(\hat y_i) - (1-y_i) \log(1-\hat y_i) \\\\ &= -\log( \hat y_i^{y_i} (1-\hat y_i)^{1 - y_i} ) \\\\ &= -\log(\hat y_i \cdot \mathbb 1 [y_i = 1] + (1-\hat y_i)\cdot \mathbb 1[y_i = 0]) \\\\ &= \text{NLL}(\hat y _i \mid \pmb x_i, y_i) \end{aligned}\]

so minimising this is in fact maximising the likelihood of the data point given the model.

Proofs




Related posts