Notes - Machine Learning MT23, Logistic regression
Flashcards
What type of machine learning problem does logistic regression solve?
Binary classification.
Is logistic regression a generative or discriminative method?
Discriminative.
Can you define $\sigma(x)$, the sigmoid function?
How does logistic regression model $p(y = 1 \mid \pmb x, \pmb w)$?
Logistic regression models $p(y = 1 \mid \pmb x, \pmb w)$ as $\sigma(\pmb w^T \pmb x) = \frac{1}{1 + \exp(-\pmb w^T \pmb x)}$. How do we then use this to make predictions, i.e. decide the category of $\pmb x _ \text{new}$?
See if $\sigma(\pmb w^T \pmb x) > 1/2$ (or some other threshold value).
Give the negative log-likelihood $\text{NLL}(\pmb y \mid \pmb X, \pmb w)$ for a logistic regression model.
First,
\[p(\pmb y \mid \pmb X, \pmb w) = \prod^N_{i=1} \sigma(\pmb w^T \pmb x_i)^{y_i} (1-\sigma(\pmb w^T \pmb x_i))^{1-y_i}\]Then,
\[\text{NLL}(\pmb y \mid \pmb X, \pmb w) = -\sum^N_{i=1}(y_i \log(\sigma(\pmb w^T \pmb x_i)) + (1-y_i)\log(1-\sigma(\pmb w^T \pmb x_i)))\]What is the “iteratively reweighted least squares” method?
A technique for finding parameters for logistic regression, based on iteratively solving a weighted least squares problem.
Define the softmax function on a vector $a \in \mathbb R^C$ and describe why it is useful?::
\[\text{softmax}([a_1, \ldots, a_C]^T) = \left[\frac {e^{a_1}\\,} Z, \ldots \frac {e^{a_C}\\,} Z \right]^T\]where
\[Z = \sum^C_{i=1} e^{a_i}\]Useful because it converts an unbounded vector of $C$ numbers into “probabilities” that correspond to something belonging to $C$ different categories.
In multiclass logistic regression, how is $p(y \mid \pmb x, \pmb W)$ where $\pmb x \in \mathbb R^C$ and $\pmb W \in \mathbb R^{D \times C}$ defined (let $\pmb w _ c$ denote the $c$-th column of $\pmb W$)?
The NLL of logistic regression is given by
\[-\sum^N_{i=1} (y_i \log \mu_i + (1 - y_i)\log(1 - \mu_i))\]
where
\[\mu_i = \sigma(\pmb w^\top \pmb x_i)\]
Quickly derive $\partial _ {\pmb w} \text{NLL}$ and the Hessian, and use Newton’s method to define an update rule that can be used to calculate the weights.
where
\[S := \text{diag}(\mu_i (1 - \mu_i))\]It can be shown $S$ is positive definite, so $X^\top S X$ is positive semidefinite.
Then
\[\begin{aligned} \pmb g_t &= X^\top (\pmb \mu_t - \pmb y) = -X^\top (\pmb y - \pmb \mu_t) \\\\ \pmb H_t &= X^\top S_t X \end{aligned}\]So Newton’s update rule gives
\[\begin{aligned} \pmb w_{t + 1} &= \pmb w_t - \pmb H_t^{-1} \pmb g_t \\\\ &= \pmb w_t + (X^\top S_t X)^{-1} X^\top (\pmb y - \mu_t) \\\\ &= (X^\top S_t X)^{-1} X^\top S_t (X \pmb w_t + S_t^{-1} (\pmb y - \pmb \mu_t)) \\\\ &= (X^\top S_t X)^{-1}(X^\top S_t z_t) \end{aligned}\]where
\[\pmb z_t = X\pmb w_t + S^{-1}_t (\pmb y - \pmb \mu_t)\]Deriving the update used in Newton’s method for the MLE of logistic regression gives
\[\pmb w_{t + 1} =(X^\top S_t X)^{-1}(X^\top S_t z_t)\]
where
\[\pmb z_t = X\pmb w_t + S^{-1}_t (\pmb y - \pmb \mu_t)\]
and
\[S := \text{diag}(\mu_i (1 - \mu_i))\]
How can you recognise this is a solution to a least squares problem?
It’s equivalent to the solution of
\[\min_{\pmb w} \quad \sum^N_{i = 1}S_{t,ii} (z_{t,i} - \pmb w^\top \pmb x_i)^2\]