Notes - Machine Learning MT23, Gaussian discriminant analysis


Flashcards

When doing Gaussian discriminant analysis, we factor the joint distribution $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$ as

\[p(\pmb x, y \mid \pmb \theta, \pmb \pi) = p(y \mid \pmb \pi) p(\pmb x \mid y, \pmb \theta)\]

How do we then model the distribution $p(\pmb x \mid y = c, \pmb \theta)$?


\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]

Gaussian discriminant analysis and Naïve Bayes models both use the generative framework. How do they differ in their treatment of $p(\pmb x \mid y = c, \pmb \theta _ c)$?


Naïve Bayes makes a strong assumption about conditional independence of features given class labels:

\[p(\pmb x \mid y = c, \pmb \theta_c) = \prod^D_{j=1} p(x_j \mid y = c, \pmb \theta_{jc})\]

Gaussian discriminant analysis makes the assumption that the probabilities for a particular class come from a multivariate Gaussian:

\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]

In Gaussian discriminant analysis, we model the probability of $\pmb x$ given $y = c$ as follows:

\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]

What is it called if we make no further assumptions about the parameters $\pmb \mu _ c$ and $\pmb \Sigma _ c$ and why?


Quadratic discriminant analysis, since we get quadric seperating surfaces.

In Gaussian discriminant analysis, we model the probability of $\pmb x$ given $y = c$ as follows:

\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]

What do we need to assume about the parameters $\pmb \mu _ c$ and $\pmb \Sigma _ c$ to get “linear discriminant analysis”?


The covariance is the same across all classes.

In Gaussian discriminant analysis, we model the probability of $\pmb x$ given $y = c$ as follows:

\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]

What is it called if we assume that $\pmb \Sigma _ c$ is the same across all classes, and why?


Linear discriminant analysis, since we get linear seperating surfaces.

Does Gaussian discriminant analysis use a generative or discriminative framework?


Generative.

When doing Gaussian discriminant analysis, we factor the joint distribution $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$ as

\[p(\pmb x, y \mid \pmb \theta, \pmb \pi) = p(y \mid \pmb \pi) p(\pmb x \mid y, \pmb \theta)\]

We then model the distribution $p(\pmb x \mid y = c, \pmb \theta)$ as

\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]

Quickly prove that the decision boundaries are given by piecewise quadratic curves.


Suppose we have two classes $c$ and $c’$ and we want to look at the decision boundary between them. These will be the set of $\pmb x$ satisfying

\[p(y = c \mid \pmb x, \pmb \theta) = p(y = c' \mid \pmb x, \pmb \theta)\]

We have the general decision rule for generative models that

\[p(y = c \mid \pmb x, \pmb \theta) = \frac{p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \theta)}{\sum^C_{\hat c=1} p(y = \hat c \mid \pmb \theta)p(\pmb x \mid y = \hat c, \pmb \theta)}\]

Or, since denominator is constant

\[p(y = c \mid \pmb x, \pmb \theta) \propto p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \pmb \theta)\]

So the decision boundaries are given by

\[p(y = c \mid \pmb \theta)p(\pmb x \mid y = c, \pmb \theta) = p(y=c'\mid \pmb \theta)p(\pmb x \mid y = c', \pmb \theta)\]

Substituting in the actual expressions for Gaussian discriminant analysis, and setting ratio to $1$,

\[\frac{\pi_ c |2\pi|^{-D/2} |\pmb \Sigma_ c|^{-1/2}\exp\left(- \frac 1 2 ( \pmb x - \pmb \mu_ c)^\top \pmb \Sigma^{-1}_ c (\pmb x - \pmb \mu_ c )\right)}{\pi_ {c'} |2\pi|^{-D/2} |\pmb \Sigma_ {c'}|^{-1/2}\exp\left(- \frac 1 2 ( \pmb x - \pmb \mu_ {c'})^\top \pmb \Sigma^{-1}_ {c'} (\pmb x - \pmb \mu_ {c'} )\right)} = 1\]

With further rearranging, we see

\[(\pmb x - \pmb \mu_ c)^\top \pmb \Sigma^{-1}_ c (\pmb x - \pmb \mu_ c ) - (\pmb x - \pmb \mu_ {c'})^\top \pmb \Sigma^{-1}_ {c'} (\pmb x - \pmb \mu_ {c'} ) = -2\log\left(\frac{\pi_ {c'} |2\pi|^{-D/2} |\pmb \Sigma_ {c'}|^{-1/2}\\,}{\pi_ {c} |2\pi|^{-D/2} |\pmb \Sigma_ {c}|^{-1/2}\\,}\right)\]

which is a quadratic surface. The boundaries are piecewise as if there are more than 2 classes, then although the probability of $\pmb x$ belonging to the classes $c$ and $c’$ might be equal, there might be another class $c’’$ where the probability $x$ belongs there is higher than both.

When doing linear discriminant analysis, we factor the joint distribution $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$ as

\[p(\pmb x, y \mid \pmb \theta, \pmb \pi) = p(y \mid \pmb \pi) p(\pmb x \mid y, \pmb \theta)\]

We then model the distribution $p(\pmb x \mid y = c, \pmb \theta)$ as

\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma)\]

Quickly prove, in one go, that LDA has linear decision boundaries, and that

\[p(y = c \mid \pmb x, \pmb \theta) = \text{softmax}(\pmb \eta)_c\]

for some vector $\pmb \eta$ (where the subscript $c$ denotes the $c$-th entry of the vector).


We have the general decision rule for generative models that

\[p(y = c \mid \pmb x, \pmb \theta) = \frac{p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \theta)}{\sum^C_{\hat c=1} p(y = \hat c \mid \pmb \theta)p(\pmb x \mid y = \hat c, \pmb \theta)}\]

Or, since denominator is constant

\[p(y = c \mid \pmb x, \pmb \theta) \propto p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \pmb \theta)\]

Then in the specific case of LDA,

\[\begin{aligned} p(y = c \mid \pmb x, \pmb \theta) &\propto |2\pi|^{-1/2}|\pmb \Sigma|^{-D/2} \pi_c \exp\left(-\frac 1 2 (\pmb x - \pmb \mu_c)^\top \pmb \Sigma^{-1}(\pmb x - \pmb \mu_c)\right) \\\\ &\propto \pi_c \exp\left(-\frac 1 2 (\pmb x - \pmb \mu_c)^\top \pmb \Sigma^{-1}(\pmb x - \pmb \mu_c)\right) \\\\ &= \pi_c \exp\left( -\frac 1 2 \pmb x^\top \pmb \Sigma^{-1}\pmb x + \frac 1 2 \pmb x^\top \pmb \Sigma^{-1} \pmb \mu_c + \frac 1 2 \pmb \mu_c^\top \pmb \Sigma^{-1}\pmb x - \frac 1 2 \pmb \mu_c^\top \Sigma^{-1} \pmb \mu_c\right) \\\\ &= \pi_c \exp\left(\pmb x^\top \pmb \Sigma^{-1} \pmb \mu_c - \frac 1 2 \pmb \mu_c^\top \pmb \Sigma^{-1} \pmb \mu_c \right)\cdot \exp\left(\frac 1 2 \pmb x^\top \Sigma^{-1} \pmb x\right) \\\\ &\propto \pi_c \exp\left(\pmb \mu_c^\top \pmb \Sigma^{-1} \pmb x - \frac 1 2 \pmb \mu_c^\top \pmb \Sigma^{-1} \pmb \mu_c\right) \\\\ &= \exp\left(\pmb \mu_c^\top \pmb \Sigma^{-1} \pmb x - \frac 1 2 \pmb \mu_c^\top \pmb \Sigma^{-1} \pmb \mu_c + \log \pi_c\right) \\\\ &= \exp(\pmb \beta_c^\top \pmb x + \gamma_c) \end{aligned}\]

where

\[\pmb \beta_c = \pmb \Sigma^{-1} \pmb \mu_c\] \[\gamma_c = - \frac 1 2 \pmb \mu_c^\top \pmb \Sigma^{-1} \pmb \mu_c + \log \pi_c\]

(confusing steps: notice the proportionality not equality in penultimate line, this is because the factor on the very right doesn’t depend on the class).

Hence if $\exp(\pmb \beta _ c^\top \pmb x + \gamma _ c) = \exp(\pmb \beta _ {c’}^\top \pmb x + \gamma _ {c’})$ (as occurs on the decision boundary), then by taking logarithms, we see this is a linear function.

For the softmax part, note that

\[\begin{aligned} p(y = c \mid \pmb x, \pmb \theta) &= \frac{\exp(\pmb \beta_c^\top \pmb x + \gamma_c)}{\sum^{c'} \exp(\pmb \beta_{c'}^\top \pmb x + \gamma_{c'})} \\\\ &= \text{softmax}(\pmb \eta)_c \end{aligned}\]

where

\[\pmb \eta:= [\pmb \beta_1^\top \pmb x, \cdots, \pmb \beta_C^\top + \gamma_C]\]

Suppose we have data $\langle \pmb x _ i, y _ i \rangle$ where $\pmb x _ i \in \mathbb R^N$ and $y _ i \in \{1, \cdots, N\}$, and are using Gaussian discriminant analysis to find a classifier for this data. This involves modelling

\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]

What are the MLEs for $\pmb \mu _ c$, $\pmb \Sigma _ c$ for this data?


\[\begin{aligned} \hat{\pmb \mu} _ c &= \frac{1}{N_c} \sum _ {i:y_i=c} \pmb x _ i \\\\ \hat{\Sigma} _ c &= \frac{1}{N _ c} \sum_{i:y _ i=c} (x_i - \hat{\pmb \mu} _ c)(\pmb x _ i - \hat{\pmb \mu} _ c)^T \end{aligned}\]

Proofs




Related posts