Notes - Machine Learning MT23, Gaussian discriminant analysis
Flashcards
When doing Gaussian discriminant analysis, we factor the joint distribution $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$ as
\[p(\pmb x, y \mid \pmb \theta, \pmb \pi) = p(y \mid \pmb \pi) p(\pmb x \mid y, \pmb \theta)\]
How do we then model the distribution $p(\pmb x \mid y = c, \pmb \theta)$?
Gaussian discriminant analysis and Naïve Bayes models both use the generative framework. How do they differ in their treatment of $p(\pmb x \mid y = c, \pmb \theta _ c)$?
Naïve Bayes makes a strong assumption about conditional independence of features given class labels:
\[p(\pmb x \mid y = c, \pmb \theta_c) = \prod^D_{j=1} p(x_j \mid y = c, \pmb \theta_{jc})\]Gaussian discriminant analysis makes the assumption that the probabilities for a particular class come from a multivariate Gaussian:
\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]In Gaussian discriminant analysis, we model the probability of $\pmb x$ given $y = c$ as follows:
\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]
What is it called if we make no further assumptions about the parameters $\pmb \mu _ c$ and $\pmb \Sigma _ c$ and why?
Quadratic discriminant analysis, since we get quadric seperating surfaces.
In Gaussian discriminant analysis, we model the probability of $\pmb x$ given $y = c$ as follows:
\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]
What do we need to assume about the parameters $\pmb \mu _ c$ and $\pmb \Sigma _ c$ to get “linear discriminant analysis”?
The covariance is the same across all classes.
In Gaussian discriminant analysis, we model the probability of $\pmb x$ given $y = c$ as follows:
\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]
What is it called if we assume that $\pmb \Sigma _ c$ is the same across all classes, and why?
Linear discriminant analysis, since we get linear seperating surfaces.
Does Gaussian discriminant analysis use a generative or discriminative framework?
Generative.
When doing Gaussian discriminant analysis, we factor the joint distribution $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$ as
\[p(\pmb x, y \mid \pmb \theta, \pmb \pi) = p(y \mid \pmb \pi) p(\pmb x \mid y, \pmb \theta)\]
We then model the distribution $p(\pmb x \mid y = c, \pmb \theta)$ as
\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma_c)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma_c)\]
Quickly prove that the decision boundaries are given by piecewise quadratic curves.
Suppose we have two classes $c$ and $c’$ and we want to look at the decision boundary between them. These will be the set of $\pmb x$ satisfying
\[p(y = c \mid \pmb x, \pmb \theta) = p(y = c' \mid \pmb x, \pmb \theta)\]We have the general decision rule for generative models that
\[p(y = c \mid \pmb x, \pmb \theta) = \frac{p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \theta)}{\sum^C_{\hat c=1} p(y = \hat c \mid \pmb \theta)p(\pmb x \mid y = \hat c, \pmb \theta)}\]Or, since denominator is constant
\[p(y = c \mid \pmb x, \pmb \theta) \propto p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \pmb \theta)\]So the decision boundaries are given by
\[p(y = c \mid \pmb \theta)p(\pmb x \mid y = c, \pmb \theta) = p(y=c'\mid \pmb \theta)p(\pmb x \mid y = c', \pmb \theta)\]Substituting in the actual expressions for Gaussian discriminant analysis, and setting ratio to $1$,
\[\frac{\pi_ c |2\pi|^{-D/2} |\pmb \Sigma_ c|^{-1/2}\exp\left(- \frac 1 2 ( \pmb x - \pmb \mu_ c)^\top \pmb \Sigma^{-1}_ c (\pmb x - \pmb \mu_ c )\right)}{\pi_ {c'} |2\pi|^{-D/2} |\pmb \Sigma_ {c'}|^{-1/2}\exp\left(- \frac 1 2 ( \pmb x - \pmb \mu_ {c'})^\top \pmb \Sigma^{-1}_ {c'} (\pmb x - \pmb \mu_ {c'} )\right)} = 1\]With further rearranging, we see
\[(\pmb x - \pmb \mu_ c)^\top \pmb \Sigma^{-1}_ c (\pmb x - \pmb \mu_ c ) - (\pmb x - \pmb \mu_ {c'})^\top \pmb \Sigma^{-1}_ {c'} (\pmb x - \pmb \mu_ {c'} ) = -2\log\left(\frac{\pi_ {c'} |2\pi|^{-D/2} |\pmb \Sigma_ {c'}|^{-1/2}\\,}{\pi_ {c} |2\pi|^{-D/2} |\pmb \Sigma_ {c}|^{-1/2}\\,}\right)\]which is a quadratic surface. The boundaries are piecewise as if there are more than 2 classes, then although the probability of $\pmb x$ belonging to the classes $c$ and $c’$ might be equal, there might be another class $c’’$ where the probability $x$ belongs there is higher than both.
When doing linear discriminant analysis, we factor the joint distribution $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$ as
\[p(\pmb x, y \mid \pmb \theta, \pmb \pi) = p(y \mid \pmb \pi) p(\pmb x \mid y, \pmb \theta)\]
We then model the distribution $p(\pmb x \mid y = c, \pmb \theta)$ as
\[p(\pmb x \mid y = c, \pmb \theta_c = (\pmb \mu_c, \pmb \Sigma)) = \mathcal N(\pmb x \mid \pmb \mu_c, \pmb \Sigma)\]
Quickly prove, in one go, that LDA has linear decision boundaries, and that
\[p(y = c \mid \pmb x, \pmb \theta) = \text{softmax}(\pmb \eta)_c\]
for some vector $\pmb \eta$ (where the subscript $c$ denotes the $c$-th entry of the vector).
We have the general decision rule for generative models that
\[p(y = c \mid \pmb x, \pmb \theta) = \frac{p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \theta)}{\sum^C_{\hat c=1} p(y = \hat c \mid \pmb \theta)p(\pmb x \mid y = \hat c, \pmb \theta)}\]Or, since denominator is constant
\[p(y = c \mid \pmb x, \pmb \theta) \propto p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \pmb \theta)\]Then in the specific case of LDA,
\[\begin{aligned} p(y = c \mid \pmb x, \pmb \theta) &\propto |2\pi|^{-1/2}|\pmb \Sigma|^{-D/2} \pi_c \exp\left(-\frac 1 2 (\pmb x - \pmb \mu_c)^\top \pmb \Sigma^{-1}(\pmb x - \pmb \mu_c)\right) \\\\ &\propto \pi_c \exp\left(-\frac 1 2 (\pmb x - \pmb \mu_c)^\top \pmb \Sigma^{-1}(\pmb x - \pmb \mu_c)\right) \\\\ &= \pi_c \exp\left( -\frac 1 2 \pmb x^\top \pmb \Sigma^{-1}\pmb x + \frac 1 2 \pmb x^\top \pmb \Sigma^{-1} \pmb \mu_c + \frac 1 2 \pmb \mu_c^\top \pmb \Sigma^{-1}\pmb x - \frac 1 2 \pmb \mu_c^\top \Sigma^{-1} \pmb \mu_c\right) \\\\ &= \pi_c \exp\left(\pmb x^\top \pmb \Sigma^{-1} \pmb \mu_c - \frac 1 2 \pmb \mu_c^\top \pmb \Sigma^{-1} \pmb \mu_c \right)\cdot \exp\left(\frac 1 2 \pmb x^\top \Sigma^{-1} \pmb x\right) \\\\ &\propto \pi_c \exp\left(\pmb \mu_c^\top \pmb \Sigma^{-1} \pmb x - \frac 1 2 \pmb \mu_c^\top \pmb \Sigma^{-1} \pmb \mu_c\right) \\\\ &= \exp\left(\pmb \mu_c^\top \pmb \Sigma^{-1} \pmb x - \frac 1 2 \pmb \mu_c^\top \pmb \Sigma^{-1} \pmb \mu_c + \log \pi_c\right) \\\\ &= \exp(\pmb \beta_c^\top \pmb x + \gamma_c) \end{aligned}\]where
\[\pmb \beta_c = \pmb \Sigma^{-1} \pmb \mu_c\] \[\gamma_c = - \frac 1 2 \pmb \mu_c^\top \pmb \Sigma^{-1} \pmb \mu_c + \log \pi_c\](confusing steps: notice the proportionality not equality in penultimate line, this is because the factor on the very right doesn’t depend on the class).
Hence if $\exp(\pmb \beta _ c^\top \pmb x + \gamma _ c) = \exp(\pmb \beta _ {c’}^\top \pmb x + \gamma _ {c’})$ (as occurs on the decision boundary), then by taking logarithms, we see this is a linear function.
For the softmax part, note that
\[\begin{aligned} p(y = c \mid \pmb x, \pmb \theta) &= \frac{\exp(\pmb \beta_c^\top \pmb x + \gamma_c)}{\sum^{c'} \exp(\pmb \beta_{c'}^\top \pmb x + \gamma_{c'})} \\\\ &= \text{softmax}(\pmb \eta)_c \end{aligned}\]where
\[\pmb \eta:= [\pmb \beta_1^\top \pmb x, \cdots, \pmb \beta_C^\top + \gamma_C]\]