Machine Learning MT23, Gaussian discriminant analysis
Flashcards
When doing Gaussian discriminant analysis, we factor the joint distribution $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$ as
\[p(\pmb x, y \mid \pmb \theta, \pmb \pi) = p(y \mid \pmb \pi) p(\pmb x \mid y, \pmb \theta)\]
How do we then model the distribution $p(\pmb x \mid y = c, \pmb \theta)$?
Gaussian discriminant analysis and Naïve Bayes models both use the generative framework. How do they differ in their treatment of $p(\pmb x \mid y = c, \pmb \theta _ c)$?
Naïve Bayes makes a strong assumption about conditional independence of features given class labels:
\[p(\pmb x \mid y = c, \pmb \theta _ c) = \prod^D _ {j=1} p(x _ j \mid y = c, \pmb \theta _ {jc})\]Gaussian discriminant analysis makes the assumption that the probabilities for a particular class come from a multivariate Gaussian:
\[p(\pmb x \mid y = c, \pmb \theta _ c = (\pmb \mu _ c, \pmb \Sigma _ c)) = \mathcal N(\pmb x \mid \pmb \mu _ c, \pmb \Sigma _ c)\]In Gaussian discriminant analysis, we model the probability of $\pmb x$ given $y = c$ as follows:
\[p(\pmb x \mid y = c, \pmb \theta _ c = (\pmb \mu _ c, \pmb \Sigma _ c)) = \mathcal N(\pmb x \mid \pmb \mu _ c, \pmb \Sigma _ c)\]
What is it called if we make no further assumptions about the parameters $\pmb \mu _ c$ and $\pmb \Sigma _ c$ and why?
Quadratic discriminant analysis, since we get quadric seperating surfaces.
In Gaussian discriminant analysis, we model the probability of $\pmb x$ given $y = c$ as follows:
\[p(\pmb x \mid y = c, \pmb \theta _ c = (\pmb \mu _ c, \pmb \Sigma _ c)) = \mathcal N(\pmb x \mid \pmb \mu _ c, \pmb \Sigma _ c)\]
What do we need to assume about the parameters $\pmb \mu _ c$ and $\pmb \Sigma _ c$ to get “linear discriminant analysis”?
The covariance is the same across all classes.
In Gaussian discriminant analysis, we model the probability of $\pmb x$ given $y = c$ as follows:
\[p(\pmb x \mid y = c, \pmb \theta _ c = (\pmb \mu _ c, \pmb \Sigma _ c)) = \mathcal N(\pmb x \mid \pmb \mu _ c, \pmb \Sigma _ c)\]
What is it called if we assume that $\pmb \Sigma _ c$ is the same across all classes, and why?
Linear discriminant analysis, since we get linear seperating surfaces.
Does Gaussian discriminant analysis use a generative or discriminative framework?
Generative.
When doing Gaussian discriminant analysis, we factor the joint distribution $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$ as
\[p(\pmb x, y \mid \pmb \theta, \pmb \pi) = p(y \mid \pmb \pi) p(\pmb x \mid y, \pmb \theta)\]
We then model the distribution $p(\pmb x \mid y = c, \pmb \theta)$ as
\[p(\pmb x \mid y = c, \pmb \theta _ c = (\pmb \mu _ c, \pmb \Sigma _ c)) = \mathcal N(\pmb x \mid \pmb \mu _ c, \pmb \Sigma _ c)\]
Quickly prove that the decision boundaries are given by piecewise quadratic curves.
Suppose we have two classes $c$ and $c’$ and we want to look at the decision boundary between them. These will be the set of $\pmb x$ satisfying
\[p(y = c \mid \pmb x, \pmb \theta) = p(y = c' \mid \pmb x, \pmb \theta)\]We have the general decision rule for generative models that
\[p(y = c \mid \pmb x, \pmb \theta) = \frac{p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \theta)}{\sum^C _ {\hat c=1} p(y = \hat c \mid \pmb \theta)p(\pmb x \mid y = \hat c, \pmb \theta)}\]Or, since denominator is constant
\[p(y = c \mid \pmb x, \pmb \theta) \propto p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \pmb \theta)\]So the decision boundaries are given by
\[p(y = c \mid \pmb \theta)p(\pmb x \mid y = c, \pmb \theta) = p(y=c'\mid \pmb \theta)p(\pmb x \mid y = c', \pmb \theta)\]Substituting in the actual expressions for Gaussian discriminant analysis, and setting ratio to $1$,
\[\frac{\pi _ c \vert 2\pi \vert ^{-D/2} \vert \pmb \Sigma _ c \vert ^{-1/2}\exp\left(- \frac 1 2 ( \pmb x - \pmb \mu _ c)^\top \pmb \Sigma^{-1} _ c (\pmb x - \pmb \mu _ c )\right)}{\pi _ {c'} \vert 2\pi \vert ^{-D/2} \vert \pmb \Sigma _ {c'} \vert ^{-1/2}\exp\left(- \frac 1 2 ( \pmb x - \pmb \mu _ {c'})^\top \pmb \Sigma^{-1} _ {c'} (\pmb x - \pmb \mu _ {c'} )\right)} = 1\]With further rearranging, we see
\[(\pmb x - \pmb \mu _ c)^\top \pmb \Sigma^{-1} _ c (\pmb x - \pmb \mu _ c ) - (\pmb x - \pmb \mu _ {c'})^\top \pmb \Sigma^{-1} _ {c'} (\pmb x - \pmb \mu _ {c'} ) = -2\log\left(\frac{\pi _ {c'} \vert 2\pi \vert ^{-D/2} \vert \pmb \Sigma _ {c'} \vert ^{-1/2}\\,}{\pi _ {c} \vert 2\pi \vert ^{-D/2} \vert \pmb \Sigma _ {c} \vert ^{-1/2}\\,}\right)\]which is a quadratic surface. The boundaries are piecewise as if there are more than 2 classes, then although the probability of $\pmb x$ belonging to the classes $c$ and $c’$ might be equal, there might be another class $c’’$ where the probability $x$ belongs there is higher than both.
When doing linear discriminant analysis, we factor the joint distribution $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$ as
\[p(\pmb x, y \mid \pmb \theta, \pmb \pi) = p(y \mid \pmb \pi) p(\pmb x \mid y, \pmb \theta)\]
We then model the distribution $p(\pmb x \mid y = c, \pmb \theta)$ as
\[p(\pmb x \mid y = c, \pmb \theta _ c = (\pmb \mu _ c, \pmb \Sigma)) = \mathcal N(\pmb x \mid \pmb \mu _ c, \pmb \Sigma)\]
Quickly prove, in one go, that LDA has linear decision boundaries, and that
\[p(y = c \mid \pmb x, \pmb \theta) = \text{softmax}(\pmb \eta) _ c\]
for some vector $\pmb \eta$ (where the subscript $c$ denotes the $c$-th entry of the vector).
We have the general decision rule for generative models that
\[p(y = c \mid \pmb x, \pmb \theta) = \frac{p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \theta)}{\sum^C _ {\hat c=1} p(y = \hat c \mid \pmb \theta)p(\pmb x \mid y = \hat c, \pmb \theta)}\]Or, since denominator is constant
\[p(y = c \mid \pmb x, \pmb \theta) \propto p(y = c \mid \pmb \theta) p(\pmb x \mid y = c, \pmb \theta)\]Then in the specific case of LDA,
\[\begin{aligned} p(y = c \mid \pmb x, \pmb \theta) &\propto \vert 2\pi \vert ^{-1/2} \vert \pmb \Sigma \vert ^{-D/2} \pi _ c \exp\left(-\frac 1 2 (\pmb x - \pmb \mu _ c)^\top \pmb \Sigma^{-1}(\pmb x - \pmb \mu _ c)\right) \\\\ &\propto \pi _ c \exp\left(-\frac 1 2 (\pmb x - \pmb \mu _ c)^\top \pmb \Sigma^{-1}(\pmb x - \pmb \mu _ c)\right) \\\\ &= \pi _ c \exp\left( -\frac 1 2 \pmb x^\top \pmb \Sigma^{-1}\pmb x + \frac 1 2 \pmb x^\top \pmb \Sigma^{-1} \pmb \mu _ c + \frac 1 2 \pmb \mu _ c^\top \pmb \Sigma^{-1}\pmb x - \frac 1 2 \pmb \mu _ c^\top \Sigma^{-1} \pmb \mu _ c\right) \\\\ &= \pi _ c \exp\left(\pmb x^\top \pmb \Sigma^{-1} \pmb \mu _ c - \frac 1 2 \pmb \mu _ c^\top \pmb \Sigma^{-1} \pmb \mu _ c \right)\cdot \exp\left(\frac 1 2 \pmb x^\top \Sigma^{-1} \pmb x\right) \\\\ &\propto \pi _ c \exp\left(\pmb \mu _ c^\top \pmb \Sigma^{-1} \pmb x - \frac 1 2 \pmb \mu _ c^\top \pmb \Sigma^{-1} \pmb \mu _ c\right) \\\\ &= \exp\left(\pmb \mu _ c^\top \pmb \Sigma^{-1} \pmb x - \frac 1 2 \pmb \mu _ c^\top \pmb \Sigma^{-1} \pmb \mu _ c + \log \pi _ c\right) \\\\ &= \exp(\pmb \beta _ c^\top \pmb x + \gamma _ c) \end{aligned}\]where
\[\pmb \beta _ c = \pmb \Sigma^{-1} \pmb \mu _ c\] \[\gamma _ c = - \frac 1 2 \pmb \mu _ c^\top \pmb \Sigma^{-1} \pmb \mu _ c + \log \pi _ c\](confusing steps: notice the proportionality not equality in penultimate line, this is because the factor on the very right doesn’t depend on the class).
Hence if $\exp(\pmb \beta _ c^\top \pmb x + \gamma _ c) = \exp(\pmb \beta _ {c’}^\top \pmb x + \gamma _ {c’})$ (as occurs on the decision boundary), then by taking logarithms, we see this is a linear function.
For the softmax part, note that
\[\begin{aligned} p(y = c \mid \pmb x, \pmb \theta) &= \frac{\exp(\pmb \beta _ c^\top \pmb x + \gamma _ c)}{\sum^{c'} \exp(\pmb \beta _ {c'}^\top \pmb x + \gamma _ {c'})} \\\\ &= \text{softmax}(\pmb \eta) _ c \end{aligned}\]where
\[\pmb \eta:= [\pmb \beta _ 1^\top \pmb x, \cdots, \pmb \beta _ C^\top + \gamma _ C]\]