Machine Learning MT23, Generative models


Flashcards

Highlight the difference between discriminative and generative models by stating the probability distribution we model given data $\pmb x$, $y$ and parameters $\pmb \theta$.


  • Discriminative: $p(y \mid \pmb x, \pmb \theta)$
  • Generative: $p(\pmb x, y \mid \pmb \theta)$

Why are generative models called generative models?


Because modelling the full joint distribution $p(\pmb x, y \mid \pmb \theta)$ means that you can sample from $p(\pmb x \mid \pmb \theta)$ to generate new samples.

Suppose we are using a generative model and have a distribution $p(\pmb x, y \mid \pmb \theta)$, where $y \in \{1, \ldots, C\}$. Suppose we are given a new $\pmb x’$. How can we determine a prediction $\hat y$?


\[\begin{aligned} \hat y &= \text{argmax} _ c \text{ }p(y = c \mid \pmb x', \pmb \theta) \\\\ &= \text{argmax} _ c \text{ }\frac{p(y = c \mid \pmb \theta)p(\pmb x' \mid y = c, \pmb \theta)}{\sum^C _ {c' = 1}p(y = c' \mid \pmb \theta)p(\pmb x' \mid y = c', \pmb \theta)} \end{aligned}\]

Suppose we have a generative model (we are modelling $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$) where $\pmb \theta, \pmb \pi$ are parameters of the distribution. How can we “factorise” the model to show that we can model the distribution of the outputs $\pmb y$ and the distribution of the inputs $\pmb x$ given the output $\pmb y$?


\[p(\pmb x, y \mid \pmb \theta, \pmb \pi) = p(y \mid \pmb \pi)p(\pmb x \mid y, \pmb \theta)\]

Suppose we have a generative model, i.e. we are modelling

\[p(\pmb x, y \mid \pmb \theta, \pmb \pi)\]

and have factored it into

\[p(y \mid \pmb \pi)p(\pmb x \mid y, \pmb \theta)\]

Under the maximum likelihood framework, and just considering the distribution of $y$, what turns out (after a Lagrange optimisation) to be the MLE for

\[p(y = c \mid \pmb \pi)\]

?


\[\frac{\text{ occurences of } \\{y = c\\}\\,}{ \text{total}\\,}\]

Suppose we have a generative model, i.e. we are modelling

\[p(\pmb x, y \mid \pmb \theta, \pmb \pi)\]

and have factored it into

\[p(y \mid \pmb \pi)p(\pmb x \mid y, \pmb \theta)\]

Quickly prove that under the maximum likelihood framework, and just considering the distribution of $y$, the MLE estimate for $p(y = c \mid \pmb \pi)$ is given by

\[\frac{\text{ occurences of } \\{y = c\\}\\,}{ \text{total}\\,}\]

Write

\[p(y = c \mid \pmb \pi) = \pi _ c\]

Then we have the constraint that

\[\sum^C _ {c = 1} \pi _ c = 1\]

Then the likelihood is given by

\[\begin{aligned} p(\mathcal D \mid \pmb \theta, \pmb \pi) &= \prod^N _ {i=1} \Bigg(\Big(\prod^C _ {c = 1} \pi _ c^{\mathbb 1(y _ i = c)}\Big) \cdot p(\pmb x _ i \mid y _ i, \pmb \theta)\Bigg) \end{aligned}\]

Then

\[\log p(\mathcal D \mid \pmb \theta, \pmb \pi) = \sum^C _ {c = 1} N _ c \log(\pi _ c) + \sum^N _ {i = 1} \log( p(\pmb x _ i \mid y _ i, \pmb \theta) )\]

(where we let $N _ c$ denote the number of occurences of $y _ i = c$ in the dataset). Since we can’t affect the $\sum^N _ {i = 1} \log( p(\pmb x _ i \mid y _ i, \pmb \theta) )$ term, it suffices just to solve the optimisation problem

  • $\max \sum^C _ {c = 1} N _ c \log \pi _ c$
  • $\text{s.t. } \sum^C _ {c = 1} \pi _ c = 1$

Forming the Lagrangian:

\[\Lambda(\pi _ c, \lambda) = \sum^C _ {c = 1} N _ c \log(\pi _ c) -\lambda \left( \sum^C _ {c=1} \pi _ c - 1 \right)\]

Then

\[\begin{aligned} &\frac{\partial \Lambda}{\partial \pi _ c} = \frac{N _ c}{\pi _ c} + \lambda = 0 \\\\ &\frac{\partial \Lambda}{\partial \lambda} = \sum^C _ {c = 1} \pi _ c - 1 = 0 \end{aligned}\]

So

\[\pi _ c = -N _ c / \lambda\]

Hence, summing over $c$, we see

\[\lambda = -N\]

so

\[\begin{aligned} \pi _ c &= \frac{N _ c}{N} \\\\ &=\frac{\text{ occurences of } \\{y = c\\} }{ \text{total} } \end{aligned}\]



Related posts