# Notes - Machine Learning MT23, Generative models

### Flashcards

Highlight the difference between discriminative and generative models by stating the probability distribution we model given data $\pmb x$, $y$ and parameters $\pmb \theta$.

- Discriminative: $p(y \mid \pmb x, \pmb \theta)$
- Generative: $p(\pmb x, y \mid \pmb \theta)$

Why are generative models called generative models?

Because modelling the full joint distribution $p(\pmb x, y \mid \pmb \theta)$ means that you can sample from $p(\pmb x \mid \pmb \theta)$ to generate new samples.

Suppose we are using a generative model and have a distribution $p(\pmb x, y \mid \pmb \theta)$, where $y \in \{1, \ldots, C\}$. Suppose we are given a new $\pmb x’$. How can we determine a prediction $\hat y$?

Suppose we have a generative model (we are modelling $p(\pmb x, y \mid \pmb \theta, \pmb \pi)$) where $\pmb \theta, \pmb \pi$ are parameters of the distribution. How can we “factorise” the model to show that we can model the distribution of the outputs $\pmb y$ and the distribution of the inputs $\pmb x$ given the output $\pmb y$?

Suppose we have a generative model, i.e. we are modelling

\[p(\pmb x, y \mid \pmb \theta, \pmb \pi)\]
and have factored it into

\[p(y \mid \pmb \pi)p(\pmb x \mid y, \pmb \theta)\]
Under the maximum likelihood framework, and just considering the distribution of $y$, what turns out (after a Lagrange optimisation) to be the MLE for

\[p(y = c \mid \pmb \pi)\]
?

Suppose we have a generative model, i.e. we are modelling

\[p(\pmb x, y \mid \pmb \theta, \pmb \pi)\]
and have factored it into

\[p(y \mid \pmb \pi)p(\pmb x \mid y, \pmb \theta)\]
Quickly prove that under the maximum likelihood framework, and just considering the distribution of $y$, the MLE estimate for $p(y = c \mid \pmb \pi)$ is given by

\[\frac{\text{ occurences of } \\{y = c\\}\\,}{ \text{total}\\,}\]

Write

\[p(y = c \mid \pmb \pi) = \pi_c\]Then we have the constraint that

\[\sum^C_{c = 1} \pi_c = 1\]Then the likelihood is given by

\[\begin{aligned} p(\mathcal D \mid \pmb \theta, \pmb \pi) &= \prod^N_{i=1} \Bigg(\Big(\prod^C_{c = 1} \pi_c^{\mathbb 1(y_i = c)}\Big) \cdot p(\pmb x_i \mid y_i, \pmb \theta)\Bigg) \end{aligned}\]Then

\[\log p(\mathcal D \mid \pmb \theta, \pmb \pi) = \sum^C_{c = 1} N_c \log(\pi_c) + \sum^N_{i = 1} \log( p(\pmb x_i \mid y_i, \pmb \theta) )\](where we let $N _ c$ denote the number of occurences of $y _ i = c$ in the dataset). Since we can’t affect the $\sum^N _ {i = 1} \log( p(\pmb x _ i \mid y _ i, \pmb \theta) )$ term, it suffices just to solve the optimisation problem

- $\max \sum^C _ {c = 1} N _ c \log \pi _ c$
- $\text{s.t. } \sum^C _ {c = 1} \pi _ c = 1$

Forming the Lagrangian:

\[\Lambda(\pi_c, \lambda) = \sum^C_{c = 1} N_c \log(\pi_c) -\lambda \left( \sum^C_{c=1} \pi_c - 1 \right)\]Then

\[\begin{aligned} &\frac{\partial \Lambda}{\partial \pi_c} = \frac{N_c}{\pi_c} + \lambda = 0 \\\\ &\frac{\partial \Lambda}{\partial \lambda} = \sum^C_{c = 1} \pi_c - 1 = 0 \end{aligned}\]So

\[\pi_c = -N_c / \lambda\]Hence, summing over $c$, we see

\[\lambda = -N\]so

\[\begin{aligned} \pi_c &= \frac{N_c}{N} \\\\ &=\frac{\text{ occurences of } \\{y = c\\} }{ \text{total} } \end{aligned}\]