# Notes - Machine Learning MT23, Naïve Bayes classifiers

### Flashcards

In the Naïve Bayes classifier, it is assumed that the features are conditionally independent given the class label. Can you write this mathematically, in terms of $\pmb x$, $y$ and some parameters $\pmb \theta$?

Suppose we have some variable $x$ that is categorical, taking one of $K$ values $\{1, \ldots, K\}$. What is the “multinoulli distribution”?

An extension of Bernoulli distribution, associating a probability with each of the $K$ categories so that

\[p(x = c) = \pi_c\]and

\[\sum_{c = 1}^K \pi_c = 1\]In Naïve Bayes models, it is easy to work with both categorical and continuous data because of the assumptions we make about the conditional independence of the variables. What models are typically used for continuous and discrete variables in this context?

- Continuous, Gaussian
- Discrete, multinomial

Why are Naïve Bayes classifiers easy to train?

The conditional independence assumption means that we only have to solve $D$ one-dimensional optimisation problems (each feature can be treated separately).

What type of model is a Naïve Bayes classifier? Generative or discriminative?

Generative.

How do Naïve Bayes classifiers find the parameters of the Gaussian to use?

The MLEs, which end up being the empirical mean and variance.

Suppose:

- We have data $\langle \pmb x _ i, y _ i \rangle$
- $y _ i \in \{1, \cdots, N\}$
- Some entires of $\pmb x _ i$ are real-valued and some are categorical.
- We are using a Naïve Bayes classifier for this data

This involves modelling (using the class-conditional independence assumption)

\[p(\hat{\pmb x} \mid y = c, \pmb\theta) = \prod_
j p(\hat{\pmb x}_
j \mid y = c, \pmb \theta_
j)\]
and then for the real-valued components we use:

\[p(\pmb x_
j \mid y = c, \pmb \theta_
j) = \mathcal N(\pmb x_
j \mid \mu_
{jc}, \sigma_
{jc})\]
and for the categorical data we use:

\[p(\pmb x_
j = \ell \mid y = c, \pmb \theta_
j) = p_
{jc,\ell}\]
($\ell$ for $\ell$abel, these are probabilities so that $\sum _ {\ell} p _ {jc, \ell} = 1$).

Can you state the MLEs for these parameters $\mu _ {jc}, \sigma _ {jc}, p _ {jc, \ell}$?

$\mu _ {jc}$ and $\sigma _ {jc}$ are given by the empirical mean and variance,

\[\begin{aligned} \mu_ {jc} &= \frac 1 N \sum_ i (\pmb x_ i)_ j \\\\ \sigma_ {jc} &= \frac 1 n \sum_ i \big((\pmb x_ i)_ j - \mu_ {jc}\big)^2 \end{aligned}\]and $p _ {jc, \ell}$ is given by the probability in the training data,

\[p_{jc, \ell} = \frac{\text{count}(\pmb x_j = \ell \text{ and }y_i = c)}{\text{count}(y_i = c)}\]Suppose:

- We have data $\langle \pmb x _ i, y _ i \rangle$
- $y _ i \in \{1, \cdots, N\}$
- Some entires of $\pmb x _ i$ are real-valued and some are categorical.
- We are using a Naïve Bayes classifier for this data

This involves modelling (using the class-conditional independence assumption)

\[p(\hat{\pmb x} \mid y = c, \pmb\theta) = \prod_
j p(\hat{\pmb x}_
j \mid y = c, \pmb \theta_
j)\]
and then for the real-valued components we use:

\[p(\pmb x_
j \mid y = c, \pmb \theta_
j) = \mathcal N(\pmb x_
j \mid \mu_
{jc}, \sigma_
{jc})\]
and for the categorical data we use:

\[p(\pmb x_
j = \ell \mid y = c, \pmb \theta_
j) = p_
{jc,\ell}\]
($\ell$ for $\ell$abel, these are probabilities so that $\sum _ {\ell} p _ {jc, \ell} = 1$).

The MLE for $p _ {jc, \ell}$ is given by

\[p_{jc, \ell} = \frac{\text{count} (\pmb x_j = \ell \text{ and }y_i = c)}{\text{count}(y_i = c)}\]
This has an issue that if no particular instance of $x _ j = \ell$ and $y _ i = c$ appear together, in which case $p _ {jc, \ell}$ and so the probability of that label occuring is actually set to $0$. How does Laplace smoothing get around this?

We instead define

\[p_{jc, \ell} = \frac{\text{count} (\pmb x_j = \ell \text{ and }y_i = c) + L}{\text{count}(y_i = c) + L\cdot d}\]where $L \ge 1$. This ensures the probability is non-zero for each $p _ {jc, \ell}$ (we are essentially “hallucinating” at least one extra example for each).