Notes - Machine Learning MT23, Bayesian machine learning
Flashcards
In terms of the data $\mathcal D$ and the parameters of the model $\pmb w$, what is the main goal of Bayesian machine learning?
Infering the posterior $p(\pmb w \mid \mathcal D)$, the distribution of the parameters given the observed data.
What is aleatoric uncertainty, and is it reducible with more data?
- Uncertainty related to the inherently stochastic nature of the variables
- Not reducible with more data
In terms of the data $\mathcal D$ and the parameters of a model $\pmb w$, how is aleatoric uncertainity characterised?
The distribution of the data given the parameters
\[p(\mathcal D \mid \pmb w)\]What is epistemic uncertainty, and is it reducible with more data?
- Uncertainty about whether a model is correct given observed data
- Reducible with more data
In terms of the data $\mathcal D$ and the parameters of a model $\pmb w$, how is epistemic uncertainity characterised mathematically?
The distribution of the parameters given the data
\[p(\pmb w \mid \mathcal D)\]In terms of data $\mathcal D$ and the parameters of a model $\pmb w$, how is “likelihood” characterised mathematically?
The probability of the data given the model
\[p(\mathcal D \mid \pmb w)\]In terms of data $\mathcal D$ and the parameters of a model $\pmb w$, how is the posterior $p(\pmb w \mid \mathcal D)$ defined in terms of the likelihood $p(\mathcal D \mid \pmb w)$?
i.e. Bayes’ theorem.
Suppose we have data $\mathcal D = \langle \pmb x _ i, y \rangle^N _ {i=1}$. We have the following expression for the posterior in terms of the prior and likelihood:
\[p(\pmb w \mid \mathcal D) = \frac{p(\mathcal D \mid \pmb w) p(\pmb w)}{p(\mathcal D)}\]
How can you rewrite $p(\mathcal D \mid w)$ more familiarly?
i.e. the likelihood of the data given the model.
Suppose:
- We are using a Bayesian approach to model some problem of predicting outputs given some inputs for $\mathcal D = \langle \pmb x _ i, y \rangle^N _ {i=1}$ (e.g. linear regression).
- We have a prior $p(\pmb w)$.
and now want to make an estimate for
\[p(y \mid \pmb x, \mathcal D)\]
What is this called, and how can this be done?
This is the Bayesian prediction distribution. Use the posterior to marginalise over the parameters
\[p(y \mid \pmb x, \mathcal D) = \int_{\pmb w} p(y \mid \pmb x, \pmb w) p(\pmb w \mid \mathcal D) \text d \pmb w\]Suppose:
- We are using a Bayesian approach to model some problem of predicting outputs given some inputs for $\mathcal D = \langle \pmb x _ i, y \rangle^N _ {i=1}$ (e.g. linear regression).
- We have a prior $p(\pmb w)$.
We can derive the “Bayesian prediction distribution” by marginalising over the parameters of the model $\pmb w$:
\[p(y \mid \pmb x, \mathcal D) = \int_{\pmb w} p(y \mid \pmb x, \pmb w) p(\pmb w \mid \mathcal D) \text d \pmb w\]
Now we want to actually predict the $y$ given some input $\pmb x$. How can this be done?
Finding the $y$ with the greatest expected value:
\[\mathbb E[y \mid \pmb x, \mathcal D] = \int y \cdot p(y \mid \pmb x, \mathcal D) \text dy\]Suppose:
- We are using a Bayesian approach to model some problem of predicting outputs given some inputs for $\mathcal D = \langle \pmb x _ i, y \rangle^N _ {i=1}$ (e.g. linear regression).
- We have a prior $p(\pmb w)$.
We can derive the “Bayesian prediction distribution” by marginalising over the parameters of the model $\pmb w$:
\[p(y \mid \pmb x, \mathcal D) = \int_{\pmb w} p(y \mid \pmb x, \mathcal D) p(\pmb w \mid \mathcal D) \text d \pmb w\]
Now we want to actually predict the $y$ given some input $\pmb x$. In the full Bayesian framework this can be done by finding the value of $y$ with the greatest expected value:
\[\mathbb E[y \mid \pmb x, \mathcal D] = \int y \cdot p(y \mid \pmb x, \mathcal D) \text dy\]
Alternatively, we can instead take the modal value of the posterior distribution instead to get $\pmb w _ \text{MAP}$, the “maximum a priori estimate”:
\[\pmb w_\text{MAP} = \text{argmax}_{\pmb w} \text{ } p(\pmb w \mid \mathcal D)\]
and instead use this to make predictions. What are the pros and cons of each?
- The full Bayesian approach is often intractable due to large numerical integration problems, whereas the MAP estimate is much easier to calculate.
- However, with the MAP estimate, we forgo the ability to characterise our uncertainty in the posterior, i.e. our epistemic uncertainty
What is the process of marginalisation in Bayesian machine learning?
Averaging all weights over the posterior to find the “average” model, which we use to make predictions.
What is the Bayesian prediction distribution?
The marginalised posterior distribution, averaged over all possible weights.
What are conjugate models in the context of Bayesian machine learning?
Where the posterior and the prior have the same distribution.
When applying true Bayesian machine learning, what often makes the problem intractable in practice?
Numerical integration (used for finding $p(\mathcal D)$ and determining the marganlised distribution).
Why is a maximum a posteriori estimator used instead of using a full Bayesian approach (e.g. not computing the full posterior distribution)?
Numerical integration makes the problem intractable.
Can you summarise the “maximum a posteriori estimator” technique in the context of Bayesian machine learning, and give the mathematical formulation for finding the MAP estimator $\pmb w _ {\text{MAP}\,}$ in terms of the data $\mathcal D$ and the parameters $\pmb w$?
Finding the modal value of the posterior distribution.
\[\pmb w_{\text {MAP}\,} = \text{argmax}_{\pmb w} \text{ } p(\pmb w \mid \mathcal D)\]Suppose we are trying to estimate a parameter $\pmb w$ for data $\pmb X, \pmb y$, which we assume is linear but has Gaussian noise, i.e.
\[p(\pmb y \mid \pmb X, \pmb w) = \frac{1}{(2\pi \sigma^2)^{N/2}\\,} \cdot \exp\left(-\frac{(\pmb y - \pmb X \pmb w)^\top (\pmb y - \pmb X \pmb w)}{2\sigma^2}\right)\]
Recall that if we assume a Gaussian prior over the weights $\pmb w$, we have
\[\pmb w \sim \mathcal N(\pmb \mu, \pmb \Sigma) \implies p(\pmb w) = \frac{1}{(2\pi)^{D/2} |\pmb \Sigma|^{1/2}\,}\exp\left(-\frac 1 2(\pmb w - \pmb \mu)^T\pmb \Sigma^{-1} (\pmb w - \pmb \mu)\right)\]
and if we assume each parameter $\pmb w _ i$ in the weights has an independent Laplacian distribution, we have
\[\pmb w_i \sim \text{Lap}(\mu, b) \implies p(\pmb w_i) = \frac{1}{2b}\exp\left(-\frac{|\pmb w_i - \mu|}{b}\right)\]
By considering a prior of the form
\[p(\pmb w) \propto \exp(-\mathcal R(\pmb w))\]
Quickly prove that ridge regression with parameter $\lambda$ corresponds to taking the prior over $\pmb w$ to be
\[p(\pmb w) = \mathcal N \left(0, \text{diag}(\sigma^2/\lambda)\right)\]
and then estimating $\pmb w$ with a MAP estimator and that LASSO with parameter $\lambda$ corresponds to taking the prior over $\pmb w$ to be
\[p(w_i) = \text{Lap}(0, 2\sigma^2/\lambda)\]
for each component $w _ i$ of $\pmb w$, and then estimating $\pmb w$ with a MAP estimator.
It’s easier to start from what using a MAP estimator for prior $p(\pmb w) \propto \exp(-\mathcal R(\pmb w))$ looks like, and then consider how we can get the same loss function as ridge and lasso.
The MAP estimator considers
\[\begin{aligned} \hat{\pmb w} &= \arg_{\pmb w} \max p(\pmb w \mid X, \pmb y) \\\\ &= \arg_{\pmb w}\min \mathcal L_{\mathsf{MAP}\\,}(\pmb w \mid X, \pmb y) \end{aligned}\]where
\[\mathcal L_{\mathsf{MAP}\\,}(\pmb w \mid X, \pmb y) = -\log(p(\pmb y \mid X, \pmb w) \cdot p(\pmb w))\](this is just using Bayes rule, and then negating and taking the log to turn products into sums. Technically, we should have $-\log(p(X, \pmb y \mid w) \cdot p(\pmb w))$, but since $p(X, \pmb y \mid \pmb w) = p(\pmb y \mid X, \pmb w) p(X \mid \pmb w)$ and we don’t care about $p(X \mid \pmb w)$, we can ignore this term and still get an equivalent loss function).
In our case
\[\begin{aligned} p(\pmb y \mid X, \pmb w) \cdot p(\pmb w) &\propto \exp\left(-\frac{(\pmb y - \pmb X \pmb w)^\top (\pmb y - \pmb X \pmb w)}{2\sigma^2}\right) \cdot \exp(-\mathcal R(\pmb w)) \\\\ &= \exp\left(-\frac{(\pmb y - \pmb X \pmb w)^\top (\pmb y - \pmb X \pmb w)}{2\sigma^2} - \mathcal R(\pmb w)\right) \end{aligned}\]This gives
\[\begin{aligned} \mathcal L_{\mathsf{MAP}\\,}(\pmb w \mid X, \pmb y) &\propto \frac{(\pmb y - \pmb X \pmb w)^\top (\pmb y - \pmb X \pmb w)}{2\sigma^2} + \mathcal R(\pmb w) \\\\ &\propto (\pmb y - \pmb X \pmb w)^\top (\pmb y - \pmb X \pmb w) + 2\sigma^2\mathcal R(\pmb w) \end{aligned}\]since they are proportional, since we are considering the value of $\pmb w$ that gives the maximum, they are for all intents and purposes equal.
So
\[\mathcal L_{\mathsf{MAP}\\,}(\pmb w \mid X, \pmb y) = (\pmb y - \pmb X \pmb w)^\top (\pmb y - \pmb X \pmb w) + 2\sigma^2\mathcal R(\pmb w)\]How can we match this up to the loss function for ridge or LASSO? Note that
\[\mathcal L_{\mathsf{ridge}\\,}(\pmb w \mid X, \pmb y) = (\pmb y - \pmb X \pmb w)^\top (\pmb y - \pmb X \pmb w) + \lambda \pmb w^\top \pmb w\]So if $2\sigma^2 \mathcal R(\pmb w) = \lambda \pmb w^\top \pmb w$, then we are done. This is true when
\[R(\pmb w) = \frac{\lambda}{2\sigma^2} \pmb w^\top \pmb w\]so then
\[\begin{aligned} p(\pmb w) &\propto \exp\left(-\frac{\lambda}{2\sigma^2} \pmb w^\top \pmb w\right) \\\\ &\propto \frac{1}{(2\pi)^{D/2} \cdot |\pmb \Lambda|^{1/2} } \exp\left(-\frac{1}{2} \pmb w^\top \pmb \Lambda^{-1} \pmb w\right) \\\\ &= \mathcal N\left(\pmb 0, \Lambda)\right) \end{aligned}\]where $\pmb \Lambda = \text{diag}(\sigma^2/\lambda)$ as required.
What about LASSO? This has
\[\mathcal L_{\mathsf{LASSO}\\,}(\pmb w) = (\pmb y - \pmb X \pmb w)^\top (\pmb y - \pmb X \pmb w) + \lambda \sum^{D}_{i = 1} |w_i|\]This is true when
\[\mathcal R(\pmb w) = \frac{\lambda}{2\sigma^2} \sum^D_{i = 1} |w_i|\]Then
\[\begin{aligned} p(\pmb w) &\propto \exp\left(-\frac{\lambda}{2\sigma^2} \sum^D_{i = 1} |w_i|\right) \\\\ &\propto \prod^D_{i = 1} \exp\left(-\frac{|w_i|}{\frac{2\sigma^2}{ \lambda}\\,}\right) \\\\ &\propto \prod^D_{i = 1} \frac{1}{2 \cdot\frac{2\sigma^2}{ \lambda}\\,} \exp\left(-\frac{|w_i|}{\frac{2\sigma^2}{ \lambda}\\,}\right) \\\\ &= \prod^D_{i = 1} \text{Lap}\left(0, \frac{2\sigma^2}{\lambda}\right) \end{aligned}\]which is equivalent to
\[p(w_i) = \text{Lap}(0, 2\sigma^2/\lambda)\]for each $w _ i$.
What is the link between the different regularisation schemes (e.g. ridge or lasso) for linear regression and Bayesian machine learning?
Each regularisation scheme corresponds to a different prior on the distribution of weights.
Suppose we have the prior over the model parameters $\pmb w$ that each $\pmb w _ i$ corresponds to an independent Laplacian, i.e.
\[\pmb w_i \sim \text{Lap}(\mu, b)\]
Can you give an expression for $p(\pmb w _ i)$?
Assume we are training two classification models:
- In one, we just use a maximum likelihood estimate (MLE) in order to find weights
- In the other, we use the maximum a priori estimate (MAP) in order to find weights, using a Gaussian prior over the weights
Can you:
- Explain for which one you would expect to achieve a lower error on the training data
- Explain for which to achieve a lower error on the test data
- Justify this by mentioning the bias-variance tradeoff
MAP can be viewed as a type of regularisation, which can reduce the expressive power of the model in exchange for better generalisation. So you would expect MLE to do better on the training data, but for MAP to generalise better to the test data.
Regularisation can be seen as increasing the bias in exchange for decreasing the variance.