Notes - Machine Learning MT23, Maximum likelihood principle
``
[[Course - Machine Learning MT23]]U - [[Notes - Machine Learning MT23, Linear regression]]U
Flashcards
What is the maximum likelihood principle?!?
Can you summarise the maximum likelihood principle?
The best fit model for a given dataset is the one that generates the data with the highest probability.
In the context of linear regression, when using the maximum likelihood principle we wish to learn a mapping $f _ {\pmb w} : \mathbb R^D \to \mathbb R$ which we assume is linear, but with a normally-distributed error term. How can we describe this mathematically? Assume the output is $y$.
then
\[y \sim \pmb w^T \pmb x + \mathcal N(0, \sigma^2)\]Consider linear regression under the maximum likelihood framework. We assume that for each $y _ i$, $y _ i = \pmb w \cdot \pmb x _ i + \epsilon _ i$, where $\langle \pmb x _ i, y\rangle^N _ {i = 1}$ is the data we observe and $\epsilon _ i \sim \mathcal N(0, \sigma^2)$. Find an expression for the negative log-likelihood, the function we wish to minimise.
then, taking logarithm and using matrix notation
\[\text{NLL}(\pmb y \mid \pmb X, \pmb w, \sigma) = \frac{1}{2\sigma^2}(\pmb X \pmb w - \pmb y)^T(\pmb X \pmb w - \pmb y) + \frac{N}{2}\log(2\pi \sigma^2)\]Under the MLE framework, if you use normally distributed errors vs Laplace distributed errors for a linear regression, what happens?
- Normally distributed errors: equivalent to least-squares (i.e. $l _ 2$ regression)
- Laplace distributed errors: equivalent to $l _ 1$ regression