Notes - Machine Learning MT23, Underfitting and overfitting
Note a couple of the definitions in this document are not fully correct, they define the variance of a random variable $\mathbb E[(X - \mathbb E[X])^2]$, which only really makes sense if $X$ is a scalar-valued random variable, but $X$ might be a vector.
Flashcards
What is the correspondence between bias/variance and underfitting/overfitting?
- High bias, low variance: underfitting
- High variance, low bias: overfitting
Suppose we have some model that describes the response variables $\pmb y$ gives a training data matrix $\pmb X$ of $D$-dimensional inputs, and the model itself is described by some parameter $\pmb w$ (e.g. the weights in a linear regression).
Suppose further that the data $\pmb X$ was truly generated by this model. Then a lot of algorithms can be formulated as finding an estimator for the true parameter $\pmb w^\star$, say $\hat{\pmb w}$. In this formulation, how is the bias of the estimator $\hat{\pmb w}$ defined, and what does it mean to be “unbiased”?
Unbiased means that the bias is zero.
Suppose:
- We have some model that describes the response variables $\pmb y$ gives a training data matrix $\pmb X$ of $D$-dimensional inputs
- The model itself is described by some parameter $\pmb w$ (e.g. the weights in a linear regression)
- The data $\pmb X$ was truly generated by this model.
A lot of algorithms can be formulated as finding an estimator for the true parameter $\pmb w^\star$, say $\hat{\pmb w}$. In this formulation:
- What is the bias?
- What is the “MSE”?
- What is the variance
- What decomposition of the MSE into a function of the bias and the variance do we have?
MSE stands for “mean squared error”. We have the decomposition
\[\text{MSE}(\hat{\pmb w}) = \text{Bias}(\hat{\pmb w})^2 + \text{Var}(\hat{\pmb w})\]Suppose:
- We have some model that describes the response variables $\pmb y$ gives a training data matrix $\pmb X$ of $D$-dimensional inputs
- The model itself is described by some parameter $\pmb w$ (e.g. the weights in a linear regression)
- The data $\pmb X$ was truly generated by this model.
A lot of algorithms can be formulated as finding an estimator for the true parameter $\pmb w^\star$, say $\hat{\pmb w}$. In this formulation we have the following definitions:
\[\text{Bias}(\hat{\pmb w}) = ||\mathbb E[\hat{\pmb w}] - \pmb w^\star||_2\]
\[\text{V}(\hat{\pmb w})= \mathbb E[||\mathbb E[\hat{\pmb w}] - \hat{\pmb w}||^2]\]
\[\text{MSE}(\hat{\pmb w}) = \mathbb E\left[||\hat{\pmb w} - \pmb w^\star||^2\right]\]
(Note that the notation $V$ is non-standard, and is not quite variance. For vector random variables, we would actually have to consider the covariance matrix).
Quickly prove that we have the decomposition
\[\text{MSE}(\hat{\pmb w}) = \text{Bias}(\hat{\pmb w})^2 + \text{V}(\hat{\pmb w})\]
A quick proof, using a similar identity to the fact that for any scalar random variable $x$
\[\text{Var}(x) = \mathbb E[x^2] - \mathbb E[x]^2 \implies \mathbb E[x^2] = \text{Var(x)} + \mathbb E[x]^2\]We want to show for a vector random variable $X$
\[V(X) = \mathbb E[X^\top X] - \mathbb E[X]^\top \mathbb E[X]\]Expanding the definition:
\[\begin{aligned} V(X) &:= \mathbb E[(X - \mathbb E[X])^\top (X - \mathbb E[X])] \\\\ &= \mathbb E[X^\top X - X^\top \mathbb E[X] - \mathbb E[X]^\top X + \mathbb E[X]^\top \mathbb E[X]] \\\\ &= \mathbb E[X^\top X] - \mathbb E[X]^\top \mathbb E[X] \end{aligned}\]Or equivalently,
\[\mathbb E[||X||^2] = \mathbb E[(X - \mathbb E[X])^\top (X - \mathbb E[X])] \\\\ + ||E[X]||^2\]Letting $X = \hat{\pmb w} - \pmb w^\star$ in the above, we see
\[\mathbb E\left[||\hat{\pmb w} - \pmb w^\star||^2\right] = ||\mathbb E[\hat{\pmb w} - \pmb w^\star]||^2 + \mathbb E[||\hat{\pmb w} - \pmb w^\star||^2]\]which is the result we wish to show.
What is the idea of model averaging, and why is this useful in terms of the bias-variance tradeoff?
Train $k$ different models and then make a big model by using a combination of their predictions. Useful since it reduces the variance of the overall model (similar to how stochastic gradient descent reduces the variance in gradient updates).
One approach to model averaging to reduce variance is “bagging” (also known as “bootstrap aggregation”). Given a dataset $\langle (\pmb x _ i, y _ i) \rangle^N _ {i=1}$, quickly describe how this works.
- Generate $k$ new datasets $\mathcal D _ 1, \cdots, \mathcal D _ k$, each of size $N$ by sampling from $\mathcal D$ with replacement.
- Train models $M _ 1, \cdots, M _ k$ on each of these datasets.
- When making predictions, combine their outputs in a sensible way.