Machine Learning MT23, Linear regression
I’m not sure the flashcards near the end about an alternative expression for $\pmb w$ are correct – I think it needs to be left in the “prediction form”. Otherwise you still end up doing big matrix products?
Flashcards
Suppose we have the linear model $\hat{\mathbf y} = \mathbf{Xw}$ where there are $D$ features and and $N$ inputs. Can you define a loss function $\mathcal L(\mathbf w)$?
Compute the gradient $\Delta _ {\mathbf w} \mathcal L$ where
\[\mathcal L (\mathbf w) = \frac{1}{2N} (\mathbf X \mathbf w - \mathbf y)^T (\mathbf X \mathbf w - \mathbf y)\]
In the analytic solution of linear regression, we solve the following:
\[\frac{1}{N} \left((\mathbf X^T \mathbf X) \mathbf w - \mathbf X^T \mathbf y\right) = 0\]
What is the solution for $\mathbf w$ in terms of $\mathbf X$ in this case?
Suppose we have the linear model $\hat{\mathbf y} = \mathbf{Xw}$ where there are $D$ features and and $N$ inputs. Assuming there is an analytic solution, what is the time complexity of calculating $\mathbf w$?
Suppose we have the linear model $\hat{\mathbf y} = \mathbf{Xw}$ where there are $D$ features and and $N$ inputs. Can you define a loss function for ridge regression $\mathcal L _ {\text{ridge}\,}(\mathbf w)$, and explain the idea behind this?
The last term is a penalty term for weights.
When performing ridge regression, we penalise weight size with a penalty term
\[\lambda \sum^D_{i = 1} w_i^2\]
What is a potential issue here and how can we solve it?
The weights might be larger or smaller depending on the scale or translations of certain components of the input. Hence we should standardise inputs to all have mean $0$ and variance $1$.
Suppose we have the linear model $\hat{\mathbf y} = \mathbf{Xw}$ where there are $D$ features and and $N$ inputs. Can you define a loss function for lasso regression $\mathcal L _ {\text{lasso}\,}(\mathbf w)$, and explain the idea behind this?
The last term is a penalty term for weights.
Why is there no analytic solution to linear regression with the lasso objective?
The penalty term is not differentiable.
What is one of the main advantages of using lasso in the context of linear regression (why is it called lasso)?
It also does variable selection, i.e. will set some of the weights to zero – it’s “lasso”-ing the relevant variables.
In the analytic solution of linear regression, we solve the following:
\[\frac{1}{N} \left((\mathbf X^T \mathbf X) \mathbf w - \mathbf X^T \mathbf y\right) = 0\]
The solution for $\mathbf w$ in terms of $\mathbf X$ in this case is then
\[\mathbf w = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y\]
However, we have an issue if $\mathbf X^\top \mathbf X$ is not invertible, since then there are actually infinitely many $\mathbf w$ that minimise the objective. Without using regularisation, how can we calculate one of these $\mathbf w$?
Use the Moore-Penrose pseudoinverse, which always exists
\[\mathbf w = (\mathbf X^\top \mathbf X)^+ \mathbf X^\top \mathbf y\]Suppose we have an arbitrary matrix $\mathbf A \in \mathbb R^{m \times n}$. How can we compute the Moore-Penrose pseudoinverse?
Take the SVD:
\[\mathbf A = \mathbf U \mathbf \Sigma \mathbf V^\top\]and then
\[\mathbf A^+ = \mathbf V^\top \mathbf \Sigma^+ \mathbf U\]$\mathbf \Sigma^+$ can be calculated by replacing all the non-zero diagonal entries $\sigma _ i$ with their reciprocal, and then taking the transpose of the resulting matrix.
Suppose you have training data with lots of features, but you believe the number of actually relevant features is quite small. What regularisation scheme would you then use?
$\ell _ 1$ regularisation, since this performs feature selection.
Quickly define the linear regression model with Gaussian noise. State what its parameters are and any assumptions made.
- We aim to take input $\pmb x \in \mathbb R^D$ and predict the value of a scalar as output.
- Output variable $y$ is a linear function of input variables, plus a noise / uncertainty term $\varepsilon$, i.e. $y = \pmb w^\top \pmb x + \epsilon$.
- We assume $\pmb x$ has a column of $1$s so that a bias term does not have to be treated seperately.
- $\pmb w \in \mathbb R^D$ is a vector of weights, and the only parameter in the model.
- We model $p(y \mid \pmb w, \pmb x) = \mathcal N(\pmb w^\top \pmb x, \sigma^2)$.
In a linear regression model with Gaussian noise, we model
\[p(y \mid \pmb w, \pmb x) = \mathcal N(\pmb w^\top \pmb x, \sigma^2)\]
What is $p(\pmb y \mid \pmb X, \pmb w)$?
This is the likelihood of the data.
\[\frac{1}{(2\pi \sigma^2)^{N/2} } \cdot \exp\left( -\frac{(\pmb y - \pmb X \pmb w)^\top(\pmb y - \pmb X \pmb w)}{2\sigma^2} \right)\](A quick derivation below:)
\[\begin{aligned} p(y_1, \ldots, y_N | x_1, \ldots, x_N, w, \sigma) &= \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma^2} } \exp\left( -\frac{(y_i - w^T x_i)^2}{2\sigma^2} \right) \\\\ &= \left( \frac{1}{2\pi\sigma^2} \right)^{N/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^{N} (y_i - w^T x_i)^2 \right) \end{aligned}\]In the analytic solution of linear regression, we want to minimise:
\[\frac{1}{N}(X \pmb w - \pmb y)^\top (X\pmb w - y)\]
Derive the solution for $\mathbf w$ in terms of $\mathbf X$ in this case.
Ignoring $1/N$, and expanding:
\[\begin{aligned} (X\pmb w - \pmb y)^\top (X \pmb w - \pmb y) &= (\pmb w^\top X^\top - \pmb y^\top)(X\pmb w - \pmb y) \\\\ &= \pmb w^\top X^\top X\pmb w - \pmb w^\top X^\top \pmb y - \pmb y^\top X \pmb w + \pmb y^\top \pmb y \end{aligned}\]Then differentiating with respect to $\pmb w$ and setting to $\pmb 0$,
\[2X^\top X - 2X^\top \pmb y = \pmb 0\]Then rearranging gives:
\[\mathbf w = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y\]In the analytic solution of linear regression, maximising the maximum likelihood
\[\frac 1 N (X \pmb w - \pmb y)^\top (X \pmb w - \pmb y)\]
gives the estimate
\[\hat{\mathbf w} = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y\]
Assuming that the data was truly generated by a linear model, why is this an “unbiased estimator”?
By the assumption, we have
\[\mathbb E[\pmb y \mid X] = X\pmb w^\star\]for some true parameter $\pmb w$. Then
\[\begin{aligned} \mathbb E[\hat{\pmb w}] &= \mathbb E[(X^\top X)^{-1}X^\top \pmb y] \\\\ &= \mathbb E[(X^\top X)^{-1}X^\top X \pmb w^\star] \\\\ &= \mathbb E[\pmb w^\star] \\\\ &= \pmb w^\star \end{aligned}\]so
\[||\mathbb E[\hat{\pmb w}] - \pmb w^\star ||^2_2 = 0\]Hence unbiased.
The original MLE for $\pmb w$ in ridge regression is given by
\[\pmb w = (X^\top X + \lambda I)^{-1}X^\top \pmb y\]
What alternative expression do we have for $\pmb w$, and why is this useful?
This is useful because:
- If we have small $N$ and large $D$, then $X^\top X$ is a $D \times D$ matrix, but $X X^\top$ is an $N \times N$ matrix
- Each entry $(X X^\top) _ {ij}$ is $x _ i^\top x _ j$, so this opens the door for kernel methods since if we have a feature map $\phi$, then the transformed data matrix $\Phi \Phi^\top$ is the Gram matrix of the kernel (note this is the opposite way around for kernelised PCA, but that’s because here the rows of $X$ are the data points, but in PCA, the columns are the data points).
The original MLE for $\pmb w$ in ridge regression is given by
\[\pmb w = (X^\top X + \lambda I)^{-1}X^\top \pmb y\]
Quickly prove that we have the equivalent expression
\[\pmb w = X^\top (X X^\top + \lambda I) \pmb y\]
Substituting back into the original
\[\begin{aligned} &(X^\top X + \lambda I_D) \lambda^{-1} X^\top (\pmb y - X \pmb w) = X^\top \pmb y \\\\ \implies& X^\top (X X^\top + \lambda I_N)\lambda^{-1} (\pmb y - X \pmb w) = X^\top \pmb y \end{aligned}\]Somehow, eliminate $X^\top$ from both sides:
\[\begin{aligned} &(X X^\top + \lambda I_N) \lambda^{-1} (\pmb y - X \pmb w) = \pmb y \\\\ \implies&\lambda^{-1}(\pmb y - X\pmb w) = (XX^\top + \lambda I_N)^{-1} \pmb y \\\\ \implies& X^\top \lambda^{-1} (\pmb y - X \pmb w) = X^\top(X X^\top + \lambda I_N)^{-1} \pmb y \\\\ \implies& \pmb w = X^\top (XX^\top + \lambda I_N)^{-1} \pmb y \end{aligned}\]Suppose we want to use kernels when performing ridge regression on data $X, \pmb y$. How could you calculate the parameter $\pmb w$ and make new predictions?::
The original MLE estimator for ridge regression is given by
\[\pmb w = X^\top (X^\top X + \lambda I)\]The ridge regression objective is
\[(\pmb X \pmb w - \pmb y)^T(\pmb X\pmb w - \pmb y) + \lambda \pmb w^\top \pmb w\]
and the LASSO objective is
\[(\pmb X \pmb w - \pmb y)^T(\pmb X\pmb w - \pmb y) + \lambda \sum^D_{i = 1} |w_i|\]
Can you give equivalent formulations as constrained convex optimisation problems?
Ridge regression:
\[\begin{aligned} \min& &&(\pmb X \pmb w - \pmb y)^T(\pmb X\pmb w - \pmb y) \\\\ \text{s.t.}& &&\pmb w^\top \pmb w \le R \end{aligned}\]LASSO regression:
\[\begin{aligned} \min& &&(\pmb X \pmb w - \pmb y)^T(\pmb X\pmb w - \pmb y) \\\\ \text{s.t.}& &&\sum^D_{i = 1} |w_i| \le R \end{aligned}\]The MLE for $\pmb w$ in linear regression is given by
\[\pmb w_\text{MLE} = (X^\top X)^{-1} X^\top \pmb y\]
What is the MLE for the variance $\sigma^2$?
The MLE for $\pmb w$ in linear regression is given by
\[\pmb w_\text{MLE} = (X^\top X)^{-1}X^\top \pmb y\]
Quickly prove that the MLE for the variance $\sigma^2$ is given by:
\[\sigma^2_\text{MLE} = \frac 1 N (\pmb X \pmb w_\text{MLE} - \pmb y)^\top (\pmb X \pmb w_\text{MLE} - \pmb y)\]
The NLL for linear regression is given by
\[\text{NLL}(\pmb y \mid \pmb X, \pmb w, \sigma) = \frac{1}{2\sigma^2} (\pmb X\pmb w - \pmb y)^\top (\pmb X\pmb w - \pmb y) + \frac{N}{2} \log(2\pi \sigma^2)\]Differentiating with respect to $\sigma$:
\[\frac{\text d \text{NLL}(\pmb y \mid \pmb X, \pmb w, \sigma)}{\text d \sigma} = - \frac{1}{\sigma^3} (\pmb X \pmb w - \pmb y)^\top (\pmb X \pmb w - \pmb y) + \frac N \sigma\]Calculating as subsituting in $\pmb w$ for $\pmb w _ \text{MLE}$ gives
\[\sigma^2_\text{MLE} = \frac 1 N (\pmb X \pmb w_\text{MLE} - \pmb y)^\top (\pmb X \pmb w_\text{MLE} - \pmb y\]