Notes - Machine Learning MT23, Principal component analysis


Flashcards

Suppose we have a set of points $\pmb x _ 1, \cdots, \pmb x _ n$ in $\mathbb R^m$, which we assume are centered, and that we have a target dimension of $k \ll m$. What is the goal of PCA?


Find a orthonormal set of $k$ vectors $\pmb u _ 1, \cdots, \pmb u _ k$ such that the “reconstruction error” (how well we can approximate $\pmb x _ i$ with some linear combination of $\pmb u _ 1, \cdots, \pmb u _ k$) is minimised.

Suppose we have a set of points $\pmb x _ 1, \cdots, \pmb x _ n$ in $\mathbb R^m$, which we assume are centered, and that we have a target dimension of $k \ll m$. What are the “principal components”?


The orthonormal set of $k$ vectors $\pmb u _ 1, \cdots, \pmb u _ k$.

Suppose we have a set of points $\pmb x _ 1, \cdots, \pmb x _ n$ in $\mathbb R^m$, which we assume are centered, and that we have a target dimension of $k \ll m$. Can you:

  • Define the reconstruction error given points $\tilde{\pmb x} _ 1, \cdots, \tilde{\pmb x} _ n$ which lie in a subspace of $\mathbb R^m$ of dimension of at most $k$
  • Give this in terms of the Frobenius norm
  • Explain why the Eckart-Young theorem means that we can solve this problem using SVD?

Reconstruction error is

\[\sum^n_{i = 1} ||\pmb x_i - \tilde{\pmb x}_i||^2\]

Construct matrices $X$ and $\tilde X$ with columns as $\pmb x _ i$ and $\tilde{\pmb x} _ i$ respectively. Then reconstruction error is

\[||X - \tilde X||^2_F\]

And by the Eckart-Young theorem, $k$-truncated SVD gives the optimal choice for the matrix $\tilde X$.

Suppose we have a set of points $\pmb x _ 1, \cdots, \pmb x _ n$ in $\mathbb R^m$, which we assume are centered, and that we have a target dimension of $k \ll m$. We form the matrix $X$ with these points as its columns, and get an SVD of $X = U\Sigma V^\top$. What are the two outputs of PCA?


  • $U _ k$, which gives the first $k$ singular vectors of $X$
  • $Z = U _ k^\top X$, which gives the first coefficients that approximate each point

Suppose we have a set of points $\pmb x _ 1, \cdots, \pmb x _ n$ in $\mathbb R^m$, which we assume are centered, and that we have a target dimension of $k \ll m$. What does it mean to perform kernalised PCA for some kernel function $K : \mathbb R^m \to \mathbb R^d$?


This kernel function represents some implicit feature map $\phi(\pmb x)$. Kernalised PCA means finding out the best vectors to approximate each $\phi(\pmb x _ i)$ without explicitly mapping each point to the expanded space.

If you have data that is very high-dimensional but the number of relevant features is actually quite sparse, what technique might you consider for reducing the dimension of the input?


Using PCA.

Suppose you are performing $k$-means clustering on very high-dimensional data which you have pre-processed with PCA to cut down the dimension of the input to $d$. There are two hyperparameters to choose here: $k$ and $d$.

Why is it important to choose both seperately, deciding which value of $d$ to use by looking at the reconstruction error and deciding which value of $k$ to choose by looking at the $k$-means objective, rather than just looking at the minimum of the $k$-means objective for all values of $d$?


PCA makes points artificially closer, and since the $k$-means objective is based on squared distances, having a high value of $d$ will make for an artificially low value of the $k$-means objective.

Quickly describe the inputs to the kernel PCA method, and quickly derive a method for performing PCA using the kernel trick.


Inputs:

  • $\pmb x _ 1, \ldots, \pmb x _ n$, original points in $\mathbb R^m$
  • Feature map $\phi : \mathbb R^m \to \mathbb R^n$ (might be implicit)
  • Associated kernel $\kappa : R^m \times R^m \to R$ given by $\kappa(\pmb x, \pmb x’) = \phi(\pmb x)^\top \phi(\pmb x’)$.

To perform PCA using this kernel, and never actually applying the feature map, we need to compute the SVD in the feature space, and then truncate to get the best low rank approximation.

  • Let $\pmb \mu = \frac 1 n \sum^n _ {i = 1} \phi(\pmb x _ i)$, the mean of the transformed points (note that this is just used in the derivation, and doesn’t need to be calculated explicitly)
  • Let $\pmb x _ i^\star = \phi(\pmb x _ i) - \pmb \mu$, the mean-centered transformed points (recall PCA works on mean-centered data)
  • Let $\pmb X$ be the $d \times n$ matrix with columns $\pmb x _ 1^\star, \ldots, \pmb x _ n^\star$.

To find the SVD, we find the right singular vectors $\pmb v _ i$, then use these to calculate the left singular vectors $\pmb u _ i$. To do this, we calculate the eigenvectors of $X^\top X$. We can do this with the kernel:

\[\begin{aligned} (X^\top X)_ {ij} &= (\pmb x_ i^\star)^\top (\pmb x_ j^\star) \\\\ &= (\phi(\pmb x_i) - \mu)^\top (\phi(\pmb x_j) - \mu) \\\\ &= \phi(\pmb x_i)^\top \phi(\pmb x_j) - \frac 1 n\sum^n_{k = 1} \phi(\pmb x_i)^\top \phi(\pmb x_k) - \frac 1 n\sum^n_{k = 1} \phi(\pmb x_j)^\top \phi(\pmb x_k) + \frac{1}{n^2} \sum^n_{k, l = 1} \phi(\pmb x_k)^\top \phi(\pmb x_l) \\\\ &= \kappa(\pmb x_i, \pmb x_j) - \frac 1 n\sum^n_{k = 1} \kappa(\pmb x_i, \pmb x_k) - \frac 1 n\sum^n_{k = 1} \kappa(\pmb x_j, \pmb x_k) + \frac{1}{n^2} \sum^n_{k, l = 1} \kappa(\pmb x_k, \pmb x_l) \end{aligned}\]

Then can calculate the top $k$ eigenvectors $\pmb v _ 1, \cdots, \pmb v _ k$ of $X^\top X$ using a standard approach. Then we can calculate the top $k$ singular vectors via

\[\begin{aligned} \pmb u_ i &= \frac{X \pmb v_ i}{||X\pmb v_ i||} \\\\ &= \frac{1}{||X\pmb v_ i||}\sum^n_ {j = 1} (v_ i)_ j \pmb{x_ j}^\star \\\\ &= \sum^n_ {j = 1} \alpha_ {ij} \pmb x^\star_ j \end{aligned}\]

where

\[\alpha_{ij} = \frac{(v_i)_j}{||X\pmb v_i||}\]

We can calculate $ \vert \vert X \pmb v _ i \vert \vert $ via $(X\pmb v _ i)^\top (X \pmb v _ i) = \pmb v _ i X^\top X \pmb v _ i$. This gives us a way of expressing the left singular vectors of linear combinations of the transformed data vectors $\pmb x^\star _ 1 \cdots, \pmb x^\star _ n$. Let $U _ k$ be the $k\times d$ matrix whose columns are the top $k$ left singular vectors.

Then we can use the kernel function to calculate $Z = U^\top _ k X$ (since we are taking the dot product of two linear combinations of $\pmb x^\star$ vectors).

Then

\[X_k = U_k Z\]

And hence we have $Z$, which gives the coordinates of $X _ k$ in this rank-$k$ approximation.

What important assumption does PCA make about the input data?


The data is mean-centered.

One of the steps in PCA works on calculating the top $k$ eigenvectors $\pmb v _ 1, \cdots, \pmb v _ k$ of a matrix $X^\top X$. Why don’t we have the potential to get complex-valued eigenvalues / eigenvectors?


Because $X^\top X$ is a symmetric, real valued matrix, so it’s eigenvalues and eigenvectors are real.

Suppose we want to determine a good value of $k$ to use in a low rank approximation of a matrix $X$. One approach is to calculate the $k$-rank approximation for lots of different values of $k$ and see which one is sufficient.

How can we instead express the relative error of a low-rank approximation $X _ k$ in terms of the singular values of $X$?


\[\frac{||X - X_k||^2_F}{||X||^2_F} = \frac{\sigma_{k+1}^2 + \cdots + \sigma^2_r}{\sigma^2_1 + \cdots + \sigma^2_r}\]

What are the two final outputs of PCA?


  • $U _ k$, which gives the basis for the $k$-rank approximation (in kernel PCA, this is implicit)
  • $Z$, which gives the coefficients that represent $X _ k$ (in kernel PCA, this is the only output)

The final two outputs of PCA are

  • $U _ k$, which gives the basis for the $k$-rank approximation (in kernel PCA, this is implicit)
  • $Z$, which gives the coefficients that represent $X _ k$ (in kernel PCA, this is the only output)

How is $Z$ calculated?


\[U_k^\top X\]

The final two outputs of PCA are

  • $U _ k$, which gives the basis for the $k$-rank approximation (in kernel PCA, this is implicit)
  • $Z$, which gives the coefficients that represent $X _ k$ (in kernel PCA, this is the only output)

$Z$ is calculated as $U _ k^\top X$. How can you then write $X _ k$ in terms of $U _ k$ and $Z$?


\[X_k = U_k Z\]

The final two outputs of PCA are

  • $U _ k$, which gives the basis for the $k$-rank approximation (in kernel PCA, this is implicit)
  • $Z$, which gives the coefficients that represent $X _ k$ (in kernel PCA, this is the only output)

$Z$ is calculated as $U _ k^\top X$. $X _ k$ can then be written

\[X_k = U_k Z\]

Quickly prove that this is indeed the same as

\[X_k = \sum^k_{i = 1} \sigma_i u_i v_i^\top\]

Expanding,

\[U_k U_k^T X = U_k U_k^\top U \Sigma V^\top\]

Then since So then and hence

\[\begin{aligned} U_k U_k^\top U \Sigma V^\top &= \sum^n_{i = 1} \sigma_i u_i v_i^\top &&\text{where }u_i = 0 \text{ for } i > k \\\\ &= \sum^k_{i = 1} \sigma_i u_i v_i^\top \\\\ &= X_k \end{aligned}\]

as required.




Related posts