Notes - Machine Learning MT23, Model selection


Flashcards

What is a hyperparameter?


A parameter used to control the learning process.

What is a grid search in the context of optimising hyperparameters?


Training a model over a grid of possible hyperparameter configurations, and choosing the ones that maximise test performance.

What is $k$-fold cross validation?


  • Divide training data into equal “folds”
  • Train $k$ different models by leaving out a particular fold
  • Use remaining fold for validation
  • Benchmark model performance by taking average over all validations

When are techniques like $k$-fold cross validation used rather than keeping aside a validation set?


When data is scarce.

Leave-one-out-cross-validation is limiting case of what?


$k$-fold cross validation.

What is regularisation, and what is its purpose?


Any modification we make to a learning algorithm with the aim of preventing overfitting without reducing training error.

Can you give three possible uses of a validation set?


  • Model selection: e.g. deciding whether it would be better to use an SVM or logistic regression
  • Hyperparameter tuning: e.g. deciding on the number of clusters in PCA
  • Early stopping: working out when to stop training based on evaluations on the validation set

Suppose we are trying to pick the value of some hyperparameter $\lambda$ (e.g. the penalty term in ridge regression), and plot the error on the validation set versus the value of the parameter.

What is the typical shape of such graphs, why, and how should we pick the value of $\lambda$?


It is typically $U$-shaped, since when $\lambda$ is small we get overfitting, and when $\lambda$ is large we get underfitting. We should choose the value at the bottom of the $U$.

Suppose we are trying to pick the values of a set of some hyperparameters $\lambda _ 1, \cdots, \lambda _ n$. What is grid search in this context?


Make a grid of all possible combinations of values of hyperparameters and pick the combination that is most suitable.

Why is having a small training set with a large number of features typically not a good thing?


There is likely to be overfitting as the learning algorithm is trying to fit the noise.

Suppose we have a small training set with a large number of $n$ features. This situation is not very good because there is likely to be overfitting as the algorithm fits the noise in the paramters.

Hence we typically apply some form of “feature selection” in order to find a subset of the features that give the smallest validation error. In this context, can you state the forward search algorithm?


F = []

while F != {1, ..., n} (i.e. while we haven't explored all greedily):
	for all i \in {1, ..., n}
		F_i =  F \cup {i}
		E_{F_i} = error on validation set using these features

	F =  F_i for for the i for which E_{F_i} is minimal

return F with the smallest generalization error

i.e. we use $O(n^2)$ calls to the learning algorithm using a greedy strategy rather than $O(2^n)$ calls

Suppose $X$ and $Y$ are random variables. How is the mutual information between $X$ and $Y$ defined?


\[I(X, Y) = \sum_x \sum_y p(X = x, Y = y) \cdot \log \frac{p(X = x, Y=y)}{p(X = x) \cdot p(Y = y)}\]

Suppose we have a small training set with a large number of $n$ features. This situation is not very good because there is likely to be overfitting as the algorithm fits the noise in the paramters.

Given that the mutual information between random variables $X$ and $Y$ is defined by

\[I(X, Y) = \sum_x \sum_y p(X = x, Y = y) \cdot \log \frac{p(X = x, Y=y)}{p(X = x) \cdot p(Y = y)}\]

can you define the idea of “filter feature selection”?


Compute $I(x _ i, y)$ empirically for all the features $x _ i$, and then take the top $k$.

Proofs




Related posts