Notes - Machine Learning MT23, Recurrent neural networks
Outstanding Questions
- The parameters are shared for every unrolled version of an RNN network. Why does running gradient descent on each of these cause updates that work well in general?
Flashcards
What is “sequential data” $\pmb x$, and what neural network architecture is often used to model it?
Where there is a dependence between each input $x _ i$ and each input that came before it $\pmb x _ {<i}$, and recurrent architectures are often used for this.
Suppose we have a language model that assigns a probability to a sequence of words $\pmb w = (w _ 1, \cdots, w _ N)$. What typical “factorisation” of $p(\pmb w)$ is used to simplify the task?
What is the main task of a language model?
To model a probability distribution $p(\pmb w)$ over sequences of words (called utterances) given by $\pmb w = (w _ 1, \cdots, w _ N)$.
The main task of a language model is to describe a probability distribution $p(\pmb w)$ over sequences of words (called utterances) given by $\pmb w = (w _ 1, \cdots, w _ N)$. Say these words come from an alphabet $\Sigma$. What condition do we have on such a distribution?
What is the intuitive interpretation of the cross entropy of an utterance $w^N _ 1 = (\pmb w _ 1, \cdots, \pmb w _ N)$, given by
\[H(w^N_1) = -\frac 1 N \sum^N_{i =1} \log_2(\pmb w_i)\]
and what is the perplexity given this definition (here each $\pmb w _ i$ is a one-hot encoding of a word)?
The number of bits required to encode each symbol, and perplexity is given by
\[\text{perplexity}(w^N_1) = 2^{H(w^N_1)}\]A typical “factorisation” of $p(\pmb w)$ used to simplify the task of language models is
\[p(w_1, \cdots, w_N) = p(w_1)p(w_2 \mid w_1)p(w_3 \mid w_1, w_2) \cdots p(w_N \mid w_1, w_2, \cdots, w_N)\]
What happens if we make the Markov assumption for a “unigram” model, and for a “bigram” model?
Unigram:
\[p(w_1, \cdots, w_N) \approx p(w_1)p(w_2)p(w_3) \cdots p(w_N)\]Bigram:
\[p(w_1, \cdots, w_N) \approx p(w_1)p(w_2 \mid w_1)p(w_3 \mid w_2) \cdots p(w_N \mid w_{N-1})\]Suppose we are making a bigram language model so that the probability $p(\pmb w)$ of an utterance is approximated as
\[p(w_1, \cdots, w_N) \approx p(w_1)p(w_2 \mid w_1)p(w_3 \mid w_1, w_2) \cdots p(w_N \mid w_{N-2}, w_{N-1})\]
What would the simplest approach, using a MLE multivariate distribution, model
\[p(w_3 \mid w_1, w_2)\]
as, and what is a disadvantage of this approach?
Most utterances end up being assigned zero probability with such a model.
Draw a picture representing a neural trigram network with one hidden layer and then state the forward equations assuming we use softmax as the final activation.
- 4 nodes
- Two inputs, $\pmb w _ 1$ and $\pmb w _ 2$
- One output, $\hat{\pmb p} _ 3$, probability distribution over words in the vocabulary
- One hidden layer $\pmb h$ in the middle
Then forward equations are something like
\[\pmb h = \text{activation}(V[\pmb w_1 ; \pmb w_2] + \pmb c)\] \[\hat{\pmb p}_n = \text{softmax}(W\pmb h + \pmb b)\]Suppose we have training data $\pmb w = (\pmb w _ 1, \cdots, \pmb w _ N)$, i.e. a long utterance where each $\pmb w _ i$ is a one-hot encoding of a word. What does the usual training objective look like for a simple language model?
The cross entropy loss of the data given the model,
\[-\frac{1}{N} \sum^N_{i=1} \pmb w_i^\intercal \log (\hat{\pmb p}_i)\](here $\hat{\pmb p} _ i$ does not just depend on $\pmb w _ i$, but on its position in the utterance).
Suppose we have the following trigram network:
Suppose we have training data $\pmb w = (\pmb w _ 1, \cdots, \pmb w _ N)$ i.e. a long utterance where each $\pmb w _ i$ is a one-hot encoding of a word. The standard training objective is the cross-entropy of the data given the model, equal to
\[\mathcal F = -\frac{1}{N} \sum^N_{i=1} \text{cost}(\pmb w_i, \hat{\pmb p}_i)\]
where
\[\text{cost}(\pmb a, \pmb b) = \pmb a^\top \log(\pmb b)\]
Quickly derive an expression for the gradients
\[\frac{\partial \mathcal F}{\partial W}, \frac{\partial \mathcal F}{\partial V}\]
(since we’re considering the training data $\pmb w _ i$ fixed), and
\[\begin{aligned} \frac{\partial \mathcal F}{\partial V} &= \frac{\partial}{\partial V} \left( - \frac 1 N \sum^N_ {i=1} \text{cost}_ i(\pmb w_ i, \hat{\pmb p}_ i) \right) \\\\ &= -\frac 1 N \sum^N_ {i = 1} \frac{\partial \text{cost}_ i(\pmb w_ i, \hat{\pmb p_ i})}{\partial V} \\\\ &= -\frac 1 N \sum^N_ {i = 1} \frac{\partial \text{cost}_ i}{\partial \hat{\pmb p}_ i} \frac{\partial \hat{\pmb p}_ i}{\partial V} \\\\ &= -\frac 1 N \sum^N_ {i = 1} \frac{\partial \text{cost}_ i}{\partial \hat{\pmb p}_ i} \frac{\partial \hat{\pmb p}_ i}{\partial \pmb h_ i} \frac{\partial \pmb{h}_ i}{\partial V} \end{aligned}\]Suppose we are using the following recurrent neural network for language modelling:
and have training data $\pmb w = (\pmb w _ 1, \cdots, \pmb w _ N)$. This gives us the unrolled computational graph
(in the above image, $N = 4$).
The BPTT applies backpropogration to this graph to in order to update the model parameters $V, W, b, c$. Quickly derive a heuristic for when the gradient information linking the effect of $w _ 0$ on $\text{cost} _ N$ will propogate, i.e. when
\[\frac{\partial \text{cost}_N}{\partial h_1}\]
does not explode and does not vanish.
We have
\[\begin{aligned} \frac{\partial \text{cost}_ N}{\partial h_ 1} = \frac{\partial \text{cost}_ N}{\partial \hat{\pmb p}_ N} \frac{\partial \pmb{\hat p}_ N}{\partial h_ N} \left( \prod_ {n \in \\{N, \cdots, 2\\} } \frac{\partial h_ n}{\partial h_ {n-1} } \right) \end{aligned}\](For the sake of analysis, all we care about is this repeated product of partial derivatives of hidden layers. Why then do we need to include the $\frac{\partial \text{cost} _ N}{\partial \hat{\pmb p} _ N} \frac{\partial \pmb{\hat p} _ N}{\partial h _ N}$ term, rather than just $\frac{\partial \text{cost} _ N}{\partial h _ N}$? I’m not 100% sure, but the former is the actual expression that would be used in the backprogation algorithm, since the latter “jumps” in the computational graph). Rewriting
\[h_ n = g(V[x_ n ; h_ {n-1}] + C)\]as
\[h_ n = g(V_ x x_ n + V_ h h_ {n-1} + c)\]and defining $z _ n = V _ x x _ n + V _ h h _ {n-1} + c$, we have
\[\frac{\partial h_ n}{\partial h_ {n-1} } = \frac{\partial h_ n}{\partial z_ n} \frac{\partial z_ n}{\partial h_ {n-1} }\]And
\[\frac{\partial h_ n}{\partial z_ n } = \text{diag}(g'(z_ n))\](since $g$ is applied componentwise). Then
\[\frac{\partial h_ n}{\partial h_ {n-1} } = \frac{\partial h_ n}{\partial z_ n} \frac{\partial z_ n}{\partial h_ {n-1} } = \text{diag}(g'(z_ n)) V_ h\]so
\[\begin{aligned} \frac{\partial \text{cost}_ N}{\partial h_ 1} &= \frac{\partial \text{cost}_ N}{\partial \hat{\pmb p}_ N} \frac{\partial \pmb{\hat p}_ N}{\partial h_ N} \left( \prod_ {n \in \\{2, \cdots, N\\} } \frac{\partial h_ n}{\partial h_ {n-1} } \right) \\\\ &= \frac{\partial \text{cost}_ N}{\partial \hat{\pmb p}_ N} \frac{\partial \pmb{\hat p}_ N}{\partial h_ N} \left( \prod_ {n \in \\{2, \cdots N\\} } \text{diag}(g'(z_ n)) V_ h \right) \end{aligned}\]Hence we need repeated multiplication of $V _ h$ to neither explode nor vanish, which happens when the largest eigenvalue of $V _ h$ is $1$.
What’s the main disadvantage of language models that use the Markov assumption?
They can’t understand long-range dependencies in data.
What is idea of the “backprogation through time” and “truncated backprogation through time” algorithms for training RNNs given training data, and what is the time complexity of each?
- Backpropogation through time: for every sequence in the training data, unroll the RNN, and use backpropogation to update the weights. Linear in the length of the longest sequence.
- Truncated backpropogation through time: split the data into fixed chunks of size $k$. Can still do an exact forward pass by initialising the initial hidden state as the last hidden state of the previous truncated sequence. Constant in this truncation length.
Why are recurrent neural networks typically difficult to train?
They create very deep neural networks, so have problems like insufficient gradient propogation.
In a language model, it would create very large networks if we represented
\[p(w_N \mid w_1, w_2, \cdots, w_{N-1})\]
by considering a network that had $N$ inputs (or would it?). How do RNNs “compress” the history $w _ 1, w _ 2, \cdots, w _ {N-1}$ in a way that keeps the network simple?
They use the hidden layer of the previous $p(w _ {N-1} \mid w _ 1, w _ 2, \cdots, w _ {N-2})$ as an approximation for the whole history.
Draw a picture that represents the equation
\[\pmb h_n = g(V[\pmb x_n ; \pmb h_{n-1}] + c)\]
in the context of an RNN where we’ve unrolled the network to sequence length $3$.
You’re on your own here, maybe check out the lecture notes for lecture 17.
Consider the unrolled computational graph for an RNN for language modelling.
Can you state how the value of
\[\frac{\partial \mathcal F}{\partial \mathbf h_2}\]
would be calculated, and also state the general recursion used for calculating derivatives in reverse-mode differentiation like this?
In general, if we want to find the derivative of $\mathcal F$ with respect to a variable $v _ i$, we have the recurrence
\[\sum_{j : \text{ child of }v_i \text{ in comp. graph} } \frac{\partial \mathcal F}{\partial v_j} \frac{\partial v_j}{\partial v_i}\]then we can compute $\partial v _ j / \partial v _ i$ similarly.
(this is a “generalisation” of the chain rule)
Suppose we are training an RNN.
Unrolling the computational graph, we get the expression $\mathcal F = \sum _ n \mathcal F _ n$ for the loss. The goal of BPTT is to compute $\nabla _ \theta \mathcal F$, where $\theta$ are the parameters of the RNN. Letting $\theta _ n$ be the copy of the RNN parameters at timestep $n$ in the computational graph, $\nabla _ \theta$ is the sum of the gradients for each copy $\theta _ n$.
What is the general expression for how the gradients are calculated in BPTT?
We use the recurrence
\[\nabla_\theta \mathcal F = \sum^N_{n = 1} \frac{\partial F}{\partial \mathbf h_n} \frac{\partial \mathbf h_n}{\partial \theta_n} = \sum^N_{i = 1} \left[ \frac{\partial \mathcal F}{\partial \mathbf h_{n+1} } \frac{\partial \mathbf h_{n+1} }{\partial \mathbf h_n} + \frac{\partial \mathcal F_n}{\partial \mathbf h_n} \right] \frac{\partial \mathbf h_n}{\partial \theta_n}\]Suppose we are training an RNN.
Unrolling the computational graph, we get the expression $\mathcal F = \sum _ n \mathcal F _ n$ for the loss. The goal of BPTT is to compute $\nabla _ \theta \mathcal F$, where $\theta$ are the parameters of the RNN. Letting $\theta _ n$ be the copy of the RNN parameters at timestep $n$ in the computational graph, $\nabla _ \theta$ is the sum of the gradients for each copy $\theta _ n$.
What is the general expression for how the gradients are calculated in RTRL where we use the forward mode of automatic differentiation rather than the backward mode?
We use the recurrence
\[\nabla_\theta \mathcal F = \sum^N_{n = 1} \frac{\partial F}{\partial \mathbf h_n} \frac{\partial \mathbf h_n}{\partial \theta} = \sum^N_{i = 1} \frac{\partial \mathcal F_n}{\partial \pmb h_n} \left[ \frac{\partial \pmb h_n}{\partial \theta_n } + \frac{\partial \pmb h_n }{\partial \mathbf h_{n-1} }\frac{\partial \mathbf h_{n-1} }{\partial \theta} \right]\]