# Notes - Machine Learning MT23, Neural networks

### Flashcards

What are the components of an artificial neuron / unit, and how are these components combined to find the output given an input $\pmb x \in \mathbb R^D$?

- Activation function: $f : \mathbb R \to \mathbb R$
- Weights: $\pmb w \in \mathbb R^D$
- Bias: $b \in \mathbb R$

Then output is given by

\[f(b + \pmb w^T \pmb x)\]Suppose we have the $i$-th artificial neuron in layer $l$ with

- Activation function: $f : \mathbb R \to \mathbb R$
- Weights: $\pmb w _ i^l = (w _ {i1}^l, w _ {i2}^l, \ldots, w _ {in _ {l-1}\,}^l) \in \mathbb R^D$
- Bias: $b^l _ i \in \mathbb R$
- Number of units in previous layer: $n _ {l-1}$

Can you define the pre-activation $z _ i^l$ and the activation $a _ i^l$, given the activations in the previous layer $\pmb a^{l-1}$, and draw a picture to represent this?

The picture should be a directed graph.

What is a multiliayer perceptron, and why is this name misleading?

Any neural network with more than one layer, and the units do not have to be perceptrons.

Suppose we have a 3 layer neural network. If

- $\pmb z^l$ denotes the pre-activations for each unit in layer $l$
- $\pmb a^l$ denotes the activations for each unit in layer $l$
- $n^l$ denotes the number of units in layer $l$
- $\pmb W^l$ is the $n _ l \times n _ {l-1}$ matrix with rows consisting of weights of for each unit
- $\tanh$ is the activation function

Can you give a formula for $\pmb y$, the final output of the neural network, given input $\pmb x$ (i.e. the “forward equations” in this case)?

Suppose:

- $\pmb z^l$ denotes the pre-activation at layer $l$
- $\pmb W^l$ denotes the weights at layer $l$
- $\pmb b^l$ denotes the bias at layer $l$
- $f$ denotes the activation function

Can you state the forward equations in order?

Suppose:

- $\pmb z^l$ denotes the pre-activation at layer $l$
- $\pmb W^l$ denotes the weights at layer $l$
- $\pmb b^l$ denotes the bias at layer $l$
- $f$ denotes the activation function
- $\ell (\pmb x _ i, y _ i)$ denotes the loss function

Furthermore, we have the forward equations

\[\begin{aligned}
&\pmb z^l = \pmb W^l \pmb a^{l-1} + \pmb b^l \\\\
&\pmb a^l = f(\pmb z^l)
\end{aligned}\]
What is the goal of the 4 backward equations, i.e. what do they help you calculate, what are they in this case, and why is each one useful?

We want to calculate

\[\frac{\partial \ell}{\partial \pmb W^l}, \frac{\partial \ell}{\partial \pmb b^l}\]For each $l$. In this context, the first two equations relate these partial derivatives to $\frac{\partial \ell}{\partial \pmb z^l}$, which turns out we can compute efficiently.

\[\begin{aligned} \frac{\partial \ell}{\partial \pmb W^l} &= \left(\pmb a^{l-1} \frac{\partial \ell}{\partial \pmb z^l}\right)^\top \\\\ \frac{\partial \ell}{\partial \pmb b^l} &= \frac{\partial \ell}{\partial \pmb z^l} \end{aligned}\]The next two provide such a procedure for calculating $\frac{\partial \ell}{\partial \pmb z^l}$:

\[\frac{\partial \ell}{\partial \pmb z^L} = \frac{\partial \ell}{\partial \pmb a^L} \frac{\partial \pmb a^L}{\partial \pmb z^L}\]Note that this one is at the output layer, saying that we can determine the derivative $\frac{\partial \ell}{\partial \pmb z^L}$ “easily” since $\ell$ is in terms of $\pmb a^L$ and $\frac{\partial \pmb a^L}{\partial \pmb z^L}$ is just the derivative of the activation function. Finally,

\[\frac{\partial \ell}{\partial \pmb z^l} = \frac{\partial \ell}{\partial \pmb z^{l+1}\\,} \pmb W^{l+1} \frac{\partial \pmb a^l\\,}{\partial \pmb z^l}\](again note that the last term here is just the derivative of the activation function).

Suppose:

- $\pmb z^l$ denotes the pre-activation at layer $l$
- $\pmb W^l$ denotes the weights at layer $l$
- $\pmb b^l$ denotes the bias at layer $l$
- $f$ denotes the activation function
- $\ell (\pmb x _ i, y _ i)$ denotes the loss function

Furthermore, we have the forward equations

\[\begin{aligned}
&\pmb z^l = \pmb W^l \pmb a^{l-1} + \pmb b^l \\\\
&\pmb a^l = f(\pmb z^l)
\end{aligned}\]
Quickly prove that the backward equations are given by

\[\begin{aligned}
\frac{\partial \ell}{\partial \pmb W^l} &= \left(\pmb a^{l-1} \frac{\partial \ell}{\partial \pmb z^l}\right)^\top \quad (1) \\\\
\frac{\partial \ell}{\partial \pmb b^l} &= \frac{\partial \ell}{\partial \pmb z^l} \quad (2)\\\\
\frac{\partial \ell}{\partial \pmb z^L} &= \frac{\partial \ell}{\partial \pmb a^L} \frac{\partial \pmb a^L}{\partial \pmb z^L} \quad (3) \\\\
\frac{\partial \ell}{\partial \pmb z^l} &= \frac{\partial \ell}{\partial \pmb z^{l+1}\\,} \pmb W^{l+1} \frac{\partial \pmb a^l\\,}{\partial \pmb z^l} \quad (4)
\end{aligned}\]

The first two follow from

\[\pmb z^l = \pmb W^l \pmb a^{l-1} + \pmb b^l\]For the first, it is clearer if we first break down this equation component-wise:

\[\pmb z_i^l = \sum_j \pmb W_{ij}^l \pmb a_j + \pmb b_i^l\]Hence

\[\begin{aligned} \frac{\partial \ell}{\partial \pmb W_{ij}^l} &= \frac{\partial \ell}{\partial \pmb z^l_i} \frac{\partial \pmb z^l_i}{\partial \pmb W_{ij}^l} \\\\ &= \frac{\partial \ell}{\partial z_i^l} \pmb a_j \\\\ \end{aligned}\]This is a scalar – $\frac{\partial \ell}{\partial \pmb W^l}$ is the matrix where the $(i, j)$-th entry is this expression. This can be done with the outer product, but we need to take the transpose so that everything is the right way around:

\[\frac{\partial \ell}{\partial \pmb W^l} = \left(\pmb a^{l-1} \frac{\partial \ell}{\partial \pmb z^l}\right)^\top\](Note that $\pmb a^{l -1}$ is a column vector and $\frac{\partial \ell}{\partial \pmb z^l}$ is a row vector). The second is much simpler:

\[\frac{\partial \ell}{\partial \pmb b^l} = \frac{\partial \ell}{\partial \pmb z^l} \frac{\partial \pmb z^l}{\partial \pmb b^l} = \frac{\partial \ell}{\partial \pmb z^l} \cdot 1\]Number three is just the chain rule. Number four is the chain rule twice:

\[\frac{\partial \ell}{\partial \pmb z^l} = \frac{\partial \ell}{\partial \pmb z^{l+1}\\,} \frac{\partial \pmb z^{l+1}\\,}{\partial \pmb z^l} = \frac{\partial \ell}{\partial \pmb z^{l+1}\\,} \frac{\partial \pmb z^{l+1}\\,}{\partial \pmb a^l} \frac{\partial \pmb a^l}{\partial \pmb z^l} = \frac{\partial \ell}{\partial \pmb z^{l+1}\\,} \pmb W^{l+1} \frac{\partial \pmb a^l}{\partial \pmb z^l}\]Draw a picture showing what it means for the training of a neural network being broken down into a “forward pass” and “backward pass”?

An input $\pmb x$ passing through the layers until its loss gets evaluated, then the derivatives of loss with respect to each activation passing backwards.

What is the problem of saturation when training neural networks?

Where the preactivation is in a region of the domain of the activation function where the activation is very flat, so the gradient is very small.

Given that the derivative of loss in a fully connected network is proportional to the product of derivatives of each of the pre-activations and also proportional to the product of the weights at each layer, can you explain the problem of exploding or vanishing gradient when training neural networks?

If the gradient is small, then you get exponential decay of the gradient of loss. If the weights are large, you get exponential growth.

What’s an advantage of using a ReLU as opposed to an activation function like a sigmoid?

There is better “gradient propogation” since it only saturates on one side (since it is flat there).

What’s the idea of early stopping in the context of neural networks?

Evaluate the model on validation set after each gradient update. If performance starts plateauing, stop optimising.

How might adding more layers to a neural network make it worse in a way that is not to do with overfitting?

Vanishing / exploding gradient, saturation might mean it does worse on training data

Why is it important to initialise the weights and biases in a neural network randomly?

This prevents symmetries in the training process, which could prevent learning.

When choosing how to initialise the weights and biases of a particular unit of a neural network, such as a sigmoid unit, what important consideration should you make about the gradient signal?

Making sure that the network isn’t saturated on initalisation.

Supposing that we have a sigmoid unit in a neural network, given by

\[\pmb a^l = \sigma(\pmb W^l \pmb a^{l - 1} + \pmb b^l)\]
Assuming that the inputs satisfy $\mathbb E[x^2 _ i] \approx 1$, how should we initialise the weights and biases to ensure that the network isn’t saturated at the beginning of training?

We want $\sigma(\pmb W^l \pmb a^{l -1} + \pmb b^l)$ to small, if there are $D$ entries in the previous activation, pick each $\pmb W _ {ij}$ to be randomly distributed in $\mathcal N(0, \frac 1 D)$.

What is the technique of dropout when training neural networks, and why is it useful?

Dropout is a regularisation technique (anything used to prevent overfitting). In each trainings step, a random fraction of units are “dropped” which means that all forward and backward connections from them are removed.

It is useful since it prevents creating dependencies between units.

Suppose:

- $\pmb z^l$ denotes the pre-activation at layer $l$
- $\pmb W^l$ denotes the weights at layer $l$
- $\pmb b^l$ denotes the bias at layer $l$
- $f$ denotes the activation function
- $\ell (\pmb x _ i, y _ i)$ denotes the loss function

Furthermore, we have the forward equations

\[\begin{aligned}
&\pmb z^l = \pmb W^l \pmb a^{l-1} + \pmb b^l \\\\
&\pmb a^l = f(\pmb z^l)
\end{aligned}\]
These are the equations for the whole layer in a fully-connected network. Can you give the equations for the $i$-th unit in this layer?

where $\pmb w$ are the weights for this unit, given by $\pmb w = \pmb W _ {i, :}^\top$.

A quick derivation:

\[\begin{aligned} \pmb z_i^l &= \sum^n_{j = 1} \pmb W_{ij} \pmb a_j^{l - 1} + \pmb b_i^l \\\\ &= (\pmb W_{i,:}) \pmb a^{l-1} + \pmb b_i^l \end{aligned}\](and using the notation that $\pmb W _ {i, :}$ is the $i$-th row of $\pmb W$).

How can you view (in a very informal sense) linear regression and logistic regression as the same “flavour” of a simple neural network?

- Linear regression is like the simplest possible neural network: one input layer, one output layer, and an identity activation function. We train on mean squared error.
- Logistic regression turns linear regression into a classifier by using sigmoid as a non-linearity and then training on cross-entropy loss.

Suppose $\pmb a _ \ell \in \mathbb R^d$ is a layer in a neural network, and that $\pmb a _ {\ell+1}$ is a residual layer. Can you define this mathematically, and give an expression for the gradient of loss at this layer?

Then

\[\begin{aligned} \frac{\partial \mathcal L}{\partial \pmb a_{\ell+1} } &= \frac{\partial \mathcal L}{\partial \pmb a_\ell} \frac{\partial \pmb a_\ell}{\partial \pmb a_{\ell +1} } \\\\ &= \frac{\partial \mathcal L}{\partial \pmb a_\ell}\left( I + \frac{\partial \mathcal F}{\partial \pmb a_\ell} \right) \end{aligned}\]When doing classification with a neural network, the final layer is typically a one-hot encoding of each of the classes. What are the pros and cons of instead using a binary encoding of each of the classes?

- Cons: might introduce unnecessary dependencies between the classes
- Pros: much more space efficient

Suppose we have $L$ layers in a neural network, and that every layer (including the input and output) has the same dimensionality, $M$. What is the time and space complexity of performing training the neural network using backpropogation?

- Forward pass:
- Time complexity: $O(M^2 L B)$
- Space complexity: $O((L-1)(M^2 + 2BM)) = O(M^2 L B)$

- Backward pass:
- Time complexy: $O(M^2 LB)$