Computer Vision MT25, Dropout
Flashcards
@Define the method of dropout and @state why it is useful.
Randomly “dropping out” neurons of a layer at each iterations of training with some probability, excluding the final layer.
It is useful because it it means neurons learn not to rely on the presence of other neurons.
Bite-sized
The typical dropout probability used in practice is $p = <span class="cloze" tabindex="0">0.5</span>$, meaning each neuron is set to zero with 50% probability on each training iteration.
Dropout is usually applied before the last fully-connected layer(s) of a CNN, and is not applied to the final prediction layer (so the output distribution is not corrupted during training).
@Describe how dropout behaves at training vs test time, and explain why the asymmetry is necessary.
- Training: each neuron is zeroed independently with probability $p$. The kept activations are typically rescaled by $1/(1 - p)$ (“inverted dropout”) so the expected output magnitude matches the no-dropout case.
- Test: dropout is turned off completely — all neurons fire normally with no zeroing.
Why: at test we want a deterministic, full-capacity prediction. The training-time stochasticity acts as a regulariser by preventing co-adaptation of features; at inference we want the ensemble-average behaviour, which corresponds to using all neurons. The $1/(1 - p)$ rescaling at train time keeps train- and test-time activations on the same scale, so weights learned during training stay calibrated.
Dropout was introduced in Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, JMLR 2014.
@Justify why dropout helps prevent overfitting in deep networks.
Without dropout, neurons can learn to co-adapt: neuron A can rely on the presence of neuron B’s activation, so the network develops fragile compound features that only work when every member is present together. This memorises the training set rather than learning generalisable features.
With dropout, on each iteration a random subset of neurons is suppressed, so each neuron must produce useful activations independently — it cannot count on any specific other neuron being there. This forces the network to learn more robust, distributed representations.
A useful intuition: dropout is approximately training an exponentially-large ensemble of subnetworks (one per dropout mask) that share weights, with the test-time forward pass approximating the ensemble average.