Paper - ADADELTA, An Adaptive Learning Rate Method
- Full title: ADADELTA: An Adaptive Learning Rate Method
- Author(s): Matthew D. Zeiler
- Year: 2012
- Link: https://arxiv.org/pdf/1212.5701
- Relevant for:
Summary
- New per-dimension (i.e. per-parameter) first-order learning rate method for gradient descent
- Key benefits
- No manual setting of a learning rate
- Insensitive to hyperparameters
- Separate dynamic learning rate per-dimension
- Minimal computation overhead
- Robust to large gradients, noise and architecture choice
- Problems with ADAGRAD
- Sensitive to initial conditions of the parameters and corresponding gradients, since the gradients are remembered
- Still also sensitive to hyperparameters
- Since it’s over all gradients, the learning rate decreases over training and will eventually shrink to zero, which can slow training even when it’s not necessary to do so
- ADADELTA gives solutions to these problems
- Learning rates shrinking to zero: use a window over the previous $w$ steps; mathematically this works out to be the RMS of the previous squared gradients up to time $t$
- Picking a learning rate: this is motivated by considering the units of the parameters
Plan for paper summary
- Start instead with a description of the ADADELTA algorithm and vanilla SGD
- Motivate the differences
- Explain advantages of ADADELTA
- These advantages can be observed experimentally
- Explain disadvantages of ADADELTA
- Slightly larger memory footprint
- Oscillations near a minima can accumulate and unnecessarily slow down convergence