Paper - ADADELTA, An Adaptive Learning Rate Method


Summary

  • New per-dimension (i.e. per-parameter) first-order learning rate method for gradient descent
  • Key benefits
    • No manual setting of a learning rate
    • Insensitive to hyperparameters
    • Separate dynamic learning rate per-dimension
    • Minimal computation overhead
    • Robust to large gradients, noise and architecture choice
  • Problems with ADAGRAD
    • Sensitive to initial conditions of the parameters and corresponding gradients, since the gradients are remembered
    • Still also sensitive to hyperparameters
    • Since it’s over all gradients, the learning rate decreases over training and will eventually shrink to zero, which can slow training even when it’s not necessary to do so
  • ADADELTA gives solutions to these problems
    • Learning rates shrinking to zero: use a window over the previous $w$ steps; mathematically this works out to be the RMS of the previous squared gradients up to time $t$
    • Picking a learning rate: this is motivated by considering the units of the parameters

Plan for paper summary

  • Start instead with a description of the ADADELTA algorithm and vanilla SGD
  • Motivate the differences
  • Explain advantages of ADADELTA
    • These advantages can be observed experimentally
  • Explain disadvantages of ADADELTA
    • Slightly larger memory footprint
    • Oscillations near a minima can accumulate and unnecessarily slow down convergence

Flashcards




Related posts