# Paper - ADADELTA, An Adaptive Learning Rate Method

> Source: https://ollybritton.com/notes/uni/part-c/mt25/theories-of-deep-learning/reading/paper-adadelta-an-adaptive-learning-rate-method/ · Updated: 2025-11-15 · Tags: uni, notes

- **Full title**: *ADADELTA: An Adaptive Learning Rate Method*
- **Author(s)**: Matthew D. Zeiler
- **Year**: 2012
- **Link**: https://arxiv.org/pdf/1212.5701
- **Relevant for**:
	- [Course - Theories of Deep Learning MT25](https://ollybritton.com/notes/uni/part-c/mt25/theories-of-deep-learning/)

### Summary
- New per-dimension (i.e. per-parameter) first-order learning rate method for gradient descent
- Key benefits
	- No manual setting of a learning rate
	- Insensitive to hyperparameters
	- Separate dynamic learning rate per-dimension
	- Minimal computation overhead
	- Robust to large gradients, noise and architecture choice
- Problems with ADAGRAD
	- Sensitive to initial conditions of the parameters and corresponding gradients, since the gradients are remembered
	- Still also sensitive to hyperparameters
	- Since it's over all gradients, the learning rate decreases over training and will eventually shrink to zero, which can slow training even when it's not necessary to do so
- ADADELTA gives solutions to these problems
	- Learning rates shrinking to zero: use a window over the previous $w$ steps; mathematically this works out to be the RMS of the previous squared gradients up to time $t$
	- Picking a learning rate: this is motivated by considering the units of the parameters

### Plan for paper summary
- Start instead with a description of the ADADELTA algorithm and vanilla SGD
- Motivate the differences
- Explain advantages of ADADELTA
	- These advantages can be observed experimentally
- Explain disadvantages of ADADELTA
	- Slightly larger memory footprint
	- Oscillations near a minima can accumulate and unnecessarily slow down convergence

### Flashcards

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt