Lecture - Theories of Deep Learning MT25, VIII, Optimisation algorithms for training DNNs

Created: November 12, 2025 | Updated: November 15, 2025 | Read markdown | About these notes

Course - Theories of Deep Learning MT25^U

This lecture and the previous (Lecture - Theories of Deep Learning MT25, VII, Stochastic gradient descent and its extensions^U) are effectively a mini-speedrun of Course - Optimisation for Data Science HT25^U. This lecture in particular covered momentum in the context of mini-batch stochastic gradient descent:

It also covers techniques not mentioned in Course - Optimisation for Data Science HT25^U, including:

Adaptive subgradients (AdaGrad)
RMSProp
AdaDelta
Adam
AdaGrad with an adaptive stepsize rule

Papers mentioned

Course - Theories of Deep Learning MT25^U

(outgoing)
Lecture - Theories of Deep Learning MT25, VII, Stochastic gradient descent and its extensions^U

(outgoing)
Paper - ADADELTA, An Adaptive Learning Rate Method^U

(outgoing)
Part C^U

(sim: 0.695)
Course - Optimisation for Data Science HT25^U

(outgoing)
Notes - Optimisation for Data Science HT25, Accelerated methods^U

(outgoing)
Notes - Optimisation for Data Science HT25, Nesterov's accelerated gradient method^U

(outgoing)
Notes - Optimisation for Data Science HT25, Steepest descent^U

(sim: 0.697)