Lecture - Theories of Deep Learning MT25, VIII, Optimisation algorithms for training DNNs

Created: November 12, 2025 | Updated: November 15, 2025 | About these notes

[[Course - Theories of Deep Learning MT25]]^U

This lecture and the previous ([[Lecture - Theories of Deep Learning MT25, VII, Stochastic gradient descent and its extensions]]^U) are effectively a mini-speedrun of [[Course - Optimisation for Data Science HT25]]^U. This lecture in particular covered momentum in the context of mini-batch stochastic gradient descent:

It also covers techniques not mentioned in [[Course - Optimisation for Data Science HT25]]^U, including:

Adaptive subgradients (AdaGrad)
RMSProp
AdaDelta
Adam
AdaGrad with an adaptive stepsize rule

Papers mentioned

Related posts

[[Course - Theories of Deep Learning MT25]]^U

(outgoing)
[[Lecture - Theories of Deep Learning MT25, VII, Stochastic gradient descent and its extensions]]^U

(outgoing)
[[Paper - ADADELTA, An Adaptive Learning Rate Method]]^U

(outgoing)
[[Course - Optimisation for Data Science HT25]]^U

(outgoing)
[[Notes - Optimisation for Data Science HT25, Heavy ball method]]^U

(outgoing)
[[Notes - Optimisation for Data Science HT25, Nesterov's accelerated gradient method]]^U

(outgoing)
[[Notes - Optimisation for Data Science HT25, Steepest descent]]^U

(sim: 0.697)
[[Lecture - Machine Learning MT23, IX]]^U

(sim: 0.674)