Course - Optimisation for Data Science HT25
This course analyses optimisation methods suitable for large-scale data science problems, mainly by deriving results on the rate of convergence under increasing assumptions (smooth, convex, strongly convex) on the objective functions.
The course begins with some optimisation terminology and then covers gradient descent and the proximal method, which can be used to apply steepest descent techniques to regularised problems. Then it covers acceleration techniques such as the heavy ball method, and then moves onto stochastic gradient descent and accelerated techniques in that context. Finally, it covers coordinate descent methods.
- Course Webpage
- Lecture Notes
- From the previous year:
- Lecture Notes
- 1, Scope and examples
- 2, Terminology and prerequisites
- 3, Method of steepest descent
- 4, The proximal method
- 5, Acceleration of gradient methods
- 6, Stochastic gradient descent
- 7, Reducing the noise floor in SGD
- 8, Coordinate descent
- 9, Practical coordinate descent
- 10, Outlook (non-examinable)
- Other courses this term: [[Courses HT25]]U
Notes
-
[[Notes - Optimisation for Data Science HT25, Overview of convergence results]]U ⭐️
- [[Notes - Optimisation for Data Science HT25, Motivation and examples]]U
- [[Notes - Optimisation for Data Science HT25, Optimisation terminology]]U
- [[Notes - Optimisation for Data Science HT25, Smoothness and convexity]]U
- [[Notes - Optimisation for Data Science HT25, Subgradients]]U
- [[Notes - Optimisation for Data Science HT25, Steepest descent]]U
- [[Notes - Optimisation for Data Science HT25, Line search methods]]U
- [[Notes - Optimisation for Data Science HT25, Proximal methods]]U
- [[Notes - Optimisation for Data Science HT25, Accelerated methods]]U
- [[Notes - Optimisation for Data Science HT25, Stochastic gradient descent]]U
- [[Notes - Optimisation for Data Science HT25, Variance reduction methods]]U
-
[[Notes - Optimisation for Data Science HT25, Coordinate descent]]U
- [[Notes - Optimisation for Data Science HT25, Useful miscellany]]U
- [[Notes - Optimisation for Data Science HT25, Homemade exam questions]]?
Problem Sheets
- From the previous year:
To-Do List
Differences between notes and slides
-
6: Stochastic gradient descent
- More details on taking expectations in the analysis of the algorithm
-
7: Reducing the noise floor
- Proof for dynamic stepsize seems easier to follow, also uses Jensen’s inequality at some point which the notes don’t
- Proves stronger result about $\lim$ rather than $\lim \inf$
- Proof for SVRG seems clearer
-
8: Coordinate descent
- Wolfe conditions
- Convergence of cyclic coordinate descent with inexact line searches
- Interpretation of why convergence on Powell’s objective fails in terms of the results about the convergence in general
- Convergence of randomised coordinate descent with inexact line searches
- Dimensionality dilemma
- More detail about Lipschitz componentwise derivatives and cleaner proof by factoring out a “foundational lemma”
- Proof in general case with fixed stepsize
- Proof in convex case with fixed stepsize
- Proof in $\gamma$-strongly convex case with fixed stepsize
-
9: Practical coordinate descent
- The results on randomised coordinate descent assume that the step size is fixed at $1/L$, this lecture is about what happens when you instead use $\alpha _ k$ satisfying the Wolfe conditions (still assuming that $f$ is CL-smooth)
- Wolfe conditions for RCD with line search
- Proof Overestimation lemma for coordinate line search (deterministic and in expectation)
- Proof RCD for General case
- Proof RCD for Convex case
- Proof RCD for $\gamma$-strongly convex case
- CCD in convex case (no proof)
- CCD in $\gamma$-strongly convex case (no proof)
- Discussion of cycle complexity and iteration complexity
- Block coordinate descent algorithm
- Block Lipschitz smoothness
- Prove block Lipschitz overestimation property, also check whether it is indeed $L _ \text{bmax}$ rather than $L _ \text{max}$ like the notes say
- Proof for RBCD CL-smooth $\gamma$-strongly convex with known block structure
- Proof RBCD CL-smooth $\gamma$-strongly convex with equal sized subsets
Relevant reading
- “Optimisation for Data Science” course at ETH Zurich:
- #uni #notes #task #read 2: Theory of Convex Functions ✅ 2025-03-29
- #uni #notes #task #read 3: Gradient Descent
- #uni #notes #task #read 4: Projected Gradient Descent
- #uni #notes #task #read 5: Coordinate Descent
- #uni #notes #task #read 10: Subgradient Methods
- #uni #notes #task #read 11: Mirror Descent, Smoothing, Proximal Algorithms
- #uni #notes #task #read 12: Stochastic Optimisation
- #uni #notes #task #read 13: Finite Sum Optimisation
-
“Convex Optimisation” at CMU:
- #uni #notes #task #read 1: Convex Sets
- #uni #notes #task #read 2: Convex Functions, Optimization Basics
- #uni #notes #task #read 3: Gradient Descent
- #uni #notes #task #read 4: More Gradient Descent and Subgradients
- #uni #notes #task #read 5: The Subgradient Method and Oracle Lower Bounds
- #uni #notes #task #read 6: Projected Gradient Descent and the Proximal Method
- #uni #notes #task #read 7: More Proximal Method
- #uni #notes #task #read 8: Stochastic Gradient Descent
-
Convex Optimisation, Boyd and Vandenberghe
- #uni #notes #task #read 1: Introduction
- #uni #notes #task #read 2: Convex sets
- #uni #notes #task #read 3: Convex functions
- #uni #notes #task #read 4: Convex optimization problems
- #uni #notes #task #read 9: Unconstrained minimisation
- #uni #notes #task #read 10: Equality constrained minimisation
-
Numerical Optimisation, Nocedal and Wright
- #uni #notes #task #read 1: Introduction
- #uni #notes #task #read 2: Fundamentals of Unconstrained Optimisation
- #uni #notes #task #read 3: Line-search methods
-
Deep Learning, Goodfellow
- #uni #notes #task #read 8: Optimization for Training Deep Models
-
Optimisation for Machine Learning, Sra, Nowozin and Wright
- #uni #notes #task #read 1: Optimization and Machine Learning
- #uni #notes #task #read 2: Convex Optimization with Sparsity-Inducing Norms
- #uni #notes #task #read 4: Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A survey