Optimisation for Data Science HT25, Coordinate descent
Flashcards
Basic definitions
Suppose we wish to solve the optimisation problem
\[\min _ {x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, can you @state the coordinate descent @algorithm?
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\](3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]Recall the coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@Define what is meant by:
- randomised,
- cyclic,
- Gauss-Southwell, and
- Block (describing two sampling variants)
coordinate descent in this context.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
- Randomised: Pick $ \vert \mathcal B _ k \vert = 1$ uniformly randomly from $1$ to $n$
- Cyclic: Cycle through the coordinates $B _ 0 = \{1\}$, $B _ 1 = \{2\}$, in order
- Gauss-Southwell: Pick $ \vert \mathcal B _ k \vert = 1$ where $\mathcal B _ k = \{\text{argmax} _ {1 \le i \le n} \vert \frac{\partial f}{\partial x _ i}(x^k) \vert \}$.
- Block coordinate descent is the case where $ \vert \mathcal B _ k \vert > 1$.
- Sampling variants give different ways that the blocks of indices can be selected:
- If $f(x) = f(x _ {\mathcal B _ 1}, \ldots, x _ {\mathcal B _ j})$ has a natural block structure, sample $j \sim \mathcal U(\{1, \ldots, \ell\})$ and select $\mathcal B _ j$.
- If $f(x)$ does not have a natural block structure, sample $\mathcal B$ from $\mathcal P _ {n _ 0} = \{\mathcal B : \vert \mathcal B \vert = n _ 0\}$.
What are some contexts in which a natural block partition of the variables appears naturally in the problem formation?
- Training neural networks, the parameters could be updated layer-by-layer
- Multistage optimisation problems
- Sparse linear least squares
CL-smoothness
This property is assumed in many of the convergence results to follow.
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@State what is meant by $f$ being CL-smooth with parameters $L _ i$.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Each partial derivative of $f$ is “coordinatewise $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x _ i}(x + te _ i) - \frac{\partial f}{\partial x _ i}(x) \right| \le L _ i |t|\]Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@State a result that relates being CL-smooth with parameters $L _ i$ to the global Lipschitz constant $L$, and a refinement of this result under the assumption $f$ is convex.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
- If $f$ is CL-smooth with parameters $L _ i$, then $f$ is also $L$-Lipschitz where $L _ \max \le L$.
- If $f$ is also convex, then $L _ \max \le L \le n L _ \max$.
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@Prove that then
- If $f$ is CL-smooth with parameters $L _ i$ and it is also convex, then $L _ \max \le L \le n L _ \max$.
(you may assume without proof that $f$ is $L$-smooth and $L _ \max \le L$, this is straightforward to prove).
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Since we may assume $L _ \text{max} \le L$, it remains to show that $L \le nL _ \max$. Define
\[\varphi(i) := f(x) - f(y) - \nabla f(y)^\top (x - y)\]for fixed $y$. Then since $f$ is convex, $\varphi(x) \ge 0$ for all $x$. Note also that:
- $\nabla _ x \varphi(x) = \nabla f(x) - \nabla f(y)$, and
- $\varphi(x)$ is convex as it’s the sum of a convex function, a constant function and linear function.
Now we apply the result that
\[h(x - \alpha \nabla _ i h(x)) \le h(x) - \alpha \left( 1 - \frac{L _ \max \alpha}{2} \right)\left( \frac{\partial h}{\partial x _ i}(x) \right)^2\]If
- $h$ is CL-smooth with parameters $L _ i$
then for each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have
where $\nabla _ i h(x) = \frac{\partial h}{\partial x _ i}(x) e _ i$.
by taking $h = \varphi$ and $\alpha = 1/L _ \max$. Then
\[0 \le \varphi(x - \alpha \nabla _ i \varphi(x)) \le \varphi(x) - \frac{1}{2L _ \max} \left( \frac{\partial \varphi}{\partial x _ i}(x) \right)^2\]Then
\[\begin{aligned} \varphi(x) &\ge \frac{1}{2 L _ \max} \max _ i \left( \frac{\partial \varphi}{\partial x _ i}(x) \right)^2 \\\\ &\ge \frac{1}{2L _ \max} \frac 1 n \sum _ i \left( \frac{\partial h}{\partial x _ i}(x) \right)^2 \\\\ &= \frac{1}{2L _ \max} \frac 1 n ||\nabla \varphi(x)||^2 \end{aligned}\]Plugging in the definition of $\varphi$, we obtain that for all $x, y$,
\[f(x) \ge f(y) + \nabla f(y)^\top (x - y) + \frac{1}{2nL _ \max} ||\nabla f(x) - \nabla f(y)||^2\]and also symmetrically
\[f(y) \ge f(x) + \nabla f(x)^\top (y - x) + \frac{1}{2nL _ \max} ||\nabla f(x) - \nabla f(y)||^2\]Summing these inequalities and rearranging, we have
\[\begin{aligned} \frac{1}{nL _ \max} ||\nabla f(y) - \nabla f(x)||^2 &\le \big(\nabla f(y) - \nabla f(x)\big)^\top (y - x) \\\\ &\stackrel{\text{CS} }\le ||\nabla f(y) - \nabla f(x)|| \cdot ||y-x|| \end{aligned}\]Rearranging further gives
\[||\nabla f(y) - \nabla f(x)|| \le n L _ \max ||y - x||\]Therefore $f$ is $L$-smooth with $L \le n L _ \max$ as required.
BL-smoothness
@Define what it means for a function with a natural block structure
\[f(x) = f(x _ {\mathcal B _ 1}, \ldots, x _ {\mathcal B _ \ell})\]
to be block Lipschitz smooth (BL-smooth) and define the blockwise Lipschitz constant $L _ \text{bmax}$.
For each block $\mathcal B$, there exists a Lipschitz constant $L _ \mathcal B$ such that
\[||\nabla _ {\mathcal B} f(x - \alpha \nabla _ {\mathcal B} f(x)) - \nabla _ {\mathcal B} f(x)|| \le L _ {\mathcal B} \times ||\alpha \nabla _ {\mathcal B} f(x)||\]for all $x \in \mathbb R^n$ and $\alpha \in \mathbb R$. Then
\[L _ \text{bmax} := \max _ \mathcal B L _ {\mathcal B}\]Suppose $f$ is CL-smooth and convex. @State a result concerning the blockwise Lipschitz constant $L _ \text{bmax}$.
$f$ is BL-smooth and
\[L _ \text{bmax} \le \max _ {\mathcal B} |\mathcal B| \times L _ \max\]Convergence of cyclic CD when not CL-smooth
Example of failure
Give an @example of an objective function $f : \mathbb R^3 \to \mathbb R$ (“Powell’s objective”) on which the cyclic CD method fails to converge to a stationary point and describe its properties.
This function is non-convex, continuously differentiable and its minimisers are at the corners $(1, 1, 1)^\top$ and $(-1, -1, -1)^\top$ of the unit cube.
General $L$-smooth, line search case
Recall the cyclic coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
where $\alpha _ k$ is chosen to minimise $f$ along that direction.
@State a result concerning the convergence of this algorithm in the general $L$-smooth case when $\alpha _ k$ is found by line search.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is $L$-smooth
- $f$ is bounded below
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
- $d^k = -\frac{\partial}{\partial x _ i} f(x^k)e _ i$
Then the sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies:
\[\sum^\infty _ {k = 0} \big(\cos^2 \theta _ k \times ||\nabla f(x^k)||^2\big) < \infty\]and hence either $\theta _ k \to \frac\pi 2$ or $\nabla f(x^k) \to 0$ where
\[\theta _ k = \angle(-\nabla f(x^k), d^k)\]Recall the cyclic coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
where $\alpha _ k$ is chosen to minimise $f$ along that direction.
@Prove that then if
- $f$ is $L$-smooth
- $f$ is bounded below
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
- $d^k = -\frac{\partial}{\partial x _ i} f(x^k)e _ i$
Then the sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies:
\[\sum^\infty _ {k = 0} \big(\cos^2 \theta _ k \times ||\nabla f(x^k)||^2\big) < \infty\]
where
\[\theta _ k = \angle(-\nabla f(x^k), d^k)\]
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
Overall idea: Since we have the Wolfe conditions, we might hope that we can apply the general results concerning any line search method and conclude $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to 0$. However, these results depend on an assumption additional to the Wolfe conditions, namely that we have “okay descent directions”: there exists some $\eta > 0$ such that $-\nabla f(x^k)^\top d^k \le \eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.
The issue is that this cannot be guaranteed for any $\eta > 0$; coordinate descent makes a very specific choice for the descent direction by taking $d^k = \nabla _ {\mathcal B _ k} f(x^k)$. However, we can repeat the analysis used when analysing line search methods but instead let $\eta$ depend on the current iteration:
\[-\nabla f(x^k)^\top d^k = \eta _ k ||\nabla f(x^k)|| \cdot ||d^k||\]by rearranging, we see that $\eta _ k = \cos \theta _ k$ where $\theta _ k = \angle (-\nabla f(x^k), d^k)$, i.e. the angle between the actual descent direction and the direction we have chosen to descent along. This gives an analogue of the foundational inequality that
\[f(x^{k+1}) \le f(x^k) - \frac{c _ 1(1-c _ 2)}{L} \times \cos^2 \theta _ k \times ||\nabla f(x^k)||^2\]and by summing over this inequality, we obtain the result.
Bound step size: By the curvature condition, we have that
\[\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k\]By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:
\[\begin{aligned} -(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\ &\stackrel{\text{C.S.} } \le ||\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)|| \cdot ||d^k|| \\\\ &\stackrel{\text{L.S.} } \le L \cdot \alpha _ k ||d^k||^2 \end{aligned}\]Solving for $\alpha _ k$ implies that
\[\alpha _ k \ge -\frac{1-c _ 2}{L||d^k||^2} \cdot \nabla f(x^k) ^\top d^k\]and hence by Assumption 1,
\[\alpha _ k \ge \frac{\eta _ k(1 - c _ 2)||\nabla f(x^k)||}{L||d^k||}\]Use bound on step size to bound derivative: Then
\[\begin{aligned} f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\ &\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\ &\le f(x^k) + c _ 1 \frac{\eta _ k(1 - c _ 2)||\nabla f(x^k)||}{L||d^k||} \nabla f(x^k)^\top d^k &&(3\star) \\\\ &\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta _ k^2}{L} ||\nabla f(x^k)||^2 && (4\star) \end{aligned}\]where $(1\star)$ is by definition, $(2\star)$ is by Assumption 2, $(3\star)$ is by the previous result, and $(4\star)$ is by Assumption 1.
Now solving for $\eta _ k^2 \vert \vert \nabla f(x^k) \vert \vert ^2$ gives
\[\eta _ k^2||\nabla f(x^k)||^2 \le \frac{L}{c _ 1(1 - c _ 2)}(f(x^k) - f(x^{k+1}))\]Telescoping sum: Summing over $k$, it follows:
\[\begin{aligned} \sum^{T-1} _ {k = 0} \eta _ k^2||\nabla f(x^k)||^2 &\le \frac{L}{c _ 1 (1 - c _ 2)} \sum^{T-1} _ {k = 0}(f(x^k) - f(x^{k+1})) \\\\ &= \frac{L}{c _ 1 (1 - c _ 2)} (f(x^0) - f(x^T)) \\\\ &\le \frac{L}{c _ 1(1 - c _ 2)} (f(x^0) - C) \end{aligned}\]and since $\eta _ k := \cos^2 \theta _ k$ and the result above holds for all $T$,
\[\sum^\infty _ {k = 0} \big(\cos^2 \theta _ k \times ||\nabla f(x^k)||^2\big) < \infty\]Unique coordinatewise minima
Recall the cyclic coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@State a result concerning the convergence of this algorithm under some additional assumptions.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is continuously differentiable
- $f$ is bounded below
- For each $i$, the minimum $\min _ {t \in \mathbb R} f(x _ 1, x _ 2, \ldots, x _ {i-1}, t, x _ {i+1}, \ldots, x _ n)$ is uniquely attained for all $x _ 1, \ldots, x _ {i-1}, x _ {i+1}, \ldots, x _ n$.
Then:
- Every limit point of the sequence of iterates $x^k$ where $k = n, 2n, 3n, \ldots$ is a stationary point of $f$.
Convex, fixed-step case
Recall the cyclic coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
where $\alpha _ k$ is chosen to minimise $f$ along that direction.
@State a result concerning the convergence in the convex case, under some additional assumptions.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is CL-smooth with parameters $L _ i$
- There is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.
- $f$ is convex
- $\alpha _ k = \frac{1}{L _ \text{max} }$
Then for $k \in \{n ,2n, 3n, \ldots\}$
\[f(x^k) - f(x^\ast) \le \frac{4n D^2 L _ \max\left(1 + n \frac{L^2}{L^2 _ \max}\right)}{k+8}\]$\gamma$-strongly convex, fixed step case
Recall the cyclic coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
where $\alpha _ k$ is chosen to minimise $f$ along that direction.
@State a result concerning the convergence in the $\gamma$-strongly convex case under some additional assumptions.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is CL-smooth with parameters $L _ i$
- $f$ is $\gamma$-strongly convex
- $\alpha _ k = \frac{1}{L _ \text{max} }$
Then for $k \in \{n ,2n, 3n, \ldots\}$:
\[f(x^k) - f(x^\ast) \le \left(1 - \frac{\gamma}{2L _ \max \left(1 + n\frac{L^2}{L _ \max^2}\right)}\right)^{k/n} (f(x^0) - f(x^\ast))\]Convergence of randomised CD with line search when not CL-smooth
General line search case
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{j\}$ where $j \sim \mathcal U(\{1, \ldots, n\})$.
(2): Calculate $d^k = -\frac{\partial}{\partial x _ {\mathcal B _ k} } = -\frac{\partial}{\partial x _ j} e _ j$
(3): Calculate
\[x^{k+1} = x^k + \alpha _ k d^k\]
where $\alpha _ k$ is chosen to minimise $f$ along that direction.
@State a result concerning the convergence of this algorithm in the non-convex $L$-smooth case when $\alpha _ k$ is found by line search and $f$ is not assumed to be CL-smooth.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{j\}$ where $j \sim \mathcal U(\{1, \ldots, n\})$.
(2): Calculate $d^k = -\frac{\partial}{\partial x _ {\mathcal B _ k} } = -\frac{\partial}{\partial x _ j} e _ j$
(3): Calculate
Suppose:
- $f$ is $L$-smooth
- $f$ is bounded below by $f _ \text{low}$
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
Then:
\[\min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{nL(f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}\]Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{j\}$ where $j \sim \mathcal U(\{1, \ldots, n\})$.
(2): Calculate $d^k = -\frac{\partial}{\partial x _ {\mathcal B _ k} } = -\frac{\partial}{\partial x _ j} e _ j$
(3): Calculate
\[x^{k+1} = x^k + \alpha _ k d^k\]
where $\alpha _ k$ is chosen to minimise $f$ along that direction.
@Prove that if:
- $f$ is $L$-smooth
- $f$ is bounded below by $f _ \text{low}$
- (Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
then:
\[\min _ {0 \le j \le k-1} \mathbb E[||\nabla f(X^j)||^2] \le \frac{nL(f(x^0) - f _ \text{low}))}{k c _ 1 (1 - c _ 2)}\]
You may assume that as a consequence of the Wolfe conditions, we have the so-called “local foundational inequality”:
\[f(x^{k+1}) \le f(x^k) - \frac{c _ 1(1-c _ 2)}{L} \times \cos^2 \theta _ k \times ||\nabla f(x^k)||^2\]
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{j\}$ where $j \sim \mathcal U(\{1, \ldots, n\})$.
(2): Calculate $d^k = -\frac{\partial}{\partial x _ {\mathcal B _ k} } = -\frac{\partial}{\partial x _ j} e _ j$
(3): Calculate
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
This is broken down into three claims:
- $\mathbb E[\eta _ k^2 \mid X^k] = \frac 1 n$ where $\eta _ k = \cos(\angle(-\nabla f(X^k), D^k))$
- $\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{c _ 1 (1 - c _ 2)}{n L} \mathbb E[ \vert \vert \nabla f(X^k) \vert \vert ^2]$
- $\min _ {0 \le i \le k-1} \mathbb E[ \vert \vert \nabla f(X^i) \vert \vert ^2] \le \frac{nL(f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$
Claim 1: We have
\[\begin{aligned} \eta _ k &= \frac{|e^\top _ j \nabla f(X^k)|}{||e _ j|| \times ||\nabla f(X^k)||} \\\\ &= \frac{\left|\frac{\partial}{\partial x _ j} f(X^k)\right|}{||\nabla f(X^k)||} \\\\ &=: \hat d _ j^k \end{aligned}\]where $\hat d^k$ is the componentwise absolute value of the normalised vector
\[\frac{\nabla f(X^k)}{||\nabla f(X^k)||}\]In particular, $\hat d^k$ is also a unit vector and since the coordinates are samples uniformly,
\[\mathbb E[\eta _ k \mid X^k] = \sum^n _ {i = 1} \frac{1}{n} \hat d^k _ i = \frac 1 n \pmb 1^\top \hat d^k\]If coordinates are sampled uniformly, then
\[\mathbb E[\eta _ k^2 \mid X^k] = \frac 1 n\]and hence
\[\mathbb E[\eta^2 _ k] = \mathbb E _ {k-1}[\mathbb E[\eta^2 _ k \mid X^k]] = \frac 1 n\]as required.
Claim 2: Taking expectations given $X^k$ on both sides of the local foundational inequality yields
\[\begin{aligned} \mathbb E[f(X^{k+1}) \mid X^k] &\le f(X^k) - \frac{c _ 1 (1 - c _ 2)}{L} \times \mathbb E[\eta^2 _ k \mid X^k] \times ||\nabla f(X^k)||^2 \\\\ &\le f(X^k) - \frac{c _ 1(1 - c _ 2)}{nL}||\nabla f(X^k)||^2 \end{aligned}\]where the second inequality follows from Claim 1. and taking total expectations on both sides gives
\[\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{c _ 1 (1 - c _ 2)}{n L} \mathbb E[||\nabla f(X^k)||^2]\]as required.
Claim 3: Rearranging, we obtain
\[\mathbb E[||\nabla f(X^k)||^2] \le \frac{nL}{c _ 1(1-c _ 2)} \Big(\mathbb E[f(X^k)] - \mathbb E[f(X^{k+1})]\Big)\]Changing indices and summing over $i$, we obtain
\[\begin{aligned} \min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] &\le \frac{1}{k} \sum^{k-1} _ {i = 0}\frac{nL}{c _ 1(1-c _ 2)} \Big(\mathbb E[f(X^k)] - \mathbb E[f(X^{k+1})]\Big) \\\\ &\le \frac{nL (f(x^0) - f _ \text{low})}{kc _ 1(1-c _ 2)} \end{aligned}\]Convergence of fixed-step randomised CD when CL-smooth
Overestimation property when CL-smooth
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@State:
- An overestimation property for each coordinate direction given $f$ is CL-smooth with parameters $L _ i$, and
- A result that follows from this which is useful in proving the convergence of randomised coordinate descent.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
For each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have
\[f(x - \alpha \nabla _ i f(x)) \le f(x) - \alpha \left( 1 - \frac{L _ \max \alpha}{2} \right)\left( \frac{\partial f}{\partial x _ i}(x) \right)^2\]where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$.
If randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha _ k > 0$, then
\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha _ k}{n} \left( 1 - \frac{L _ \max \alpha _ k}{2} \right)||\nabla f(X^k)||^2\]where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.
@important~ (overestimation property)
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
In this context, we have
An overestimation property for each coordinate direction given $f$ is CL-smooth with parameters $L _ i$:
For each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have
\[f(x - \alpha \nabla _ i f(x)) \le f(x) - \alpha \left( 1 - \frac{L _ \max \alpha}{2} \right)\left( \frac{\partial f}{\partial x _ i}(x) \right)^2\]
where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$.
and a result that follows from this which is useful in proving the convergence of randomised coordinate descent:
If randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha _ k > 0$, then
\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha _ k}{n} \left( 1 - \frac{L _ \max \alpha _ k}{2} \right) \vert \vert \nabla f(X^k) \vert \vert ^2\]
where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.
Can you contrast this to the foundational inequality of steepest descent?
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
For each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have
where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$.
If randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha _ k > 0$, then
where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@Prove that if $f$ is CL-smooth with parameters $L _ i$, then for each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have:
\[f(x - \alpha \nabla _ i f(x)) \le f(x) - \alpha \left( 1 - \frac{L _ \max \alpha}{2} \right)\left( \frac{\partial f}{\partial x _ i}(x) \right)^2\]
where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$, and therefore that if randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha _ k > 0$, then
\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha _ k}{n} \left( 1 - \frac{L _ \max \alpha _ k}{2} \right)||\nabla f(X^k)||^2\]
where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
By the fundamental theorem of calculus applied to $g(t) := f(x + td)$, we have
\[\begin{aligned} f(x + d) &= f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text{ } \text dt\\\\ &= f(x) + \nabla f(x)^\top d + \int^1 _ 0 (\nabla f(x + td) - \nabla f(x))^\top d \text{ d}t \end{aligned}\]Let $g = \nabla _ i f(x)$ and take $d = -\alpha g$ to deduce
\[\begin{aligned} f(x - \alpha g) &= f(x) + \nabla f(x)^\top (-\alpha g) + \int^1 _ 0 [\nabla f(x - t \alpha g) - \nabla f(x)]^\top (-\alpha g)\text dt \\\\ &\stackrel{\text{C.S} }\le f(x) - \alpha g^\top \nabla f(x) + \alpha ||g|| \int^1 _ 0 ||\nabla f(x - t\alpha g) - \nabla f(x)||\text dt \end{aligned}\]Note also that
\[g^\top \nabla f(x) = \left[\frac{\partial f}{\partial x _ i} (x)\right] e _ i^\top \nabla f(x) = \left[\frac{\partial f}{\partial x _ i} (x)\right]^2\]by definition.
Finally,
\[\begin{aligned} ||\nabla f(x - t\alpha g) - \nabla f(x)|| &= ||\nabla f(x - t\alpha \frac{\partial f}{\partial x _ i}(x) e _ i) - \nabla f(x)|| \\\\ &\le L _ i \left| t\alpha \frac{\partial f}{\partial x _ i}(x) \right| \\\\ &\le L _ \max t\alpha \left| \frac{\partial f}{\partial x _ i}(x) \right| \end{aligned}\]Combining these results together gives
\[f(x - \alpha g) \le f(x) - \alpha \left[ \frac{\partial f}{\partial x _ i}(x) \right]^2 + L _ \max \alpha^2 \left[ \frac{\partial f}{\partial x _ i}(x) \right]^2 \int^1 _ 0 t \text dt\]and hence we have the first result.
To prove the second result, let $x = X^k$, $g = d^k = \nabla _ {\mathcal B _ k} f(X^k) = \frac{\partial f}{\partial x _ {\mathcal B _ k} } (X^k) e _ {\mathcal B _ k}$ and $\alpha = \alpha _ k$ in the above, and apply expectation with respect to $\mathcal B _ k$ on both sides. Therefore we have
\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \alpha _ k \left(1 - \frac{L _ \max \alpha _ k}{2} \mathbb E _ {\mathcal B _ k}\left[\left(\frac{\partial f}{\partial x _ {\mathcal B _ k} }(X^k)\right)^2\right]\right)\]where we also used that $X^{k+1} = X^k - \alpha _ k d^k$. Then we also have by the definition of expectation that
\[\begin{aligned} \mathbb E _ {\mathcal B _ k}\left[\left(\frac{\partial f}{\partial x _ {\mathcal B _ k} }(X^k)\right)^2\right] &= \sum^n _ {k = 1} \mathbb E\left[\left(\frac{\partial f}{\partial x _ {\mathcal B _ k} }(X^k)\right)^2 \text{ }\Bigg|\text{ } B _ k = i\right] \mathbb P(\mathcal B _ k = i) \\\\\ &= \sum^n _ {i = 1}\left(\frac{\partial f}{\partial x _ {\mathcal B _ k} }(X^k)\right)^2 \cdot \frac 1 n \\\\ &= \frac 1 n ||\nabla f(X^k)||^2 \end{aligned}\]and hence we have the second result.
@important~ (overestimation property)
General fixed-step, CL-smooth case
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@State a result characterising the convergence of the randomised CD method with fixed step size in the non-convex case when $f$ is CL-smooth.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
- $f$ is bounded below by $f _ \text{low}$
- $f$ is CL-smooth with parameters $L _ i$
Then for all $k \ge 1$,
\[\min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}\]Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@Prove that if
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
- $f$ is bounded below by $f _ \text{low}$
- $f$ is CL-smooth with parameters $L _ i$
then for all $k \ge 1$,
\[\min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}\]
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Recall we have the overestimation property that in this context
\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha _ k}{n} \left( 1 - \frac{L _ \max \alpha _ k}{2} \right) \vert \vert \nabla f(X^k) \vert \vert ^2\]
where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.
By properties of expectation, it follows that if $\mathbb E = \mathbb E _ k := \mathbb E[\cdot \mid \mathcal B _ 0, \ldots, \mathcal B _ k]$ then we have
\[\mathbb E _ {k} [f(X^{k+1})] \le \mathbb E _ {k-1}[f(X^k)] - \frac{1}{2n L _ \max} ||\nabla f(X^k)||^2\]for all $k \ge 0$, which then rearranges to give
\[\mathbb E _ {i - 1} [f(X^i)] - \mathbb E _ i [f(X^{i+1})] \ge \frac{1}{2n L _ \max} \mathbb E _ {i-1}[||\nabla f(X^i)^2||]\]where $\mathbb E _ {-1}[f(X^0)] = f(x^0)$ since $x^0$ is deterministic.
Then we sum this up from $i = 0$ to $i = k$ and deduce from the telescoping sum that
\[f(x^0) - \mathbb E _ k[f(X^{k+1})] \ge \frac{1}{2nL _ \max} \sum^{k} _ {i = 0} \mathbb E _ {i - 1} [||\nabla f(X^i)||^2]\]and then using $f(X^{k+1}) \ge f _ \text{low}$ and $\mathbb E _ {i-1}[ \vert \vert \nabla f(X^i) \vert \vert ^2] \ge \min _ {0 \le i \le k} \mathbb E _ {i-1}[ \vert \vert \nabla f(X^i) \vert \vert ^2]$, so that
\[f(x^0) - f _ \text{low} \ge \frac{1}{2nL _ \max} (k+1) \min _ {0 \le i \le k} \mathbb E _ {i-1} [||\nabla f(X^i)||^2]\]which gives the result with $k+1$ instead of $k$.
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
One result is that if:
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
- $f$ is bounded below by $f _ \text{low}$
- $f$ is CL-smooth with parameters $L _ i$
then for all $k \ge 1$,
\[\min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}\]
This only shows that
\[\liminf_{i \to \infty} \mathbb E[||\nabla f(X^i)||] = 0\]
@State the result which allows you to actually prove that
\[\lim_{i \to \infty} \mathbb E[||\nabla f(X^i)||] = 0\]
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Then taking $k \to \infty$ lets you conclude the sum is finite, and hence the summands tend to zero.
Convex, fixed-step, CL-smooth case
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x _ i}(x + te _ i) - \frac{\partial f}{\partial x _ i}(x) \right| \le L _ i |t|\]
and define
\[L _ \max := \max _ {i \in \\{ 1, \ldots, n \\} } L _ i\]
@State an additional assumption on $f$ used to prove the convergence of coordinate descent when $f$ is convex.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
There is some finite minimiser $x^\ast$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@State a result characterising the convergence of the randomised CD method with fixed step size in the convex case under some additional assumptions
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
- $f$ is bounded below by $f _ \text{low}$
- $f$ is convex
- $f$ is CL-smooth with parameters $L _ i$
- There is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$
Then:
\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \max D^2}{k}\]Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@Prove that if
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$, where $\mathcal B _ k$ is chosen uniformly at random and with replacement and $\alpha _ k := \frac{1}{L _ \max}$
- $f$ is convex
- $f$ is bounded below by $f _ \text{low}$
- $f$ is CL-smooth with parameters $L _ i$
- There is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.
Then:
\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \max D^2}{k}\]
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Recall we have the overestimation property that in this context
\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha _ k}{n} \left( 1 - \frac{L _ \max \alpha _ k}{2} \right) \vert \vert \nabla f(X^k) \vert \vert ^2\]
where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.
By properties of expectation, it follows that if $\mathbb E = \mathbb E _ k := \mathbb E[\cdot \mid \mathcal B _ 0, \ldots, \mathcal B _ k]$ then we have
\[\mathbb E _ {k} [f(X^{k+1})] \le \mathbb E _ {k-1}[f(X^k)] - \frac{1}{2n L _ \max} ||\nabla f(X^k)||^2\]for all $k \ge 0$. Since $\mathbb E _ {k-1} [f(X^k)] = \mathbb E _ {k} [f(X^k)] = \mathbb E [f(X^k)]$, subtracting $f(x^\ast)$ from both sides we have
\[\mathbb E[f(X^{k+1})] - f(x^\ast) \le \mathbb E[f(X^k)] - f(x^\ast) - \frac{1}{2nL_\max} \mathbb E[||\nabla f(X^k)||^2]\]Since $f$ is convex, we also have that
\[\begin{aligned} \quad&f(x^\ast) \ge f(X^k) + \nabla f(X^k)(x^\ast - X^k) \\\\ \implies&f(X^k) - f(x^\ast) \le \nabla f(X^k)^\top(X^k - x^\ast) \\\\ \implies&f(X^k) - f(x^\ast) \le ||\nabla f(X^k)|| \cdot ||X^k - x^\ast|| \\\\ \implies&f(X^k) - f(x^\ast) \le D||\nabla f(X^k)|| \end{aligned}\]where the final inequality follows from the fact that
\[f(x^k) \le f(x^0)\]for all $k$ and so $X^k$ is in the $f(x^0)$-level set of $f$. Squaring this, passing to total expectation, and using Jensen’s inequality, we have:
\[\begin{aligned} \mathbb E[(f(X^k) - f(x^\ast))]^2 &\le \mathbb E[(f(X^k) - f(x^\ast))^2] \\\\ &\le D^2 \mathbb E[||\nabla f(X^k)||^2] \end{aligned}\]Going back to the original overestimation property, this implies
\[\begin{aligned} \mathbb E[f(X^{k+1})] - f(x^\ast) &\le \mathbb E[f(X^k)] - f(x^\ast) - \frac{1}{2nL_\max} \mathbb E[||\nabla f(X^k)||^2] \\\\ &\le \mathbb E[f(X^k)] - f(x^\ast) - \frac{1}{2nL_\max} \frac{1}{D^2} (\mathbb E[f(X^k)] - f(x^\ast))^2 \end{aligned}\]To make this simpler, let
\[\Delta_k := \mathbb E[f(X^k)] - f(x^\ast)\]and so we have
\[\Delta_{k+1} \le \Delta_k - \frac{1}{2nL_\max D^2} \Delta_k^2\]We now aim to show by solving the recurrence that
\[\mathbb E[f(X^k)] - f(x^\ast) = \Delta_k \le \frac{2nL_\max D^2}{k}\]To do this, note that
\[\begin{aligned} \frac{1}{\Delta_{k+1}} - \frac{1}{\Delta_k} &= \frac{\Delta_k - \Delta_{k+1}}{\Delta_k \Delta_{k+1} } \\\\ &\ge \frac{\Delta_k - \Delta_{k+1}}{\Delta_k^2} \\\\ &= \frac{\mathbb E[f(X^k)] - \mathbb E[f(X^{k+1})]}{(\mathbb E[f(X^k)] - f(x^\ast))^2} \\\\ &\ge \frac{1}{2nL_\max D^2} \end{aligned}\]Summing over over $k$, we obtain
\[\frac{k}{2nL_\max D^2} \le \sum^{k-1}_{i=0} \frac{1}{\Delta_{i+1}} - \frac{1}{\Delta_i} = \frac{1}{\Delta_k} - \frac{1}{\Delta_0} \le \frac{1}{\Delta_k} = \frac{1}{\mathbb E[f(X^k)] - f(x^\ast)}\]and then reciprocating both sides we obtain
\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL_\max D^2}{k}\]$\gamma$-strongly convex, fixed-step, CL-smooth case
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
In this context, @state a result characterising the convergence of the randomised CD method with fixed step size in the $\gamma$-strongly convex case under some additional assumptions.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is CL-smooth with parameters $L _ i$
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
Then: for all $k \ge 0$,
\[\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))\]Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@Prove that if:
- $f$ is CL-smooth with parameters $L _ i$
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
then for all $k \ge 0$,
\[\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))\]
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Note that all the conditions required for the convergence of CD in the general case are also met, so following that proof and defining
\[\Delta _ k := \mathbb E[f(X^k)] - f(x^\ast)\]we have
\[\Delta _ {k + 1} \le \Delta _ k - \frac{1}{2n L _ \max} \mathbb E[||\nabla f(X^k)||^2]\]As $f$ is $\gamma$-strongly convex, we have that
\[f(x) - f(x^\ast) \le \frac 1 {2\gamma} ||\nabla f(x)||^2\]for all $x$. Therefore, letting $x := X^k$ and passing to total expectation, we have
\[\Delta _ k = \mathbb E[f(X^k)] - f(x^\ast) \le \frac{1}{2\gamma} \mathbb E[||\nabla f(X^k)||^2]\]Hence, by substituting into the above inequality $\Delta _ {k+1} \le \Delta _ k - \cdots$, we obtain
\[\Delta _ {k+1} \le \Delta _ k - \frac{2\gamma}{2nL _ \max} \Delta _ k\]where $k \ge 0$, and hence
\[\Delta _ {k+1} \le \left( 1 - \frac{\gamma}{nL _ \max} \right) \Delta _ k\]for all $k \ge 0$ also. Since $\gamma \le L \le nL _ \max$, the convergence factor is in $(0, 1]$ and so the result follows inductively.
Convergence of randomised CD with line search when CL-smooth
Componentwise Wolfe conditions
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
If $f$ is CL-smooth, one choice for $\alpha _ k$ is $1/L _ \text{max}$, but it might be difficult to know $L _ \text{max}$ in advance. One approach is to use line search methods to generate $\alpha _ k$ satisfying the Wolfe conditions:
$\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition): $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition): $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
@State a weakened form of the Wolfe conditions that are useful for analysing componentwise methods.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
$\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition): $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition): $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
There exist constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$ and $1 \le i \le n$,
- $f\left(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i\right) \le f(x^k) - c _ 1 \alpha _ k \left( \frac{\partial f(x^k)}{\partial x _ i} \right)^2$
- $\left(\frac{\partial f(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i)}{\partial x _ i}\right) \left(-\frac{\partial f(x^k)}{\partial x _ i} \right) \ge -c _ 2 \left(\frac{\partial f(x^k)}{\partial x _ i}\right)^2$
(these are just the Wolfe conditions with $d^k$ as $-\frac{\partial f(x^k)}{\partial x _ i} e _ i$ and all the $\nabla f(x^k)$ replaced by $\frac{\partial f(x^k)}{\partial x _ i}$)
Overestimation property when CL-smooth and using line search
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
If $f$ is CL-smooth, one choice for $\alpha _ k$ is $1/L _ \text{max}$, but it might be difficult to know $L _ \text{max}$ in advance. Hence we have the following variant of the Wolfe conditions in order to analyse line search methods:
There exist constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$ and $1 \le i \le n$,
- $f\left(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i\right) \le f(x^k) - c _ 1 \alpha _ k \left( \frac{\partial f(x^k)}{\partial x _ i} \right)^2$
- $\left(\frac{\partial f(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i)}{\partial x _ i}\right) \left(-\frac{\partial f(x^k)}{\partial x _ i} \right) \ge -c _ 2 \left(\frac{\partial f(x^k)}{\partial x _ i}\right)^2$
@State:
- An overestimation property for each coordinate direction given $f$ is CL-smooth with parameters $L _ i$, and
- A result that follows from this which is useful in proving the convergence of randomised coordinate descent with line search.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
There exist constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$ and $1 \le i \le n$,
- $f\left(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i\right) \le f(x^k) - c _ 1 \alpha _ k \left( \frac{\partial f(x^k)}{\partial x _ i} \right)^2$
- $\left(\frac{\partial f(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i)}{\partial x _ i}\right) \left(-\frac{\partial f(x^k)}{\partial x _ i} \right) \ge -c _ 2 \left(\frac{\partial f(x^k)}{\partial x _ i}\right)^2$
For each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha$ chosen to satisfy the Wolfe conditions,
\[f(x - \alpha \nabla _ i f(x)) \le f(x) - \frac{c _ 1(1-c _ 2)}{L _ i} \left( \frac{\partial f}{\partial x _ i}(x) \right)^2\]If randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha _ k > 0$, then
\[\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{c _ 1(1-c _ 2)}{nL _ \text{max} } \mathbb E[||\nabla f(X^k)||^2]\]@important~ (overestimation property)
General line search CL-smooth case
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@State a result characterising the convergence of the randomised CD method with $\alpha _ k$ found by line search in the non-convex case when $f$ is assumed to be CL-smooth.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k$ is chosen to satisfy the componentwise Wolfe conditions
- $f$ is bounded below by $f _ \text{low}$
- $f$ is CL-smooth with parameters $L _ i$
Then for all $k \ge 1$,
\[\min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{nL _ \text{max} (f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}\]Convex, line search, CL-smooth case
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@State a result characterising the convergence of the randomised CD method with $\alpha _ k$ found by line search in the convex case under some additional assumptions.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k$ is chosen to satisfy the componentwise Wolfe conditions
- $f$ is bounded below by $f _ \text{low}$
- $f$ is convex
- $f$ is CL-smooth with parameters $L _ i$
- There is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$
Then for all $k \ge 1$,
\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}\]$\gamma$-strongly convex, line search, CL-smooth case
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@State a result characterising the convergence of the randomised CD method with $\alpha _ k$ found by line search in the $\gamma$-strongly convex case under some additional assumptions.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k$ is chosen to satisfy the componentwise Wolfe conditions
- $f$ is bounded below by $f _ \text{low}$
- $f$ is $\gamma$-strongly convex
- $f$ is CL-smooth with parameters $L _ i$
Then for all $k \ge 1$,
\[\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{2\gamma c _ 1(1-c _ 2)}{nL _ \text{max} }\right)^k(f(x^0) - f(x^\ast))\]We have the following results for the convergence of fixed-step randomised coordinate descent when the objective is CL-smooth:
- General case: $\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}$
- Convex case: $\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \text{max}D^2}{k}$
- $\gamma$-strongly convex case: $\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))$
How do things change when the step size and descent direction are instead found by line search, and what is the general rule?
- General case: $\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL _ \text{max} (f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$
- Convex case: $\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}$
- $\gamma$-strongly convex case: $\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{2\gamma c _ 1(1-c _ 2)}{nL _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$
Whenever $nL _ \text{max}$ appears in the fixed-step version, $\frac{nL _ \text{max} }{2c _ 1(1-c _ 2)}$ appears in the line-search version.
Convergence of randomised block coordinate descent
Blockwise overestimation property
Recall the randomised coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
@State:
- A overestimation property for each block given $f$ is BL-smooth with blockwise Lipschitz constant $L _ \text{bmax}$, and
- A result that follows from this which is useful in proving the convergence of randomised blockwise coordinate descent when blocks are chosen uniformly from a fixed set $\mathcal B _ 1, \ldots, B _ \ell$ and $\alpha _ k > 0$
- Another result that follows from this which is useful in proving the convergence of randomised blockwise coordinate descent when blocks are chosen randomly and have fixed size $n _ 0$.
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
\[\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{\alpha _ k}{\ell} \left(1 - \frac{L _ \text{bmax} \alpha _ k}{2}\right) \mathbb E[||\nabla f(X^k)||^2]\]
\[\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{\alpha _ k n _ 0}{n} \left(1 - \frac{L _ \text{bmax} \alpha _ k}{2}\right) \mathbb E[||\nabla f(X^k)||^2]\]
@important~ (overestimation property)
$\gamma$-strongly convex, fixed step, CL-smooth case
Recall the randomised block coordinate descent algorithm:
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select block $\mathcal B _ k \subset \{1, \ldots, n\}$ according to some sampling scheme.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k d^k\]
In this context, @state a result characterising the convergence of the randomised block CD method with fixed step size in the $\gamma$-strongly convex case under some additional assumptions, when:
- Blocks are chosen from some fixed set $\{\mathcal B _ 1, \ldots, B _ \ell\}$.
- Blocks are chosen to have size $n _ 0$
To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select block $\mathcal B _ k \subset \{1, \ldots, n\}$ according to some sampling scheme.
(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
- $f$ is BL-smooth with parameter $L _ \text{bmax}$
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- $\alpha _ k = \frac{1}{L _ \text{bmax} }$.
Blocks are chosen from some fixed set $\{\mathcal B _ 1, \ldots, B _ \ell\}$:
\[\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{\gamma}{\ell L _ \text{bmax} }\right)^k(f(x^0) - f(x^\ast))\]Blocks are chosen to have size $n _ 0$:
\[\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{n _ 0 \gamma}{n L _ \text{max} }\right)^k(f(x^0) - f(x^\ast))\]Comparisons between cyclic CD and randomised CD
Rather than comparing iteration complexity for coordinate descent versus standard gradient descent, what does it make sense to compare instead?
Epoch complexity, since in standard gradient descent we need to do $O(n)$ more work per iteration.
Here is one result concerning randomised coordinate descent in the convex case:
\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}\]
Here is another concerning cyclic coordinate descent in the convex case:
\[f(x^k) - f(x^\ast) \le \frac{4nD^2 L _ \max (1 + n\frac{L^2}{L^2 _ \max})}{k+8}\]
Why might randomised coordinate descent be preferred over cyclic coordinate descent despite the convergence being of the same order (sublinear)?
Since $L^2 / L^2 _ \text{max} > 1$, there is an extra factor of $O(n)$ in the constant for cyclic coordinate descent.
Why might you prefer
- Standard gradient descent with line search
over
- Randomised coordinate descent with line search
when $f$ is expensive to evaluate?
Randomised gradient descent requires $O(n)$ more function evaluations for line searches.