Notes - Optimisation for Data Science HT25, Coordinate descent

[[Course - Optimisation for Data Science HT25]]^U

Flashcards

Basic definitions

Suppose we wish to solve the optimisation problem

\[\min _ {x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, can you @state the coordinate descent @algorithm?

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

Recall the coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@Define what is meant by:

randomised,
cyclic,
Gauss-Southwell, and
Block (describing two sampling variants)

coordinate descent in this context.

Randomised: Pick $ \vert \mathcal B _ k \vert = 1$ uniformly randomly from $1$ to $n$
Cyclic: Cycle through the coordinates $B _ 0 = \{1\}$, $B _ 1 = \{2\}$, in order
Gauss-Southwell: Pick $ \vert \mathcal B _ k \vert = 1$ where $\mathcal B _ k = \{\text{argmax} _ {1 \le i \le n} \vert \frac{\partial f}{\partial x _ i}(x^k) \vert \}$.
Block coordinate descent is the case where $ \vert \mathcal B _ k \vert > 1$.
- Sampling variants give different ways that the blocks of indices can be selected:
- If $f(x) = f(x _ {\mathcal B _ 1}, \ldots, x _ {\mathcal B _ j})$ has a natural block structure, sample $j \sim \mathcal U(\{1, \ldots, \ell\})$ and select $\mathcal B _ j$.
- If $f(x)$ does not have a natural block structure, sample $\mathcal B$ from $\mathcal P _ {n _ 0} = \{\mathcal B : \vert \mathcal B \vert = n _ 0\}$.

What are some contexts in which a natural block partition of the variables appears naturally in the problem formation?

Training neural networks, the parameters could be updated layer-by-layer
Multistage optimisation problems
Sparse linear least squares

CL-smoothness

This property is assumed in many of the convergence results to follow.

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@State what is meant by $f$ being CL-smooth with parameters $L _ i$.

Each partial derivative of $f$ is “coordinatewise $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x _ i}(x + te _ i) - \frac{\partial f}{\partial x _ i}(x) \right| \le L _ i |t|\]

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@State a result that relates being CL-smooth with parameters $L _ i$ to the global Lipschitz constant $L$, and a refinement of this result under the assumption $f$ is convex.

If $f$ is CL-smooth with parameters $L _ i$, then $f$ is also $L$-Lipschitz where $L _ \max \le L$.
If $f$ is also convex, then $L _ \max \le L \le n L _ \max$.

@Prove that if $f$ is CL-smooth with parameters $L _ i$ and it is also convex, then $L _ \max \le L \le n L _ \max$.

(you may assume without proof that $f$ is $L$-smooth and $L _ \max \le L$, this is straightforward to prove).

Since we may assume $L _ \text{max} \le L$, it remains to show that $L \le nL _ \max$. Define

\[\varphi(x) := f(x) - f(y) - \nabla f(y)^\top (x - y)\]

for fixed $y$. Then since $f$ is convex, $\varphi(x) \ge 0$ for all $x$. Note also that:

$\nabla _ x \varphi(x) = \nabla f(x) - \nabla f(y)$, and
$\varphi(x)$ is convex as it’s the sum of a convex function, a constant function and linear function.
$\varphi$ has the same coordinatewise Lipschitz parameters as $f$ (do the algebra)

Now we apply the result that

If

$h$ is CL-smooth with parameters $L _ i$

then for each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have

\[h(x - \alpha \nabla _ i h(x)) \le h(x) - \alpha \left( 1 - \frac{L _ \max \alpha}{2} \right)\left( \frac{\partial h}{\partial x _ i}(x) \right)^2\]

where $\nabla _ i h(x) = \frac{\partial h}{\partial x _ i}(x) e _ i$.

by taking $h = \varphi$ and $\alpha = 1/L _ \max$. Then

\[0 \le \varphi(x - \alpha \nabla _ i \varphi(x)) \le \varphi(x) - \frac{1}{2L _ \max} \left( \frac{\partial \varphi}{\partial x _ i}(x) \right)^2\]

Then

\[\begin{aligned} \varphi(x) &\ge \frac{1}{2 L _ \max} \max _ i \left( \frac{\partial \varphi}{\partial x _ i}(x) \right)^2 \\\\ &\ge \frac{1}{2L _ \max} \frac 1 n \sum _ i \left( \frac{\partial h}{\partial x _ i}(x) \right)^2 \\\\ &= \frac{1}{2L _ \max} \frac 1 n ||\nabla \varphi(x)||^2 \end{aligned}\]

Plugging in the definition of $\varphi$, we obtain that for all $x, y$,

\[f(x) \ge f(y) + \nabla f(y)^\top (x - y) + \frac{1}{2nL _ \max} ||\nabla f(x) - \nabla f(y)||^2\]

and also symmetrically

\[f(y) \ge f(x) + \nabla f(x)^\top (y - x) + \frac{1}{2nL _ \max} ||\nabla f(x) - \nabla f(y)||^2\]

Summing these inequalities and rearranging, we have

\[\begin{aligned} \frac{1}{nL _ \max} ||\nabla f(y) - \nabla f(x)||^2 &\le \big(\nabla f(y) - \nabla f(x)\big)^\top (y - x) \\\\ &\stackrel{\text{CS} }\le ||\nabla f(y) - \nabla f(x)|| \cdot ||y-x|| \end{aligned}\]

Rearranging further gives

\[||\nabla f(y) - \nabla f(x)|| \le n L _ \max ||y - x||\]

Therefore $f$ is $L$-smooth with $L \le n L _ \max$ as required.

BL-smoothness

@Define what it means for a function with a natural block structure

\[f(x) = f(x _ {\mathcal B _ 1}, \ldots, x _ {\mathcal B _ \ell})\]

to be block Lipschitz smooth (BL-smooth) and define the blockwise Lipschitz constant $L _ \text{bmax}$.

For each block $\mathcal B$, there exists a Lipschitz constant $L _ \mathcal B$ such that

\[||\nabla _ {\mathcal B} f(x - \alpha \nabla _ {\mathcal B} f(x)) - \nabla _ {\mathcal B} f(x)|| \le L _ {\mathcal B} \times ||\alpha \nabla _ {\mathcal B} f(x)||\]

for all $x \in \mathbb R^n$ and $\alpha \in \mathbb R$. Then

\[L _ \text{bmax} := \max _ \mathcal B L _ {\mathcal B}\]

At first glance this conflicts with the definition of CL-smoothness as having the property that

\[\left| \frac{\partial f}{\partial x _ i}(x + te _ i) - \frac{\partial f}{\partial x _ i}(x) \right| \le L _ i |t|\]

for all $t$ (we would hope that the definition of blockwise smoothness reduces to CL-smoothness in the case where the blocks are singleton). For blocks of size 1, the definition above requires

\[\left|\left| \frac{\partial f}{\partial x_i}\left(x - \alpha \frac{\partial f}{\partial x_i} (x)e_i \right) - \frac{\partial f}{\partial x_i}f(x) \right|\right| \le L_i \times \left|\left| \alpha\frac{\partial f}{\partial x_i}(x) \right|\right|\]

But notice in this case that (assuming the partial derivative is not zero), by an appropriate choice of $\alpha$, we can rescale $\frac{\partial f}{\partial x _ i}(x)$ to hit any $t$, and so these definitions do agree.

Suppose $f$ is CL-smooth and convex. @State a result concerning the blockwise Lipschitz constant $L _ \text{bmax}$.

$f$ is BL-smooth and

\[L _ \text{bmax} \le \max _ {\mathcal B} |\mathcal B| \times L _ \max\]

Convergence of cyclic CD when not CL-smooth

Example of failure

Give an @example of an objective function $f : \mathbb R^3 \to \mathbb R$ (“Powell’s objective”) on which the cyclic CD method fails to converge to a stationary point and describe its properties.

\[f(x _ 1, x _ 2, x _ 3) = -(x _ 1x _ 2 + x _ 2 x _ 3 + x _ 1 x _ 3) + \sum^3 _ {i=1} \max\big((|x _ i| - 1)^2, 0\big)\]

This function is non-convex, continuously differentiable and its minimisers are at the corners $(1, 1, 1)^\top$ and $(-1, -1, -1)^\top$ of the unit cube.

General $L$-smooth, line search case

Recall the cyclic coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

where $\alpha _ k$ is chosen to minimise $f$ along that direction.

@State a result concerning the convergence of this algorithm in the general $L$-smooth case when $\alpha _ k$ is found by line search.

Suppose:

$f$ is $L$-smooth
$f$ is bounded below
(Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
$d^k = -\frac{\partial}{\partial x _ i} f(x^k)e _ i$

Then the sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies:

\[\sum^\infty _ {k = 0} \big(\cos^2 \theta _ k \times ||\nabla f(x^k)||^2\big) < \infty\]

and hence either $\theta _ k \to \frac\pi 2$ or $\nabla f(x^k) \to 0$ where

\[\theta _ k = \angle(-\nabla f(x^k), d^k)\]

Recall the cyclic coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

where $\alpha _ k$ is chosen to minimise $f$ along that direction.

@Prove that then if

$f$ is $L$-smooth
$f$ is bounded below
(Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$
$d^k = -\frac{\partial}{\partial x _ i} f(x^k)e _ i$

Then the sequence of iterates $(x^k) _ {k \in \mathbb N}$ satisfies:

\[\sum^\infty _ {k = 0} \big(\cos^2 \theta _ k \times ||\nabla f(x^k)||^2\big) < \infty\]

where

\[\theta _ k = \angle(-\nabla f(x^k), d^k)\]

Overall idea: Since we have the Wolfe conditions, we might hope that we can apply the general results concerning any line search method and conclude $ \vert \vert \nabla f(x^k) \vert \vert \to 0$ as $k \to 0$. However, these results depend on an assumption additional to the Wolfe conditions, namely that we have “okay descent directions”: there exists some $\eta > 0$ such that $-\nabla f(x^k)^\top d^k \le \eta \vert \vert \nabla f(x^k) \vert \vert \cdot \vert \vert d^k \vert \vert $.

The issue is that this cannot be guaranteed for any $\eta > 0$; coordinate descent makes a very specific choice for the descent direction by taking $d^k = \nabla _ {\mathcal B _ k} f(x^k)$. However, we can repeat the analysis used when analysing line search methods but instead let $\eta$ depend on the current iteration:

\[-\nabla f(x^k)^\top d^k = \eta _ k ||\nabla f(x^k)|| \cdot ||d^k||\]

by rearranging, we see that $\eta _ k = \cos \theta _ k$ where $\theta _ k = \angle (-\nabla f(x^k), d^k)$, i.e. the angle between the actual descent direction and the direction we have chosen to descent along. This gives an analogue of the foundational inequality that

\[f(x^{k+1}) \le f(x^k) - \frac{c _ 1(1-c _ 2)}{L} \times \cos^2 \theta _ k \times ||\nabla f(x^k)||^2\]

and by summing over this inequality, we obtain the result.

Bound step size: By the curvature condition, we have that

\[\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k\]

By subtracting $\nabla f(x^k)^\top d^k$ from both sides, we see:

\[\begin{aligned} -(1-c _ 2) \nabla f(x^k)^\top d^k &\le \left(\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)\right)^\top d^k \\\\ &\stackrel{\text{C.S.} } \le ||\nabla f(x^k + \alpha _ k d^k) - \nabla f(x^k)|| \cdot ||d^k|| \\\\ &\stackrel{\text{L.S.} } \le L \cdot \alpha _ k ||d^k||^2 \end{aligned}\]

Solving for $\alpha _ k$ implies that

\[\alpha _ k \ge -\frac{1-c _ 2}{L||d^k||^2} \cdot \nabla f(x^k) ^\top d^k\]

and hence by Assumption 1,

\[\alpha _ k \ge \frac{\eta _ k(1 - c _ 2)||\nabla f(x^k)||}{L||d^k||}\]

Use bound on step size to bound derivative: Then

\[\begin{aligned} f(x^{k+1}) &= f(x^k + \alpha _ k d^k) && (1\star) \\\\ &\le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k && (2\star) \\\\ &\le f(x^k) + c _ 1 \frac{\eta _ k(1 - c _ 2)||\nabla f(x^k)||}{L||d^k||} \nabla f(x^k)^\top d^k &&(3\star) \\\\ &\le f(x^k) - \frac{c _ 1 (1 - c _ 2) \eta _ k^2}{L} ||\nabla f(x^k)||^2 && (4\star) \end{aligned}\]

where $(1\star)$ is by definition, $(2\star)$ is by Assumption 2, $(3\star)$ is by the previous result, and $(4\star)$ is by Assumption 1.

Now solving for $\eta _ k^2 \vert \vert \nabla f(x^k) \vert \vert ^2$ gives

\[\eta _ k^2||\nabla f(x^k)||^2 \le \frac{L}{c _ 1(1 - c _ 2)}(f(x^k) - f(x^{k+1}))\]

Telescoping sum: Summing over $k$, it follows:

\[\begin{aligned} \sum^{T-1} _ {k = 0} \eta _ k^2||\nabla f(x^k)||^2 &\le \frac{L}{c _ 1 (1 - c _ 2)} \sum^{T-1} _ {k = 0}(f(x^k) - f(x^{k+1})) \\\\ &= \frac{L}{c _ 1 (1 - c _ 2)} (f(x^0) - f(x^T)) \\\\ &\le \frac{L}{c _ 1(1 - c _ 2)} (f(x^0) - C) \end{aligned}\]

and since $\eta _ k := \cos^2 \theta _ k$ and the result above holds for all $T$,

\[\sum^\infty _ {k = 0} \big(\cos^2 \theta _ k \times ||\nabla f(x^k)||^2\big) < \infty\]

Unique coordinatewise minima

Recall the cyclic coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@State a result concerning the convergence of this algorithm under some additional assumptions.

Suppose:

$f$ is continuously differentiable
$f$ is bounded below
For each $i$, the minimum $\min _ {t \in \mathbb R} f(x _ 1, x _ 2, \ldots, x _ {i-1}, t, x _ {i+1}, \ldots, x _ n)$ is uniquely attained for all $x _ 1, \ldots, x _ {i-1}, x _ {i+1}, \ldots, x _ n$.

Then:

Every limit point of the sequence of iterates $x^k$ where $k = n, 2n, 3n, \ldots$ is a stationary point of $f$.

Convex, fixed-step case

Recall the cyclic coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

where $\alpha _ k$ is chosen to minimise $f$ along that direction.

@State a result concerning the convergence in the convex case, under some additional assumptions.

Suppose:

$f$ is CL-smooth with parameters $L _ i$
There is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.
$f$ is convex
$\alpha _ k = \frac{1}{L _ \text{max} }$

Then for $k \in \{n ,2n, 3n, \ldots\}$

\[f(x^k) - f(x^\ast) \le \frac{4n D^2 L _ \max\left(1 + n \frac{L^2}{L^2 _ \max}\right)}{k+8}\]

$\gamma$-strongly convex, fixed step case

Recall the cyclic coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

where $\alpha _ k$ is chosen to minimise $f$ along that direction.

@State a result concerning the convergence in the $\gamma$-strongly convex case under some additional assumptions.

Suppose:

$f$ is CL-smooth with parameters $L _ i$
$f$ is $\gamma$-strongly convex
$\alpha _ k = \frac{1}{L _ \text{max} }$

Then for $k \in \{n ,2n, 3n, \ldots\}$:

\[f(x^k) - f(x^\ast) \le \left(1 - \frac{\gamma}{2L _ \max \left(1 + n\frac{L^2}{L _ \max^2}\right)}\right)^{k/n} (f(x^0) - f(x^\ast))\]

Convergence of randomised CD with line search when not CL-smooth

General line search case

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k = \{j\}$ where $j \sim \mathcal U(\{1, \ldots, n\})$.

(2): Calculate $d^k = -\frac{\partial}{\partial x _ {\mathcal B _ k} } = -\frac{\partial}{\partial x _ j} e _ j$

(3): Calculate

\[x^{k+1} = x^k + \alpha _ k d^k\]

where $\alpha _ k$ is chosen to minimise $f$ along that direction.

@State a result concerning the convergence of this algorithm in the non-convex $L$-smooth case when $\alpha _ k$ is found by line search and $f$ is not assumed to be CL-smooth.

Suppose:

$f$ is $L$-smooth
$f$ is bounded below by $f _ \text{low}$
(Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

Then:

\[\min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{nL(f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}\]

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k = \{j\}$ where $j \sim \mathcal U(\{1, \ldots, n\})$.

(2): Calculate $d^k = -\frac{\partial}{\partial x _ {\mathcal B _ k} } = -\frac{\partial}{\partial x _ j} e _ j$

(3): Calculate

\[x^{k+1} = x^k + \alpha _ k d^k\]

where $\alpha _ k$ is chosen to minimise $f$ along that direction.

@Prove that if:

$f$ is $L$-smooth
$f$ is bounded below by $f _ \text{low}$
(Wolfe conditions) $\alpha _ k$ is neither too short nor too long:
- There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:
- (Armijo condition) Assumption 2: $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and
- (Curvature condition) Assumption 3: $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

then:

\[\min _ {0 \le j \le k-1} \mathbb E[||\nabla f(X^j)||^2] \le \frac{nL(f(x^0) - f _ \text{low}))}{k c _ 1 (1 - c _ 2)}\]

You may assume that as a consequence of the Wolfe conditions, we have the so-called “local foundational inequality”:

\[f(x^{k+1}) \le f(x^k) - \frac{c _ 1(1-c _ 2)}{L} \times \cos^2 \theta _ k \times ||\nabla f(x^k)||^2\]

This is broken down into three claims:

$\mathbb E[\eta _ k^2 \mid X^k] = \frac 1 n$ where $\eta _ k = \cos(\angle(-\nabla f(X^k), D^k))$
$\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{c _ 1 (1 - c _ 2)}{n L} \mathbb E[ \vert \vert \nabla f(X^k) \vert \vert ^2]$
$\min _ {0 \le i \le k-1} \mathbb E[ \vert \vert \nabla f(X^i) \vert \vert ^2] \le \frac{nL(f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$

Claim 1: We have

\[\begin{aligned} \eta _ k &= \frac{|e^\top _ j \nabla f(X^k)|}{||e _ j|| \times ||\nabla f(X^k)||} \\\\ &= \frac{\left|\frac{\partial}{\partial x _ j} f(X^k)\right|}{||\nabla f(X^k)||} \\\\ &=: \hat d _ j^k \end{aligned}\]

where $\hat d^k$ is the componentwise absolute value of the normalised vector

\[\frac{\nabla f(X^k)}{||\nabla f(X^k)||}\]

In particular, $\hat d^k$ is also a unit vector and since the coordinates are samples uniformly,

\[\mathbb E[\eta _ k \mid X^k] = \sum^n _ {i = 1} \frac{1}{n} \hat d^k _ i = \frac 1 n \pmb 1^\top \hat d^k\]

If coordinates are sampled uniformly, then

\[\mathbb E[\eta _ k^2 \mid X^k] = \frac 1 n\]

and hence

\[\mathbb E[\eta^2 _ k] = \mathbb E _ {k-1}[\mathbb E[\eta^2 _ k \mid X^k]] = \frac 1 n\]

as required.

Claim 2: Taking expectations given $X^k$ on both sides of the local foundational inequality yields

\[\begin{aligned} \mathbb E[f(X^{k+1}) \mid X^k] &\le f(X^k) - \frac{c _ 1 (1 - c _ 2)}{L} \times \mathbb E[\eta^2 _ k \mid X^k] \times ||\nabla f(X^k)||^2 \\\\ &\le f(X^k) - \frac{c _ 1(1 - c _ 2)}{nL}||\nabla f(X^k)||^2 \end{aligned}\]

where the second inequality follows from Claim 1. and taking total expectations on both sides gives

\[\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{c _ 1 (1 - c _ 2)}{n L} \mathbb E[||\nabla f(X^k)||^2]\]

as required.

Claim 3: Rearranging, we obtain

\[\mathbb E[||\nabla f(X^k)||^2] \le \frac{nL}{c _ 1(1-c _ 2)} \Big(\mathbb E[f(X^k)] - \mathbb E[f(X^{k+1})]\Big)\]

Changing indices and summing over $i$, we obtain

\[\begin{aligned} \min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] &\le \frac{1}{k} \sum^{k-1} _ {i = 0}\frac{nL}{c _ 1(1-c _ 2)} \Big(\mathbb E[f(X^k)] - \mathbb E[f(X^{k+1})]\Big) \\\\ &\le \frac{nL (f(x^0) - f _ \text{low})}{kc _ 1(1-c _ 2)} \end{aligned}\]

Convergence of fixed-step randomised CD when CL-smooth

Overestimation property when CL-smooth

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@State:

An overestimation property for each coordinate direction given $f$ is CL-smooth with parameters $L _ i$, and
A result that follows from this which is useful in proving the convergence of randomised coordinate descent.

For each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have

\[f(x - \alpha \nabla _ i f(x)) \le f(x) - \alpha \left( 1 - \frac{L _ \max \alpha}{2} \right)\left( \frac{\partial f}{\partial x _ i}(x) \right)^2\]

where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$.

If randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha _ k > 0$, then

\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha _ k}{n} \left( 1 - \frac{L _ \max \alpha _ k}{2} \right)||\nabla f(X^k)||^2\]

where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.

@important~ (overestimation property)

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

In this context, we have

An overestimation property for each coordinate direction given $f$ is CL-smooth with parameters $L _ i$:

For each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have

\[f(x - \alpha \nabla _ i f(x)) \le f(x) - \alpha \left( 1 - \frac{L _ \max \alpha}{2} \right)\left( \frac{\partial f}{\partial x _ i}(x) \right)^2\]

where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$.

and a result that follows from this which is useful in proving the convergence of randomised coordinate descent:

If randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha _ k > 0$, then

\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha _ k}{n} \left( 1 - \frac{L _ \max \alpha _ k}{2} \right) \vert \vert \nabla f(X^k) \vert \vert ^2\]

where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.

Can you contrast this to the foundational inequality of steepest descent?

\[f\left(x^k - \frac 1 L \nabla f(x^k)\right) \le f(x^k) - \frac 1 {2L} ||\nabla f(x^k)||^2\]

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@Prove that if $f$ is CL-smooth with parameters $L _ i$, then for each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have:

\[f(x - \alpha \nabla _ i f(x)) \le f(x) - \alpha \left( 1 - \frac{L _ \max \alpha}{2} \right)\left( \frac{\partial f}{\partial x _ i}(x) \right)^2\]

where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$, and therefore that if randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha _ k > 0$, then

\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha _ k}{n} \left( 1 - \frac{L _ \max \alpha _ k}{2} \right)||\nabla f(X^k)||^2\]

where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.

By the fundamental theorem of calculus applied to $g(t) := f(x + td)$, we have

\[\begin{aligned} f(x + d) &= f(x) + \int^1 _ 0 \nabla f(x + td)^\top d \text{ } \text dt\\\\ &= f(x) + \nabla f(x)^\top d + \int^1 _ 0 (\nabla f(x + td) - \nabla f(x))^\top d \text{ d}t \end{aligned}\]

Let $g = \nabla _ i f(x)$ and take $d = -\alpha g$ to deduce

\[\begin{aligned} f(x - \alpha g) &= f(x) + \nabla f(x)^\top (-\alpha g) + \int^1 _ 0 [\nabla f(x - t \alpha g) - \nabla f(x)]^\top (-\alpha g)\text dt \\\\ &\stackrel{\text{C.S} }\le f(x) - \alpha g^\top \nabla f(x) + \alpha ||g|| \int^1 _ 0 ||\nabla f(x - t\alpha g) - \nabla f(x)||\text dt \end{aligned}\]

Note also that

\[g^\top \nabla f(x) = \left[\frac{\partial f}{\partial x _ i} (x)\right] e _ i^\top \nabla f(x) = \left[\frac{\partial f}{\partial x _ i} (x)\right]^2\]

by definition.

Finally,

\[\begin{aligned} ||\nabla f(x - t\alpha g) - \nabla f(x)|| &= ||\nabla f(x - t\alpha \frac{\partial f}{\partial x _ i}(x) e _ i) - \nabla f(x)|| \\\\ &\le L _ i \left| t\alpha \frac{\partial f}{\partial x _ i}(x) \right| \\\\ &\le L _ \max t\alpha \left| \frac{\partial f}{\partial x _ i}(x) \right| \end{aligned}\]

Combining these results together gives

\[f(x - \alpha g) \le f(x) - \alpha \left[ \frac{\partial f}{\partial x _ i}(x) \right]^2 + L _ \max \alpha^2 \left[ \frac{\partial f}{\partial x _ i}(x) \right]^2 \int^1 _ 0 t \text dt\]

and hence we have the first result.

To prove the second result, let $x = X^k$, $g = d^k = \nabla _ {\mathcal B _ k} f(X^k) = \frac{\partial f}{\partial x _ {\mathcal B _ k} } (X^k) e _ {\mathcal B _ k}$ and $\alpha = \alpha _ k$ in the above, and apply expectation with respect to $\mathcal B _ k$ on both sides. Therefore we have

\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \alpha _ k \left(1 - \frac{L _ \max \alpha _ k}{2}\right) \mathbb E _ {\mathcal B _ k}\left[\left(\frac{\partial f}{\partial x _ {\mathcal B _ k} }(X^k)\right)^2\right]\]

where we also used that $X^{k+1} = X^k - \alpha _ k d^k$. Then we also have by the definition of expectation that

\[\begin{aligned} \mathbb E _ {\mathcal B _ k}\left[\left(\frac{\partial f}{\partial x _ {\mathcal B _ k} }(X^k)\right)^2\right] &= \sum^n _ {k = 1} \mathbb E\left[\left(\frac{\partial f}{\partial x _ {\mathcal B _ k} }(X^k)\right)^2 \text{ }\Bigg|\text{ } B _ k = i\right] \mathbb P(\mathcal B _ k = i) \\\\\ &= \sum^n _ {i = 1}\left(\frac{\partial f}{\partial x _ {\mathcal B _ k} }(X^k)\right)^2 \cdot \frac 1 n \\\\ &= \frac 1 n ||\nabla f(X^k)||^2 \end{aligned}\]

and hence we have the second result.

@important~ (overestimation property)

General fixed-step, CL-smooth case

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@State a result characterising the convergence of the randomised CD method with fixed step size in the non-convex case when $f$ is CL-smooth.

Suppose:

The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
$\mathcal B _ k$ is chosen uniformly at random and with replacement
$\alpha _ k = \frac 1 {L _ \max}$
$f$ is bounded below by $f _ \text{low}$
$f$ is CL-smooth with parameters $L _ i$

Then for all $k \ge 1$,

\[\min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}\]

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@Prove that if

The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
$\mathcal B _ k$ is chosen uniformly at random and with replacement
$\alpha _ k = \frac 1 {L _ \max}$
$f$ is bounded below by $f _ \text{low}$
$f$ is CL-smooth with parameters $L _ i$

then for all $k \ge 1$,

\[\min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}\]

Recall we have the overestimation property that in this context

\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha _ k}{n} \left( 1 - \frac{L _ \max \alpha _ k}{2} \right) \vert \vert \nabla f(X^k) \vert \vert ^2\]

where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.

By properties of expectation, it follows that if $\mathbb E = \mathbb E _ k := \mathbb E[\cdot \mid \mathcal B _ 0, \ldots, \mathcal B _ k]$ then we have

\[\mathbb E _ {k} [f(X^{k+1})] \le \mathbb E _ {k-1}[f(X^k)] - \frac{1}{2n L _ \max} ||\nabla f(X^k)||^2\]

for all $k \ge 0$, which then rearranges to give

\[\mathbb E _ {i - 1} [f(X^i)] - \mathbb E _ i [f(X^{i+1})] \ge \frac{1}{2n L _ \max} \mathbb E _ {i-1}[||\nabla f(X^i)^2||]\]

where $\mathbb E _ {-1}[f(X^0)] = f(x^0)$ since $x^0$ is deterministic.

Then we sum this up from $i = 0$ to $i = k$ and deduce from the telescoping sum that

\[f(x^0) - \mathbb E _ k[f(X^{k+1})] \ge \frac{1}{2nL _ \max} \sum^{k} _ {i = 0} \mathbb E _ {i - 1} [||\nabla f(X^i)||^2]\]

and then using $f(X^{k+1}) \ge f _ \text{low}$ and $\mathbb E _ {i-1}[ \vert \vert \nabla f(X^i) \vert \vert ^2] \ge \min _ {0 \le i \le k} \mathbb E _ {i-1}[ \vert \vert \nabla f(X^i) \vert \vert ^2]$, so that

\[f(x^0) - f _ \text{low} \ge \frac{1}{2nL _ \max} (k+1) \min _ {0 \le i \le k} \mathbb E _ {i-1} [||\nabla f(X^i)||^2]\]

which gives the result with $k+1$ instead of $k$.

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

One result is that if:

The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
$\mathcal B _ k$ is chosen uniformly at random and with replacement
$\alpha _ k = \frac 1 {L _ \max}$
$f$ is bounded below by $f _ \text{low}$
$f$ is CL-smooth with parameters $L _ i$

then for all $k \ge 1$,

\[\min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}\]

This only shows that

\[\liminf _ {i \to \infty} \mathbb E[||\nabla f(X^i)||] = 0\]

@State the result which allows you to actually prove that

\[\lim _ {i \to \infty} \mathbb E[||\nabla f(X^i)||] = 0\]

\[f(x^0) - \mathbb E _ k[f(X^{k+1})] \ge \frac{1}{2nL _ \max} \sum^{k} _ {i = 0} \mathbb E _ {i - 1} [||\nabla f(X^i)||^2]\]

Then taking $k \to \infty$ lets you conclude the sum is finite, and hence the summands tend to zero.

Convex, fixed-step, CL-smooth case

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x _ i}(x + te _ i) - \frac{\partial f}{\partial x _ i}(x) \right| \le L _ i |t|\]

and define

\[L _ \max := \max _ {i \in \\{ 1, \ldots, n \\} } L _ i\]

@State an additional assumption on $f$ used to prove the convergence of coordinate descent when $f$ is convex.

There is some finite minimiser $x^\ast$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@State a result characterising the convergence of the randomised CD method with fixed step size in the convex case under some additional assumptions

Suppose:

The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
$\mathcal B _ k$ is chosen uniformly at random and with replacement
$\alpha _ k = \frac 1 {L _ \max}$
$f$ is bounded below by $f _ \text{low}$
$f$ is convex
$f$ is CL-smooth with parameters $L _ i$
There is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$

Then:

\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \max D^2}{k}\]

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@Prove that if

The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$, where $\mathcal B _ k$ is chosen uniformly at random and with replacement and $\alpha _ k := \frac{1}{L _ \max}$
$f$ is convex
$f$ is bounded below by $f _ \text{low}$
$f$ is CL-smooth with parameters $L _ i$
There is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.

Then:

\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \max D^2}{k}\]

Recall we have the overestimation property that in this context

\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha _ k}{n} \left( 1 - \frac{L _ \max \alpha _ k}{2} \right) \vert \vert \nabla f(X^k) \vert \vert ^2\]

where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.

By properties of expectation, it follows that if $\mathbb E = \mathbb E _ k := \mathbb E[\cdot \mid \mathcal B _ 0, \ldots, \mathcal B _ k]$ then we have

\[\mathbb E _ {k} [f(X^{k+1})] \le \mathbb E _ {k-1}[f(X^k)] - \frac{1}{2n L _ \max} ||\nabla f(X^k)||^2\]

for all $k \ge 0$. Since $\mathbb E _ {k-1} [f(X^k)] = \mathbb E _ {k} [f(X^k)] = \mathbb E [f(X^k)]$, subtracting $f(x^\ast)$ from both sides we have

\[\mathbb E[f(X^{k+1})] - f(x^\ast) \le \mathbb E[f(X^k)] - f(x^\ast) - \frac{1}{2nL _ \max} \mathbb E[||\nabla f(X^k)||^2]\]

Since $f$ is convex, we also have that

\[\begin{aligned} \quad&f(x^\ast) \ge f(X^k) + \nabla f(X^k)(x^\ast - X^k) \\\\ \implies&f(X^k) - f(x^\ast) \le \nabla f(X^k)^\top(X^k - x^\ast) \\\\ \implies&f(X^k) - f(x^\ast) \le ||\nabla f(X^k)|| \cdot ||X^k - x^\ast|| \\\\ \implies&f(X^k) - f(x^\ast) \le D||\nabla f(X^k)|| \end{aligned}\]

where the final inequality follows from the fact that

\[f(x^k) \le f(x^0)\]

for all $k$ and so $X^k$ is in the $f(x^0)$-level set of $f$. Squaring this, passing to total expectation, and using Jensen’s inequality, we have:

\[\begin{aligned} \mathbb E[(f(X^k) - f(x^\ast))]^2 &\le \mathbb E[(f(X^k) - f(x^\ast))^2] \\\\ &\le D^2 \mathbb E[||\nabla f(X^k)||^2] \end{aligned}\]

Going back to the original overestimation property, this implies

\[\begin{aligned} \mathbb E[f(X^{k+1})] - f(x^\ast) &\le \mathbb E[f(X^k)] - f(x^\ast) - \frac{1}{2nL _ \max} \mathbb E[||\nabla f(X^k)||^2] \\\\ &\le \mathbb E[f(X^k)] - f(x^\ast) - \frac{1}{2nL _ \max} \frac{1}{D^2} (\mathbb E[f(X^k)] - f(x^\ast))^2 \end{aligned}\]

To make this simpler, let

\[\Delta _ k := \mathbb E[f(X^k)] - f(x^\ast)\]

and so we have

\[\Delta _ {k+1} \le \Delta _ k - \frac{1}{2nL _ \max D^2} \Delta _ k^2\]

We now aim to show by solving the recurrence that

\[\mathbb E[f(X^k)] - f(x^\ast) = \Delta _ k \le \frac{2nL _ \max D^2}{k}\]

To do this, note that

\[\begin{aligned} \frac{1}{\Delta _ {k+1} } - \frac{1}{\Delta _ k} &= \frac{\Delta _ k - \Delta _ {k+1} }{\Delta _ k \Delta _ {k+1} } \\\\ &\ge \frac{\Delta _ k - \Delta _ {k+1} }{\Delta _ k^2} \\\\ &= \frac{\mathbb E[f(X^k)] - \mathbb E[f(X^{k+1})]}{(\mathbb E[f(X^k)] - f(x^\ast))^2} \\\\ &\ge \frac{1}{2nL _ \max D^2} \end{aligned}\]

Summing over over $k$, we obtain

\[\frac{k}{2nL _ \max D^2} \le \sum^{k-1} _ {i=0} \frac{1}{\Delta _ {i+1} } - \frac{1}{\Delta _ i} = \frac{1}{\Delta _ k} - \frac{1}{\Delta _ 0} \le \frac{1}{\Delta _ k} = \frac{1}{\mathbb E[f(X^k)] - f(x^\ast)}\]

and then reciprocating both sides we obtain

\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \max D^2}{k}\]

$\gamma$-strongly convex, fixed-step, CL-smooth case

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

In this context, @state a result characterising the convergence of the randomised CD method with fixed step size in the $\gamma$-strongly convex case under some additional assumptions.

Suppose:

$f$ is CL-smooth with parameters $L _ i$
$f$ is $\gamma$-strongly convex
$f$ is bounded below by $f _ \text{low}$
The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
$\mathcal B _ k$ is chosen uniformly at random and with replacement
$\alpha _ k = \frac 1 {L _ \max}$

Then: for all $k \ge 0$,

\[\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))\]

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@Prove that if:

$f$ is CL-smooth with parameters $L _ i$
$f$ is $\gamma$-strongly convex
$f$ is bounded below by $f _ \text{low}$
The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
$\mathcal B _ k$ is chosen uniformly at random and with replacement
$\alpha _ k = \frac 1 {L _ \max}$

then for all $k \ge 0$,

\[\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))\]

Note that all the conditions required for the convergence of CD in the general case are also met, so following that proof and defining

\[\Delta _ k := \mathbb E[f(X^k)] - f(x^\ast)\]

we have

\[\Delta _ {k + 1} \le \Delta _ k - \frac{1}{2n L _ \max} \mathbb E[||\nabla f(X^k)||^2]\]

As $f$ is $\gamma$-strongly convex, we have that

\[f(x) - f(x^\ast) \le \frac 1 {2\gamma} ||\nabla f(x)||^2\]

for all $x$. Therefore, letting $x := X^k$ and passing to total expectation, we have

\[\Delta _ k = \mathbb E[f(X^k)] - f(x^\ast) \le \frac{1}{2\gamma} \mathbb E[||\nabla f(X^k)||^2]\]

Hence, by substituting into the above inequality $\Delta _ {k+1} \le \Delta _ k - \cdots$, we obtain

\[\Delta _ {k+1} \le \Delta _ k - \frac{2\gamma}{2nL _ \max} \Delta _ k\]

where $k \ge 0$, and hence

\[\Delta _ {k+1} \le \left( 1 - \frac{\gamma}{nL _ \max} \right) \Delta _ k\]

for all $k \ge 0$ also. Since $\gamma \le L \le nL _ \max$, the convergence factor is in $(0, 1]$ and so the result follows inductively.

Convergence of randomised CD with line search when CL-smooth

Componentwise Wolfe conditions

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

If $f$ is CL-smooth, one choice for $\alpha _ k$ is $1/L _ \text{max}$, but it might be difficult to know $L _ \text{max}$ in advance. One approach is to use line search methods to generate $\alpha _ k$ satisfying the Wolfe conditions:

$\alpha _ k$ is neither too short nor too long:

There exists constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$:

(Armijo condition): $f(x^k + \alpha _ k d^k) \le f(x^k) + c _ 1 \alpha _ k \nabla f(x^k)^\top d^k$, and

(Curvature condition): $\nabla f(x^k + \alpha _ k d^k)^\top d^k \ge c _ 2 \nabla f(x^k)^\top d^k$

@State a weakened form of the Wolfe conditions that are useful for analysing componentwise methods.

There exist constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$ and $1 \le i \le n$,

$f\left(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i\right) \le f(x^k) - c _ 1 \alpha _ k \left( \frac{\partial f(x^k)}{\partial x _ i} \right)^2$
$\left(\frac{\partial f(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i)}{\partial x _ i}\right) \left(-\frac{\partial f(x^k)}{\partial x _ i} \right) \ge -c _ 2 \left(\frac{\partial f(x^k)}{\partial x _ i}\right)^2$

(these are just the Wolfe conditions with $d^k$ as $-\frac{\partial f(x^k)}{\partial x _ i} e _ i$ and all the $\nabla f(x^k)$ replaced by $\frac{\partial f(x^k)}{\partial x _ i}$)

Overestimation property when CL-smooth and using line search

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

If $f$ is CL-smooth, one choice for $\alpha _ k$ is $1/L _ \text{max}$, but it might be difficult to know $L _ \text{max}$ in advance. Hence we have the following variant of the Wolfe conditions in order to analyse line search methods:

There exist constants $0 < c _ 1 < c _ 2 < 1$ such that for all $k \in \mathbb N$ and $1 \le i \le n$,

$f\left(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i\right) \le f(x^k) - c _ 1 \alpha _ k \left( \frac{\partial f(x^k)}{\partial x _ i} \right)^2$

$\left(\frac{\partial f(x^k - \alpha _ k \frac{\partial f(x^k)}{\partial x _ i} e _ i)}{\partial x _ i}\right) \left(-\frac{\partial f(x^k)}{\partial x _ i} \right) \ge -c _ 2 \left(\frac{\partial f(x^k)}{\partial x _ i}\right)^2$

@State:

An overestimation property for each coordinate direction given $f$ is CL-smooth with parameters $L _ i$, and
A result that follows from this which is useful in proving the convergence of randomised coordinate descent with line search.

For each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha$ chosen to satisfy the Wolfe conditions,

\[f(x - \alpha \nabla _ i f(x)) \le f(x) - \frac{c _ 1(1-c _ 2)}{L _ i} \left( \frac{\partial f}{\partial x _ i}(x) \right)^2\]

If randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha _ k > 0$, then

\[\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{c _ 1(1-c _ 2)}{nL _ \text{max} } \mathbb E[||\nabla f(X^k)||^2]\]

@important~ (overestimation property)

General line search CL-smooth case

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@State a result characterising the convergence of the randomised CD method with $\alpha _ k$ found by line search in the non-convex case when $f$ is assumed to be CL-smooth.

Suppose:

The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
$\mathcal B _ k$ is chosen uniformly at random and with replacement
$\alpha _ k$ is chosen to satisfy the componentwise Wolfe conditions
$f$ is bounded below by $f _ \text{low}$
$f$ is CL-smooth with parameters $L _ i$

Then for all $k \ge 1$,

\[\min _ {0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{nL _ \text{max} (f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}\]

Convex, line search, CL-smooth case

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@State a result characterising the convergence of the randomised CD method with $\alpha _ k$ found by line search in the convex case under some additional assumptions.

Suppose:

The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
$\mathcal B _ k$ is chosen uniformly at random and with replacement
$\alpha _ k$ is chosen to satisfy the componentwise Wolfe conditions
$f$ is bounded below by $f _ \text{low}$
$f$ is convex
$f$ is CL-smooth with parameters $L _ i$
There is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$

Then for all $k \ge 1$,

\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}\]

$\gamma$-strongly convex, line search, CL-smooth case

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@State a result characterising the convergence of the randomised CD method with $\alpha _ k$ found by line search in the $\gamma$-strongly convex case under some additional assumptions.

Suppose:

The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
$\mathcal B _ k$ is chosen uniformly at random and with replacement
$\alpha _ k$ is chosen to satisfy the componentwise Wolfe conditions
$f$ is bounded below by $f _ \text{low}$
$f$ is $\gamma$-strongly convex
$f$ is CL-smooth with parameters $L _ i$

Then for all $k \ge 1$,

\[\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{2\gamma c _ 1(1-c _ 2)}{nL _ \text{max} }\right)^k(f(x^0) - f(x^\ast))\]

We have the following results for the convergence of fixed-step randomised coordinate descent when the objective is CL-smooth:

General case: $\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{2n L _ \max (f(x^0) - f _ \text{low})}{k}$
Convex case: $\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL _ \text{max}D^2}{k}$
$\gamma$-strongly convex case: $\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL _ \max} \right)^k (f(x^0) - f(x^\ast))$

How do things change when the step size and descent direction are instead found by line search, and what is the general rule?

General case: $\min _ {0 \le i \le k-1} \mathbb E[\vert\vert\nabla f(X^i)\vert\vert^2] \le \frac{nL _ \text{max} (f(x^0) - f _ \text{low})}{k c _ 1 (1 - c _ 2)}$
Convex case: $\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}$
$\gamma$-strongly convex case: $\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{2\gamma c _ 1(1-c _ 2)}{nL _ \text{max} }\right)^k(f(x^0) - f(x^\ast))$

Whenever $nL _ \text{max}$ appears in the fixed-step version, $\frac{nL _ \text{max} }{2c _ 1(1-c _ 2)}$ appears in the line-search version.

Convergence of randomised block coordinate descent

Blockwise overestimation property

Recall the randomised coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

@State:

A overestimation property for each block given $f$ is BL-smooth with blockwise Lipschitz constant $L _ \text{bmax}$, and
A result that follows from this which is useful in proving the convergence of randomised blockwise coordinate descent when blocks are chosen uniformly from a fixed set $\mathcal B _ 1, \ldots, B _ \ell$ and $\alpha _ k > 0$
Another result that follows from this which is useful in proving the convergence of randomised blockwise coordinate descent when blocks are chosen randomly and have fixed size $n _ 0$.

\[f(x - \alpha \nabla _ \mathcal B f(x)) \le f(x) - \alpha \left(1 - \frac{L _ \text{bmax} \alpha }{2}\right) ||\nabla _ {\mathcal B} f(x)||^2\]

\[\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{\alpha _ k}{\ell} \left(1 - \frac{L _ \text{bmax} \alpha _ k}{2}\right) \mathbb E[||\nabla f(X^k)||^2]\]

\[\mathbb E[f(X^{k+1})] \le \mathbb E[f(X^k)] - \frac{\alpha _ k n _ 0}{n} \left(1 - \frac{L _ \text{bmax} \alpha _ k}{2}\right) \mathbb E[||\nabla f(X^k)||^2]\]

@important~ (overestimation property)

$\gamma$-strongly convex, fixed step, CL-smooth case

Recall the randomised block coordinate descent algorithm:

To calculate $\min _ {x \in \mathbb R^n} f(x)$, pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select block $\mathcal B _ k \subset \{1, \ldots, n\}$ according to some sampling scheme.

(2): Calculate $d^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(d^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k d^k\]

In this context, @state a result characterising the convergence of the randomised block CD method with fixed step size in the $\gamma$-strongly convex case under some additional assumptions, when:

Blocks are chosen from some fixed set $\{\mathcal B _ 1, \ldots, B _ \ell\}$.
Blocks are chosen to have size $n _ 0$

$f$ is BL-smooth with parameter $L _ \text{bmax}$
$f$ is $\gamma$-strongly convex
$f$ is bounded below by $f _ \text{low}$
$\alpha _ k = \frac{1}{L _ \text{bmax} }$.

Blocks are chosen from some fixed set $\{\mathcal B _ 1, \ldots, B _ \ell\}$:

\[\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{\gamma}{\ell L _ \text{bmax} }\right)^k(f(x^0) - f(x^\ast))\]

Blocks are chosen to have size $n _ 0$:

\[\mathbb E[f(X^k)] - f(x^\ast) \le \left(1 - \frac{n _ 0 \gamma}{n L _ \text{max} }\right)^k(f(x^0) - f(x^\ast))\]

Comparisons between cyclic CD and randomised CD

Rather than comparing iteration complexity for coordinate descent versus standard gradient descent, what does it make sense to compare instead?

Epoch complexity, since in standard gradient descent we need to do $O(n)$ more work per iteration.

Here is one result concerning randomised coordinate descent in the convex case:

\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{nL _ \text{max}D^2}{kc _ 1 (1 - c _ 2)}\]

Here is another concerning cyclic coordinate descent in the convex case:

\[f(x^k) - f(x^\ast) \le \frac{4nD^2 L _ \max (1 + n\frac{L^2}{L^2 _ \max})}{k+8}\]

Why might randomised coordinate descent be preferred over cyclic coordinate descent despite the convergence being of the same order (sublinear)?

Since $L^2 / L^2 _ \text{max} > 1$, there is an extra factor of $O(n)$ in the constant for cyclic coordinate descent.

Why might you prefer

Standard gradient descent with line search

over

Randomised coordinate descent with line search

when $f$ is expensive to evaluate?

Randomised gradient descent requires $O(n)$ more function evaluations for line searches.

Flashcards

Basic definitions

CL-smoothness

BL-smoothness

Convergence of cyclic CD when not CL-smooth

Example of failure

General $L$-smooth, line search case

Unique coordinatewise minima

Convex, fixed-step case

$\gamma$-strongly convex, fixed step case

Convergence of randomised CD with line search when not CL-smooth

General line search case

Convergence of fixed-step randomised CD when CL-smooth

Overestimation property when CL-smooth

General fixed-step, CL-smooth case

Convex, fixed-step, CL-smooth case

$\gamma$-strongly convex, fixed-step, CL-smooth case

Convergence of randomised CD with line search when CL-smooth

Componentwise Wolfe conditions

Overestimation property when CL-smooth and using line search

General line search CL-smooth case

Convex, line search, CL-smooth case

$\gamma$-strongly convex, line search, CL-smooth case

Convergence of randomised block coordinate descent

Blockwise overestimation property

$\gamma$-strongly convex, fixed step, CL-smooth case

Comparisons between cyclic CD and randomised CD

Related posts