Notes - Optimisation for Data Science HT25, Coordinate descent



Flashcards

Basic definitions

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, can you @state the coordinate descent @algorithm?


Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha_k g^k\]

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

@Define what is meant by:

  • randomised,
  • cyclic, and
  • Gauss-Southwell

coordinate descent in this context.


  • Randomised: Pick $ \vert \mathcal B _ k \vert = 1$ uniformly randomly from $1$ to $n$
  • Cyclic: Cycle through the coordinates $B _ 0 = \{1\}$, $B _ 1 = \{2\}$, in order
  • Gauss-Southwell: Pick $ \vert \mathcal B _ k \vert = 1$ where $\mathcal B _ k = \{\text{argmax} _ {1 \le i \le n} \vert \frac{\partial f}{\partial x _ i}(x^k) \vert \}$.

Convergence of randomised CD

Assumptions on $f$

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

@State the assumptions on $f$ used to prove the convergence of this algorithm.


Each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

Let

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

@State a result that relates the $L$-smoothness of $f$ to $L _ \max$.


\[L_\max \le L \le n L_\max\]

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

Let

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

@Prove that if $f$ is $L$-smooth, then

\[L_\max \le L \le n L_\max\]

@todo, after sheet 4.

Overestimation property

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

@State an overestimation property for each coordinate directions, and then a result that follows from this which is useful in proving the convergence of randomised coordinate descent.


For each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have

\[f(x - \alpha \nabla_i f(x)) \le f(x) - \alpha \left( 1 - \frac{L_\max \alpha}{2} \right)\left( \frac{\partial f}{\partial x_i}(x) \right)^2\]

where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$.


If randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha^k > 0$, then

\[\mathbb E_{\mathcal B_k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha^k}{n} \left( 1 - \frac{L_\max \alpha^k}{2} \right)||\nabla f(X^k)||^2\]

where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

@Prove that then for each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have

\[f(x - \alpha \nabla_i f(x)) \le f(x) - \alpha \left( 1 - \frac{L_\max \alpha}{2} \right)\left( \frac{\partial f}{\partial x_i}(x) \right)^2\]

where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$, and therefore that if randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha^k > 0$, then

\[\mathbb E_{\mathcal B_k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha^k}{n} \left( 1 - \frac{L_\max \alpha^k}{2} \right)||\nabla f(X^k)||^2\]

where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.


Let $g = \nabla _ i f(x)$. We use the Taylor expansion

\[f(x + d) = f(x) + \nabla f(x)^\top d + \int^1_0 d^\top(\nabla f(x + td) - \nabla f(x)) \text dt\]

with $d := -\alpha g$ to deduce

\[\begin{aligned} f(x - \alpha g) &= f(x) + \nabla f(x)^\top (-\alpha g) + \int^1_0 [\nabla f(x - t \alpha g) - \nabla f(x)]^\top (-\alpha g)\text dt \\\\ &\le f(x) - \alpha g^\top \nabla f(x) + \alpha ||g|| \int^1_0 ||\nabla f(x - t\alpha g) - \nabla f(x)||\text dt \end{aligned}\]

by the Cauchy-Schwarz inequality.

Note also that

\[g^\top \nabla f(x) = \left[\frac{\partial f}{\partial x_i} (x)\right] e_i^\top \nabla f(x) = \left[\frac{\partial f}{\partial x_i} (x)\right]^2\]

by definition.

Finally,

\[\begin{aligned} ||\nabla f(x - t\alpha g) - \nabla f(x)|| &= ||\nabla f(x - t\alpha \frac{\partial f}{\partial x_i}(x) e_i) - \nabla f(x)|| \\\\ &\le L_i \left| t\alpha \frac{\partial f}{\partial x_i}(x) \right| \\\\ &\le L_\max t\alpha \left| \frac{\partial f}{\partial x_i}(x) \right| \end{aligned}\]

Combining these results together gives

\[f(x - \alpha g) \le f(x) - \alpha \left[ \frac{\partial f}{\partial x_i}(x) \right]^2 + L_\max \alpha^2 \left[ \frac{\partial f}{\partial x_i}(x) \right]^2 \int^1_0 t \text dt\]

and hence we have the first result.

To prove the second result, let $x = X^k$, $g = G^k = \nabla _ {\mathcal B _ k} f(X^k) = \frac{\partial f}{\partial x _ {\mathcal B _ k} } (X^k) e _ {\mathcal B _ k}$ and $\alpha = \alpha^k$ in the above, and apply expectation with respect to $\mathcal B _ k$ on both sides. Therefore we have

\[\mathbb E_{\mathcal B_k} [f(X^{k+1})] \le f(X^k) - \alpha^k \left(1 - \frac{L_\max \alpha^k}{2} \mathbb E_{\mathcal B_k}\left[\left(\frac{\partial f}{\partial x_{\mathcal B_k} }(X^k)\right)^2\right]\right)\]

where we also used that $X^{k+1} = X^k - \alpha^k G^k$. Then we also have by the definition of expectation that

\[\begin{aligned} \mathbb E_{\mathcal B_k}\left[\left(\frac{\partial f}{\partial x_{\mathcal B_k} }(X^k)\right)^2\right] &= \sum^n_{k = 1} \mathbb E\left[\left(\frac{\partial f}{\partial x_{\mathcal B_k} }(X^k)\right)^2 \text{ }\Bigg|\text{ } B_k = i\right] \mathbb P(\mathcal B_k = i) \\\\\ &= \sum^n_{i = 1}\left(\frac{\partial f}{\partial x_{\mathcal B_k} }(X^k)\right)^2 \cdot \frac 1 n \\\\ &= \frac 1 n ||\nabla f(X^k)||^2 \end{aligned}\]

and hence we have the second result.

General case

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

In this context, @state a result characterising the convergence of the randomised CD method with fixed step size in the general case.


Suppose:

  • $f$ is as above
  • $f$ is bounded below by $f _ \text{low}$
  • The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
  • $\mathcal B _ k$ is chosen uniformly at random and with replacement
  • $\alpha _ k = \frac 1 {L _ \max}$

Then: for all $k \ge 1$,

\[\min_{0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{2n L_\max (f(x^0) - f_\text{low})}{k}\]

and hence the randomised CD method takes at most $k \le 2n L _ \max(f(x^0) - f _ \text{low}) \frac 1 \varepsilon$ iterations to ensure $\mathbb E[ \vert \vert \nabla f(X^{k-1}) \vert \vert ^2] \le \varepsilon$.

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

@Prove that if

  • $f$ is as above
  • $f$ is bounded below by $f _ \text{low}$
  • The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
  • $\mathcal B _ k$ is chosen uniformly at random and with replacement
  • $\alpha _ k = \frac 1 {L _ \max}$

then for all $k \ge 1$,

\[\min_{0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{2n L_\max (f(x^0) - f_\text{low})}{k}\]

and hence the randomised CD method takes at most $k \le 2n L _ \max(f(x^0) - f _ \text{low}) \frac 1 \varepsilon$ iterations to ensure $\mathbb E[ \vert \vert \nabla f(X^{k-1}) \vert \vert ^2] \le \varepsilon$.


Recall we have the result that in this context

\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha^k}{n} \left( 1 - \frac{L _ \max \alpha^k}{2} \right) \vert \vert \nabla f(X^k) \vert \vert ^2\]

where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$

By properties of expectation (@todo), it follows that if $\mathbb E = \mathbb E _ k := \mathbb E[\cdot \mid \mathcal B _ 0, \ldots, \mathcal B _ k]$, then we have

\[\mathbb E_{k} [f(X^{k+1})] \le \mathbb E_{k-1}[f(X^k)] - \frac{1}{2n L_\max} ||\nabla f(X^k)||^2\]

for all $k \ge 0$, which then rearranges to give

\[\mathbb E_{i - 1} [f(X^i)] - \mathbb E_i [f(X^{i+1})] \ge \frac{1}{2n L_\max} \mathbb E_{i-1}[||\nabla f(X^i)^2||]\]

where $\mathbb E _ {-1}[f(X^0)] = f(x^0)$ since $x^0$ is deterministic.

Then we sum this up from $i = 0$ to $i = k$ and deduce from the telescoping sum that

\[f(x^0) - \mathbb E_k[f(X^{k+1})] \ge \frac{1}{2nL_\max} \sum^{k}_{i = 0} \mathbb E_{i - 1} [||\nabla f(X^i)||^2]\]

and then using $f(X^{k+1}) \ge f _ \text{low}$ and $\mathbb E _ {i-1}[ \vert \vert \nabla f(X^i) \vert \vert ^2] \ge \min _ {0 \le i \le k} \mathbb E _ {i-1}[ \vert \vert \nabla f(X^i) \vert \vert ^2]$, so that

\[f(x^0) - f_\text{low} \ge \frac{1}{2nL_\max} (k+1) \min_{0 \le i \le k} \mathbb E_{i-1} [||\nabla f(X^i)||^2]\]

which gives the result with $k+1$ instead of $k$.

Convex case

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

@State an additional assumption on $f$ used to prove the convergence of coordinate descent when $f$ is convex.


There is some finite minimiser $x^\ast$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence in the general case, we make the additional assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

Likewise, in order to prove convergence in the convex case, we also have the additional assumption that there is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.

In this context, @state a result characterising the convergence of the randomised CD method with fixed step size in the convex case.


Suppose:

  • $f$ is as above (i.e. all the above assumptions are met)
  • $f$ is convex (not necessarily $\gamma$-strongly convex)
  • $f$ is bounded below by $f _ \text{low}$
  • The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
  • $\mathcal B _ k$ is chosen uniformly at random and with replacement
  • $\alpha _ k = \frac 1 {L _ \max}$

Then:

\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL_\max D^2}{k}\]

and hence the randomised CD method takes at most $k \le 2n L _ \max D^2 \frac 1 \varepsilon$ iterations to generate

\[\mathbb E[f(X^k)] - f(x^\ast) \le \varepsilon\]

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

Likewise, in order to prove convergence in the convex case, we also have the additional assumption that there is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.

@Prove that if

  • $f$ is as above (i.e. all the above assumptions are met)
  • $f$ is convex (not necessarily $\gamma$-strongly convex)
  • $f$ is bounded below by $f _ \text{low}$
  • The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
  • $\mathcal B _ k$ is chosen uniformly at random and with replacement
  • $\alpha _ k = \frac 1 {L _ \max}$

Then:

\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL_\max D^2}{k}\]

Note that all the conditions required for the convergence of CD in the general case are also met, so following that proof and defining

\[\Delta_k := \mathbb E[f(X^k)] - f(x^\ast)\]

we have

\[\Delta_{k + 1} \le \Delta_k - \frac{1}{2n L_\max} \mathbb E[||\nabla f(X^k)||^2]\]

Since $f$ is convex, we also have that

\[\begin{aligned} 0 &\le f(X^k) - f(x^\ast) \\\\ &\le \nabla f(X^k)^\top (X^k - x^\ast) \\\\ &\le ||\nabla f(X^k)|| \cdot ||(X^k - x^\ast)|| \\\\ &\le D||\nabla f(X^k)|| \end{aligned}\]

where the final inequality follows from the fact that

\[f(x^k) \le f(x^0)\]

for all $k$ and so $X^k$ is in the $f(x^0)$-level set of $f$.

Squaring this and passing to total expectation, it follows that

\[\mathbb E[(f(X^k) - f(x^\ast))^2] \le D^2 \mathbb E[||\nabla f(X^k)||^2]\]

and we also have that

\[(\mathbb E[(f(X^k)] - f(x^\ast))^2 \le \mathbb E[(f(X^k) - f(x^\ast))]\]

since

\[\text{var}(f(X^k) - f(x^\ast)) \ge 0\]

Therefore

\[\Delta^2_k \le D^2 \mathbb E[||\nabla f(X^k)||^2]\]

Hence we have

\[\Delta_{k+1} \le \Delta_k - \frac{1}{2nL_\max D^2} \Delta^2_k\]

or equivalently

\[\Delta_k - \Delta_{k+1} \ge \frac{1}{2nL_\max D^2} \Delta^2_k\]

for all $k \ge 0$, which implies the result.

$\gamma$-strongly convex case

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

Likewise, in order to prove convergence in the convex case, we also have the additional assumption that there is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.

In this context, @state a result characterising the convergence of the randomised CD method with fixed step size in the $\gamma$-strongly convex case.


Suppose:

  • $f$ is as above
  • $f$ is $\gamma$-strongly convex
  • $f$ is bounded below by $f _ \text{low}$
  • The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
  • $\mathcal B _ k$ is chosen uniformly at random and with replacement
  • $\alpha _ k = \frac 1 {L _ \max}$

Then: for all $k \ge 0$,

\[\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL_\max} \right)^k (f(x^0) - f(x^\ast))\]

and hence the randomised CD method takes at most $k \le O(\log \frac 1 \varepsilon)$ to generate

\[\mathbb E[f(X^k)] - f(x^\ast) \le \varepsilon\]

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

Likewise, in order to prove convergence in the convex case, we also have the additional assumption that there is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.

@Prove that if:

  • $f$ is as above
  • $f$ is $\gamma$-strongly convex
  • $f$ is bounded below by $f _ \text{low}$
  • The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
  • $\mathcal B _ k$ is chosen uniformly at random and with replacement
  • $\alpha _ k = \frac 1 {L _ \max}$

then for all $k \ge 0$,

\[\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL_\max} \right)^k (f(x^0) - f(x^\ast))\]

Note that all the conditions required for the convergence of CD in the general case are also met, so following that proof and defining

\[\Delta_k := \mathbb E[f(X^k)] - f(x^\ast)\]

we have

\[\Delta_{k + 1} \le \Delta_k - \frac{1}{2n L_\max} \mathbb E[||\nabla f(X^k)||^2]\]

As $f$ is $\gamma$-strongly convex, we have that

\[f(x) - f(x^\ast) \le \frac 1 {2\gamma} ||\nabla f(x)||^2\]

for all $x$. Therefore, letting $x := X^k$ and passing to total expectation, we have

\[\Delta_k = \mathbb E[f(X^k)] - f(x^\ast) \le \frac{1}{2\gamma} \mathbb E[||\nabla f(X^k)||^2]\]

Hence, by substituting into the above inequality $\Delta _ {k+1} \le \Delta _ k - \cdots$, we obtain

\[\Delta_{k+1} \le \Delta_k - \frac{2\gamma}{2nL_\max} \Delta_k\]

where $k \ge 0$, and hence

\[\Delta_{k+1} \le \left( 1 - \frac{\gamma}{nL_\max} \right) \Delta_k\]

for all $k \ge 0$ also. Since $\gamma \le L \le nL _ \max$, the convergence factor is in $(0, 1]$ and so the result follows inductively.

Convergence of cyclic CD

Example of failure

Give an @example of an objective function $f : \mathbb R^3 \to \mathbb R$ on which the cyclic CD method fails to converge to a stationary point and describe its properties.


\[f(x_1, x_2, x_3) = -(x_1x_2 + x_2 x_3 + x_1 x_3) + \sum^3_{i=1} \max\big((|x_i| - 1)^2, 0\big)\]

This function is non-convex, continuously differentiable and its minimisers are at the corners $(1, 1, 1)^\top$ and $(-1, -1, -1)^\top$ of the unit cube.

General case

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, one version of the cyclic coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

where $\alpha _ k$ is chosen to minimise $f$ along that direction.

@State a result concerning the convergence in this case.


Suppose:

  • $f$ is continuously differentiable
  • $f$ is bounded below
  • For each $i$, the minimum $\min _ {t \in \mathbb R} f(x _ 1, x _ 2, \ldots, x _ {i-1}, t, x _ {i+1}, \ldots, x _ n)$ is uniquely attained for all $x _ 1, \ldots, x _ {i-1}, x _ {i+1}, \ldots, x _ n$.

Then:

  • Every limit point of the sequence of iterates $x^k$ where $k = n, 2n, 3n, \ldots$ is a stationary point of $f$
Convex case

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, one version of the cyclic coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

where $\alpha _ k = L _ \max$.

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

Likewise, in order to prove convergence in the convex case, we also have the additional assumption that there is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.

@State a result concerning the convergence in the convex case.


Suppose:

  • $f$ is as above
  • $f$ is convex

Then for $k \in \{n ,2n, 3n, \ldots\}$

\[f(x^k) - f(x^\ast) \le \frac{4n L_\max D^2 \left(1 + n \frac{L^2}{L^2_\max}\right)}{k+8}\]
$\gamma$-strongly convex case

Suppose we wish to solve the optimisation problem

\[\min_{x \in \mathbb R^n} f(x)\]

where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, one version of the cyclic coordinate descent algorithm is as follows:

Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:

(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.

(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as

\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]

(3): Calculate

\[x^{k+1} = x^k - \alpha _ k g^k\]

where $\alpha _ k = L _ \max$.

In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,

\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]

and define

\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]

@State a result concerning the convergence in the $\gamma$-strongly convex case.


Suppose:

  • $f$ is as above
  • $f$ is $\gamma$-strongly convex

Then for $k \in \{n ,2n, 3n, \ldots\}$

\[f(x^k) - f(x^\ast) \le \left(1 - \frac{\gamma}{2L_\max (1 + nL^2 / L_\max^2)}\right)^{k/n} (f(x^0) - f(x^\ast))\]



Related posts