Notes - Optimisation for Data Science HT25, Coordinate descent
Flashcards
Basic definitions
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, can you @state the coordinate descent @algorithm?
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\](3): Calculate
\[x^{k+1} = x^k - \alpha_k g^k\]Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
@Define what is meant by:
- randomised,
- cyclic, and
- Gauss-Southwell
coordinate descent in this context.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
- Randomised: Pick $ \vert \mathcal B _ k \vert = 1$ uniformly randomly from $1$ to $n$
- Cyclic: Cycle through the coordinates $B _ 0 = \{1\}$, $B _ 1 = \{2\}$, in order
- Gauss-Southwell: Pick $ \vert \mathcal B _ k \vert = 1$ where $\mathcal B _ k = \{\text{argmax} _ {1 \le i \le n} \vert \frac{\partial f}{\partial x _ i}(x^k) \vert \}$.
Convergence of randomised CD
Assumptions on $f$
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
@State the assumptions on $f$ used to prove the convergence of this algorithm.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
Let
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
@State a result that relates the $L$-smoothness of $f$ to $L _ \max$.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
Let
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
@Prove that if $f$ is $L$-smooth, then
\[L_\max \le L \le n L_\max\]
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
@todo, after sheet 4.
Overestimation property
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
@State an overestimation property for each coordinate directions, and then a result that follows from this which is useful in proving the convergence of randomised coordinate descent.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
For each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have
\[f(x - \alpha \nabla_i f(x)) \le f(x) - \alpha \left( 1 - \frac{L_\max \alpha}{2} \right)\left( \frac{\partial f}{\partial x_i}(x) \right)^2\]where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$.
If randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha^k > 0$, then
\[\mathbb E_{\mathcal B_k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha^k}{n} \left( 1 - \frac{L_\max \alpha^k}{2} \right)||\nabla f(X^k)||^2\]where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
@Prove that then for each $i \in \{1, \ldots, n\}$, $x \in \mathbb R^n$ and $\alpha \ge 0$, we have
\[f(x - \alpha \nabla_i f(x)) \le f(x) - \alpha \left( 1 - \frac{L_\max \alpha}{2} \right)\left( \frac{\partial f}{\partial x_i}(x) \right)^2\]
where $\nabla _ i f(x) = \frac{\partial f}{\partial x _ i}(x) e _ i$, and therefore that if randomised coordinate descent is applied with $ \vert \mathcal B _ k \vert = 1$ and $\alpha^k > 0$, then
\[\mathbb E_{\mathcal B_k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha^k}{n} \left( 1 - \frac{L_\max \alpha^k}{2} \right)||\nabla f(X^k)||^2\]
where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Let $g = \nabla _ i f(x)$. We use the Taylor expansion
\[f(x + d) = f(x) + \nabla f(x)^\top d + \int^1_0 d^\top(\nabla f(x + td) - \nabla f(x)) \text dt\]with $d := -\alpha g$ to deduce
\[\begin{aligned} f(x - \alpha g) &= f(x) + \nabla f(x)^\top (-\alpha g) + \int^1_0 [\nabla f(x - t \alpha g) - \nabla f(x)]^\top (-\alpha g)\text dt \\\\ &\le f(x) - \alpha g^\top \nabla f(x) + \alpha ||g|| \int^1_0 ||\nabla f(x - t\alpha g) - \nabla f(x)||\text dt \end{aligned}\]by the Cauchy-Schwarz inequality.
Note also that
\[g^\top \nabla f(x) = \left[\frac{\partial f}{\partial x_i} (x)\right] e_i^\top \nabla f(x) = \left[\frac{\partial f}{\partial x_i} (x)\right]^2\]by definition.
Finally,
\[\begin{aligned} ||\nabla f(x - t\alpha g) - \nabla f(x)|| &= ||\nabla f(x - t\alpha \frac{\partial f}{\partial x_i}(x) e_i) - \nabla f(x)|| \\\\ &\le L_i \left| t\alpha \frac{\partial f}{\partial x_i}(x) \right| \\\\ &\le L_\max t\alpha \left| \frac{\partial f}{\partial x_i}(x) \right| \end{aligned}\]Combining these results together gives
\[f(x - \alpha g) \le f(x) - \alpha \left[ \frac{\partial f}{\partial x_i}(x) \right]^2 + L_\max \alpha^2 \left[ \frac{\partial f}{\partial x_i}(x) \right]^2 \int^1_0 t \text dt\]and hence we have the first result.
To prove the second result, let $x = X^k$, $g = G^k = \nabla _ {\mathcal B _ k} f(X^k) = \frac{\partial f}{\partial x _ {\mathcal B _ k} } (X^k) e _ {\mathcal B _ k}$ and $\alpha = \alpha^k$ in the above, and apply expectation with respect to $\mathcal B _ k$ on both sides. Therefore we have
\[\mathbb E_{\mathcal B_k} [f(X^{k+1})] \le f(X^k) - \alpha^k \left(1 - \frac{L_\max \alpha^k}{2} \mathbb E_{\mathcal B_k}\left[\left(\frac{\partial f}{\partial x_{\mathcal B_k} }(X^k)\right)^2\right]\right)\]where we also used that $X^{k+1} = X^k - \alpha^k G^k$. Then we also have by the definition of expectation that
\[\begin{aligned} \mathbb E_{\mathcal B_k}\left[\left(\frac{\partial f}{\partial x_{\mathcal B_k} }(X^k)\right)^2\right] &= \sum^n_{k = 1} \mathbb E\left[\left(\frac{\partial f}{\partial x_{\mathcal B_k} }(X^k)\right)^2 \text{ }\Bigg|\text{ } B_k = i\right] \mathbb P(\mathcal B_k = i) \\\\\ &= \sum^n_{i = 1}\left(\frac{\partial f}{\partial x_{\mathcal B_k} }(X^k)\right)^2 \cdot \frac 1 n \\\\ &= \frac 1 n ||\nabla f(X^k)||^2 \end{aligned}\]and hence we have the second result.
General case
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
In this context, @state a result characterising the convergence of the randomised CD method with fixed step size in the general case.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is as above
- $f$ is bounded below by $f _ \text{low}$
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
Then: for all $k \ge 1$,
\[\min_{0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{2n L_\max (f(x^0) - f_\text{low})}{k}\]and hence the randomised CD method takes at most $k \le 2n L _ \max(f(x^0) - f _ \text{low}) \frac 1 \varepsilon$ iterations to ensure $\mathbb E[ \vert \vert \nabla f(X^{k-1}) \vert \vert ^2] \le \varepsilon$.
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
@Prove that if
- $f$ is as above
- $f$ is bounded below by $f _ \text{low}$
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
then for all $k \ge 1$,
\[\min_{0 \le i \le k-1} \mathbb E[||\nabla f(X^i)||^2] \le \frac{2n L_\max (f(x^0) - f_\text{low})}{k}\]
and hence the randomised CD method takes at most $k \le 2n L _ \max(f(x^0) - f _ \text{low}) \frac 1 \varepsilon$ iterations to ensure $\mathbb E[ \vert \vert \nabla f(X^{k-1}) \vert \vert ^2] \le \varepsilon$.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Recall we have the result that in this context
\[\mathbb E _ {\mathcal B _ k} [f(X^{k+1})] \le f(X^k) - \frac{\alpha^k}{n} \left( 1 - \frac{L _ \max \alpha^k}{2} \right) \vert \vert \nabla f(X^k) \vert \vert ^2\]
where $\mathbb E _ {\mathcal B _ k}$ denotes conditional expectation with respect to the random variable $\mathcal B _ k$
By properties of expectation (@todo), it follows that if $\mathbb E = \mathbb E _ k := \mathbb E[\cdot \mid \mathcal B _ 0, \ldots, \mathcal B _ k]$, then we have
\[\mathbb E_{k} [f(X^{k+1})] \le \mathbb E_{k-1}[f(X^k)] - \frac{1}{2n L_\max} ||\nabla f(X^k)||^2\]for all $k \ge 0$, which then rearranges to give
\[\mathbb E_{i - 1} [f(X^i)] - \mathbb E_i [f(X^{i+1})] \ge \frac{1}{2n L_\max} \mathbb E_{i-1}[||\nabla f(X^i)^2||]\]where $\mathbb E _ {-1}[f(X^0)] = f(x^0)$ since $x^0$ is deterministic.
Then we sum this up from $i = 0$ to $i = k$ and deduce from the telescoping sum that
\[f(x^0) - \mathbb E_k[f(X^{k+1})] \ge \frac{1}{2nL_\max} \sum^{k}_{i = 0} \mathbb E_{i - 1} [||\nabla f(X^i)||^2]\]and then using $f(X^{k+1}) \ge f _ \text{low}$ and $\mathbb E _ {i-1}[ \vert \vert \nabla f(X^i) \vert \vert ^2] \ge \min _ {0 \le i \le k} \mathbb E _ {i-1}[ \vert \vert \nabla f(X^i) \vert \vert ^2]$, so that
\[f(x^0) - f_\text{low} \ge \frac{1}{2nL_\max} (k+1) \min_{0 \le i \le k} \mathbb E_{i-1} [||\nabla f(X^i)||^2]\]which gives the result with $k+1$ instead of $k$.
Convex case
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
@State an additional assumption on $f$ used to prove the convergence of coordinate descent when $f$ is convex.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
There is some finite minimiser $x^\ast$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence in the general case, we make the additional assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
Likewise, in order to prove convergence in the convex case, we also have the additional assumption that there is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.
In this context, @state a result characterising the convergence of the randomised CD method with fixed step size in the convex case.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is as above (i.e. all the above assumptions are met)
- $f$ is convex (not necessarily $\gamma$-strongly convex)
- $f$ is bounded below by $f _ \text{low}$
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
Then:
\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL_\max D^2}{k}\]and hence the randomised CD method takes at most $k \le 2n L _ \max D^2 \frac 1 \varepsilon$ iterations to generate
\[\mathbb E[f(X^k)] - f(x^\ast) \le \varepsilon\]Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
Likewise, in order to prove convergence in the convex case, we also have the additional assumption that there is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.
@Prove that if
- $f$ is as above (i.e. all the above assumptions are met)
- $f$ is convex (not necessarily $\gamma$-strongly convex)
- $f$ is bounded below by $f _ \text{low}$
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
Then:
\[\mathbb E[f(X^k)] - f(x^\ast) \le \frac{2nL_\max D^2}{k}\]
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Note that all the conditions required for the convergence of CD in the general case are also met, so following that proof and defining
\[\Delta_k := \mathbb E[f(X^k)] - f(x^\ast)\]we have
\[\Delta_{k + 1} \le \Delta_k - \frac{1}{2n L_\max} \mathbb E[||\nabla f(X^k)||^2]\]Since $f$ is convex, we also have that
\[\begin{aligned} 0 &\le f(X^k) - f(x^\ast) \\\\ &\le \nabla f(X^k)^\top (X^k - x^\ast) \\\\ &\le ||\nabla f(X^k)|| \cdot ||(X^k - x^\ast)|| \\\\ &\le D||\nabla f(X^k)|| \end{aligned}\]where the final inequality follows from the fact that
\[f(x^k) \le f(x^0)\]for all $k$ and so $X^k$ is in the $f(x^0)$-level set of $f$.
Squaring this and passing to total expectation, it follows that
\[\mathbb E[(f(X^k) - f(x^\ast))^2] \le D^2 \mathbb E[||\nabla f(X^k)||^2]\]and we also have that
\[(\mathbb E[(f(X^k)] - f(x^\ast))^2 \le \mathbb E[(f(X^k) - f(x^\ast))]\]since
\[\text{var}(f(X^k) - f(x^\ast)) \ge 0\]Therefore
\[\Delta^2_k \le D^2 \mathbb E[||\nabla f(X^k)||^2]\]Hence we have
\[\Delta_{k+1} \le \Delta_k - \frac{1}{2nL_\max D^2} \Delta^2_k\]or equivalently
\[\Delta_k - \Delta_{k+1} \ge \frac{1}{2nL_\max D^2} \Delta^2_k\]for all $k \ge 0$, which implies the result.
$\gamma$-strongly convex case
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
Likewise, in order to prove convergence in the convex case, we also have the additional assumption that there is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.
In this context, @state a result characterising the convergence of the randomised CD method with fixed step size in the $\gamma$-strongly convex case.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is as above
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
Then: for all $k \ge 0$,
\[\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL_\max} \right)^k (f(x^0) - f(x^\ast))\]and hence the randomised CD method takes at most $k \le O(\log \frac 1 \varepsilon)$ to generate
\[\mathbb E[f(X^k)] - f(x^\ast) \le \varepsilon\]Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, the coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
Likewise, in order to prove convergence in the convex case, we also have the additional assumption that there is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.
@Prove that if:
- $f$ is as above
- $f$ is $\gamma$-strongly convex
- $f$ is bounded below by $f _ \text{low}$
- The above algorithm is applied to $f$, starting at $x^0 \in \mathbb R^n$ with $ \vert \mathcal B _ k \vert = 1$
- $\mathcal B _ k$ is chosen uniformly at random and with replacement
- $\alpha _ k = \frac 1 {L _ \max}$
then for all $k \ge 0$,
\[\mathbb E[f(X^k)] - f(x^\ast) \le \left( 1 - \frac{\gamma}{nL_\max} \right)^k (f(x^0) - f(x^\ast))\]
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k \subset \{1, \ldots, n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Note that all the conditions required for the convergence of CD in the general case are also met, so following that proof and defining
\[\Delta_k := \mathbb E[f(X^k)] - f(x^\ast)\]we have
\[\Delta_{k + 1} \le \Delta_k - \frac{1}{2n L_\max} \mathbb E[||\nabla f(X^k)||^2]\]As $f$ is $\gamma$-strongly convex, we have that
\[f(x) - f(x^\ast) \le \frac 1 {2\gamma} ||\nabla f(x)||^2\]for all $x$. Therefore, letting $x := X^k$ and passing to total expectation, we have
\[\Delta_k = \mathbb E[f(X^k)] - f(x^\ast) \le \frac{1}{2\gamma} \mathbb E[||\nabla f(X^k)||^2]\]Hence, by substituting into the above inequality $\Delta _ {k+1} \le \Delta _ k - \cdots$, we obtain
\[\Delta_{k+1} \le \Delta_k - \frac{2\gamma}{2nL_\max} \Delta_k\]where $k \ge 0$, and hence
\[\Delta_{k+1} \le \left( 1 - \frac{\gamma}{nL_\max} \right) \Delta_k\]for all $k \ge 0$ also. Since $\gamma \le L \le nL _ \max$, the convergence factor is in $(0, 1]$ and so the result follows inductively.
Convergence of cyclic CD
Example of failure
Give an @example of an objective function $f : \mathbb R^3 \to \mathbb R$ on which the cyclic CD method fails to converge to a stationary point and describe its properties.
This function is non-convex, continuously differentiable and its minimisers are at the corners $(1, 1, 1)^\top$ and $(-1, -1, -1)^\top$ of the unit cube.
General case
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, one version of the cyclic coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
where $\alpha _ k$ is chosen to minimise $f$ along that direction.
@State a result concerning the convergence in this case.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is continuously differentiable
- $f$ is bounded below
- For each $i$, the minimum $\min _ {t \in \mathbb R} f(x _ 1, x _ 2, \ldots, x _ {i-1}, t, x _ {i+1}, \ldots, x _ n)$ is uniquely attained for all $x _ 1, \ldots, x _ {i-1}, x _ {i+1}, \ldots, x _ n$.
Then:
- Every limit point of the sequence of iterates $x^k$ where $k = n, 2n, 3n, \ldots$ is a stationary point of $f$
Convex case
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, one version of the cyclic coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
where $\alpha _ k = L _ \max$.
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
Likewise, in order to prove convergence in the convex case, we also have the additional assumption that there is some finite minimiser $x^\ast$ and constant $D$ such that for all $x \in \mathbb R^n$ where $f(x) \le f(x^0)$, we have $ \vert \vert x - x^\ast \vert \vert \le D$.
@State a result concerning the convergence in the convex case.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is as above
- $f$ is convex
Then for $k \in \{n ,2n, 3n, \ldots\}$
\[f(x^k) - f(x^\ast) \le \frac{4n L_\max D^2 \left(1 + n \frac{L^2}{L^2_\max}\right)}{k+8}\]$\gamma$-strongly convex case
Suppose we wish to solve the optimisation problem
\[\min_{x \in \mathbb R^n} f(x)\]
where $f : \mathbb R^n \to \mathbb R$ is continuously differentiable. In this context, one version of the cyclic coordinate descent algorithm is as follows:
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
\[(g^k) _ i = \begin{cases}\frac{\partial f}{\partial x _ i} (x^k) &i \in \mathcal B _ k \\\\ 0 &\text{otherwise}\end{cases}\]
(3): Calculate
\[x^{k+1} = x^k - \alpha _ k g^k\]
where $\alpha _ k = L _ \max$.
In order to prove convergence, we make the assumption that each partial derivative of $f$ is “directionally $L _ i$-Lipschitz”: for each $i$, for all $x \in \mathbb R^n$ and $t \in \mathbb R$,
\[\left| \frac{\partial f}{\partial x_i}(x + te_i) - \frac{\partial f}{\partial x_i}(x) \right| \le L_i |t|\]
and define
\[L_\max := \max_{i \in \\{ 1, \ldots, n \\} } L_i\]
@State a result concerning the convergence in the $\gamma$-strongly convex case.
Pick some $x^0 \in \mathbb R^n$, then for $k = 0, 1, 2, \ldots$ repeat:
(1): Select $\mathcal B _ k = \{(k+1) \pmod n\}$.
(2): Calculate $g^k = \nabla _ {\mathcal B _ k} f(x^k)$ where this is defined component-wise as
(3): Calculate
Suppose:
- $f$ is as above
- $f$ is $\gamma$-strongly convex
Then for $k \in \{n ,2n, 3n, \ldots\}$
\[f(x^k) - f(x^\ast) \le \left(1 - \frac{\gamma}{2L_\max (1 + nL^2 / L_\max^2)}\right)^{k/n} (f(x^0) - f(x^\ast))\]