Uncertainty in Deep Learning MT25, Kullback-Leibler divergence
Flashcards
@Define the Kullback-Leibler (KL) divergence between two probability distributions $q$, $p$.
Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
@Justify that if the two distributions are the same, we get exactly $0$.
This follows immediately from the fact $\log 1 = 0$.
Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
@Prove that the KL divergence between any two distributions is non-negative.
Two proofs:
Via Jensen’s inequality:
\[\begin{aligned} \text{KL}(q, p) &= \mathbb E _ {q(x)} \left[ \log \frac{q(x)}{p(x)} \right] \\ &= \mathbb E _ {q(x)} \left[ -\log \frac{p(x)}{q(x)} \right] \\ &\ge - \log \mathbb E _ {q(x)} \left[ \frac{p(x)}{q(x)} \right] &(\text{Jensen's})\\ &= - \log \sum q(x) \frac{p(x)}{q(x)} \\ &= -\log \sum p(x) \\ &= - \log 1 \\ &= 0 \end{aligned}\]Without Jensen’s inequality:
We first show $\log x \le x - 1$. Define $f(x) := \log x - x + 1$. Then note that $f’(x) = 0$ when $x = 1$, and as $f’‘(1) = -1$, $f(1) = 0$ is a maximum. So $\log x \le x - 1$. Then
\[\begin{aligned} -\text{KL}(q, p) &= -\sum q(x) \log \frac{q(x)}{p(x)} \\ &= \sum q(x) \log \frac{p(x)}{q(x)} \\ &\le \sum q(x) \left( \frac{p(x)}{q(x)} - 1 \right)\\ &= \sum \left( p(x) - q(x) \right) \\ &= 0\\ \end{aligned}\]so
\[\text{KL}(q, p) \ge 0\]Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
@Justify that the KL divergence is not symmetric.
Consider:
- $q = \left[ \frac 3 8, \frac 4 8, \frac 1 8 \right]$
- $p = \left[\frac 1 8, \frac 3 8, \frac 4 8\right]$
Or, algebraically:
\[\begin{aligned} \text{KL}(p, q) - \text{KL}(q, p) &= \sum p _ k \log(p _ k / q _ k) - \sum q _ k \log(q _ k / p _ k) \\ &= \sum (p _ k + q _ k) \log(p _ k / q _ k) \end{aligned}\]and there exist $p, q$ for which this is nonzero.
Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
@Prove that if some $q(x) = 0$, then it is ignored in the KL divergence.
Since $\log 0$ is undefined, we have to take a limit. Consider a term in the sum where $q _ k = \varepsilon$. Then
\[\begin{aligned} \lim _ {\varepsilon \to 0^+} \varepsilon \log \frac{\varepsilon}{p _ k} &= \lim _ {\varepsilon \to 0^+} \varepsilon \log \varepsilon - \varepsilon \log p _ k \\ &= \lim _ {\varepsilon \to 0^+} \frac{\log \varepsilon}{1/\varepsilon} - \varepsilon \log p _ k \\ &= \lim _ {\varepsilon \to 0^+} \frac{-1/\varepsilon}{-1/\varepsilon^2} - \varepsilon \log p _ k \\ &= \lim _ {\varepsilon \to 0^+} \varepsilon - \varepsilon \log p _ k \\ &= \lim _ {\varepsilon \to 0^+} \varepsilon - \lim _ {\varepsilon \to 0^+} \varepsilon \log p _ k \\ &= 0 - 0 \\ &= 0 \end{aligned}\]So terms for which $q _ k = 0$ are ignored in the sum.
Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \sum q _ k \log \frac{q _ k}{p _ k}\]
@Prove that if some $q _ k > 0$, it must be that $p _ k > 0$ for the KL to be finite.
In the discrete case, suppose there exists some index $i$ with $q _ i > 0$ and $p _ i \le 0$. Then the sum contains a term $q _ i \log (q _ i / p _ i) = \infty$, so the KL will be infinite. Hence we require $p _ i > 0$.
Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
Suppose that:
- $q(x) = \mathcal N(x \mid m _ 0, s _ 0^2)$
- $p(x) = \mathcal N(x \mid m _ 1, s _ 1^2)$
@Prove that then
\[\text{KL}(q,p) = \frac{1}{2} \left( \frac{s _ 0^2}{s _ 1^2} + \frac{(m _ 1 - m _ 0)^2}{s _ 1^2} - 1 - \log\frac{s _ 0^2}{s _ 1^2} \right)\]
We have
\[\begin{aligned} \text{KL}(q, p) &= \int q(x) \log \frac{q(x)}{p(x)} \text dx \\ &= \int q(x) \log\left( \frac{\frac{1}{\sqrt{2\pi s _ 0^2}} \exp\left(-\frac{(x-m _ 0)^2}{2s _ 0^2}\right)}{\frac{1}{\sqrt{2\pi s _ 1^2}} \exp\left(-\frac{(x-m _ 1)^2}{2s _ 1^2}\right)} \right) \text dx \\ &= \int q(x) \log\left( \frac{s _ 1}{s _ 0} \exp\left( -\frac{(x-m _ 0)^2}{2s _ 0^2} + \frac{(x - m _ 1)^2}{2s _ 1^2} \right) \right) \text dx \\ &= \mathbb E _ {q(x)} \left[ \log\left( \frac{s _ 1}{s _ 0} \exp\left( -\frac{(x-m _ 0)^2}{2s _ 0^2} + \frac{(x - m _ 1)^2}{2s _ 1^2} \right) \right) \right] \\ &= \mathbb E _ {q(x)} \left[ \log\left(\frac{s _ 1}{s _ 2}\right) - \frac{(x - m _ 0)^2}{2s _ 0^2} + \frac{(x - m _ 1)^2}{2s _ 1^2} \right] \\ &= \log \frac{s _ 1}{s _ 0} - \frac{1}{2s _ 0^2} \mathbb E _ {q(x)}[(x - m _ 0)^2] + \frac{1}{2s _ 1^2} \mathbb E _ {q(x)}\left[ (x - m _ 1)^2 \right] \end{aligned}\]Since $x \sim \mathcal N(m _ 0, s _ 0^2)$, we have:
- $\mathbb E _ {q(x)} [(x - m _ 0)^2] = s _ 0^2$, and
- $\mathbb E _ q[(x-m _ 1)^2] = s _ 0^2 + (m _ 0 - m _ 1)^2$
Therefore
\[\begin{aligned} \text{KL}(q, p) &= \log \frac{s _ 1}{s _ 0} - \frac 1 2 + \frac{1}{2s _ 1^2} \left( s _ 0^2 + (m _ 0 - m _ 1)^2 \right) \\ &= \frac 1 2 \left( \log\left(\frac{s _ 1^2}{s _ 0^2}\right) + \frac{s _ 0^2}{s _ 1^2} + \frac{(m _ 0 - m _ 1)^2}{s _ 1^2} - 1\right) \\ &= \frac{1}{2} \left( \frac{s _ 0^2}{s _ 1^2} + \frac{(m _ 1 - m _ 0)^2}{s _ 1^2} - 1 + \log\frac{s _ 1^2}{s _ 0^2} \right) \\ &= \frac{1}{2} \left( \frac{s _ 0^2}{s _ 1^2} + \frac{(m _ 1 - m _ 0)^2}{s _ 1^2} - 1 - \log\frac{s _ 0^2}{s _ 1^2} \right) \end{aligned}\]Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
Suppose that:
- $X _ 1$ and $X _ 2$ are independent under $p$ and $q$, i.e.
- $q(X _ 1, X _ 2) = q _ 1(X _ 1)q _ 2(X _ 2)$
- $p(X _ 1, X _ 2) = p _ 1(X _ 1) p _ 2(X _ 2)$
then:
\[\text{KL}(q(X _ 1, X _ 2), p(X _ 1, X _ 2)) = \text{KL}(q(X _ 1), p(X _ 1)) + \text{KL}(q(X _ 2), p(X _ 2))\]
- $q(X _ 1, X _ 2) = q _ 1(X _ 1)q _ 2(X _ 2)$
- $p(X _ 1, X _ 2) = p _ 1(X _ 1) p _ 2(X _ 2)$
Then
\[\begin{aligned} \mathbb E _ {q(x)} \left[ \log \frac{q _ 1(X _ 1)}{p _ 1(X _ 1)} \right] &= \iint q(x _ 1, x _ 2) \log \frac{q _ 1(x _ 1)}{p _ 1(x _ 1)} \text dx _ 1 \text dx _ 2 \\ &= \iint q _ 1(x _ 1) q _ 2(x _ 2) \log \frac{q _ 1(x _ 1)}{p _ 1(x _ 1)} \text dx _ 1 \text dx _ 2 \\ &= \int q _ 1(x _ 1) \log \frac{q _ 1(x _ 1)}{p _ 1(x _ 1)} \text dx _ 1 &(\text{marg. over }x _ 2) \\ &= \text{KL}(q(X _ 1), p(X _ 1)) \end{aligned}\]and similarly for the other term we have
\[\mathbb E _ {q(x)} \left[ \log \frac{q _ 2(X _ 2)}{p _ 2(X _ 2)} \right]\]Hence we have overall
\[\text{KL}(q(X _ 1, X _ 2), p(X _ 1, X _ 2)) = \text{KL}(q(X _ 1), p(X _ 1)) + \text{KL}(q(X _ 2), p(X _ 2))\]as required.
Suppose we have two $K$-dimensional multivariate Gaussians
- $q(\pmb x) = \mathcal N(\pmb x \mid \pmb m _ 0, \mathbf S _ 0)$, $\mathbf S _ 0 = \text{diag}(s _ {0,1}^2, \ldots, s _ {0,K}^2)$
- $p(\pmb x) = \mathcal N(\pmb x \mid \pmb m _ 1, \mathbf S _ 1)$, $\mathbf S _ 1 = \text{diag}(s _ {1,1}^2, \ldots, s _ {1,K}^2)$
@Prove that
\[\text{KL}(q,p) = \frac{1}{2} \sum _ k \left( \frac{s _ {0,k}^2}{s _ {1,k}^2} + \frac{(\pmb m _ {1,k} - \pmb m _ {0, k})^2}{s _ {1,k}^2} - 1 + \log\frac{s _ {1,k}^2}{s _ {0,k}^2} \right)\]
Since these Gaussians are diagonal, we have
\[\begin{aligned} q(\pmb x) = \mathcal N(\pmb x \mid \pmb m _ 0, \mathbf S _ 0) = \prod _ k \mathcal N(x _ k \mid m _ {0,k}, s _ {0,k}^2) \\ p(\pmb x) = \mathcal N(\pmb x \mid \pmb m _ 1, \mathbf S _ 1) = \prod _ k \mathcal N(x _ k \mid m _ {1,k}, s _ {1,k}^2) \end{aligned}\]Then by the result that says for independent $X _ 1, X _ 2$ we have
\[\text{KL}(q(X _ 1, X _ 2), p(X _ 1, X _ 2)) = \text{KL}(q(X _ 1), p(X _ 1)) + \text{KL}(q(X _ 2), p(X _ 2))\]It follows we can sum over each univariate distribution to obtain
\[\text{KL}(q,p) = \frac{1}{2} \sum _ k \left( \frac{s _ {0,k}^2}{s _ {1,k}^2} + \frac{(\pmb m _ {1,k} - \pmb m _ {0, k})^2}{s _ {1,k}^2} - 1 + \log\frac{s _ {1,k}^2}{s _ {0,k}^2} \right)\]