Uncertainty in Deep Learning MT25, Kullback-Leibler divergence
Flashcards
@Define the Kullback-Leibler (KL) divergence between two probability distributions $q$, $p$.
Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
@Justify that if the two distributions are the same, we get exactly $0$.
This follows immediately from the fact $\log 1 = 0$.
Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
@Prove that the KL divergence between any two distributions is non-negative.
Two proofs:
Via Jensen’s inequality:
\[\begin{aligned} \text{KL}(q, p) &= \mathbb E _ {q(x)} \left[ \log \frac{q(x)}{p(x)} \right] \\ &= \mathbb E _ {q(x)} \left[ -\log \frac{p(x)}{q(x)} \right] \\ &\ge - \log \mathbb E _ {q(x)} \left[ \frac{p(x)}{q(x)} \right] &(\text{Jensen's})\\ &= - \log \sum q(x) \frac{p(x)}{q(x)} \\ &= -\log \sum p(x) \\ &= - \log 1 \\ &= 0 \end{aligned}\]Without Jensen’s inequality:
We first show $\log x \le x - 1$. Define $f(x) := \log x - x + 1$. Then note that $f’(x) = 0$ when $x = 1$, and as $f’‘(1) = -1$, $f(1) = 0$ is a maximum. So $\log x \le x - 1$. Then
\[\begin{aligned} -\text{KL}(q, p) &= \sum q(x) \log \frac{q(x)}{p(x)} \\ &\le \sum q(x) (\frac{q(x)}{p(x)} - 1) \\ &\le 0 \end{aligned}\]Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
@Justify that the KL divergence is not symmetric.
Consider:
- $q = \left[ \frac 3 8, \frac 4 8, \frac 1 8 \right]$
- $p = \left[\frac 1 8, \frac 3 8, \frac 4 8\right]$
Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
@Prove that if some $q(x) = 0$, then it is ignored in the KL divergence.
Since $\log 0$ is undefined, we have to take a limit. Consider
\[\begin{aligned} \lim _ {\varepsilon \to 0^+} \varepsilon \log \frac{\varepsilon}{k} &= \lim _ {\varepsilon \to 0^+} \varepsilon \log \varepsilon - \varepsilon \log k \\ &= \lim _ {\varepsilon \to 0^+} \frac{\log \varepsilon}{1/\varepsilon} - \varepsilon \log k \\ &= \lim _ {\varepsilon \to 0^+} \frac{-1/\varepsilon}{-1/\varepsilon^2} - \varepsilon \log k \\ &= \lim _ {\varepsilon \to 0^+} \varepsilon - \varepsilon \log k \\ &= 0 \end{aligned}\]Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
@Prove that if some $q(x) > 0$, it must be that $p(x) > 0$ for the KL to be finite.
In the discrete case, suppose there exists some index $i$ with $q _ i > 0$ and $p _ i \le 0$. Then the sum contains a term $q _ i \log (q _ i / p _ i) = \infty$, so the KL will be infinite. Hence we require $p _ i > 0$.
Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
Suppose that:
- $q(x) = \mathcal N(x \mid m _ 0, s _ 0^2)$
- $p(x) = \mathcal N(x \mid m _ 1, s _ 1^2)$
@Prove that then
\[\text{KL}(q,p) = \frac{1}{2} \left( \frac{s _ 0^2}{s _ 1^2} + \frac{(m _ 1 - m _ 0)^2}{s _ 1^2} - 1 + \log\frac{s _ 1^2}{s _ 0^2} \right)\]
We have
\[\begin{aligned} \text{KL}(q, p) &= \int q(x) \log \frac{q(x)}{p(x)} \text dx \\ &= \int q(x) \log\left( \frac{\frac{1}{\sqrt{2\pi s _ 0^2}} \exp\left(-\frac{(x-m _ 0)^2}{2s _ 0^2}\right)}{\frac{1}{\sqrt{2\pi s _ 1^2}} \exp\left(-\frac{(x-m _ 1)^2}{2s _ 1^2}\right)} \right) \text dx \\ &= \int q(x) \log\left( \frac{s _ 1}{s _ 0} \exp\left( -\frac{(x-m _ 0)^2}{2s _ 0^2} + \frac{(x - m _ 1)^2}{2s _ 1^2} \right) \right) \text dx \\ &= \mathbb E _ {q(x)} \left[ \log\left( \frac{s _ 1}{s _ 0} \exp\left( -\frac{(x-m _ 0)^2}{2s _ 0^2} + \frac{(x - m _ 1)^2}{2s _ 1^2} \right) \right) \right] \\ &= \mathbb E _ {q(x)} \left[ \log\left(\frac{s _ 1}{s _ 2}\right) - \frac{(x - m _ 0)^2}{2s _ 0^2} + \frac{(x - m _ 1)^2}{2s _ 1^2} \right] \\ &= \log \frac{s _ 1}{s _ 0} - \frac{1}{2s _ 0^2} \mathbb E _ {q(x)}[(x - m _ 0)^2] + \frac{1}{2s _ 1^2} \mathbb E _ {q(x)}\left[ (x - m _ 1)^2 \right] \end{aligned}\]Since $x \sim \mathcal N(m _ 0, s _ 0^2)$, we have:
- $\mathbb E _ {q(x)} [(x - m _ 0)^2] = s _ 0^2$, and
- $\mathbb E _ q[(x-m _ 1)^2] = s _ 0^2 + (m _ 0 - m _ 1)^2$
Therefore
\[\begin{aligned} \text{KL}(q, p) &= \log \frac{s _ 1}{s _ 0} - \frac 1 2 + \frac{1}{2s _ 1^2} \left( s _ 0^2 + (m _ 0 - m _ 1)^2 \right) \\ &= \frac 1 2 \left( \log\left(\frac{s _ 1^2}{s _ 0^2}\right) + \frac{s _ 0^2}{s _ 1^2} + \frac{(m _ 0 - m _ 1)^2}{s _ 1^2} - 1\right) \\ &= \frac{1}{2} \left( \frac{s _ 0^2}{s _ 1^2} + \frac{(m _ 1 - m _ 0)^2}{s _ 1^2} - 1 + \log\frac{s _ 1^2}{s _ 0^2} \right) \end{aligned}\]Recall the KL divergence between two probability distributions $q, p$:
\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]
Suppose that:
- $X _ 1$ and $X _ 2$ are independent under $p$ and $q$
then:
\[\text{KL}(q(X _ 1, X _ 2), p(X _ 1, X _ 2)) = \text{KL}(q(X _ 1), p(X _ 1)) + \text{KL}(q(X _ 2), p(X _ 2))\]
Then since
\[\begin{aligned} \mathbb E _ {q(x)} \left[ \log \frac{q _ 1(X _ 1)}{p _ 1(X _ 1)} \right] &= \iint q(x _ 1, x _ 2) \log \frac{q _ 1(x _ 1)}{p _ 1(x _ 1)} \text dx _ 1 \text dx _ 2 \\ &= \iint q _ 1(x _ 1) q _ 2(x _ 2) \log \frac{q _ 1(x _ 1)}{p _ 1(x _ 1)} \text dx _ 1 \text dx _ 2 \\ &= \int q _ 1(x _ 1) \log \frac{q _ 1(x _ 1)}{p _ 1(x _ 1)} \text dx _ 1 \\ &= \text{KL}(q(X _ 1), p(X _ 1)) \end{aligned}\]and similarly for the other term. Hence we have overall
\[\text{KL}(q(X _ 1, X _ 2), p(X _ 1, X _ 2)) = \text{KL}(q(X _ 1), p(X _ 1)) + \text{KL}(q(X _ 2), p(X _ 2))\]as required.