Uncertainty in Deep Learning MT25, Kullback-Leibler divergence


Flashcards

@Define the Kullback-Leibler (KL) divergence between two probability distributions $q$, $p$.


\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]

Recall the KL divergence between two probability distributions $q, p$:

\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]

@Justify that if the two distributions are the same, we get exactly $0$.


This follows immediately from the fact $\log 1 = 0$.

Recall the KL divergence between two probability distributions $q, p$:

\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]

@Prove that the KL divergence between any two distributions is non-negative.


Two proofs:

Via Jensen’s inequality:

\[\begin{aligned} \text{KL}(q, p) &= \mathbb E _ {q(x)} \left[ \log \frac{q(x)}{p(x)} \right] \\ &= \mathbb E _ {q(x)} \left[ -\log \frac{p(x)}{q(x)} \right] \\ &\ge - \log \mathbb E _ {q(x)} \left[ \frac{p(x)}{q(x)} \right] &(\text{Jensen's})\\ &= - \log \sum q(x) \frac{p(x)}{q(x)} \\ &= -\log \sum p(x) \\ &= - \log 1 \\ &= 0 \end{aligned}\]

Without Jensen’s inequality:

We first show $\log x \le x - 1$. Define $f(x) := \log x - x + 1$. Then note that $f’(x) = 0$ when $x = 1$, and as $f’‘(1) = -1$, $f(1) = 0$ is a maximum. So $\log x \le x - 1$. Then

\[\begin{aligned} -\text{KL}(q, p) &= \sum q(x) \log \frac{q(x)}{p(x)} \\ &\le \sum q(x) (\frac{q(x)}{p(x)} - 1) \\ &\le 0 \end{aligned}\]

Recall the KL divergence between two probability distributions $q, p$:

\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]

@Justify that the KL divergence is not symmetric.


Consider:

  • $q = \left[ \frac 3 8, \frac 4 8, \frac 1 8 \right]$
  • $p = \left[\frac 1 8, \frac 3 8, \frac 4 8\right]$

Recall the KL divergence between two probability distributions $q, p$:

\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]

@Prove that if some $q(x) = 0$, then it is ignored in the KL divergence.


Since $\log 0$ is undefined, we have to take a limit. Consider

\[\begin{aligned} \lim _ {\varepsilon \to 0^+} \varepsilon \log \frac{\varepsilon}{k} &= \lim _ {\varepsilon \to 0^+} \varepsilon \log \varepsilon - \varepsilon \log k \\ &= \lim _ {\varepsilon \to 0^+} \frac{\log \varepsilon}{1/\varepsilon} - \varepsilon \log k \\ &= \lim _ {\varepsilon \to 0^+} \frac{-1/\varepsilon}{-1/\varepsilon^2} - \varepsilon \log k \\ &= \lim _ {\varepsilon \to 0^+} \varepsilon - \varepsilon \log k \\ &= 0 \end{aligned}\]

Recall the KL divergence between two probability distributions $q, p$:

\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]

@Prove that if some $q(x) > 0$, it must be that $p(x) > 0$ for the KL to be finite.


In the discrete case, suppose there exists some index $i$ with $q _ i > 0$ and $p _ i \le 0$. Then the sum contains a term $q _ i \log (q _ i / p _ i) = \infty$, so the KL will be infinite. Hence we require $p _ i > 0$.

Recall the KL divergence between two probability distributions $q, p$:

\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]

Suppose that:

  • $q(x) = \mathcal N(x \mid m _ 0, s _ 0^2)$
  • $p(x) = \mathcal N(x \mid m _ 1, s _ 1^2)$

@Prove that then

\[\text{KL}(q,p) = \frac{1}{2} \left( \frac{s _ 0^2}{s _ 1^2} + \frac{(m _ 1 - m _ 0)^2}{s _ 1^2} - 1 + \log\frac{s _ 1^2}{s _ 0^2} \right)\]

We have

\[\begin{aligned} \text{KL}(q, p) &= \int q(x) \log \frac{q(x)}{p(x)} \text dx \\ &= \int q(x) \log\left( \frac{\frac{1}{\sqrt{2\pi s _ 0^2}} \exp\left(-\frac{(x-m _ 0)^2}{2s _ 0^2}\right)}{\frac{1}{\sqrt{2\pi s _ 1^2}} \exp\left(-\frac{(x-m _ 1)^2}{2s _ 1^2}\right)} \right) \text dx \\ &= \int q(x) \log\left( \frac{s _ 1}{s _ 0} \exp\left( -\frac{(x-m _ 0)^2}{2s _ 0^2} + \frac{(x - m _ 1)^2}{2s _ 1^2} \right) \right) \text dx \\ &= \mathbb E _ {q(x)} \left[ \log\left( \frac{s _ 1}{s _ 0} \exp\left( -\frac{(x-m _ 0)^2}{2s _ 0^2} + \frac{(x - m _ 1)^2}{2s _ 1^2} \right) \right) \right] \\ &= \mathbb E _ {q(x)} \left[ \log\left(\frac{s _ 1}{s _ 2}\right) - \frac{(x - m _ 0)^2}{2s _ 0^2} + \frac{(x - m _ 1)^2}{2s _ 1^2} \right] \\ &= \log \frac{s _ 1}{s _ 0} - \frac{1}{2s _ 0^2} \mathbb E _ {q(x)}[(x - m _ 0)^2] + \frac{1}{2s _ 1^2} \mathbb E _ {q(x)}\left[ (x - m _ 1)^2 \right] \end{aligned}\]

Since $x \sim \mathcal N(m _ 0, s _ 0^2)$, we have:

  • $\mathbb E _ {q(x)} [(x - m _ 0)^2] = s _ 0^2$, and
  • $\mathbb E _ q[(x-m _ 1)^2] = s _ 0^2 + (m _ 0 - m _ 1)^2$

Therefore

\[\begin{aligned} \text{KL}(q, p) &= \log \frac{s _ 1}{s _ 0} - \frac 1 2 + \frac{1}{2s _ 1^2} \left( s _ 0^2 + (m _ 0 - m _ 1)^2 \right) \\ &= \frac 1 2 \left( \log\left(\frac{s _ 1^2}{s _ 0^2}\right) + \frac{s _ 0^2}{s _ 1^2} + \frac{(m _ 0 - m _ 1)^2}{s _ 1^2} - 1\right) \\ &= \frac{1}{2} \left( \frac{s _ 0^2}{s _ 1^2} + \frac{(m _ 1 - m _ 0)^2}{s _ 1^2} - 1 + \log\frac{s _ 1^2}{s _ 0^2} \right) \end{aligned}\]

Recall the KL divergence between two probability distributions $q, p$:

\[\text{KL}(q, p) = \int q(x) \log \frac{q(x)}{p(x)} \text dx\]

Suppose that:

  • $X _ 1$ and $X _ 2$ are independent under $p$ and $q$

then:

\[\text{KL}(q(X _ 1, X _ 2), p(X _ 1, X _ 2)) = \text{KL}(q(X _ 1), p(X _ 1)) + \text{KL}(q(X _ 2), p(X _ 2))\]

\[\begin{aligned} \text{KL}(q(X _ 1, X _ 2), p(X _ 1, X _ 2)) &= \mathbb E _ {q(x)} \left[ \log \frac{q(X _ 1, X _ 2)}{p(X _ 1, X _ 2)} \right] \\ &= \mathbb E _ {q(x)} \left[ \log \frac{q _ 1(X _ 1)q _ 2(X _ 2)}{p _ 1(X _ 1) p _ 2(X _ 2)}\right] \\ &= \mathbb E _ {q(x)} \left[ \log \frac{q _ 1(X _ 1)}{p _ 1(X _ 1)} + \log \frac{q _ 2(X _ 2)}{p _ 2(X _ 2)} \right] \\ &= \mathbb E _ {q(x)} \left[ \log \frac{q _ 1(X _ 1)}{p _ 1(X _ 1)} \right] + \mathbb E _ {q(x)} \left[\log \frac{q _ 2(X _ 2)}{p _ 2(X _ 2)}\right] \end{aligned}\]

Then since

\[\begin{aligned} \mathbb E _ {q(x)} \left[ \log \frac{q _ 1(X _ 1)}{p _ 1(X _ 1)} \right] &= \iint q(x _ 1, x _ 2) \log \frac{q _ 1(x _ 1)}{p _ 1(x _ 1)} \text dx _ 1 \text dx _ 2 \\ &= \iint q _ 1(x _ 1) q _ 2(x _ 2) \log \frac{q _ 1(x _ 1)}{p _ 1(x _ 1)} \text dx _ 1 \text dx _ 2 \\ &= \int q _ 1(x _ 1) \log \frac{q _ 1(x _ 1)}{p _ 1(x _ 1)} \text dx _ 1 \\ &= \text{KL}(q(X _ 1), p(X _ 1)) \end{aligned}\]

and similarly for the other term. Hence we have overall

\[\text{KL}(q(X _ 1, X _ 2), p(X _ 1, X _ 2)) = \text{KL}(q(X _ 1), p(X _ 1)) + \text{KL}(q(X _ 2), p(X _ 2))\]

as required.

Suppose we have two $K$-dimensional multivariate Gaussians

  • $q(\pmb x) = \mathcal N(\pmb x \mid \pmb m _ 0, \mathbf S _ 0)$, $\mathbf S _ 0 = \text{diag}(s _ {0,1}^2, \ldots, s _ {0,K}^2)$
  • $p(\pmb x) = \mathcal N(\pmb x \mid \pmb m _ 1, \mathbf S _ 1)$, $\mathbf S _ 1 = \text{diag}(s _ {1,1}^2, \ldots, s _ {1,K}^2)$

@State $\text{KL}(q, p)$.


\[\text{KL}(q,p) = \frac{1}{2} \sum _ k \left( \frac{s _ {0,k}^2}{s _ {1,k}^2} + \frac{(\pmb m _ {1,k} - \pmb m _ {0, k})^2}{s _ {1,k}^2} - 1 + \log\frac{s _ {1,k}^2}{s _ {0,k}^2} \right)\]



Related posts