Lecture - Uncertainty in Deep Learning MT25, Some very useful mathematical tools

[[Course - Uncertainty in Deep Learning MT25]]^U
- Previous lecture: [[Lecture - Uncertainty in Deep Learning MT25, Approximate inference]]^U
- Next lecture: [[Lecture - Uncertainty in Deep Learning MT25, Stochastic approximate inference in DNNs]]^U

In this lecture, we cover some mathematical tools which are useful in the pursuit of approximating the posterior distribution in situations that are more complicated than just everything being Gaussian. Specifically, these tools are:

Monte-Carlo integration, in the context of estimating expectations
(Estimating) integral derivatives, since optimising the ELBO requires us to estimate the gradients of an integral rather than the integral itself
The reparameterisation trick
Stochastic optimisation

Monte-Carlo integration

How would Monte-Carlo integration estimate $E := \mathbb E _ {p(x)}[f(x)]$?

Sample $\hat x _ 1, \ldots, \hat x _ T \sim p(x)$, and then calculate:

\[\hat E := \frac 1 T \sum _ t f(\hat x _ t)\]

Integral derivative estimation

In a stochastic variational inference setting, we might wish to calculate gradients of the function

\[L(\mu, \sigma) = \int (W +W ^2) \mathcal N(W \mid \mu, \sigma^2) \text dW\]

Suppose we want to find $L’(0, 1)$. You might assume that it’s possible to find an estimate of gradient by first estimating this integral with Monte-Carlo integration, and then differentiating with respect to the parameters.

Why does this not work, and what should you do instead to compute the correct gradients?

If you do this, you will get $0$. The issue is that $W$ depends on $\mu, \sigma$ through the sampling process, so you can’t treat the samples as constants.

To get around this, you reparameterise by first expressing $W$ as a deterministic transformation of a parameter-free noise variable $\epsilon$.

\[W = \mu + \sigma \epsilon, \quad \epsilon \sim \mathcal N(0, 1)\]

Then we have

\[L(\mu, \sigma) = \mathbb E _ {\epsilon \sim \mathcal N(0, 1)}[f(\mu + \sigma \epsilon)]\]

and hence

\[\frac{\partial L}{\partial \mu} = \mathbb E _ {\epsilon \sim \mathcal N(0, 1)} [f'(\mu + \sigma \epsilon)]\]

Reparameterisation trick

@Describe the general setup of the reparameterisation trick.

We have a function $f(W)$ and a variational distribution $q _ \theta(W)$
We want to estimate gradients of $L(\theta) := \int f(W) q _ \theta(W) \text dW$
We reparameterise $W$ as $W = g(\theta, \epsilon)$ with $\epsilon$ not dependent on $\theta$, and $g$ is differentiable with respect to $\theta$

Then:

\[L(\theta) = \int f(W) q _ \theta(W) \text dW = \mathbb E _ \epsilon [f(g(\theta, \epsilon))]\]

and taking gradients

\[\begin{aligned} \frac{\partial}{\partial \theta}(L(\theta)) &= \mathbb E _ \epsilon [\nabla _ \theta f(g(\theta, \epsilon))] \\ &= \mathbb E _ \epsilon \left[ f'(g(\theta, \epsilon)) \frac{\partial g(\theta, \epsilon)}{\partial \theta} \right] \end{aligned}\]

so we have the Monte-Carlo estimate

\[\hat G(\theta) = \frac{1}{T} \sum^T _ {t = 1} f'(g(\theta, \epsilon _ t)) \frac{\partial g(\theta, \epsilon _ t)}{\partial \theta}\]

and $\epsilon _ t \sim p(\epsilon)$.

Monte-Carlo integration

Integral derivative estimation

Reparameterisation trick

Related posts