# Lecture - Uncertainty in Deep Learning MT25, Some very useful mathematical tools

> Source: https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/lectures/tools/ · Updated: 2025-11-20 · Tags: uni, notes

- [Course - Uncertainty in Deep Learning MT25](https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/)
	- Previous lecture: [Lecture - Uncertainty in Deep Learning MT25, Approximate inference](https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/lectures/inference/)
	- Next lecture: [Lecture - Uncertainty in Deep Learning MT25, Stochastic approximate inference in DNNs](https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/lectures/dnns/)

In this lecture, we cover some mathematical tools which are useful in the pursuit of approximating the posterior distribution in situations that are more complicated than just everything being Gaussian. Specifically, these tools are:

- Monte-Carlo integration, in the context of estimating expectations
- (Estimating) integral derivatives, since optimising the ELBO requires us to estimate the gradients of an integral rather than the integral itself
- The reparameterisation trick
- Stochastic optimisation

### Monte-Carlo integration
How would Monte-Carlo integration estimate $E := \mathbb E_{p(x)}[f(x)]$?::

Sample $\hat x_1, \ldots, \hat x_T \sim p(x)$, and then calculate:
$$
\hat E := \frac 1 T \sum_t f(\hat x_t)
$$

### Integral derivative estimation
In a stochastic variational inference setting, we might wish to calculate gradients of the function
$$
L(\mu, \sigma) = \int (W +W ^2) \mathcal N(W \mid \mu, \sigma^2) \text dW
$$
Suppose we want to find $L'(0, 1)$. You might assume that it's possible to find an estimate of gradient by first estimating this integral with Monte-Carlo integration, and then differentiating with respect to the parameters.

Why does this not work, and what should you do instead to compute the correct gradients?::

If you do this, you will get $0$. The issue is that $W$ depends on $\mu, \sigma$ through the sampling process, so you can't treat the samples as constants.

To get around this, you reparameterise by first expressing $W$ as a deterministic transformation of a parameter-free noise variable $\epsilon$.
$$
W = \mu + \sigma \epsilon, \quad \epsilon \sim \mathcal N(0, 1)
$$
Then we have
$$
L(\mu, \sigma) = \mathbb E_{\epsilon \sim \mathcal N(0, 1)}[f(\mu + \sigma \epsilon)]
$$
and hence
$$
\frac{\partial L}{\partial \mu} = \mathbb E_{\epsilon \sim \mathcal N(0, 1)} [f'(\mu + \sigma \epsilon)]
$$

### Reparameterisation trick
@Describe the general setup of the reparameterisation trick.::

- We have a function $f(W)$ and a variational distribution $q_\theta(W)$
- We want to estimate gradients of $L(\theta) := \int f(W) q_\theta(W) \text dW$
- We reparameterise $W$ as $W = g(\theta, \epsilon)$ with $\epsilon$ not dependent on $\theta$, and $g$ is differentiable with respect to $\theta$

Then:
$$
L(\theta) = \int f(W) q_\theta(W) \text dW = \mathbb E_\epsilon [f(g(\theta, \epsilon))]
$$
and taking gradients
$$
\begin{aligned}
\frac{\partial}{\partial \theta}(L(\theta)) &= \mathbb E_\epsilon [\nabla_\theta f(g(\theta, \epsilon))] \\
&= \mathbb E_\epsilon \left[ f'(g(\theta, \epsilon)) \frac{\partial g(\theta, \epsilon)}{\partial \theta} \right]
\end{aligned}
$$
so we have the Monte-Carlo estimate
$$
\hat G(\theta) = \frac{1}{T} \sum^T_{t = 1} f'(g(\theta, \epsilon_t)) \frac{\partial g(\theta, \epsilon_t)}{\partial \theta}
$$
and $\epsilon_t \sim p(\epsilon)$.

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt