# Lecture - Uncertainty in Deep Learning MT25, Stochastic approximate inference in DNNs

> Source: https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/lectures/dnns/ · Updated: 2025-11-20 · Tags: uni, notes

- [Course - Uncertainty in Deep Learning MT25](https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/)
	- Previous lecture: [Lecture - Uncertainty in Deep Learning MT25, Some very useful mathematical tools](https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/lectures/tools/)
	- Next lecture: [redacted](https://ollybritton.com/404)

In this lecture, we use the mathematical tools developed in [Lecture - Uncertainty in Deep Learning MT25, Some very useful mathematical tools](https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/lectures/tools/) to derive an update rule for the approximate posterior in a shallow neural network, and then show how this generalises to deep neural networks.

### Reparameterisation trick in a shallow network
> It is a one-line proof if you skip many steps.

Suppose:

- We are considering a deep neural network with vector outputs $y$
- $W \in \mathbb R^{K \times D}$ is the weight matrix of the last layer
- $b = 0$ for the output layer
- We don't consider the rest of the weights of the network, so that we may write $f^W(\pmb x) = W^\top \phi(x)$, where $\phi(\pmb x)$ is the feature map implemented by the previous layers of the network

and consider the following generative story:

- Nature chose $W$ which defines a function $f^W(\pmb x) := W^\top \phi(x)$
- Then nature generated function values with inputs $x_1, \ldots, x_N$ given by $f^W(x_n)$
- These were corrupted with additive Gaussian noise $y_n := f^W(x_n) + \epsilon_n$, $\epsilon_n \sim \mathcal N(0, \sigma^2)$
- We then observe these corrupted values $\{(x_1, y_1), \ldots, (x_N, y_N)\}$

We have the prior:

- $p(w_{k,d}) = \mathcal N(w_{k,d} \mid 0, s^2)$ (i.e. each entry is Gaussian distributed)

and the likelihood:

- $P(Y \mid X, W) = \prod_n \mathcal N(Y_n; f^W(X_n), \sigma^2 I_D)$

and finally, we wish to approximate the posterior via the variational distribution $q_{m, \sigma}(w_{k,d})$ where
$$
q_{m, \theta}(w_{k, d}) = \mathcal N(w_{k,d} \mid m_{k, d}, \sigma^2_{k,d})
$$
and we collect these into the matrices $M \in \mathbb R^{K \times D}$ and $S \in \mathbb R^{K \times D}$ (note this is quite a strong condition on the variational distribution, since each $w_{k,d}$ is independent).

Recall that we optimise against the ELBO, which we may write
$$
\text{ELBO}(\theta) = \mathbb E_{q_{\theta}(W)} [\log p(Y \mid X, W)] - \text{KL}(q_\theta(W), p(W))
$$
We have shown previously that the $\text{KL}(q_\theta(W), p(W))$ is analytic, and so the tricky part is really approximating the gradient of $\mathbb E_{q_{\theta}(W)} [\log p(Y \mid X, W)]$.

Derive an estimate $\hat G(\theta, \{\hat \epsilon\}$) using the reparameterisation trick.::

@todo.

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt