# Lecture - Uncertainty in Deep Learning MT25, Bayesian probabilistic modelling of functions

> Source: https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/lectures/functions/ · Updated: 2025-12-17 · Tags: uni, notes

- [Course - Uncertainty in Deep Learning MT25](https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/)
	- Previous lecture: [Lecture - Uncertainty in Deep Learning MT25, Bayesian probabilistic modelling](https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/lectures/modelling/)
	- Next lecture: [Lecture - Uncertainty in Deep Learning MT25, Uncertainty over functions](https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/lectures/functions/)

In this lecture, we apply the ideas in [Lecture - Uncertainty in Deep Learning MT25, Bayesian probabilistic modelling](https://ollybritton.com/notes/uni/part-c/mt25/uncertainty-in-deep-learning/lectures/modelling/) to the last layer of a neural network (in the specific setting of scalar outputs and Gaussian priors and noise). We specifically derive the mean and variance of the posterior and predictive distribution.

### Flashcards
Suppose:

- We are considering a deep neural network with scalar outputs $y$
- $W \in \mathbb R^{K \times 1}$ is the weight matrix of the last layer
- $b = 0$ for the output layer
- We don't consider the rest of the weights of the network, so that we may write $f^W(\pmb x) = \sum w_k \phi_k(\pmb x) = W^\top \phi(\pmb x)$, where $\phi(\pmb x)$ is the feature map implemented by the previous layers of the network

and consider the following generative story:

- Nature chose $W$ which defines a function $f^W(\pmb x) := W^\top \phi(x)$
- Then nature generated function values with inputs $x_1, \ldots, x_N$ given by $f^W(x_n)$
- These were corrupted with additive Gaussian noise $y_n := f^W(x_n) + \epsilon_n$, $\epsilon_n \sim \mathcal N(0, \sigma^2)$
- We then observe these corrupted values $\{(x_1, y_1), \ldots, (x_N, y_N)\}$

This means we have the prior:

- $p(w_k) = \mathcal N(w_k; 0, s^2)$

and the likelihood:

- $\mathbb P(y \mid W, x) = \mathcal N(y; W^\top \phi(x), \sigma^2)$

Derive a formula for the predictive distribution over $y^\ast$ for new $x^\ast$ in terms of the posterior over $W$ and the likelihood.::

In other words, we want an expression for $\mathbb P(y^\ast \mid x^\ast, X, Y)$.
$$
\begin{aligned}
\mathbb P(y^\ast \mid x^\ast, X, Y) &= \int \mathbb P(y^\ast, W \mid x^\ast, X, Y) \text dW &\text{sum rule} \\
&= \int \mathbb P(y^\ast \mid W, x^\ast, X, Y) \mathbb P(W \mid x^\ast, X, Y) \text dW &\text{product rule} \\
&= \int \mathbb P(y^\ast \mid x^\ast, W) \mathbb P(W \mid X, Y) \text dW &\text{model assumptions}
\end{aligned}
$$
where the model assumptions we use are that:

1. $\mathbb P(y^\ast \mid W, x^\ast, X, Y) = \mathbb P(y^\ast \mid x^\ast, W)$, since $y^\ast$ and $X, Y$ are conditionally independent given $W$, this follows from the generative story
2. $\mathbb P(W \mid x^\ast, X, Y) = \mathbb P(W \mid X, Y)$; this is slightly involved to justify formally but is intuitive

Suppose:

- We are considering a deep neural network with scalar outputs $y$
- $W \in \mathbb R^{K \times 1}$ is the weight matrix of the last layer
- $b = 0$ for the output layer
- We don't consider the rest of the weights of the network, so that we may write $f^W(\pmb x) = \sum w_k \phi_k(\pmb x) = W^\top \phi(\pmb x)$, where $\phi(\pmb x)$ is the feature map implemented by the previous layers of the network

and consider the following generative story:

- Nature chose $W$ which defines a function $f^W(\pmb x) := W^\top \phi(x)$
- Then nature generated function values with inputs $x_1, \ldots, x_N$ given by $f^W(x_n)$
- These were corrupted with additive Gaussian noise $y_n := f^W(x_n) + \epsilon_n$, $\epsilon_n \sim \mathcal N(0, \sigma^2)$
- We then observe these corrupted values $\{(x_1, y_1), \ldots, (x_N, y_N)\}$

This means we have the prior:

- $p(w_k) = \mathcal N(w_k; 0, s^2)$

and the likelihood:

- $\mathbb P(y \mid W, x) = \mathcal N(y; W^\top \phi(x), \sigma^2)$

@Prove that, in this situation where the prior and likelihood are Gaussians, then the posterior probability over $W$ must be a Gaussian too.

Then prove that the parameters $\Sigma, \mu$ of this Gaussian are given by
$$
\begin{aligned}
&\Sigma = \left( \frac{1}{\sigma^2} \phi(X)^\top \phi(X) + \frac{1}{s^2} I_k \right)^{-1} \\
&\mu = \frac{1}{\sigma^2} \Sigma \phi(X)^\top Y
\end{aligned}
$$

::

We want to show that
$$
\mathbb P(W \mid X, Y) = \mathcal N(W \mid \mu, \Sigma)
$$
for some $\mu, \sigma$, given that

- $\mathbb P(y \mid w, x) \sim \mathcal N(w^\top \phi(x), \sigma^2)$
- $\mathbb P(w_k) \sim \mathcal N(0, s^2)$

**The posterior must be Gaussian**:

From Bayes rule, we have
$$
\mathbb P(W \mid D) = \frac{\mathbb P(D \mid W) \mathbb P(W)}{\mathbb P(W)}
$$
Therefore
$$
\mathbb P(D \mid W) \mathbb P(W) = c_1 \mathcal N(W \mid \mu_s, \Sigma_s)
$$
where $c_1$ is some constant corresponding to the likelihood. Therefore
$$
\mathbb P(W \mid D) = \frac{c_1}{c_2} \mathcal N(W \mid \mu_s, \sigma_s)
$$
where $c_2$ is the model evidence. Then since
$$
\begin{aligned}
&\int_W \mathbb P(W \mid D)\text dW = 1 \\
\implies& \int_W \frac{c_1}{c_2} \mathcal N(W \mid \mu_s, \Sigma_s) = 1 \\
\implies& \frac{c_1}{c_2} \int_W \mathcal N(W \mid \mu_s, \Sigma_s) \text dW = 1 \\
\implies& \frac{c_1}{c_2} = 1
\end{aligned}
$$
so this is indeed a Gaussian.

---

**Deriving the mean and variance**:

Deriving the mean and variance is a little bit more difficult. Since we know that it must be a Gaussian, it has the form
$$
\mathbb P(W \mid X, Y) \propto \exp\left(-\frac 1 2 (W - \mu')^\top \Sigma^{-1} (W - \mu')\right)
$$
But by Bayes law, we also have
$$
\begin{aligned}
\mathbb P(W \mid X, Y) &= \frac{\mathbb P(Y \mid W, X)\mathbb P(W \mid X)}{\mathbb P(Y \mid X)} \\
&\propto \mathbb P(Y \mid W, X) \mathbb P(W) \\
&\propto \exp\left( -\left(\frac{1}{2\sigma^2} \sum^N_{i = 1} ||y_i - W^\top \phi(x)||^2\right) - \frac{1}{2s^2}W^\top W\right)
\end{aligned}
$$
Now we expand the (negated) terms inside the exponential:
$$
\begin{aligned}
&\frac{1}{2s^2} W^\top W + \frac{1}{2\sigma^2} \sum^N_{i = 1} ||y_i - W^\top \phi(x)||^2\\
=&\frac{1}{2s^2}W^\top W + \frac{1}{2\sigma^2} (y_i - W^\top \phi(x_i))^\top (y_i - W^\top \phi(x_i)) \\
=&\frac{1}{2s^2} W^\top W + \frac{1}{2\sigma^2} (y_i^\top y_i - y_i^\top W^\top \phi(x_i) - \phi(x_i)^\top W y_i + \phi(x_i)^\top W W^\top \phi(x_i)) \\
=&\frac{1}{2s^2} W^\top W + \frac{1}{2\sigma^2}(y_i^2 - 2y_i W^\top \phi(x_i) + W^\top \phi(x_i)\phi(x_i)^\top W) &(\star)\\
\end{aligned}
$$

where $(\star)$ follows from the fact that $y_i^\top W^\top \phi(x_i)$ and $\phi(x_i)^\top W y_i$ are both equal scalars. Expanding out the other form, we have also
$$
\mathbb P(W \mid X, Y) \propto \exp\left(-\frac 1 2\left( W^\top \Sigma' W - W^\top \Sigma^{-1} \mu - \mu^\top \Sigma^{-1}W + \mu^\top \Sigma^{-1} \mu \right)\right)
$$
Now we set these equal to one another and match up the quadratic and linear terms:
$$
\begin{aligned}
\frac{1}{2s^2} W^\top W + \frac{1}{2\sigma^2} \sum^N_{i=1} (y_i^2 &- 2y_i W^\top \phi(x_i) + W^\top \phi(x_i) \phi(x_i)^\top W)\\
&= \\
\frac{1}{2} (W^\top \Sigma^{-1}W &- 2W^\top \Sigma^{-1} \mu + \mu^\top \Sigma^{-1} \mu)
\end{aligned}
$$
Taking out $W$ from either side of the quadratic terms, we obtain that
$$
\Sigma^{-1} = \frac{1}{\sigma^2} \sum^N_{i = 1}\phi(x_i) \phi(x_i)^\top + \frac{1}{s^2} I_k
$$
and matching up the linear terms, we have
$$
W^\top \Sigma^{-1} \mu = \frac{1}{\sigma^2} \sum^N_{i = 1} y_i W^\top \phi(x_i)
$$
so therefore
$$
\mu = \frac{1}{\sigma^2} \Sigma \sum^N_{i = 1} y_i \phi(x_i)
$$
Collecting sums into outer products, we can summarise these results as:

$$
\begin{aligned}
&\Sigma = \left( \frac{1}{\sigma^2} \phi(X)^\top \phi(X) + \frac{1}{s^2} I_k \right)^{-1} \\
&\mu = \frac{1}{\sigma^2} \Sigma \phi(X)^\top Y
\end{aligned}
$$

(I have been a little bit sloppy with the notation here, it should really be $\Sigma'$ and $\mu'$ throughout, rather than $\Sigma$ and $\mu$).

Suppose:

- We are considering a deep neural network with scalar outputs $y$
- $W \in \mathbb R^{K \times 1}$ is the weight matrix of the last layer
- $b = 0$ for the output layer
- We don't consider the rest of the weights of the network, so that we may write $f^W(\pmb x) = \sum w_k \phi_k(\pmb x) = W^\top \phi(\pmb x)$, where $\phi(\pmb x)$ is the feature map implemented by the previous layers of the network

and consider the following generative story:

- Nature chose $W$ which defines a function $f^W(\pmb x) := W^\top \phi(x)$
- Then nature generated function values with inputs $x_1, \ldots, x_N$ given by $f^W(x_n)$
- These were corrupted with additive Gaussian noise $y_n := f^W(x_n) + \epsilon_n$, $\epsilon_n \sim \mathcal N(0, \sigma^2)$
- We then observe these corrupted values $\{(x_1, y_1), \ldots, (x_N, y_N)\}$

This means we have the prior:

- $p(w_k) = \mathcal N(w_k; 0, s^2)$

and the likelihood:

- $\mathbb P(y \mid W, x) = \mathcal N(y \mid W^\top \phi(x), \sigma^2)$

In this context, we have that the posterior distribution of $W$ over $X, Y$ is given by
$$
\mathbb P(W \mid X, Y) = \mathcal N(W \mid \mu, \Sigma)
$$
where
$$
\begin{aligned}
&\Sigma = \left( \frac{1}{\sigma^2} \phi(X)^\top \phi(X) + \frac{1}{s^2} I_k \right)^{-1} \\
&\mu = \frac{1}{\sigma^2} \Sigma \phi(X)^\top Y
\end{aligned}
$$
Use this to derive the parameters of the predictive distribution, i.e. $\mu^\ast, \Sigma^\ast$ where
$$
\mathbb P(y^\ast \mid x^\ast, X, Y) = \mathcal N(y^\ast \mid \mu^\ast, \Sigma^\ast)
$$

::

To do this, we use moment matching. Since we know that the predictive distribution must be a Gaussian (since all the other distributions are involved, and Gaussians have all these nice closure properties), it follows that we can directly calculate the parameters by finding the mean and variance of $\mathbb P(y^\ast \mid x^\ast, X, Y)$.

**The mean**:

We have
$$
\begin{aligned}
\mu^\ast &= \mathbb E_{\mathbb P(y^\ast \mid x^\ast, X, Y)}[y^\ast] \\
&= \int_{y^\ast} y^\ast \mathbb P(y^\ast \mid x^\ast, X, Y) \text dy^\ast \\
&= \int_{y^\ast} y^\ast \left( \int_w \mathbb P(y^\ast \mid x^\ast, W) \mathbb P(W \mid X, Y) \text dW \right) \text dy^\ast &(\text{sum rule}) \\
&= \int_{W} \int_{y^\ast} y^\ast \mathbb P(y^\ast \mid x^\ast, W) \text dy^\ast \mathbb P(W \mid X, Y) \text dW \\
&= \int_W \mathbb E_{\mathbb P(y^\ast \mid x^\ast, W)} \mathbb P(W \mid X, Y) \text dW \\
&= \left( \int_W W^\top \mathbb P(W \mid X, Y) \text dW \right) \phi(x^\ast) &(\text{regroup, take out }\phi(x^\ast)) \\
&= \mu^\top \phi(x^\ast)
\end{aligned}
$$
where here $\mu$ is the mean of the posterior distribution.

**The variance**: We make repeated use of the identity that
$$
\text{Var}(z) = \mathbb E[z^\top z] - \mathbb E[z]^\top \mathbb E[z]
$$
In this case, we have:
$$
\begin{aligned}
&\text{Var}(y^\ast \mid x^\ast, X, Y) \\
=& \mathbb E[(y^\ast)^\top (y^\ast)] - (\mathbb E[y^\ast])^\top (\mathbb E[y^\ast]) \\
=&\mathbb E[(y^\ast)^\top (y^\ast)] - (\mu^\top \phi(x^\ast))^\top (\mu^\top \phi(x^\ast)) \\
=&\mathbb E[(y^\ast)^\top (y^\ast )] - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast) \\
=&\int_{y^\ast} (y^\ast)^\top(y^\ast) \mathbb P(y^\ast \mid x^\ast, X, Y) \text dy^\ast - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast) \\
=& \int_{y^\ast}(y^\ast)^\top (y^\ast) \left( \int_W \mathbb P(y^\ast \mid x^\ast, W) \mathbb P(W \mid X, Y) \text dW \right) \text dy^\ast - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast) \\
=&\int_W \left(\int_{y^\ast} (y^\ast)^\top (y^\ast) \mathbb P(y^\ast \mid x^\ast, W)\right) \mathbb P(W \mid X, Y) \text dW - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast) \\
=&\int_W \mathbb E_{\mathbb P(y^\ast \mid x^\ast, W)}[(y^\ast)^\top (y^\ast)] \mathbb P(W \mid X, Y) \text dW - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast) 
\end{aligned}
$$
where the initial expectations are taken with respect to $\mathbb P(y^\ast \mid x^\ast, X, Y)$. Now we deal the expectation. By applying the (rearranged) identity, we obtain
$$
\begin{aligned}
\mathbb E[(y^\ast)^\top (y^\ast) \mid x^\ast, W] &= \text{Var}(y^\ast \mid x^\ast, W) + \mathbb E[y^\ast \mid x^\ast, W]^\top \mathbb E[y^\ast \mid x^\ast, W] \\
&= \sigma^2 + (W^\top \phi(x^\ast))^\top (W^\top \phi(x^\ast)) &\text{(from gen. story)} \\
&= \sigma^2 + \phi(x^\ast) WW^\top \phi(x^\ast)
\end{aligned}
$$
Therefore
$$
\begin{aligned}
&\int_W \mathbb E_{\mathbb P(y^\ast \mid x^\ast, W)}[(y^\ast)^\top (y^\ast)] \mathbb P(W \mid X, Y) \text dW - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast) \\
=&\int_W (\sigma^2 + \phi(x^\ast)^\top W W^\top \phi(x^\ast)) \mathbb P(W \mid X, Y) \text dW -  \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast) \\
=& \int_W \sigma^2 \mathbb P(W \mid X, Y) \text dW + \int_W \phi(x^\ast) ^\top W W^\top \phi(x^\ast) \mathbb P(W \mid X, Y) \text dW  - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast)  \\
=& \sigma^2 + \phi(x^\ast)^\top \left( \int_W WW^\top \mathbb P(W\mid X, Y) \text dW \right) \phi(x^\ast) - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast) \\
=&\sigma^2 + \phi(x^\ast)^\top \mathbb E[WW^\top \mid X, Y] \phi(x^\ast)  - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast)  \\
=& \sigma^2 + \phi(x^\ast)^\top (\text{Var}(W^\top \mid X, Y) + \mathbb E[W^\top \mid X, Y]^\top \mathbb E[W^\top X, Y])\phi(x^\ast) - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast)  \\
=& \sigma^2 + \phi(x^\ast)^\top (\Sigma^\top + \mu \mu^\top) \phi(x^\ast)  - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast) &(\text{via results on pred. dist.})  \\
=& \sigma^2 + \phi(x^\ast)^\top (\Sigma + \mu \mu^\top)\phi(x^\ast) - \phi(x^\ast)^\top \mu \mu^\top \phi(x^\ast) &(\text{assump. imply }\Sigma^\top = \Sigma) \\
=&\sigma^2 + \phi(x^\ast)^\top \Sigma \phi(x^\ast)
\end{aligned}
$$

**Summarising these results**: We have

$$
\begin{aligned}
&\mu^\ast = \mu^\top \phi(x^\ast) \\
&\text{Var}(y^\ast \mid x^\ast, X, Y) = \sigma^2 + \phi(x^\ast)^\top \Sigma \phi(x^\ast)
\end{aligned}
$$

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt
