Uncertainty in Deep Learning MT25, Probability reference
Most of my notes on basic probability theory can be found in [[Course - Probability MT22]]U and interspersed throughout [[Course - Machine Learning MT23]]U (especially [[Notes - Machine Learning MT23, Bayesian machine learning]]U).
The notes here come primarily from the chapters 2 and 3 of Probabilistic Machine Learning by Kevin Murphy.
Flashcards
Univariate models
Conditional independence
@Define what it means for two events $A$ and $B$ to be conditionally independent given an event $C$, and state the notation used to write this.
\[\mathbb P(A, B \mid C)\]
This is written $A \perp B \mid C$.
Conditional moments
@State the law of total expectation.
@Prove the law of total expectation, i.e. that
\[\mathbb E[X] = \mathbb E _ Y[\mathbb E[X \mid Y]]\]
@State the law of total variance.
@Prove the law of total variance, i.e. that
\[\text{Var}(X) = \mathbb E _ Y(\text{Var}(X \mid Y)) + \text{Var} _ Y(\mathbb E[X \mid Y])\]
Define:
- $\mu _ {X \vert Y} = \mathbb E[X \mid Y]$
- $s _ {X \mid Y} = \mathbb E[X^2 \mid Y]$
- $\sigma^2 _ {X \vert Y} = \text{Var}(X \mid Y) = s _ {X \vert Y} - \mu^2 _ {X \vert Y}$
Then:
\[\begin{aligned} \text{Var}(X) &= \mathbb E[X^2] - (\mathbb E[X])^2 \\ &= \mathbb E _ Y[s _ {X \vert Y}] - (\mathbb E _ Y[\mu _ {X \vert Y}])^2 \\ &= \mathbb E _ Y[\sigma^2 _ {X \vert Y}] + \mathbb E _ Y[\mu^2 _ {X \vert Y}] - (\mathbb E _ Y[\mu _ {X \vert Y}])^2 \\ &= \mathbb E _ Y[\text{Var}(X \mid Y)] + \text{Var} _ Y(\mu _ {X \vert Y}) \end{aligned}\]Properties of the sigmoid function
@Define the $\text{logit}$ function.
Dirac delta function as a limiting case of the Gaussian
What happens to a Gaussian as you shrink its variance to $0$?
Student $t$ distribution
@Define the probability density function of the Student $t$ distribution $\mathcal T(y \mid \mu, \sigma^2, \nu)$, and describe how the choice of $\nu$ (the “degrees of freedom” or “degree of normality” affects the distribution).
where:
- $\mu$ is the mean
- $\sigma > 0$ is a scale parameter distinct from the standard deviation
- $\nu$ is the “degree of normality”, large values of $\nu$ make the distribution act like a Gaussian
- $C$ is a scaling parameter to make it integrate to one
Intuitively, why is the Student $t$ distribution $\mathcal T(y \mid \mu, \sigma^2, \nu)$ more robust to errors than the normal distribution $\mathcal N(y \mid \mu, \sigma^2)$?
The probability density decays as a polynomial function of the squared distance from the mean, rather than exponentially.
Recall that the probability density function of the Student $t$ distribution $\mathcal T(y \mid \mu, \sigma^2, \nu)$ is given by:
\[\mathcal T(y \mid \mu, \sigma^2, \nu) = C \left( 1 + \frac 1 \nu \left( \frac{y - \mu}{\sigma} \right)^2 \right)^{-\left(\frac{\nu+1}{2}\right)}\]
where:
- $\mu$ is the mean
- $\sigma > 0$ is a scale parameter distinct from the standard deviation
- $\nu$ is the “degree of normality”, large values of $\nu$ make the distribution act like a Gaussian
- $C$ is a scaling parameter to make it integrate to one
What is the mean, mode and variance of this distribution?
- Mean: $\mu$
- Mode: $\mu$,
- Variance: $\frac{\nu \sigma^2}{(\nu - 2)}$
Cauchy distribution
@Define the probability density function of the Cauchy distribution $\mathcal C(x \mid \mu, \gamma)$ and the half Cauchy distribution $\mathcal C _ +$. In what situation is the half Cauchy distribution often used?
i.e. it is the Student $t$ distribution with $\nu = 1$. The half Cauchy distribution folds this over itself on the origin:
\[\mathcal C _ +(x \mid \gamma) = \frac{2}{\pi \gamma}\left[ 1 + \left(\frac{x}{\gamma}\right)^2 \right]^{-1}\]This is useful for when you want a distribution over positive reals with heavy tails, but a finite density at the origin.
Empirical distribution
Suppose we have a set of $N$ samples $\mathcal D = {x^{(1)}, \ldots, x^{(N)}}$. @Define the empirical distribution.
The distribution formed by spikes around these points, i.e.
\[\hat p _ N (x) = \frac 1 N \sum^N _ {n = 1}\delta _ {x^{(n)}}(x)\]Other distributions
- Truncated Gaussian distribution
- Cut the Gaussian off between $[a, b]$ and renormalise so it integrates to $1$
- Beta distribution:
- Has support over $[0, 1]$
- Gamma distribution
- Has support over $(0, \infty)$
- Exponential distribution
- Describes the times between events in a Poisson process, which is a process in which events occur continuously and independently at a constant average rate
- Chi-squared distribution
- Comes from the sum of squared Gaussian random variables
- Inverse Gamma distribution
Transformations of discrete random variables
Suppose:
- $X$ is a discrete random variable with probability mass function $p _ x$.
- $f$ is a deterministic function
- $Y = f(X)$
@State the probability mass function $p _ y$.
Transformations of continuous random variables
Suppose:
- $X$ is a scalar continuous random variable with probability density function $p _ x$.
- $f$ is a deterministic, monotonic (and so in particular invertible) function
- $Y = f(X)$
@State the probability mass function $p _ y$ found using the change of variables formula.
Suppose:
- $\pmb X$ is a multidimensional continuous random variable with probability density function $p _ x$.
- $\pmb f$ is a deterministic and invertible function with inverse $\pmb g$, from $\mathbb R^n \to \mathbb R^n$
- $\pmb Y = \pmb f(\pmb X)$
@State the probability mass function $p _ y$ found using the change of variables formula.
Moments of a linear transformation
Suppose:
- $\pmb x$ is a multidimensional random variable
- $\pmb y = \mathbf A \pmb x + \pmb b$
@State the mean and covariance of $\pmb y$, and what this reduces to when $\pmb y$ is a scalar (so that $\mathbf A = \pmb a^\top$).
- Mean: $\mathbf A \pmb \mu + \pmb b$, or in particular $\pmb a^\top \mu + \pmb b$ when $\mathbf A = \pmb a^\top$
- Covariance: $\mathbf A \mathbf \Sigma \mathbf A^\top$ where $\pmb \Sigma = \text{Cov}[\pmb x]$, or in particular $\pmb a^\top \Sigma \pmb a$ when $\mathbf A = \pmb a^\top$
The convolution theorem
Suppose:
- $x _ 1, x _ 2$ are two independent random variables
- $y = x _ 1 + x _ 2$
@State the probability mass function $p _ y$ when these are discrete random variables, and the probability density function $p _ y$ when these are continuous random variables.
where $\ast$ denotes convolution, so that
\[p _ y(y = j) = \sum _ j p(x _ 1 = k) p(x _ 2 = j-k)\]in the discrete case, and in the continuous case
\[p(y) = \int p _ 1(x _ 1) p _ 2(y - x _ 1) \text dx _ 1\]Central limit theorem
Suppose:
- We have $N$ i.i.d. random variables
- $S _ N = \sum^N _ {n = 1}X _ n$
@State the central limit theorem.
\[\lim _ {N \to \infty} p(S _ N = u) = \mathcal N(u \mid \mu, \sigma)\]
and so the distribution of the quantity $Z _ N := \frac{S _ N - N\mu}{\sigma \sqrt{N}}$ converges to a standard normal.
Multivariate models
Covariance matrix
Suppose $\pmb x$ is a $D$-dimensional random vector. @Define the covariance matrix $\text{Cov}[\pmb x]$.
What is $\mathbb E[\pmb x\pmb x^\top]$?
@Define the cross-covariance between two random vectors $\pmb x, \pmb y$.
Pearson correlation coefficient
@Define the Pearson correlation coefficient between random variables $X$ and $Y$, state why it is useful, and explain why “degree of linearity” might be a better term.
- It is useful because covariances between two random variables can be between any real number, $\rho$ is always between $-1$ and $1$.
- Two datasets can be highly related in nonlinear ways despite having a $\rho$ of $0$.
The Pearson correlation coefficient is defined as
\[\rho = \text{Corr}[X, Y] = \frac{\text{Cov}[X, Y]}{\sqrt{\text{Var}[X]\text{Var}[Y]}}\]
@State a result about when $X$ and $Y$ are in a linear relationship.
Correlation matrix
Suppose $\pmb x$ is a random vector. @Define the correlation matrix $\text{corr}(\pmb x)$.
The correlation matrix $\text{corr}(\pmb x)$ of a random vector is defined as
\[\mathrm{corr}(x) =
\begin{pmatrix}
1 & \dfrac{\mathbb{E}[(X _ 1 - \mu _ 1)(X _ 2 - \mu _ 2)]}{\sigma _ 1 \sigma _ 2} & \cdots & \dfrac{\mathbb{E}[(X _ 1 - \mu _ 1)(X _ D - \mu _ D)]}{\sigma _ 1 \sigma _ D} \\[1em]
\dfrac{\mathbb{E}[(X _ 2 - \mu _ 2)(X _ 1 - \mu _ 1)]}{\sigma _ 2 \sigma _ 1} & 1 & \cdots & \dfrac{\mathbb{E}[(X _ 2 - \mu _ 2)(X _ D - \mu _ D)]}{\sigma _ 2 \sigma _ D} \\[1em]
\vdots & \vdots & \ddots & \vdots \\[1em]
\dfrac{\mathbb{E}[(X _ D - \mu _ D)(X _ 1 - \mu _ 1)]}{\sigma _ D \sigma _ 1} &
\dfrac{\mathbb{E}[(X _ D - \mu _ D)(X _ 2 - \mu _ 2)]}{\sigma _ D \sigma _ 2} &
\cdots & 1
\end{pmatrix}\]
How could you write this more compactly?
where $\mathbf K _ {xx}$ is the auto-covariance matrix
\[\mathbf K _ {xx} = \mathbf \Sigma = \mathbb E[(\pmb x - \mathbb E[\pmb x])(\pmb x - \mathbb E[\pmb x])^\top] = \mathbf R _ {xx} - \pmb \mu \pmb \mu^\top\]and $\mathbf R _ {xx} = \mathbf E[xx^\top]$ is the auto-correlation matrix.
Multivariate Gaussian distribution
See [[Notes - Machine Learning MT23, Multivariate Gaussians]]U, specifically ∆multivarite-gaussian-pdf and ∆multivariate-gaussian-eigenvectors.
Suppose that $\pmb y \sim \mathcal N(\pmb y \mid \pmb \mu, \mathbf \Sigma)$, i.e.
\[p(\pmb y) = \frac{1}{(2\pi)^{D/2} \vert \pmb \Sigma \vert ^{1/2}\,}\exp\left(-\frac 1 2(\pmb y - \pmb \mu)^\top\pmb \Sigma^{-1} (\pmb y - \pmb \mu)\right)\]
What is $\mathbb E[\pmb y \pmb y^\top]$ in this case?
Suppose that $\pmb y$ is a 2D random variable and $\pmb y \sim \mathcal N(\pmb y \mid \pmb \mu, \mathbf \Sigma)$ (a “bivariate Gaussian”), i.e.
\[p(\pmb y) = \frac{1}{(2\pi)^{D/2} \vert \pmb \Sigma \vert ^{1/2}\,}\exp\left(-\frac 1 2(\pmb y - \pmb \mu)^\top\pmb \Sigma^{-1} (\pmb y - \pmb \mu)\right)\]
@State a convenient parameterisation of $\mathbf \Sigma$.
where $\rho$ is the correlation coefficient of $Y _ 1$ and $Y _ 2$, which are the marginalisations of $Y$ and turn out to also be Gaussian with parameters $\mu _ i$ and $\sigma _ i$. You can interpret this as a 2D Gaussian being a pair of correlated 1D Gaussians.
@Define a spherical/isotropic covariance matrix.
Covariance matrices of the form
\[\mathbf \Sigma = \sigma^2 I\]Mahalanobis distance
A metric where contours of distance from a Gaussian with mean $\pmb \mu$ have constant probability.
Marginals and conditionals of an MVN
Suppose $\pmb y = (\pmb y _ 1, \pmb y _ 2)$ is jointly Gaussian with parameters
\[\pmb{\mu} =
\begin{pmatrix}
\mu _ 1 \\[4pt]
\mu _ 2
\end{pmatrix},
\quad
\pmb{\Sigma} =
\begin{pmatrix}
\pmb{\Sigma} _ {11} & \pmb{\Sigma} _ {12} \\[4pt]
\pmb{\Sigma} _ {21} & \pmb{\Sigma} _ {22}
\end{pmatrix},
\quad
\pmb{\Lambda} = \pmb{\Sigma}^{-1} =
\begin{pmatrix}
\pmb{\Lambda} _ {11} & \pmb{\Lambda} _ {12} \\[4pt]
\pmb{\Lambda} _ {21} & \pmb{\Lambda} _ {22}
\end{pmatrix}\]
@State the marginals $p(\pmb y _ i)$ and the posterior conditional $p(\pmb y _ 1 \mid \pmb y _ 2)$.
and
\[p(\pmb y _ 1 \mid \pmb y _ 2) = \mathcal N(\pmb y _ 1 \mid \pmb \mu _ {1 \vert 2}, \pmb \Sigma _ {1,2})\]where:
\[\begin{aligned} \pmb{\mu} _ {1 \vert 2} &= \pmb{\mu} _ 1 + \pmb{\Sigma} _ {12} \pmb{\Sigma} _ {22}^{-1} (\pmb{y} _ 2 - \pmb{\mu} _ 2) \\ &= \pmb{\mu} _ 1 - \pmb{\Lambda} _ {11}^{-1} \pmb{\Lambda} _ {12} (\pmb{y} _ 2 - \pmb{\mu} _ 2) \\ &= \pmb{\Sigma} _ {1 \vert 2} \bigl( \pmb{\Lambda} _ {11} \pmb{\mu} _ 1 - \pmb{\Lambda} _ {12} (\pmb{y} _ 2 - \pmb{\mu} _ 2) \bigr). \end{aligned}\]and
\[\begin{aligned} \pmb{\Sigma} _ {1 \vert 2} &= \pmb{\Sigma} _ {11} - \pmb{\Sigma} _ {12} \pmb{\Sigma} _ {22}^{-1} \pmb{\Sigma} _ {21} \\ &= \pmb{\Lambda} _ {11}^{-1} \end{aligned}\]so in particular, the marginal distributions are also Gaussian.
Linear Gaussian systems
Suppose:
- $\pmb z \in \mathbb R^L$ is an unknown vector of values
- $\pmb y \in \mathbb R^D$ is a noisy measurement of $\pmb z$
- $p(\pmb z) = \mathcal N(\pmb z \mid \pmb \mu _ z, \pmb \Sigma _ z)$
- $p(\pmb y \mid \pmb z) = \mathcal N(\pmb y \mid \pmb W \pmb z + \pmb b, \pmb \Sigma _ y)$
- $\mathbf W$ is a matrix of size $D \times L$
What is the name of such a setup?
A linear Gaussian system.
Suppose:
- $\pmb z \in \mathbb R^L$ is an unknown vector of values
- $\pmb y \in \mathbb R^D$ is a noisy measurement of $\pmb z$
- $p(\pmb z) = \mathcal N(\pmb z \mid \pmb \mu _ z, \pmb \Sigma _ z)$
- $p(\pmb y \mid \pmb z) = \mathcal N(\pmb y \mid \pmb W \pmb z + \pmb b, \pmb \Sigma _ y)$
- $\mathbf W$ is a matrix of size $D \times L$
@State the parameters of the corresponding joint distribution $p(\pmb z, \pmb y) = p(\pmb z) p(\pmb y \mid \pmb z)$.
This is an $L + D$-dimensional Gaussian, with mean and covariance given by
\[\boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu} _ z \\ \mathbf{W} \boldsymbol{\mu} _ z + \mathbf{b} \end{pmatrix}, \qquad \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma} _ z & \boldsymbol{\Sigma} _ z \mathbf{W}^\top \\ \mathbf{W} \boldsymbol{\Sigma} _ z & \boldsymbol{\Sigma} _ y + \mathbf{W} \boldsymbol{\Sigma} _ z \mathbf{W}^\top \end{pmatrix}\]Suppose:
- $\pmb z \in \mathbb R^L$ is an unknown vector of values
- $\pmb y \in \mathbb R^D$ is a noisy measurement of $\pmb z$
- $p(\pmb z) = \mathcal N(\pmb z \mid \pmb \mu _ z, \pmb \Sigma _ z)$
- $p(\pmb y \mid \pmb z) = \mathcal N(\pmb y \mid \pmb W \pmb z + \pmb b, \pmb \Sigma _ y)$
- $\mathbf W$ is a matrix of size $D \times L$
The corresponding joint distribution $p(\pmb z, \pmb y) = p(\pmb z) p(\pmb y \mid \pmb z)$ is an $L + D$-dimensional Gaussian, with mean and covariance given by
\[\boldsymbol{\mu} =
\begin{pmatrix}
\boldsymbol{\mu} _ z \\
\mathbf{W} \boldsymbol{\mu} _ z + \mathbf{b}
\end{pmatrix},
\qquad
\boldsymbol{\Sigma} =
\begin{pmatrix}
\boldsymbol{\Sigma} _ z & \boldsymbol{\Sigma} _ z \mathbf{W}^\top \\
\mathbf{W} \boldsymbol{\Sigma} _ z & \boldsymbol{\Sigma} _ y + \mathbf{W} \boldsymbol{\Sigma} _ z \mathbf{W}^\top
\end{pmatrix}\]
@State “Bayes rule for Gaussians” to derive $p(\pmb z \mid \pmb y)$ by applying the formula for Gaussian conditioning, and interpret this result in terms of “conjugate priors”.
and
\[p(\pmb y) = \int \mathcal N (\pmb z \mid \pmb \mu _ z, \pmb \Sigma _ z) \mathcal N(\pmb y \mid \pmb W \pmb z + \pmb b, \pmb \Sigma _ y) \text d\pmb z = \mathcal N(\pmb y \mid \mathbf W \pmb \mu _ z + \pmb b, \pmb \Sigma _ y + \pmb W \pmb \Sigma _ z \pmb W^\top)\]This shows that a Gaussian prior $p(\pmb z)$ combined with the Gaussian likelihood $p(\pmb y \mid \pmb z)$ gives a Gaussian posterior $p(\pmb z \mid \pmb y)$, so that Gaussians are closed under Bayesian conditioning.
In other words, the Gaussian prior is a conjugate prior for the Gaussian likelihood.
Exponential family
Suppose:
- We have a family of probability distributions parameterised by $\pmb \eta \in \mathbb R^K$
- These distributions have fixed support over $\mathcal Y^D \subseteq \mathbb R^D$.
@Define what it means for the distribution $p(\pmb y \mid \pmb \eta)$ to be in the exponential family.
The density can be written in the form
\[\begin{aligned} p(\pmb y \mid \pmb \eta) &= \frac{1}{Z(\pmb \eta)} h(\pmb y) \exp[\pmb \eta^\top \mathcal T(\pmb y)] \\ &= h(\pmb y) \text{exp}[\pmb \eta^\top \mathcal T(\pmb y) - A(\pmb \eta)] \end{aligned}\]where:
- $h(\pmb y)$ is a scaling constant
- $\mathcal T(\pmb y) \in \mathbb R^K$ are “sufficient statistics”
- $\pmb \eta$ are the natural parameters
- $Z(\pmb \eta)$ is a normalisation constant known as the partition function
- $A(\pmb \eta) = \log Z(\pmb \eta)$ is the log partition function
Mixture models
@Define a mixture model $p(\pmb y \mid \pmb \theta)$.
A convex combination of distributions, i.e.
\[p(\pmb y \mid \pmb \theta) = \sum^K _ {k = 1} \pi _ K p _ k(\pmb y)\]where:
- $p _ k$ is the $k$th mixture component
- $\pi _ k$ are the mixture weights which satisfy $0 \le \pi _ k \le 1$
- $\sum^K _ {k = 1} \pi _ k = 1$
A mixture model $p(\pmb y \mid \pmb \theta)$ is defined via
\[p(\pmb y \mid \pmb \theta) = \sum^K _ {k = 1} \pi _ K p _ k(\pmb y)\]
@State how you can re-express this model as a hierarchical model.
Introduce the discrete latent variable $z \in {1, \ldots, K}$, which specifies which distribution to use for generating the output $\pmb y$, with the prior that $p(z = k \mid \pmb \theta) = \pi _ k$, and the condition $p(\pmb y \mid z = k, \pmb \theta) = p _ k(\pmb y) = p(\pmb y \mid \pmb \theta _ k)$. In other words, we have
\[\begin{aligned} p(z \mid \pmb \theta) &= \text{Cat}(z \mid \pmb \pi) \\ p(\pmb y \mid z = k, \pmb \theta) &= p(\pmb y \mid \pmb \theta _ k) \end{aligned}\]where $\pmb \theta = (\pi _ 1, \ldots, \pi _ K, \pmb \theta _ 1, \ldots, \theta _ K)$ are the model parameters.
Gaussian mixture models
@Define what it means for a random variable $\pmb y$ to be a mixture of Gaussians.
where the $\pi _ k$ sum to 1.
Probabilistic graphical models
See Graphical model on Wikipedia.
What is the name for probabilistic graphical models which are DAGs?
Bayesian networks.
Consider the following Bayesian network:

@State the general rule used for factorising the joint probability distribution given these diagrams, and factorise the joint probability distribution $\mathbb P(A, B, C, D)$ in this specific case.

The general rule is that
\[\mathbb P(X _ 1, \ldots, X _ n) = \prod^n _ {i = 1} \mathbb P(X _ i \mid \text{pa}(X _ i))\]@example~