Uncertainty in Deep Learning MT25, Probability reference


Most of my notes on basic probability theory can be found in [[Course - Probability MT22]]U and interspersed throughout [[Course - Machine Learning MT23]]U (especially [[Notes - Machine Learning MT23, Bayesian machine learning]]U).

The notes here come primarily from the chapters 2 and 3 of Probabilistic Machine Learning by Kevin Murphy.

Flashcards

Univariate models

Conditional independence

@Define what it means for two events $A$ and $B$ to be conditionally independent given an event $C$, and state the notation used to write this.

\[\mathbb P(A, B \mid C)\]

This is written $A \perp B \mid C$.


Conditional moments

@State the law of total expectation.


@Prove the law of total expectation, i.e. that

\[\mathbb E[X] = \mathbb E _ Y[\mathbb E[X \mid Y]]\]

\[\begin{aligned} \mathbb E _ Y[\mathbb E[X \mid Y]] &= \mathbb E _ Y \left[\sum _ x x \mathbb P(X = x \mid Y)\right] \\ &= \sum _ y \left[ \sum _ x x \mathbb P(X = x \mid Y = y) \right] \mathbb P(Y = y) \\ &= \sum _ {x,y} x\mathbb P(X = x, Y = y) \\ &= \mathbb E[X] \end{aligned}\]

@State the law of total variance.


\[\text{Var}(X) = \mathbb E _ Y(\text{Var}(X \mid Y)) + \text{Var} _ Y(\mathbb E[X \mid Y])\]

@Prove the law of total variance, i.e. that

\[\text{Var}(X) = \mathbb E _ Y(\text{Var}(X \mid Y)) + \text{Var} _ Y(\mathbb E[X \mid Y])\]

Define:

  • $\mu _ {X \vert Y} = \mathbb E[X \mid Y]$
  • $s _ {X \mid Y} = \mathbb E[X^2 \mid Y]$
  • $\sigma^2 _ {X \vert Y} = \text{Var}(X \mid Y) = s _ {X \vert Y} - \mu^2 _ {X \vert Y}$

Then:

\[\begin{aligned} \text{Var}(X) &= \mathbb E[X^2] - (\mathbb E[X])^2 \\ &= \mathbb E _ Y[s _ {X \vert Y}] - (\mathbb E _ Y[\mu _ {X \vert Y}])^2 \\ &= \mathbb E _ Y[\sigma^2 _ {X \vert Y}] + \mathbb E _ Y[\mu^2 _ {X \vert Y}] - (\mathbb E _ Y[\mu _ {X \vert Y}])^2 \\ &= \mathbb E _ Y[\text{Var}(X \mid Y)] + \text{Var} _ Y(\mu _ {X \vert Y}) \end{aligned}\]
Properties of the sigmoid function

@Define the $\text{logit}$ function.


\[\text{logit}(p) = \sigma^{-1}(p) = \log\left(\frac{p}{1-p}\right)\]
Dirac delta function as a limiting case of the Gaussian

What happens to a Gaussian as you shrink its variance to $0$?


\[\lim _ {\sigma \to 0} \mathcal N(y \mid \mu, \sigma^2) = \delta(y - \mu)\]
Student $t$ distribution

@Define the probability density function of the Student $t$ distribution $\mathcal T(y \mid \mu, \sigma^2, \nu)$, and describe how the choice of $\nu$ (the “degrees of freedom” or “degree of normality” affects the distribution).


\[\mathcal T(y \mid \mu, \sigma^2, \nu) = C \left( 1 + \frac 1 \nu \left( \frac{y - \mu}{\sigma} \right)^2 \right)^{-\left(\frac{\nu+1}{2}\right)}\]

where:

  • $\mu$ is the mean
  • $\sigma > 0$ is a scale parameter distinct from the standard deviation
  • $\nu$ is the “degree of normality”, large values of $\nu$ make the distribution act like a Gaussian
  • $C$ is a scaling parameter to make it integrate to one

Intuitively, why is the Student $t$ distribution $\mathcal T(y \mid \mu, \sigma^2, \nu)$ more robust to errors than the normal distribution $\mathcal N(y \mid \mu, \sigma^2)$?


The probability density decays as a polynomial function of the squared distance from the mean, rather than exponentially.

Recall that the probability density function of the Student $t$ distribution $\mathcal T(y \mid \mu, \sigma^2, \nu)$ is given by:

\[\mathcal T(y \mid \mu, \sigma^2, \nu) = C \left( 1 + \frac 1 \nu \left( \frac{y - \mu}{\sigma} \right)^2 \right)^{-\left(\frac{\nu+1}{2}\right)}\]

where:

  • $\mu$ is the mean
  • $\sigma > 0$ is a scale parameter distinct from the standard deviation
  • $\nu$ is the “degree of normality”, large values of $\nu$ make the distribution act like a Gaussian
  • $C$ is a scaling parameter to make it integrate to one

What is the mean, mode and variance of this distribution?


  • Mean: $\mu$
  • Mode: $\mu$,
  • Variance: $\frac{\nu \sigma^2}{(\nu - 2)}$
Cauchy distribution

@Define the probability density function of the Cauchy distribution $\mathcal C(x \mid \mu, \gamma)$ and the half Cauchy distribution $\mathcal C _ +$. In what situation is the half Cauchy distribution often used?


\[\mathcal C(x \mid \mu, \gamma) = \frac{1}{\gamma \pi} \left[ 1 + \left(\frac{x - \mu}{\gamma}\right)^2 \right]^{-1}\]

i.e. it is the Student $t$ distribution with $\nu = 1$. The half Cauchy distribution folds this over itself on the origin:

\[\mathcal C _ +(x \mid \gamma) = \frac{2}{\pi \gamma}\left[ 1 + \left(\frac{x}{\gamma}\right)^2 \right]^{-1}\]

This is useful for when you want a distribution over positive reals with heavy tails, but a finite density at the origin.

Empirical distribution

Suppose we have a set of $N$ samples $\mathcal D = {x^{(1)}, \ldots, x^{(N)}}$. @Define the empirical distribution.


The distribution formed by spikes around these points, i.e.

\[\hat p _ N (x) = \frac 1 N \sum^N _ {n = 1}\delta _ {x^{(n)}}(x)\]
Other distributions
  • Truncated Gaussian distribution
    • Cut the Gaussian off between $[a, b]$ and renormalise so it integrates to $1$
  • Beta distribution:
    • Has support over $[0, 1]$
  • Gamma distribution
    • Has support over $(0, \infty)$
  • Exponential distribution
    • Describes the times between events in a Poisson process, which is a process in which events occur continuously and independently at a constant average rate
  • Chi-squared distribution
    • Comes from the sum of squared Gaussian random variables
  • Inverse Gamma distribution
Transformations of discrete random variables

Suppose:

  • $X$ is a discrete random variable with probability mass function $p _ x$.
  • $f$ is a deterministic function
  • $Y = f(X)$

@State the probability mass function $p _ y$.


\[p _ y(y) = \sum _ {x \text{ s.t. } f(x) = y} p _ x(x)\]
Transformations of continuous random variables

Suppose:

  • $X$ is a scalar continuous random variable with probability density function $p _ x$.
  • $f$ is a deterministic, monotonic (and so in particular invertible) function
  • $Y = f(X)$

@State the probability mass function $p _ y$ found using the change of variables formula.


\[p _ y(x) = p _ x(g(y)) \left \vert \frac{\text d}{\text dy}g(y) \right \vert\]

Suppose:

  • $\pmb X$ is a multidimensional continuous random variable with probability density function $p _ x$.
  • $\pmb f$ is a deterministic and invertible function with inverse $\pmb g$, from $\mathbb R^n \to \mathbb R^n$
  • $\pmb Y = \pmb f(\pmb X)$

@State the probability mass function $p _ y$ found using the change of variables formula.


\[p _ y(\pmb y) = p _ x(\pmb g(\pmb y)) \vert \det[\pmb J _ {\pmb g}(\pmb y)] \vert\]
Moments of a linear transformation

Suppose:

  • $\pmb x$ is a multidimensional random variable
  • $\pmb y = \mathbf A \pmb x + \pmb b$

@State the mean and covariance of $\pmb y$, and what this reduces to when $\pmb y$ is a scalar (so that $\mathbf A = \pmb a^\top$).


  • Mean: $\mathbf A \pmb \mu + \pmb b$, or in particular $\pmb a^\top \mu + \pmb b$ when $\mathbf A = \pmb a^\top$
  • Covariance: $\mathbf A \mathbf \Sigma \mathbf A^\top$ where $\pmb \Sigma = \text{Cov}[\pmb x]$, or in particular $\pmb a^\top \Sigma \pmb a$ when $\mathbf A = \pmb a^\top$
The convolution theorem

Suppose:

  • $x _ 1, x _ 2$ are two independent random variables
  • $y = x _ 1 + x _ 2$

@State the probability mass function $p _ y$ when these are discrete random variables, and the probability density function $p _ y$ when these are continuous random variables.


\[p _ y = p _ 1 \ast p _ 2\]

where $\ast$ denotes convolution, so that

\[p _ y(y = j) = \sum _ j p(x _ 1 = k) p(x _ 2 = j-k)\]

in the discrete case, and in the continuous case

\[p(y) = \int p _ 1(x _ 1) p _ 2(y - x _ 1) \text dx _ 1\]
Central limit theorem

Suppose:

  • We have $N$ i.i.d. random variables
  • $S _ N = \sum^N _ {n = 1}X _ n$

@State the central limit theorem.

\[\lim _ {N \to \infty} p(S _ N = u) = \mathcal N(u \mid \mu, \sigma)\]

and so the distribution of the quantity $Z _ N := \frac{S _ N - N\mu}{\sigma \sqrt{N}}$ converges to a standard normal.


Multivariate models

Covariance matrix

Suppose $\pmb x$ is a $D$-dimensional random vector. @Define the covariance matrix $\text{Cov}[\pmb x]$.


\[\begin{aligned} \operatorname{Cov}[\mathbf{x}] &= \mathbb{E}\!\left[(\mathbf{x} - \mathbb{E}[\mathbf{x}])(\mathbf{x} - \mathbb{E}[\mathbf{x}])^\top\right] \\ &= \boldsymbol{\Sigma} \\ &= \begin{pmatrix} \operatorname{Var}[X _ 1] & \operatorname{Cov}[X _ 1, X _ 2] & \cdots & \operatorname{Cov}[X _ 1, X _ D] \\ \operatorname{Cov}[X _ 2, X _ 1] & \operatorname{Var}[X _ 2] & \cdots & \operatorname{Cov}[X _ 2, X _ D] \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}[X _ D, X _ 1] & \operatorname{Cov}[X _ D, X _ 2] & \cdots & \operatorname{Var}[X _ D] \end{pmatrix} \end{aligned}\]

What is $\mathbb E[\pmb x\pmb x^\top]$?


\[\pmb \Sigma + \pmb \mu \pmb \mu^\top\]

@Define the cross-covariance between two random vectors $\pmb x, \pmb y$.


\[\text{Cov}[\pmb x, \pmb y] = \mathbb E[(\pmb x-\mathbb E[\pmb x])(\pmb y - \mathbb E[\pmb y])^\top]\]
Pearson correlation coefficient

@Define the Pearson correlation coefficient between random variables $X$ and $Y$, state why it is useful, and explain why “degree of linearity” might be a better term.


\[\rho = \text{Corr}[X, Y] = \frac{\text{Cov}[X, Y]}{\sqrt{\text{Var}[X]\text{Var}[Y]}}\]
  • It is useful because covariances between two random variables can be between any real number, $\rho$ is always between $-1$ and $1$.
  • Two datasets can be highly related in nonlinear ways despite having a $\rho$ of $0$.

The Pearson correlation coefficient is defined as

\[\rho = \text{Corr}[X, Y] = \frac{\text{Cov}[X, Y]}{\sqrt{\text{Var}[X]\text{Var}[Y]}}\]

@State a result about when $X$ and $Y$ are in a linear relationship.


\[\rho = 1 \iff Y = aX + b \text{ for some } a> 0, b\]
Correlation matrix

Suppose $\pmb x$ is a random vector. @Define the correlation matrix $\text{corr}(\pmb x)$.


\[\mathrm{corr}(x) = \begin{pmatrix} 1 & \dfrac{\mathbb{E}[(X _ 1 - \mu _ 1)(X _ 2 - \mu _ 2)]}{\sigma _ 1 \sigma _ 2} & \cdots & \dfrac{\mathbb{E}[(X _ 1 - \mu _ 1)(X _ D - \mu _ D)]}{\sigma _ 1 \sigma _ D} \\[1em] \dfrac{\mathbb{E}[(X _ 2 - \mu _ 2)(X _ 1 - \mu _ 1)]}{\sigma _ 2 \sigma _ 1} & 1 & \cdots & \dfrac{\mathbb{E}[(X _ 2 - \mu _ 2)(X _ D - \mu _ D)]}{\sigma _ 2 \sigma _ D} \\[1em] \vdots & \vdots & \ddots & \vdots \\[1em] \dfrac{\mathbb{E}[(X _ D - \mu _ D)(X _ 1 - \mu _ 1)]}{\sigma _ D \sigma _ 1} & \dfrac{\mathbb{E}[(X _ D - \mu _ D)(X _ 2 - \mu _ 2)]}{\sigma _ D \sigma _ 2} & \cdots & 1 \end{pmatrix}\]

The correlation matrix $\text{corr}(\pmb x)$ of a random vector is defined as

\[\mathrm{corr}(x) = \begin{pmatrix} 1 & \dfrac{\mathbb{E}[(X _ 1 - \mu _ 1)(X _ 2 - \mu _ 2)]}{\sigma _ 1 \sigma _ 2} & \cdots & \dfrac{\mathbb{E}[(X _ 1 - \mu _ 1)(X _ D - \mu _ D)]}{\sigma _ 1 \sigma _ D} \\[1em] \dfrac{\mathbb{E}[(X _ 2 - \mu _ 2)(X _ 1 - \mu _ 1)]}{\sigma _ 2 \sigma _ 1} & 1 & \cdots & \dfrac{\mathbb{E}[(X _ 2 - \mu _ 2)(X _ D - \mu _ D)]}{\sigma _ 2 \sigma _ D} \\[1em] \vdots & \vdots & \ddots & \vdots \\[1em] \dfrac{\mathbb{E}[(X _ D - \mu _ D)(X _ 1 - \mu _ 1)]}{\sigma _ D \sigma _ 1} & \dfrac{\mathbb{E}[(X _ D - \mu _ D)(X _ 2 - \mu _ 2)]}{\sigma _ D \sigma _ 2} & \cdots & 1 \end{pmatrix}\]

How could you write this more compactly?


\[\text{corr}(\pmb x) = (\text{diag}(\mathbf K _ {xx}))^{-1/2} \mathbf K _ {xx} (\text{diag}(\mathbf (K _ {xx})))^{-1/2}\]

where $\mathbf K _ {xx}$ is the auto-covariance matrix

\[\mathbf K _ {xx} = \mathbf \Sigma = \mathbb E[(\pmb x - \mathbb E[\pmb x])(\pmb x - \mathbb E[\pmb x])^\top] = \mathbf R _ {xx} - \pmb \mu \pmb \mu^\top\]

and $\mathbf R _ {xx} = \mathbf E[xx^\top]$ is the auto-correlation matrix.

Multivariate Gaussian distribution

See [[Notes - Machine Learning MT23, Multivariate Gaussians]]U, specifically ∆multivarite-gaussian-pdf and ∆multivariate-gaussian-eigenvectors.

Suppose that $\pmb y \sim \mathcal N(\pmb y \mid \pmb \mu, \mathbf \Sigma)$, i.e.

\[p(\pmb y) = \frac{1}{(2\pi)^{D/2} \vert \pmb \Sigma \vert ^{1/2}\,}\exp\left(-\frac 1 2(\pmb y - \pmb \mu)^\top\pmb \Sigma^{-1} (\pmb y - \pmb \mu)\right)\]

What is $\mathbb E[\pmb y \pmb y^\top]$ in this case?


\[\pmb \Sigma + \pmb \mu \pmb \mu^\top\]

Suppose that $\pmb y$ is a 2D random variable and $\pmb y \sim \mathcal N(\pmb y \mid \pmb \mu, \mathbf \Sigma)$ (a “bivariate Gaussian”), i.e.

\[p(\pmb y) = \frac{1}{(2\pi)^{D/2} \vert \pmb \Sigma \vert ^{1/2}\,}\exp\left(-\frac 1 2(\pmb y - \pmb \mu)^\top\pmb \Sigma^{-1} (\pmb y - \pmb \mu)\right)\]

@State a convenient parameterisation of $\mathbf \Sigma$.


\[\mathbf \Sigma = \begin{pmatrix} \sigma _ 1^2 & \rho \sigma _ 1 \sigma _ 2 \\ \rho \sigma _ 1 \sigma _ 2 & \sigma _ 2^2 \end{pmatrix}\]

where $\rho$ is the correlation coefficient of $Y _ 1$ and $Y _ 2$, which are the marginalisations of $Y$ and turn out to also be Gaussian with parameters $\mu _ i$ and $\sigma _ i$. You can interpret this as a 2D Gaussian being a pair of correlated 1D Gaussians.

@Define a spherical/isotropic covariance matrix.


Covariance matrices of the form

\[\mathbf \Sigma = \sigma^2 I\]
Mahalanobis distance

A metric where contours of distance from a Gaussian with mean $\pmb \mu$ have constant probability.

Marginals and conditionals of an MVN

Suppose $\pmb y = (\pmb y _ 1, \pmb y _ 2)$ is jointly Gaussian with parameters

\[\pmb{\mu} = \begin{pmatrix} \mu _ 1 \\[4pt] \mu _ 2 \end{pmatrix}, \quad \pmb{\Sigma} = \begin{pmatrix} \pmb{\Sigma} _ {11} & \pmb{\Sigma} _ {12} \\[4pt] \pmb{\Sigma} _ {21} & \pmb{\Sigma} _ {22} \end{pmatrix}, \quad \pmb{\Lambda} = \pmb{\Sigma}^{-1} = \begin{pmatrix} \pmb{\Lambda} _ {11} & \pmb{\Lambda} _ {12} \\[4pt] \pmb{\Lambda} _ {21} & \pmb{\Lambda} _ {22} \end{pmatrix}\]

@State the marginals $p(\pmb y _ i)$ and the posterior conditional $p(\pmb y _ 1 \mid \pmb y _ 2)$.


\[\begin{aligned} p(\pmb y _ 1) &= \mathcal N(\pmb y _ 1 \mid \pmb \mu _ 1, \pmb \Sigma _ {11}) \\ p(\pmb y _ 2) &= \mathcal N(\pmb y _ 2 \mid \pmb \mu _ 2, \pmb \Sigma _ {22}) \end{aligned}\]

and

\[p(\pmb y _ 1 \mid \pmb y _ 2) = \mathcal N(\pmb y _ 1 \mid \pmb \mu _ {1 \vert 2}, \pmb \Sigma _ {1,2})\]

where:

\[\begin{aligned} \pmb{\mu} _ {1 \vert 2} &= \pmb{\mu} _ 1 + \pmb{\Sigma} _ {12} \pmb{\Sigma} _ {22}^{-1} (\pmb{y} _ 2 - \pmb{\mu} _ 2) \\ &= \pmb{\mu} _ 1 - \pmb{\Lambda} _ {11}^{-1} \pmb{\Lambda} _ {12} (\pmb{y} _ 2 - \pmb{\mu} _ 2) \\ &= \pmb{\Sigma} _ {1 \vert 2} \bigl( \pmb{\Lambda} _ {11} \pmb{\mu} _ 1 - \pmb{\Lambda} _ {12} (\pmb{y} _ 2 - \pmb{\mu} _ 2) \bigr). \end{aligned}\]

and

\[\begin{aligned} \pmb{\Sigma} _ {1 \vert 2} &= \pmb{\Sigma} _ {11} - \pmb{\Sigma} _ {12} \pmb{\Sigma} _ {22}^{-1} \pmb{\Sigma} _ {21} \\ &= \pmb{\Lambda} _ {11}^{-1} \end{aligned}\]

so in particular, the marginal distributions are also Gaussian.

Linear Gaussian systems

Suppose:

  • $\pmb z \in \mathbb R^L$ is an unknown vector of values
  • $\pmb y \in \mathbb R^D$ is a noisy measurement of $\pmb z$
  • $p(\pmb z) = \mathcal N(\pmb z \mid \pmb \mu _ z, \pmb \Sigma _ z)$
  • $p(\pmb y \mid \pmb z) = \mathcal N(\pmb y \mid \pmb W \pmb z + \pmb b, \pmb \Sigma _ y)$
  • $\mathbf W$ is a matrix of size $D \times L$

What is the name of such a setup?


A linear Gaussian system.

Suppose:

  • $\pmb z \in \mathbb R^L$ is an unknown vector of values
  • $\pmb y \in \mathbb R^D$ is a noisy measurement of $\pmb z$
  • $p(\pmb z) = \mathcal N(\pmb z \mid \pmb \mu _ z, \pmb \Sigma _ z)$
  • $p(\pmb y \mid \pmb z) = \mathcal N(\pmb y \mid \pmb W \pmb z + \pmb b, \pmb \Sigma _ y)$
  • $\mathbf W$ is a matrix of size $D \times L$

@State the parameters of the corresponding joint distribution $p(\pmb z, \pmb y) = p(\pmb z) p(\pmb y \mid \pmb z)$.


This is an $L + D$-dimensional Gaussian, with mean and covariance given by

\[\boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu} _ z \\ \mathbf{W} \boldsymbol{\mu} _ z + \mathbf{b} \end{pmatrix}, \qquad \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma} _ z & \boldsymbol{\Sigma} _ z \mathbf{W}^\top \\ \mathbf{W} \boldsymbol{\Sigma} _ z & \boldsymbol{\Sigma} _ y + \mathbf{W} \boldsymbol{\Sigma} _ z \mathbf{W}^\top \end{pmatrix}\]

Suppose:

  • $\pmb z \in \mathbb R^L$ is an unknown vector of values
  • $\pmb y \in \mathbb R^D$ is a noisy measurement of $\pmb z$
  • $p(\pmb z) = \mathcal N(\pmb z \mid \pmb \mu _ z, \pmb \Sigma _ z)$
  • $p(\pmb y \mid \pmb z) = \mathcal N(\pmb y \mid \pmb W \pmb z + \pmb b, \pmb \Sigma _ y)$
  • $\mathbf W$ is a matrix of size $D \times L$

The corresponding joint distribution $p(\pmb z, \pmb y) = p(\pmb z) p(\pmb y \mid \pmb z)$ is an $L + D$-dimensional Gaussian, with mean and covariance given by

\[\boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu} _ z \\ \mathbf{W} \boldsymbol{\mu} _ z + \mathbf{b} \end{pmatrix}, \qquad \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma} _ z & \boldsymbol{\Sigma} _ z \mathbf{W}^\top \\ \mathbf{W} \boldsymbol{\Sigma} _ z & \boldsymbol{\Sigma} _ y + \mathbf{W} \boldsymbol{\Sigma} _ z \mathbf{W}^\top \end{pmatrix}\]

@State “Bayes rule for Gaussians” to derive $p(\pmb z \mid \pmb y)$ by applying the formula for Gaussian conditioning, and interpret this result in terms of “conjugate priors”.


\[\begin{aligned} p(\pmb z \mid \pmb y) &= \mathcal N (\pmb z \mid \pmb \mu _ {z \mid y}, \mathbf \Sigma _ {z \mid y}) \\ \mathbf \Sigma _ {z \vert y}^{-1} &= \mathbf \Sigma _ z^{-1} + \mathbf W^\top \mathbf \Sigma _ y^{-1} \mathbf W \\ \pmb \mu _ {z \mid y} &= \Sigma _ {z \mid y} [\mathbf W^\top \mathbf \Sigma _ y^{-1} (\pmb y - \pmb b) + \pmb \Sigma _ z^{-1} \pmb \mu _ z] \end{aligned}\]

and

\[p(\pmb y) = \int \mathcal N (\pmb z \mid \pmb \mu _ z, \pmb \Sigma _ z) \mathcal N(\pmb y \mid \pmb W \pmb z + \pmb b, \pmb \Sigma _ y) \text d\pmb z = \mathcal N(\pmb y \mid \mathbf W \pmb \mu _ z + \pmb b, \pmb \Sigma _ y + \pmb W \pmb \Sigma _ z \pmb W^\top)\]

This shows that a Gaussian prior $p(\pmb z)$ combined with the Gaussian likelihood $p(\pmb y \mid \pmb z)$ gives a Gaussian posterior $p(\pmb z \mid \pmb y)$, so that Gaussians are closed under Bayesian conditioning.

In other words, the Gaussian prior is a conjugate prior for the Gaussian likelihood.

Exponential family

Suppose:

  • We have a family of probability distributions parameterised by $\pmb \eta \in \mathbb R^K$
  • These distributions have fixed support over $\mathcal Y^D \subseteq \mathbb R^D$.

@Define what it means for the distribution $p(\pmb y \mid \pmb \eta)$ to be in the exponential family.


The density can be written in the form

\[\begin{aligned} p(\pmb y \mid \pmb \eta) &= \frac{1}{Z(\pmb \eta)} h(\pmb y) \exp[\pmb \eta^\top \mathcal T(\pmb y)] \\ &= h(\pmb y) \text{exp}[\pmb \eta^\top \mathcal T(\pmb y) - A(\pmb \eta)] \end{aligned}\]

where:

  • $h(\pmb y)$ is a scaling constant
  • $\mathcal T(\pmb y) \in \mathbb R^K$ are “sufficient statistics”
  • $\pmb \eta$ are the natural parameters
  • $Z(\pmb \eta)$ is a normalisation constant known as the partition function
  • $A(\pmb \eta) = \log Z(\pmb \eta)$ is the log partition function
Mixture models

@Define a mixture model $p(\pmb y \mid \pmb \theta)$.


A convex combination of distributions, i.e.

\[p(\pmb y \mid \pmb \theta) = \sum^K _ {k = 1} \pi _ K p _ k(\pmb y)\]

where:

  • $p _ k$ is the $k$th mixture component
  • $\pi _ k$ are the mixture weights which satisfy $0 \le \pi _ k \le 1$
  • $\sum^K _ {k = 1} \pi _ k = 1$

A mixture model $p(\pmb y \mid \pmb \theta)$ is defined via

\[p(\pmb y \mid \pmb \theta) = \sum^K _ {k = 1} \pi _ K p _ k(\pmb y)\]

@State how you can re-express this model as a hierarchical model.


Introduce the discrete latent variable $z \in {1, \ldots, K}$, which specifies which distribution to use for generating the output $\pmb y$, with the prior that $p(z = k \mid \pmb \theta) = \pi _ k$, and the condition $p(\pmb y \mid z = k, \pmb \theta) = p _ k(\pmb y) = p(\pmb y \mid \pmb \theta _ k)$. In other words, we have

\[\begin{aligned} p(z \mid \pmb \theta) &= \text{Cat}(z \mid \pmb \pi) \\ p(\pmb y \mid z = k, \pmb \theta) &= p(\pmb y \mid \pmb \theta _ k) \end{aligned}\]

where $\pmb \theta = (\pi _ 1, \ldots, \pi _ K, \pmb \theta _ 1, \ldots, \theta _ K)$ are the model parameters.

Gaussian mixture models

@Define what it means for a random variable $\pmb y$ to be a mixture of Gaussians.


\[p(\pmb y \mid \pmb \theta) = \sum^K _ {k=1} \pi _ k \mathcal N(\pmb y \mid \pmb \mu _ k, \mathbf \Sigma _ k)\]

where the $\pi _ k$ sum to 1.

Probabilistic graphical models

See Graphical model on Wikipedia.

What is the name for probabilistic graphical models which are DAGs?


Bayesian networks.

Consider the following Bayesian network:

@State the general rule used for factorising the joint probability distribution given these diagrams, and factorise the joint probability distribution $\mathbb P(A, B, C, D)$ in this specific case.


The general rule is that

\[\mathbb P(X _ 1, \ldots, X _ n) = \prod^n _ {i = 1} \mathbb P(X _ i \mid \text{pa}(X _ i))\]

@example~




Related posts