Computer Vision MT25, Generative models
Flashcards
@State the general setup of a generative model.
We have some dataset $\mathcal D = { x _ i \mid 1 \le i \le N }$ which is assumed to come from some underlying distribution $p _ \text{data}(x)$. We aim to learn a distribution $p _ \text{model}(x)$ we can sample from so that $p _ \text{data}$ and $p _ \text{model}$ are similar.
What is the difference between discriminative, generative and conditional generative models?
- Discriminative: $p(y \mid x)$
- Generative: $p(x)$
- Conditional generative: $p(x \mid y)$.
@Justify how Bayes’ rule lets you build generative $p(x)$ models from other components (a discriminative model, a prior and a conditional generative model).
Consider the following form of Bayes rule:
\[p(x \mid y) = \frac{p(y \mid x) p(x)}{p(y)}\]Here:
- $p(x \mid y)$ is a conditional generative model
- $p(y \mid x)$ is a discriminative model
- $p(y)$ is a prior over labels
- $p(x)$ is a generative model
In a generative model, we have some dataset $\mathcal D = { x _ i \mid 1 \le i \le N }$ which is assumed to come from some underlying distribution $p _ \text{data}(x)$. We aim to learn a distribution $p _ \text{model}(x)$ we can sample from so that $p _ \text{data}$ and $p _ \text{model}$ are similar.
In this context, what is the difference between an explicit and an implicit generative model?
- Explicit: We obtain some representation of $p _ \text{model}(x)$ and can determine the density of samples
- Implicit: We may sample from $p _ \text{model}(x)$ but cannot determine the density of samples exactly
What’s the main idea behind autoregressive distribution estimation for a generative image model?
- Convert the image into a sequence of pixels
- Predict the sequence of pixels by factorising the joint distribution into a series of one-dimensional distributions that only depends on the previous pixels


How would autoregressive distribution estimation for image generation factorise the joint distribution $p(x)$?

@Visualise the architecture for autoregressive distribution estimation.

One approach to autoregressive distribution estimation for image generation might use an architecture like this:

Why do you have to take care about which neurons are connected to one another?

In order to preserve the autoregressive property, predictions are not allowed to depend on the future pixels, otherwise a model could cheat.
@Visualise the masks used in an autoregressive image generation CNN, and explain why this has to be done.

You need to preserve the autoregressive property that the prediction for a pixel is not allowed to depend on the values of the next pixels. Mask A is used for the first layer, and Mask B is used for all subsequent layers.
The difference between having a $0$ or $1$ in the middle of the mask is due to the centre of the computed features after the first layer now longer not conveying information about the next pixels.
How did DALL-E improve on the standard autoregressive image generation model?
It instead generated the images auto-regressively in “token-space”, using tokens learned by a VQ-VAE.
@Visualise the architecture of a VQ-VAE. How do they work in high-level terms?

- Train a CNN encoder to generate continuous representations of images
- At training time, replace activations with the closest vector from a learned codebook
- Backpropagate as if nothing happened
What is the main idea of a flow-based generative model?
- Start with a sample of a simple distribution
- Learn a neural field to represent the flow from $x _ T$ to $x _ 0$
- So $x _ {t-1} = f(x _ t) + x _ t$

At a high level, how do diffusion models for image generation work?
Generate an image in small stops from some noise $\epsilon$ by learning a denoising process.
@Visualise the diffusion vs reverse diffusion process in a diffusion model for image generation.


What is the aim of a diffusion model in this context?

Learn how to generate $p _ \theta (x) _ {t-1} \mid x _ t)$.

What’s a typical form for $q$ in this context?

where $\beta _ t$ is a “variance schedule”, which is often fixed.

Typically, $q$ might look something like
\[q(x _ t \mid x _ {t-1}) = \mathcal N(x _ t \mid \sqrt{1 - \beta _ t} x _ {t-1}, \beta _ t I)\]
where $\beta _ t$ is a “variance schedule”, which is often fixed. What is the closed form solution for $x _ t$ in terms of $x _ 0$?


Typically, $q$ might look something like
\[q(x _ t \mid x _ {t-1}) = \mathcal N(x _ t \mid \sqrt{1 - \beta _ t} x _ {t-1}, \beta _ t I)\]
where $\beta _ t$ is a “variance schedule”, which is often fixed. @State an expression for $x _ t$ in terms of $x _ 0$ and some nose parameter $\epsilon$.

We rewrite to give the update
\[x _ t = \sqrt{\alpha _ t} x _ {t-1} + \sqrt{1 - \alpha _ t} \epsilon _ {t-1}\]where $\alpha _ t = 1 - \beta _ t$ and $\bar \alpha _ t = \sum^t _ {i = 1} \alpha _ i$. Applying this recursively yields
\[x _ t = \sqrt{\bar \alpha _ t} x _ 0 + \sqrt{1 - \bar \alpha _ t} \epsilon\]What is the idea of latent diffusion?
Do diffusion in latent space rather than pixel space.