Computer Vision MT25, Unsupervised computer vision
Flashcards
@State the general setup of an unsupervised learning task.
- Dataset $\mathcal D = {x _ i \mid 1 \le i \le N}$, inputs only
- Dataset is split into training / validation $\mathcal D = \mathcal D _ V$
- We aim to learn some $f$ which will be useful for a downstream task (e.g. assigning clusters)
- We have some downstream task $T = {(\chi _ i, \nu _ i) \mid 1 \le i \le M }$ (e.g. classification)
In unsupervised learning, we have:
- Dataset $\mathcal D = {x _ i \mid 1 \le i \le N}$, inputs only
- Dataset is split into training / validation $\mathcal D = \mathcal D _ V$
- We aim to learn some $f$ which will be useful for a downstream task (e.g. assigning clusters)
- We have some downstream task $T = {(\chi _ i, \nu _ i) \mid 1 \le i \le M }$ (e.g. classification)
In order to find a useful $f$, we aim to find some learning signal that means $f$ incorporates some useful priors.
@State 7 techniques with a brief explanation that can be used to find a useful learning signal.
- Recovery: Let $M$ be some transformation, given $f(M(x))$, recover $x$.
- Bottleneck: Let $g$ be a function from some restricted class. Given $f(g(x))$, recover $x$.
- Dataset: E.g. using patches of an image and unrelated patches from the dataset to learn unsupervised image representations.
- Invariance: Make sure $f$ is (approximately) invariant under some class of transformations to the input, so that $f(\pi(x)) \approx f(x)$.
- Equivariance: Make sure $f$ is (approximately) equivariant under some class of transformations to the input, so that $f(\pi(x)) \approx \pi’(f(x))$.
- Transformation estimation: Given a transformed input $x$, estimate the transformation: $f(\pi(x, \theta)) \approx \theta$
- Generative: Learn to generate samples from the dataset.
@State some ways an image could be corrupted in a way that an unsupervised learning algorithm could learn to recover the original image:
- Noise (i.e. $f$ learns denoising)
- Occlusion (i.e. $f$ learns inpainting)
- Grayscale

@State some requirements you could put in the middle of this bottleneck setup.

- Low dimensional
- Sparse
- Activations from a dictionary
What is the context prediction task?
Given two patches from an image, determine the spatial relationship between the two patches.

@Describe and @visualise the DINO architecture for unsupervised learning of image representations.
- Student network and teacher network, both with the same architecture
- These networks are randomly initialised
- Each input $x$ is split into two components
- $x _ 1$ for the student network, a random crop or transformation of the full input
- $x _ 2$, the full input
- The output probability distributions are compared via cross-entropy loss and gradient descent optimises the parameters of the student
- The teacher’s parameters are an exponentially weighted moving average of the student

A rough sketch of the DINO architecture is as follows:
- Student network and teacher network, both with the same architecture
- These networks are randomly initialised
- Each input $x$ is split into two components
- $x _ 1$ for the student network, a random crop or transformation of the full input
- $x _ 2$, the full input
- The output probability distributions (possibly after the teachers logits are centred) are compared via cross-entropy loss and gradient descent optimises the parameters of the student
- The teacher’s parameters are an exponentially weighted moving average of the student

What’s the intuition behind why the student receives only a random transformation or crop of the original input?
- $x _ 1$ for the student network, a random crop or transformation of the full input
- $x _ 2$, the full input

This encourages local-to-global features to be learned.
Why is DINO useful?
It provides a rich set of features for images, which can be used for downstream tasks.
What is weak supervision?
Supervision but where the outputs come from a different task.