Computer Vision MT25, Unsupervised computer vision


Flashcards

@State the general setup of an unsupervised learning task.


  • Dataset $\mathcal D = {x _ i \mid 1 \le i \le N}$, inputs only
  • Dataset is split into training / validation $\mathcal D = \mathcal D _ V$
  • We aim to learn some $f$ which will be useful for a downstream task (e.g. assigning clusters)
  • We have some downstream task $T = {(\chi _ i, \nu _ i) \mid 1 \le i \le M }$ (e.g. classification)

In unsupervised learning, we have:

  • Dataset $\mathcal D = {x _ i \mid 1 \le i \le N}$, inputs only
  • Dataset is split into training / validation $\mathcal D = \mathcal D _ V$
  • We aim to learn some $f$ which will be useful for a downstream task (e.g. assigning clusters)
  • We have some downstream task $T = {(\chi _ i, \nu _ i) \mid 1 \le i \le M }$ (e.g. classification)

In order to find a useful $f$, we aim to find some learning signal that means $f$ incorporates some useful priors.

@State 7 techniques with a brief explanation that can be used to find a useful learning signal.


  1. Recovery: Let $M$ be some transformation, given $f(M(x))$, recover $x$.
  2. Bottleneck: Let $g$ be a function from some restricted class. Given $f(g(x))$, recover $x$.
  3. Dataset: E.g. using patches of an image and unrelated patches from the dataset to learn unsupervised image representations.
  4. Invariance: Make sure $f$ is (approximately) invariant under some class of transformations to the input, so that $f(\pi(x)) \approx f(x)$.
  5. Equivariance: Make sure $f$ is (approximately) equivariant under some class of transformations to the input, so that $f(\pi(x)) \approx \pi’(f(x))$.
  6. Transformation estimation: Given a transformed input $x$, estimate the transformation: $f(\pi(x, \theta)) \approx \theta$
  7. Generative: Learn to generate samples from the dataset.

@State some ways an image could be corrupted in a way that an unsupervised learning algorithm could learn to recover the original image:

  • Noise (i.e. $f$ learns denoising)
  • Occlusion (i.e. $f$ learns inpainting)
  • Grayscale

@State some requirements you could put in the middle of this bottleneck setup.


  • Low dimensional
  • Sparse
  • Activations from a dictionary

What is the context prediction task?


Given two patches from an image, determine the spatial relationship between the two patches.

@Describe and @visualise the DINO architecture for unsupervised learning of image representations.


  • Student network and teacher network, both with the same architecture
  • These networks are randomly initialised
  • Each input $x$ is split into two components
    • $x _ 1$ for the student network, a random crop or transformation of the full input
    • $x _ 2$, the full input
  • The output probability distributions are compared via cross-entropy loss and gradient descent optimises the parameters of the student
  • The teacher’s parameters are an exponentially weighted moving average of the student

A rough sketch of the DINO architecture is as follows:

  • Student network and teacher network, both with the same architecture
  • These networks are randomly initialised
  • Each input $x$ is split into two components
    • $x _ 1$ for the student network, a random crop or transformation of the full input
    • $x _ 2$, the full input
  • The output probability distributions (possibly after the teachers logits are centred) are compared via cross-entropy loss and gradient descent optimises the parameters of the student
  • The teacher’s parameters are an exponentially weighted moving average of the student

What’s the intuition behind why the student receives only a random transformation or crop of the original input?


This encourages local-to-global features to be learned.

Why is DINO useful?


It provides a rich set of features for images, which can be used for downstream tasks.

What is weak supervision?


Supervision but where the outputs come from a different task.




Related posts