Computer Vision MT25, Unsupervised computer vision


Flashcards

@State the general setup of an unsupervised learning task.

  • Dataset $\mathcal D = \{x _ i \mid 1 \le i \le N\}$, inputs only
  • Dataset is split into training / validation $\mathcal D = \mathcal D _ T \sqcup \mathcal D _ V$
  • We aim to learn some $f$ which will be useful for a downstream task (e.g. assigning clusters)
  • We have some downstream task $T = \{(\chi _ i, \nu _ i) \mid 1 \le i \le M \}$ (e.g. classification)

In unsupervised learning, we have:

  • Dataset $\mathcal D = \{x _ i \mid 1 \le i \le N\}$, inputs only
  • Dataset is split into training / validation $\mathcal D = \mathcal D _ T \sqcup \mathcal D _ V$
  • We aim to learn some $f$ which will be useful for a downstream task (e.g. assigning clusters)
  • We have some downstream task $T = \{(\chi _ i, \nu _ i) \mid 1 \le i \le M \}$ (e.g. classification)

In order to find a useful $f$, we aim to find some learning signal that means $f$ incorporates some useful priors.

@State 7 techniques with a brief explanation that can be used to find a useful learning signal.

  1. Recovery: Let $M$ be some transformation. Train $f$ to undo $M$, i.e. so that $f(M(x)) \approx x$.
  2. Bottleneck: Let $g$ be a function from some restricted class (e.g. a low-dimensional encoder). Train $f$ to recover $x$ from $g(x)$, i.e. so that $f(g(x)) \approx x$.
  3. Dataset: E.g. using patches of an image and unrelated patches from the dataset to learn unsupervised image representations.
  4. Invariance: Make sure $f$ is (approximately) invariant under some class of transformations to the input, so that $f(\pi(x)) \approx f(x)$.
  5. Equivariance: Make sure $f$ is (approximately) equivariant under some class of transformations to the input, so that $f(\pi(x)) \approx \pi'(f(x))$.
  6. Transformation estimation: Given a transformed input $x$, estimate the transformation: $f(\pi(x, \theta)) \approx \theta$
  7. Generative: Learn to generate samples from the dataset.

@State some ways an image could be corrupted in a way that an unsupervised learning algorithm could learn to recover the original image:

  • Noise (i.e. $f$ learns denoising)
  • Occlusion (i.e. $f$ learns inpainting)
  • Grayscale

@State some requirements you could put in the middle of this bottleneck setup.

  • Low dimensional
  • Sparse
  • Activations from a dictionary

What is the context prediction task?

Given two patches from an image, determine the spatial relationship between the two patches.

@Describe and @visualise the DINO architecture for unsupervised learning of image representations.

  • Student network and teacher network, both with the same architecture
  • These networks are randomly initialised
  • Each input $x$ is split into two components
    • $x _ 1$ for the student network, a random crop or transformation of the full input
    • $x _ 2$, the full input
  • The output probability distributions are compared via cross-entropy loss and gradient descent optimises the parameters of the student
  • The teacher’s parameters are an exponentially weighted moving average of the student

A rough sketch of the DINO architecture is as follows:

  • Student network and teacher network, both with the same architecture
  • These networks are randomly initialised
  • Each input $x$ is split into two components
    • $x _ 1$ for the student network, a random crop or transformation of the full input
    • $x _ 2$, the full input
  • The output probability distributions (possibly after the teachers logits are centred) are compared via cross-entropy loss and gradient descent optimises the parameters of the student
  • The teacher’s parameters are an exponentially weighted moving average of the student

What’s the intuition behind why the student receives only a random transformation or crop of the original input?

This encourages local-to-global features to be learned.

The DINO architecture (∆dino-architecture) trains a student network to match the predictions of a teacher whose weights are an exponential moving average of the student’s. Without additional precautions, this setup tends to collapse: both networks converge to outputting the same prediction for every input, which trivially minimises the loss while learning nothing useful.

@Describe the two complementary tricks DINO uses to prevent this collapse.

Centring: We centre the teacher’s pre-softmax logits:

\[f _ T(x _ i) \leftarrow f _ T(x _ i) - \frac 1 N \sum^N _ {j = 1} f _ T(x _ j),\]

where the average is over a running batch of teacher outputs.

This subtracts off the running mean of the teacher outputs. Without it, the EMA dynamics push the teacher towards a constant function, which is one mode of collapse.

Sharpening: We sharpen the teacher’s output distribution by decreasing the softmax temperature:

\[p _ T(x) = \text{softmax}(f _ T(x) / \tau _ T)\]

with $\tau _ T$ small. This makes the teacher’s distribution more peaked. Without it, centring alone tends to push the teacher towards the uniform distribution, which is another mode of collapse.

@exam~

Why is DINO useful?

It provides a rich set of features for images, which can be used for downstream tasks.

What is weak supervision?

Supervision but where the outputs come from a different task.

@Describe the rotation-prediction task for unsupervised representation learning.

Setup: Take an unlabelled image $x$, apply a random rotation $r \in \{0^\circ, 90^\circ, 180^\circ, 270^\circ\}$ to produce $\tilde x = r(x)$, and train a network $f$ to predict which rotation was applied via cross-entropy over $4$ classes.

Why it works: To classify the rotation correct, $f$ must learn features that depend on global spatial layout

@exam~

SimCLR is a contrastive self-supervised framework that learns image representations by maximising agreement between two augmented views of the same image.

@Describe the SimCLR framework.

  • how it constructs training pairs,
  • what loss it uses,
  • and the role of the non-linear projection head

Pipeline (per training image $x$):

  1. Apply two random augmentations $t, t' \sim \mathcal T$ (a fixed policy) to produce two views $\tilde x _ i = t(x), \tilde x _ j = t'(x)$.
  2. Pass each through a shared encoder $f$ to get representations $h _ i = f(\tilde x _ i), h _ j = f(\tilde x _ j)$.
  3. Pass each representation through a non-linear projection head $g$ (a small MLP) to get $z _ i = g(h _ i)$, $z _ j = g(h _ j)$.
  4. Train to maximise the agreement between $z _ i$ and $z _ j$ via a contrastive loss (∆contrastive-loss): the positives are $(z _ i, z _ j)$, the negatives are all other batch elements.

Non-linear projection head: The contrastive loss forces $z$ to be invariant to whatever augmentations are used (e.g. colour jitter would make $z$ colour-invariant). This invariance is good for the loss, but discards information downstream tasks may need (e.g. classifying a red vs green apple). Inserting a non-linear $g$ between $h$ and $z$ lets $g$ absorb the “invariance pressure” while $h$ retains the richer features. Downstream users can take $h$ and discard $g$.

Example augmentations:

@Define Hungarian matching and explain its role in unsupervised classification.

In unsupervised classification, a clustering algorithm produces $N$ cluster IDs with no semantic meaning. To evaluate against ground-truth labels, we need an optimal 1-to-1 assignment between clusters and labels.

Hungarian matching: given an $N \times N$ cost matrix $C$ where $C _ {ij}$ is the error of matching cluster $i$ to label $j$, find a permutation matrix $P$ minimising

\[\min _ P \text{trace}(PC)\]

in $O(N^3)$ time (i.e. the sum of the diagonal).

This separates the grouping (the model’s job) with the semantics (Hungarian’s job, naming the clusters). It finds the best possible assignment of clusters to true labels to minimise the error.

@Describe the self-labelling by clustering algorithm for unsupervised classification.

An alternating optimisation between learning a model and assigning pseudo-labels, both minimising the cross-entropy

\[H(q, p) = -\frac{1}{N} \sum^N _ {i=1} \sum _ y q(y \mid x _ i) \log p(y \mid x _ i, \Theta)\]
  • Learning step: fix labels $q$ and update CNN weights via $\min _ p H(q, p)$, which is equivalent to standard supervised training with the current pseudo-labels.
  • Labelling step: fix the model $p$ and update label assignments via $\min _ q H(q, p)$, subject to $\sum _ i q(y \mid x _ i) = N/K$ for every class $y$ (each cluster receives an equal share of samples). Without this constraint, $\min _ q H$ collapses to the trivial solution of putting every sample in the single most-likely class.

Why the labelling step is optimal transport: with the equal-size constraint, the labelling problem has the structure of a discrete OT problem. Treat the $N$ samples as sources (mass $1$ each) and the $K$ classes as targets (each demanding mass $N/K$). The cost of assigning sample $i$ to class $y$ is the entry

\[C _ {iy} = -\log p(y \mid x _ i, \Theta),\]

so $H(q, p) = \tfrac{1}{N}\sum _ {i,y} q(y \mid x _ i)\, C _ {iy}$ is exactly the total transport cost. Minimising it over $q$ subject to the two marginals $\sum _ y q(y \mid x _ i) = 1$ (each sample fully assigned) and $\sum _ i q(y \mid x _ i) = N/K$ (equal class sizes) is the Kantorovich OT problem. Solved efficiently with the Sinkhorn-Knopp algorithm (entropy-regularised OT, Cuturi 2013): scale rows and columns of $\exp(-C/\varepsilon)$ iteratively to match the marginals, $O(NK)$ per iteration.

Iterate the two steps to convergence to obtain an unsupervised classifier. After training, ∆hungarian-matching maps cluster IDs to ground-truth labels for evaluation.

Bite-sized

The rotation-prediction pretext task was introduced by Gidaris et al., ICLR 2018. Each unlabelled image is rotated by one of $\{0°, 90°, 180°, 270°\}$, and the network is trained to predict the applied rotation via 4-way cross-entropy. Directly tested in CV exam 2024 Q2(g).

Source: Lecture 18, Transformation Estimation slide; Gidaris et al., ICLR 2018.

@bite~

@Describe the global-average-pool collapse trap for the rotation-prediction pretext task.

If your CNN architecture ends with a global average pooling layer before the rotation classifier, the rotation-prediction task becomes trivial in an unhelpful way:

GAP is invariant to spatial permutation — averaging across all spatial positions discards the spatial layout of the feature map. But the rotation classifier needs spatial layout to figure out which way is “up”. So the classifier ends up keying on spatial-invariant cues like aspect ratio of objects or colour gradients (e.g. sky tends to be lighter at the top), rather than learning genuine visual features.

The fix: don’t use global pooling before the rotation head. Use a head that preserves spatial information (flatten + FC, or attention over spatial positions). Then the network is forced to learn features that respect spatial layout, which generalise better to downstream tasks like classification.

This is the standard worked example of how pretext task design choices interact with architecture to determine whether self-supervision actually transfers.

Source: Lecture 18, Transformation Estimation slide discussion.

@bite~ @exam~

SimCLR (Chen, Kornblith, Norouzi, Hinton) was introduced at ICML 2020 in “A Simple Framework for Contrastive Learning of Visual Representations”. The best augmentation policy from their ablation study is random crop + colour distortion + Gaussian blur; the lecture’s heatmap shows Crop+Color gives top linear-eval accuracy.

Source: Lecture 18, SimCLR and SimCLR: Augmentations slides.

@bite~

DINO uses two anti-collapse mechanisms applied to the teacher’s output:

  • Centring: subtract the EMA of past teacher logits from the current teacher logits, preventing collapse to a constant.
  • Sharpening: divide by a small temperature in the teacher softmax, making the distribution peaked rather than uniform.

Without these two together, DINO collapses to either constant or uniform outputs.

Source: Lecture 18, DINO – Sharpening and Centring slide.

@bite~

@Describe four classical pretext tasks for unsupervised representation learning, naming the paper for each.

  • Jigsaw puzzle solving (Noroozi & Favaro, ECCV 2016): split an image into a 3×3 grid, shuffle the patches, train the network to predict the permutation (chosen from 1000 fixed permutations, treated as 1000-way classification).
  • Context prediction (Doersch, Gupta, Efros, ICCV 2015): given a centre patch and one of its 8 neighbours, predict the relative position. 8-way classification.
  • Inpainting (Pathak et al., CVPR 2016): mask out a random rectangular region and train the network to fill it in. Forces semantic understanding.
  • Colourisation (Zhang, Isola, Efros, ECCV 2016): convert image to grayscale and train the network to predict the colour (or chromatic) channels. Forces understanding of object-level colour priors.

All four follow the recovery template: apply a known transformation, train the network to undo it. The learned features (encoder activations) transfer to downstream tasks like classification.

Source: Lecture 18, Recovery and Transformation Estimation slides.

@bite~

SCAN (Gansbeke et al., ECCV 2020) does unsupervised image classification in three steps: (1) self-supervised pretraining (e.g. SimCLR) to get image representations, (2) constraint that nearest-neighbour pairs in the embedding space should be classified into the same class, plus a class-balance constraint, (3) optional self-training to refine.

Source: Lecture 18, Classifying Images without Labels (SCAN) slide.

@bite~

@Define invariance vs equivariance and give a CV example of each.

For a function $f$ and a transformation $\pi$:

  • Invariance: $f(\pi(x)) = f(x)$ — applying the transformation to the input doesn’t change the output.
    • Example: an image classifier should be invariant to small translations of the image — a cat shifted by 5 pixels is still a cat.
  • Equivariance: $f(\pi(x)) = \pi'(f(x))$ for some “matching” transformation $\pi'$ on the output space. Applying $\pi$ to the input changes the output in a predictable way.
    • Example: a semantic segmentation network should be equivariant to translation — translating the image translates the output mask by the same amount.

Convolutional layers are translation-equivariant by construction. Pooling adds partial translation invariance. Self-attention is permutation-equivariant unless positional encodings break the symmetry.

In unsupervised learning, both are used as objectives: SimCLR enforces invariance under data augmentations; rotation-prediction enforces equivariance with $\pi' = $ the rotation label.

Source: Lecture 18, Invariance and Equivariance slides.

@bite~

@Justify the use of the non-linear projection head $g$ between the encoder $h$ and the contrastive features $z$ in SimCLR.

The contrastive loss $\mathcal L(z _ i, z _ j) = -\log \frac{\exp \mathcal S(z _ i, z _ j)}{\sum _ k \exp \mathcal S(z _ i, z _ k)}$ forces $z$ to be invariant to whatever transformations are used during augmentation. If we augment with colour distortion, then $z$ must produce the same output for “red car” and “blue car” — i.e. colour information is destroyed.

But downstream classification needs colour information (distinguishing “red apple” from “green apple”, for example). If we used $h = z$ (no projection head), the encoder would have to destroy this information.

The fix is to insert a non-linear MLP $g$ between $h$ and $z$. The contrastive loss invariance pressure operates only on $z$; $h$ can retain richer features that are useful downstream. After training, the projection head $g$ is discarded and $h$ is used as the representation.

This was one of SimCLR’s biggest empirical findings: using $g$ rather than $h$ for the contrastive loss substantially improves downstream transfer.

Source: Lecture 18, SimCLR slide.

@bite~

Hungarian matching has runtime complexity $\mathcal O(N^3) $ for finding the optimal $N \times N$ cluster-to-label assignment. This is what makes it tractable for unsupervised classification evaluation with $N \sim 1000$ classes, though it’d struggle for $N \sim 10^6$ (e.g. iNaturalist-style fine-grained categories).

Source: Lecture 18, Hungarian Matching slide.

@bite~

@Justify why unsupervised and self-supervised learning are often used interchangeably, and where the distinction lies.

The lecture’s own Twitter-poll-style discussion shows there’s no universally agreed definition. Common positions:

  • “They are the same”: both produce a useful representation without human-annotated labels for the target task.
  • “Self-supervised uses problem-specific principles to derive supervisory signal from the data itself” (e.g. predict-the-rotation, predict-the-relative-patch): so the code looks like supervised learning but with synthetic labels.
  • “Unsupervised is older techniques that don’t work; self-supervised is newer methods that do”: half-joking but contains a kernel of truth — the modern self-supervised paradigm (SimCLR, MoCo, DINO) genuinely outperforms classical unsupervised methods (k-means, PCA, autoencoders) on representation learning.

A reasonable resolution: self-supervised is a technique (deriving labels from the data), and unsupervised is a goal (no human labels for the downstream task). Self-supervised methods are unsupervised methods that achieve this via the labels-from-data trick.

But: don’t lose sleep over the terminology distinction. Most papers use them interchangeably.

Source: Lecture 18, Unsupervised vs. Self-Supervised? slide.

@bite~

Self-labelling by clustering (Asano et al., ICLR 2020) alternates between (i) learning a classifier with current pseudo-labels via cross-entropy, and (ii) updating the pseudo-labels via optimal transport with an equal-cluster-size constraint. The equal-size constraint is crucial: it prevents the trivial solution of assigning all images to one cluster.

Source: Lecture 18, Self-labelling by Clustering slide; Asano et al., ICLR 2020.

@bite~