Computer Vision MT25, Representation learning
-
[[Course - Computer Vision MT25]]U
- [[Notes - Computer Vision MT25, Image representation]]U
- [[Notes - Computer Vision MT25, Image classification]]U
- [[Notes - Computer Vision MT25, Scale-invariant feature transform]]U
- [[Notes - Computer Vision MT25, Loss function design]]U
- See also:
- Learning to Compare Image Patches via Convolutional Neural Networks (2015)
Flashcards
How does representational learning differ from more general machine learning setups?
Generally you learn a function from the input to task output, in representational learning you learn a general-purpose representation which can then be used for downstream tasks.
How does representation learning differ in a supervised vs unsupervised context?
- Supervised: Given a specific task, learn a domain-specific representation which is often constrained to this particular task.
- Unsupervised: Given only the data, find a representation for it which often does not align exactly with the task.
What are some of the issues with hand-crafted representations?
- It’s hard to find the “discriminative signature” for a problem
- Even if you find the discriminative signature, it can be hard to implement programmatically
- Many different signals need to be combined, which itself is a difficult problem
Give an @example of a handcrafted image representation.
The SIFT descriptor.
@Describe and @visualise the approach you might use to learn a keypoint descriptor (i.e. a representation of an image patch such that different images of the same keypoint have similar descriptors).
- Use a dataset with known point correspondences
- Extract patches around keypoints
- Use positive examples of matching keypoints and negative examples using random keypoints
- Train a model to predict the similarity between two patches

@Visualise the three different architectures considered in “Learning to Compare Image Patches via Convolutional Neural Networks (2015)” for determining if two image patches correspond to the same keypoint.

SIFT is an algorithm which determines a set of keypoints in an image and then calculates keypoint descriptors for each of these. What does LIFT stand for, and how does it differ?
- LIFT: Learned Invariant Feature Transform
- In SIFT, the keypoints to choose and their corresponding descriptors are determined by a hand-crafted algorithm.
- In LIFT, the keypoints to choose and their corresponding descriptors are determined using a trained model.
What are the typical problems with learned image descriptors versus hand-crafted ones?
- Might not generalise to unseen domains
- Typically slower than hand-crafted
In supervised image representation learning, what does the training data generally look like?
A set of samples ${ x _ i }$ and corresponding positive ${ x _ {i, j}^+ }$ and negative ${ x _ {i, k}^- }$ examples.
Suppose we have a set of samples ${ x _ i }$ and corresponding positive ${ x _ {i, j}^+ }$ and negative ${ x _ {i, k}^- }$ examples, and a function $f : \mathcal D \to \Phi$.
@Define the cosine similarity loss in this context, and how it is typically approximated.
Let $J _ i$ be the number of positive examples and $K _ i$ be the number of negative examples. Then:
\[\mathcal L _ {\cos}(\phi _ i) = -\frac{1}{J _ i} \sum^{J _ i} _ {j = 1} \mathcal S _ {\cos}(\phi _ i, \phi _ {i, j}^+) + \frac{1}{K _ i} \sum^{K _ i} _ {k = 1} \mathcal S _ {\cos} (\phi _ i, \phi _ {i, k}^-)\]where $\mathcal S _ {\cos}$ denotes cosine similarity. Since these sums might be large, it’s typically approximated by choosing randomly one positive and negative example:
\[\mathcal L _ {\cos}(\phi _ i) = -\mathcal S _ {\cos}(\phi _ i, \phi _ i^+) + \mathcal S _ {\cos}(\phi _ i, \phi _ i^-)\]Suppose:
- We have a set of samples ${ x _ i }$ and corresponding positive ${ x _ {i, j}^+ }$ and negative ${ x _ {i, k}^- }$ examples
- An embedding function $f : \mathcal D \to \Phi$
- $J _ i$ and $K _ i$ are the number of positive and negative examples respectively
In this context, the cosine similarity loss is given by:
\[\mathcal L _ {\cos}(\phi _ i) = -\frac{1}{J _ i} \sum^{J _ i} _ {j = 1} \mathcal S _ {\cos}(\phi _ i, \phi _ {i, j}^+) + \frac{1}{K _ i} \sum^{K _ i} _ {k = 1} \mathcal S _ {\cos} (\phi _ i, \phi _ {i, k}^-)\]
What’s the problem with this approach?
It forces positive pairs to be almost identical and negative examples to be fully dissimilar.
Suppose:
- We have a set of samples ${ x _ i }$ and corresponding positive ${ x _ {i, j}^+ }$ and negative ${ x _ {i, k}^- }$ examples
- An embedding function $f : \mathcal D \to \Phi$
@Define the triplet loss $\mathcal L _ \text{triplet}(\phi, \phi^+, \phi^-)$ in this context given an arbitrary similarity function $\mathcal S$, perhaps (negative) $L _ 2$ distance or cosine similarity. What’s the intuitive interpretation of this loss?
where $\epsilon$ is some constant.
Intuitively, this is enforcing a relative order on the similarities of positive and negative examples.
Suppose:
- We have a set of samples ${ x _ i }$ and corresponding positive ${ x _ {i, j}^+ }$ and negative ${ x _ {i, k}^- }$ examples
- An embedding function $f : \mathcal D \to \Phi$
In this context, the triplet loss $\mathcal L _ \text{triplet}(\phi, \phi^+, \phi^-)$ is given by:
\[\mathcal L _ \text{triplet}(\phi, \phi^+, \phi^-) = \max(0, \mathcal S(\phi, \phi^-) - S(\phi, \phi^+) + \epsilon)\]
where $\epsilon$ is some constant. Why might this be slow, and how can you speed this up using batch-based training?
This requires three evaluations of $f$ per loss calculation, which might be slow if $f$ is a complicated neural network.
If training over a batch, then you can save calculating the embeddings for lots of negative samples by using the embeddings of the other batch elements as negative examples.
Suppose:
- We have a set of samples ${ x _ i }$ and corresponding positive ${ x _ {i}^+ }$ examples.
- An embedding function $f : \mathcal D \to \Phi$
- We are training over a batch of $B$ elements
@Define the contrastive loss in this context.
Suppose:
- We have a set of samples ${ x _ i }$ and corresponding positive ${ x _ {i}^+ }$ examples.
- An embedding function $f : \mathcal D \to \Phi$
- We are training over a batch of $B$ elements
In this context, the contrastive loss is defined by
\[\mathcal L _ \text{cont}(\phi _ i, \phi _ i^+) = -\log \frac{\exp(\mathcal S(\phi _ i, \phi _ i^+))}{\sum^B _ {k=1} \exp(\mathcal S(\phi _ i, \phi _ k))}\]
How can you interpret this in terms of cross-entropy loss?
Imagine you had a classifier that computed the embeddings $\phi _ i$ of the input, and then at the final layer computed the similarities across all the classes. Assuming the similarity function $\mathcal S$ were fixed, contrastive loss is the cross-entropy loss you would use to tune the embedding model.
@Visualise how the loss is computed in the CLIP model.

Then we use cross-entropy loss to make the diagonal entries close to $1$ and the off-diagonal entries $0$.

Give the @algorithm used for calculating the loss of the CLIP image encoder and text encoder in this context.

# image _ encoder - ResNet or Vision Transformer
# text _ encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# t - learned temperature parameter
# extract feature representations of each modality
I _ f = image _ encoder(I) # [n, d _ e]
T _ f = text _ encoder(T) # [n, d _ e]
# scaled pairwise cosine similarities [n, n]
logits = dot(I _ f, T _ f.T) * exp(t)
# symmetric loss function
labels = arange(n)
loss _ i = cross _ entropy _ loss(logits, labels, axis=0)
loss _ t = cross _ entropy _ loss(logits, labels, axis=1)
loss = (loss _ i + loss _ t) / 2
Give the @algorithm used for calculating the loss of the SigLIP image encoder and text encoder.
# img _ emb : image model embedding [n, dim]
# txt _ emb : text model embedding [n, dim]
# t _ prime, b : learnable temperature and bias
# n : mini-batch size
t = exp(t _ prime)
zimg = l2 _ normalize(img _ emb)
ztxt = l2 _ normalize(txt _ emb)
logits = dot(zimg, ztxt.T) * t + b
labels = 2 * eye(n) - ones(n) # -1 with diagonal 1
l = -sum(log _ sigmoid(labels * logits)) / n
# img _ emb : image model embedding [n, dim]
# txt _ emb : text model embedding [n, dim]
# t _ prime, b : learnable temperature and bias
# n : mini-batch size
t = exp(t _ prime)
zimg = l2 _ normalize(img _ emb)
ztxt = l2 _ normalize(txt _ emb)
logits = dot(zimg, ztxt.T) * t + b
labels = 2 * eye(n) - ones(n) # -1 with diagonal 1
l = -sum(log _ sigmoid(labels * logits)) / n
How does SigLIP compare to CLIP?
It is focussed on efficiency and multi-GPU training.
What is “ranking loss” in the context of representation learning?
Computing the loss when there is more than one positive and negative example to compare similarity to, and there is a ranking to how similar each positive and negative example is.
Why is representation learning useful for retrieval problems?
In retrieval problems, there is either not a predefined set of classes or the set of classes is too large to train a classifier. For representation learning we don’t require a list of classes, and can just instead return similar examples.