Computer Vision MT25, Scale-invariant feature transform
Flashcards
What assumption is made the content of the image in the SIFT algorithm in order to identify corresponding points?
Everything lies on the same 3D plane.

@State three properties that keypoints should be stable under.
- Perspective changes (homographies)
- Contrast changes
- Lighting changes
What is the scale of a keypoint in the SIFT algorithm?
A description of how large to make the local neighbourhood around the keypoint in order for it to be a useful keypoint.

In high-level terms, how does the SIFT algorithm formulate the problem of finding keypoints?
They are local minima in an energy function $E(x, y, \sigma)$ where $(x, y)$ are the coordinates and $\sigma$ is the characteristic scale.
The SIFT algorithm formulates the problem of finding keypoints in an image as finding local minima in an energy function $E(x, y, \sigma)$ where $(x, y)$ are the coordinates and $\sigma$ is the characteristic scale.
How is this energy function formulated?
It’s the convolution at $(x, y)$ with the Laplacian of Gaussian (LoG) filter.
@Define the Laplacian of Gaussian filter $\text{LoG}$, define it in high-level terms in 2D, and give the precise formula in 1D.
Let $g(x, y)$ be the 2D Gaussian. Then
\[\text{LoG}(x, y) = \frac{\partial^2}{\partial^2 x} g + \frac{\partial^2}{\partial^2 y}g\]In 1D, this becomes
\[\text{LoG}(x) = -\frac{1}{\sigma^3 \sqrt{2\pi}} \left( 1 - \frac{x^2}{\sigma^2} \right) e^{\frac{-x^2}{2\sigma^2}}\]How does the SIFT descriptor characterise a keypoint?
- It computes edge orientations and a global orientation
- It rotates all edges so the global orientation is up
- It splits the local area around the keypoint into $4 \times 4$ regions
- It computes the edge histograms for each region
- It concatenates these histograms to make a $128$ dimensional vector
Describe how brute force local matching with SIFT descriptors might work.
For each descriptor vector in the first image, return the descriptor vector with the lowest Euclidean distance in the second image. Then display the top $N$ matches.

@Describe the technique of “visual words” in relation to the SIFT algorithm, and why it is useful for image matching.
- Compute the SIFT features on a large image dataset
- Compute $K$-means clustering and assign each feature to be one of $K$ “classes”
- An image is described with a histogram of feature classes it contains
- This reduces the size of the descriptor for an image from $\text{features} \times \text{dimensions}$ to $K$ (as if you did brute force matching of each SIFT feature vector).
The “visual words” technique for computing image descriptors is as follows:
- Compute the SIFT features on a large image dataset
- Compute $K$-means clustering and assign each feature to be one of $K$ “classes”
- An image is described with a histogram of feature classes it contains
- This reduces the size of the descriptor for an image from $\text{features} \times \text{dimensions}$ to $K$ (as if you did brute force matching of each SIFT feature vector).
How can you use this for image retrieval?
- Given a query image, compute keypoints and descriptors.
- Find keypoint labels from pre-computed $K$-means
- Compute bag of visual words for the query
- Compare bag of words query to all database bag of words and retrieve top $M$
- (potentially perform slower verification of top $M$ results)
Bite-sized
SIFT was introduced by David Lowe in “Distinctive image features from scale-invariant keypoints”, IJCV 2004. With ~77k citations, it is one of the most-cited computer vision papers ever — far exceeding LeNet (~73k) despite predating it by only a few years.
The SIFT pipeline has three high-level stages: (1) keypoint detection, (2) keypoint description, (3) keypoint matching.
@Describe what makes a good keypoint for matching, using Lecture 5’s worked example on Christ Church.

The lecture labels four regions A, B, C, D and contrasts them:
- A (untextured region, e.g. plain wall): bad — too smooth, no distinguishing local content, the same descriptor would be computed everywhere in a uniform area.
- B (a corner or blob — well-textured local content): good — stable under viewpoint changes, distinguishable from other locations, single best location for the descriptor.
- C (a repetitive pattern — e.g. a row of identical windows): stable but has multiple occurrences. A single descriptor could match any of several locations — which window is which becomes ambiguous.
- D (a long edge): hard to pinpoint a single point along the edge. Motion along the edge is
So good keypoints are blob- or corner-like local features with rich 2D gradient content, distinctive from other locations in the image.
The SIFT keypoint descriptor is constructed by computing per-cell histograms of edge orientations over a $4 \times 4$ grid around the keypoint, with 8 orientation bins per cell. Concatenating the histograms gives a $4 \times 4 \times 8 = 128$-dimensional descriptor vector.
@Justify why the SIFT descriptor rotates all edge orientations to align with the keypoint’s global orientation.
The keypoint’s “global orientation” is the dominant edge direction in its local neighbourhood (computed by histogramming all the edges and picking the largest peak).
Rotating all subsequent per-cell histograms so this global orientation points “up” makes the descriptor rotation-invariant: if the same keypoint is photographed at a different rotation, both observations produce the same descriptor after rotation-normalisation.
Without this step, two photos of the same scene taken from different angles would produce different SIFT descriptors at corresponding keypoints, and matching would fail.
The descriptor is also implicitly:
- Scale-invariant (via the characteristic scale step).
- Translation-invariant (the descriptor is location-relative).
- Roughly invariant to small viewpoint changes (homographies of small patches).
- Roughly invariant to small lighting changes (gradient-based, so additive constants cancel).
The Laplacian of Gaussian (LoG) at scale $\sigma$ acts as a blob detector — it responds maximally when convolved with a 2D blob whose width matches $\sigma$. SIFT scans across both spatial location and scale, finding $(x, y, \sigma)$ triples where the (squared) LoG response is a local extremum. In practice, this means examining a $3 \times 3 \times 3$ neighbourhood in $(x, y, \sigma)$ space and checking if the centre is the extremum among its 26 neighbours.
@Describe the bag of visual words image-retrieval pipeline at inference time.
Pre-computed offline (one-time):
- Run SIFT on a large image dataset to collect many local descriptors.
- K-means cluster the descriptors into $K$ centroids (the “vocabulary”).
- For each database image, compute SIFT keypoints, label each by its nearest cluster, and form a $K$-dim histogram. Store these histograms with the image IDs.
At query time:
- Compute SIFT keypoints + descriptors for the query image.
- Label each query keypoint by its nearest cluster centroid (a fast O(K) lookup per keypoint).
- Form the query’s $K$-dim histogram of visual words.
- Compare this histogram to every database histogram via cosine similarity (sparse, so cheap), return the top-$M$ matches.
- Optionally re-rank top-$M$ via slower geometric verification (RANSAC + homography check).
Speed-up vs brute-force feature matching: $10^{18}$ ops → $10^{10}$-$10^{12}$ ops (depending on $K$), making web-scale image retrieval feasible.
The motivation for visual words is that brute-force feature matching has cost $\sim <span class="cloze" tabindex="0">10^{18}</span>$ operations to search 10 billion images with 1000 keypoints each. Even on a 1000 PetaFLOPS supercomputer, that’s 1 second per image — utterly infeasible. Visual words drop the cost by converting per-image descriptors into a sparse global vector that compares in O(K) rather than O(#kp \cdot d) per image-pair.
The keypoint’s characteristic scale $\sigma$ in SIFT is found by searching across $\sigma$ for the local extremum of the LoG response in a 3D $(x, y, \sigma)$ space. The scale at which a blob detector responds most strongly is the natural size of the underlying feature.
