Computer Vision MT25, Scale-invariant feature transform


Flashcards

What assumption is made the content of the image in the SIFT algorithm in order to identify corresponding points?


Everything lies on the same 3D plane.

@State three properties that keypoints should be stable under.


  • Perspective changes (homographies)
  • Contrast changes
  • Lighting changes

What is the scale of a keypoint in the SIFT algorithm?


A description of how large to make the local neighbourhood around the keypoint in order for it to be a useful keypoint.

In high-level terms, how does the SIFT algorithm formulate the problem of finding keypoints?


They are local minima in an energy function $E(x, y, \sigma)$ where $(x, y)$ are the coordinates and $\sigma$ is the characteristic scale.

The SIFT algorithm formulates the problem of finding keypoints in an image as finding local minima in an energy function $E(x, y, \sigma)$ where $(x, y)$ are the coordinates and $\sigma$ is the characteristic scale.

How is this energy function formulated?


It’s the convolution at $(x, y)$ with the Laplacian of Gaussian (LoG) filter.

@Define the Laplacian of Gaussian filter $\text{LoG}$, define it in high-level terms in 2D, and give the precise formula in 1D.


Let $g(x, y)$ be the 2D Gaussian. Then

\[\text{LoG}(x, y) = \frac{\partial^2}{\partial^2 x} g + \frac{\partial^2}{\partial^2 y}g\]

In 1D, this becomes

\[\text{LoG}(x) = -\frac{1}{\pi \sigma^2} \left( 1 - \frac{x^2}{\sigma^2} \right) e^{\frac{-x^2}{2\sigma^2}}\]

@Visualise what the 1D $\text{LoG}$ filter looks like.


4

How does the SIFT descriptor characterise a keypoint?


  • It computes edge orientations and a global orientation
  • It rotates all edges so the global orientation is up
  • It splits the local area around the keypoint into $4 \times 4$ regions
  • It computes the edge histograms for each region
  • It concatenates these histograms to make a $128$ dimensional vector

Describe how brute force local matching with SIFT descriptors might work.


For each descriptor vector in the first image, return the descriptor vector with the lowest Euclidean distance in the second image. Then display the top $N$ matches.

@Describe the technique of “visual words” in relation to the SIFT algorithm, and why it is useful for image matching.


  • Compute the SIFT features on a large image dataset
  • Compute $K$-means clustering and assign each feature to be one of $K$ “classes”
  • An image is described with a histogram of feature classes it contains
  • This reduces the size of the descriptor for an image from $\text{features} \times \text{dimensions}$ to $K$ (as if you did brute force matching of each SIFT feature vector).

The “visual words” technique for computing image descriptors is as follows:

  • Compute the SIFT features on a large image dataset
  • Compute $K$-means clustering and assign each feature to be one of $K$ “classes”
  • An image is described with a histogram of feature classes it contains
  • This reduces the size of the descriptor for an image from $\text{features} \times \text{dimensions}$ to $K$ (as if you did brute force matching of each SIFT feature vector).

How can you use this for image retrieval?


  • Given a query image, compute keypoints and descriptors.
  • Find keypoint labels from pre-computed $K$-means
  • Compute bag of visual words for the query
  • Compare bag of words query to all database bag of words and retrieve top $M$
  • (potentially perform slower verification of top $M$ results)



Related posts