Computer Vision MT25, Object detection


**- [[Course - Computer Vision MT25]]U - [[Notes - Computer Vision MT25, Image classification]]U

Flashcards

How does object detection differ from image classification?


Rather than classify the subject of one image, we instead want to find the location of an object in the image, typically in the form of a labelled bounding box $(x, y, w, h, c)$.

@Describe the sliding window approach to object classification.


Train an image classifier which can also detect just background, and then slide a window across the big image and classify each sub-image.

What are the main problems with the sliding window approach to object detection with an image classifier?


  • It is computationally expensive
  • There are problems with occlusion and truncation
  • Often get multiple responses for the same object

Briefly describe how Dalal & Triggs did pedestrian detection?


  • Sliding window classifier
  • Binary SVM for person or background using the Histogram of Gradients feature

How could you compute the histogram of gradients using a CNN?


  • Apply edge filters for a discrete approximation of several angles
  • Use a pooling operation to combine these extracted edges
  • Use layer norm to normalise histograms
  • Flatten into a single feature vector

Suppose you have trained an SVM to classify person versus background for an SVM using HoG. How could you visualise what the SVM $f(x) = w^\top x + b$ has “learned”?


Assuming that $f(x) > 0$ corresponds to person and $f(x) < 0$ corresponds to background, you could plot the histograms corresponding to the positive versus negative weights.

How would IoU calculate the score for the predicted bounding box versus the ground truth bounding box?


\[\text{IoU} = \frac{\text{Area}(\text{GT} \cap \text{Pred})}{\text{Area}(\text{GT} \cup \text{Pred})}\]

where $\text{GT}$ stands for “ground truth”.

What slight bias is there for IoU?


It is favourable to large objects, since it’s harder to intersect small objects.

One way to get false positives is that a bounding box doesn’t correspond to any ground truth. What’s another common way this happens?


The ground truth annotation is already covered by a better-fitting prediction.

State the @algorithm which evaluates an object detection model which gives predictions $(\hat x, \hat y, \hat w, \hat h, \hat c)$ of bounding boxes in an image.


  • For each image $i$, class $c$ and IoU threshold $t _ \text{iou}$:
    • Create set of predictions $(\hat x _ i, \hat y _ i, \hat w _ i, \hat h _ i, \hat c _ {c, i})$
    • For each confidence threshold $t _ \text{conf}$:
      • Ignore any boxes with $\hat c _ {c, i} < t _ \text{conf}$
      • For each GT annotation $(x _ j, y _ j, h _ j, w _ j)$:
        • Find highest IoU prediction
        • Is $\text{IoU}((\hat x _ i, \hat y _ i, \hat w _ i, \hat h _ i), (x _ j, y _ j, h _ j, w _ j)) \ge t _ \text{iou}$?
          • Yes: True positive, remove from predictions
          • No: False negative
      • Remaining predictions: False positive

@Define the technique of bootstrapping / self-training / hard negative mining for improving a classification model for object detection.


  1. Create a training dataset of positive and negative patches.
  2. Train a classifier
  3. Detect objects in training data
  4. Add false positives to training data as negative examples
  5. Repeat step 2

@Define the technique of non-maximum suppression in the context of object detection.


Often there are multiple detections for the same object; choose the one with the highest confidence and remove or down-weigh the others.

@Define the technique of cascaded classifiers in the context of object detection, and describe why it is useful.


  • Train a sequence of models, each getting more complicated, that can decide (with low precision but high recall) whether an image contains the object.
  • Useful since sliding window classifiers can be very slow.

@Define the technique of object proposals in the context of computer vision, and describe why it is useful.


  • Have some algorithm / model suggest possible bounding boxes, and then run the classifier over these bounding boxes.
  • Useful since it prevents the need to slide the classifier over the whole image.

@Visualise what the selective search algorithm for object proposal looks like when applied to some image.


What goes wrong with the following approach to object detection?

  1. Compute proposals with selective search
  2. For each proposal, give to ImageNet
  3. If the class probability exceeds some threshold, then count as a match

ImageNet is not trained to predict the background, so you get lots of false positives.

What was the architecture of R-CNN?

  • How were regions selected?
  • What model was used for features?
  • What model classified the features?

  • Regions: found via selective search
  • Features: AlexNet fine-tuned on particular problem
  • Classification: SVM

@Define the technique of bounding box regression in computer vision.


A model using an object proposal algorithm (e.g. selective search) modifies the proposed region during the prediction in order to improve localisation.

What were the pros and cons of R-CNN?


  • Pro: More accurate than previous approaches using e.g. HoGs
  • Pro: Any deep architecture can be plugged in
  • Con: Not a single end-to-end system
  • Con: Slow training
  • Con: Slow inference, you have to run a CNN for every object proposal

One of the main problems with R-CNN is that you had to run a CNN for every region proposal. What was the insight of Fast R-CNN?


CNNs preserve local information, so you can instead:

  • Do one pass over the entire image
  • Map the region proposals back onto the feature map
  • Use a pooling layer to convert these pieces of the feature map corresponding to region proposals into a fixed-size format
  • Use a fully-connected layer to classify and update bounding boxes

@Visualise the architecture of Fast R-CNN.


What is the purpose of the RoI pooling layer?


The Conv5 feature maps are arbitrarily sized, so can’t be easily passed to a fully-connected layer.

How does the RoIPool transform calculate the values for the fixed-size $2 \times 2$ grid?


It says it in the slide: the feature value is the max over input cells. The input cells are obtained by splitting the snapped RoI into (maybe uneven) bins using integer division.

@Define what is meant by a region proposal network (RPN) in the context of object detection?


A network which takes a fixed size anchor box over an image and predicts whether it is likely to contain an object.

What is meant by a “two-stage detector” in the context of object detection, and what is the name of the technique for a “one-stage detector”?


In two-stage object detectors, we:

  • Create proposals
  • Classify and update proposals

In one-stage detectors, “you only look once”.

@Visualise the architecture for object detection with YOLO, and describe in words roughly how it works.


  • Divide image into coarse grid
  • Predict class label and a few candidate bounding boxes for each grid cell



Related posts