Computer Vision MT25, Video


Flashcards

One problem with video processing models is that videos are too big. @Visualise and @describe one way that this is avoided at training time versus test time.


You can train on short clips with low resolution and low FPS, and then at test time run the model on overlapping clips and average the predictions.

How might a per-frame model classify the activity contained in a video?


  • Predict for each frame independently with an image model
  • Average the probabilities for each frame together

@Define the object-bias of action recognition.


The observation that for many actions, only one frame of a video of that activity is needed to classify that activity.

@Visualise how “late fusion” would classify the activity in this image?


@State some other fusion mechanisms other than concatenation.


  • Pooling
  • Averaging

What is the primary problem with late fusion?


Low-level motion is often lost after compressing a frame into a feature vector.

@Visualise how “early fusion” would classify the activity in this image?


How could you implement “slow fusion” in a CNN?


Use 3D convolutions to slowly compress the temporal information.

@Visualise the architecture used for action recognition with optical flow.


In “Two-stream convolutional networks for action recognition in videos”, we classify actions in videos using an architecture like the following:

Can you remember what the inputs to both of the two streams are?


  • Single input image in the first stream: $3 \times H \times W$
  • Stack of optical flow in the $x$ and $y$ directions across all the whole $T$ frames in the video: $[2 \times (T - 1)] \times H \times W$.

@Visualise how a transformer and a CNN together could be used for action recognition in a video over a long temporal context.


@Visualise how a pure transformer architecture could be used for video understanding over a long temporal context. What are the names given to the tokens in this context? What do they represent intuitively?


The tokens are spatio-temporal tokens: each token comes from a patch in space and time.

What is the name given to bounding boxes that change over time?


Tubelets.

@Define a tubelet.


A bounding box with time in a spatio-temporal detection model.




Related posts