Computer Vision MT25, Vision and language
Flashcards
What optimisation task determines the tokens used by a language model?
- Have a budget of $N$ tokens
- Want to find an assignment of strings to tokens that minimises the total number of tokens used to represent the whole training data
@Visualise the architecture that incorporates a reward model into an LM.

What is the “referring expressions” task in vision models?
Given some description (e.g. text) of some aspect of a scene, find it in the image.
What is the “visual question answering” task in vision models?
Given image and a question, predict an answer.
What is “visual grounding” in the context of vision models?
Finding some way for the output of the model to directly refer to aspects of the image, e.g. through describing regions as text or outputting special region embeddings.