Computer Vision MT25, Attention and transformers


Flashcards

What does it mean to say that the Transformer architecture is permutation-equivariant?


Changing the order of the inputs $x _ i$ will change the order of the outputs $y _ i$.

The Transformer architecture is permutation-equivariant, which means that changing the order of the inputs $x _ i$ will just change the order of the outputs $y _ i$. How do models where the location of the inputs matters resolve this?


They add a positional encoding

\[x _ i' = x _ i + P(i)\]

which is either learned by the network or is hardcoded.

@State the general inputs and outputs of a transformer, and what it means to do “self-attention”, and separately what these would be in an English-French translation task.


  • $N$ input tokens $x _ i \in \mathbb R^d$
    • English-French: This would be embeddings of English word fragments
  • $M$ condition tokens $z _ j \in \mathbb R^d$:
    • Self-attention: This would be the input again
    • English-French: This would be the predicted translation so far.
  • $N$ output tokens $y _ i \in \mathbb R^d$
    • English-French: Each vector would be a probability distribution over the words in the vocabulary. During training, each of the output tokens contributes to the loss separately. During inference, you only sample from the next output token.

How does multi-head self-attention change the normal attention setup?


The output is a vector of stacked single-head self-attention outputs.

@Describe and @visualise the steps that a vision transformer uses to classify images.


  • Split images into fixed-size patches
  • A linear layer maps each patch to a vector
  • Add a 2D positional encoding
  • Add an extra “class” token to the sequence of 256
  • Train a transformer on sequences of length 257
  • Use the encoding of the class token as input to a simple MLP classifier




Related posts