Paper - Attention Is All You Need (2017)
- Full title: Attention Is All You Need
- Author(s): Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- Year: 2017
- Link: https://arxiv.org/abs/1706.03762
- Relevant for:
- Links to:
-
See also:
- “You Could Have Invented Transformers” by Gwern
- “Transformers, the tech behind LLMs” by 3Blue1Brown
- “Attention in transformers, step-by-step” by 3Blue1Brown
Summary
This very influential paper introduced the Transformer architecture:
The Transformer is an architecture for sequence modelling where the input is a sequence of some sort, such as a list of words, or patches of images, or snippets of audio, and so on, and these sequences are not necessarily all of the same length. Then you want to produce an output of some sort, which might be:
- A sequence of outputs (e.g. for translation or text generation)
- A single output (e.g. for classification)
Previous sequence modelling architectures (such as RNNs) had some problems which made them difficult to train and struggle at complex tasks:
- They struggled to capture long-range dependencies in the inputs and outputs because of problems with vanishing or exploding gradients once you unrolled the network
- They were very slow, as the recurrent nature of their design meant they were hard to parallelise
Transformers addressed these two issues by:
- Using an attention mechanism which gives an effective way to model dependencies
- Implementing this attention mechanism in a “flat” way using a fixed number of nonlinearities and matrix multiplications
Flashcards
Suppose $X \in \mathbb R^{n \times d}$. How is the self-attention matrix calculated?
Apply three projections $W _ Q$, $W _ K$, $W _ V$ to $X$ to obtain $Q, K, V$, and then calculate
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d _ k}}\right) V\]Intuitively:
- $Q K^\top$ computes the dot products between each of the query vectors and the key vectors, to see how important they are
- $\sqrt{d _ k}$ is a scaling factor used to prevent issues with vanishing / exploding gradients
- These attention scores are then mixed together by the value matrix