Paper - Attention Is All You Need (2017)

Full title: Attention Is All You Need
Author(s): Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Year: 2017
Link: https://arxiv.org/abs/1706.03762
Relevant for:
- [[Course - Theories of Deep Learning MT25]]^U
- [[Course - Computer Vision MT25]]^U
Links to:
- [[Notes - Computer Vision MT25, Attention and transformers]]^U
See also:
- “You Could Have Invented Transformers” by Gwern
- “Transformers, the tech behind LLMs” by 3Blue1Brown
- “Attention in transformers, step-by-step” by 3Blue1Brown

Summary

This very influential paper introduced the Transformer architecture:

The Transformer is an architecture for sequence modelling where the input is a sequence of some sort, such as a list of words, or patches of images, or snippets of audio, and so on, and these sequences are not necessarily all of the same length. Then you want to produce an output of some sort, which might be:

A sequence of outputs (e.g. for translation or text generation)
A single output (e.g. for classification)

Previous sequence modelling architectures (such as RNNs) had some problems which made them difficult to train and struggle at complex tasks:

They struggled to capture long-range dependencies in the inputs and outputs because of problems with vanishing or exploding gradients once you unrolled the network
They were very slow, as the recurrent nature of their design meant they were hard to parallelise

Transformers addressed these two issues by:

Using an attention mechanism which gives an effective way to model dependencies
Implementing this attention mechanism in a “flat” way using a fixed number of nonlinearities and matrix multiplications

Flashcards

Suppose $X \in \mathbb R^{n \times d}$. How is the self-attention matrix calculated?

Apply three projections $W _ Q$, $W _ K$, $W _ V$ to $X$ to obtain $Q, K, V$, and then calculate

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d _ k}}\right) V\]

Intuitively:

$Q K^\top$ computes the dot products between each of the query vectors and the key vectors, to see how important they are
$\sqrt{d _ k}$ is a scaling factor used to prevent issues with vanishing / exploding gradients
These attention scores are then mixed together by the value matrix

Summary

Flashcards

Related posts