# Paper - Attention Is All You Need (2017)

> Source: https://ollybritton.com/notes/uni/part-c/mt25/theories-of-deep-learning/reading/paper-attention-is-all-you-need-2017/ · Updated: 2026-05-10 · Tags: uni, notes

- **Full title**: *Attention Is All You Need*
- **Author(s)**: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- **Year**: 2017
- **Link**: https://arxiv.org/abs/1706.03762
- **Relevant for**:
	- [Course - Theories of Deep Learning MT25](https://ollybritton.com/notes/uni/part-c/mt25/theories-of-deep-learning/)
	- [Course - Computer Vision MT25](https://ollybritton.com/notes/uni/part-c/mt25/computer-vision/)
- **Links to**:
	- [Notes - Computer Vision MT25, Attention and transformers](https://ollybritton.com/notes/uni/part-c/mt25/computer-vision/notes/attention-and-transformers/)
- **See also**:
	- ["You Could Have Invented Transformers"](https://gwern.net/blog/2025/you-could-have-invented-transformers) by Gwern
	- ["Transformers, the tech behind LLMs"](https://www.3blue1brown.com/?v=gpt) by 3Blue1Brown
	- ["Attention in transformers, step-by-step"](https://www.3blue1brown.com/?v=attention) by 3Blue1Brown

### Summary
This very influential paper introduced the Transformer architecture:

![Screenshot 2025-11-05 at 13.47.15.png](https://ollybritton.com/assets/attachments/img/Screenshot 2025-11-05 at 13.47.15.png)

The Transformer is an architecture for sequence modelling where the input is a sequence of some sort, such as a list of words, or patches of images, or snippets of audio, and so on, and these sequences are not necessarily all of the same length. Then you want to produce an output of some sort, which might be:

- A sequence of outputs (e.g. for translation or text generation)
- A single output (e.g. for classification)

Previous sequence modelling architectures (such as RNNs) had some problems which made them difficult to train and struggle at complex tasks:

- They struggled to capture long-range dependencies in the inputs and outputs because of problems with vanishing or exploding gradients once you unrolled the network
- They were very slow, as the recurrent nature of their design meant they were hard to parallelise

Transformers addressed these two issues by:

- Using an attention mechanism which gives an effective way to model dependencies
- Implementing this attention mechanism in a "flat" way using a fixed number of nonlinearities and matrix multiplications

### Flashcards
Suppose $X \in \mathbb R^{n \times d}$. How is the self-attention matrix calculated?::

Apply three projections $W_Q$, $W_K$, $W_V$ to $X$ to obtain $Q, K, V$, and then calculate 

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
$$

Intuitively:

- $Q K^\top$ computes the dot products between each of the query vectors and the key vectors, to see how important they are
- $\sqrt{d_k}$ is a scaling factor used to prevent issues with vanishing / exploding gradients
- These attention scores are then mixed together by the value matrix

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt