Section 03

The idea: every word talks to every word, simultaneously

Attention Is All You Need 2017

3. The idea — every word talks to every word, simultaneously

Imagine a classroom discussion. A teacher asks a question and, in the old way (the RNN way), students must answer one at a time: student 1 speaks, then student 2, then student 3. Each student hears what the previous ones said before speaking. The classroom can only process one person at a time.

Now imagine a different style: the teacher says “everyone compare notes with everyone else — simultaneously.” All students turn to face all other students, ask their questions, and receive answers at the same time. The room processes everything in parallel. By the end of one round, every student knows what every other student said.

This is self-attention. And this simultaneous comparison is what the Transformer makes possible.


Self-attention: every position attends to every position

In Bahdanau’s model, attention was cross-attention: the decoder (in one language) attended to the encoder (in another language). Self-attention is different: every position in a sequence attends to every other position in the same sequence.

When encoding “The cat sat on the mat,” self-attention lets “sat” directly look at “cat” (who is doing the sitting) and “mat” (where the sitting happens). Both relationships are captured in a single operation, not through a chain of hidden states.

This is powerful for language because meaning is non-local. “The bank collapsed after the heavy rain” — does “bank” mean a financial institution or a riverbank? The answer lies in “rain” and “collapsed,” which might be several words away. Self-attention connects them directly.


Three roles: Query, Key, and Value

Self-attention works by giving every word three different vector representations:

  • Query (Q) — what this word is looking for. “I am word 3. What do I need to know about my surroundings?”
  • Key (K) — what this word advertises about itself. “I am word 7. This is what I contain.”
  • Value (V) — what this word actually contributes when selected. “If you attend to me, here is the information I send you.”

The analogy: think of a library. Your Query is the search term you type. Each book’s Key is its title and index entry (how it describes itself). The Value is the actual content of the book. You search with your Query, find the books whose Keys best match, and retrieve their Values.

The Query-Key match determines how much attention is paid. The Value is what is actually retrieved.


How the scores are computed

For each position i, its query vector qᵢ is compared with the key vector of every position j by taking a dot product:

score(i, j) = qᵢ · kⱼ

Higher dot product = more similar = more attention. These scores are collected into a row [score(i,1), score(i,2), …, score(i,T)], then divided by √dₖ to prevent large values from making softmax too sharp, then passed through softmax to get attention weights.

Finally, the output for position i is a weighted sum of all value vectors:

outputᵢ = Σⱼ αᵢⱼ · vⱼ

Every position does this simultaneously. In matrix form, it is a handful of matrix multiplications — exactly what GPUs are built for.


Multi-head attention: multiple perspectives at once

A single attention operation might learn to focus on one type of relationship — say, syntactic subject-verb agreement. But language has many types of structure simultaneously: who is doing the action, where, with what object, in what tone.

The Transformer uses multi-head attention: instead of one attention computation, it runs h independent attention operations in parallel (the paper uses h = 8). Each “head” uses its own Q, K, V projection matrices and learns to attend to different aspects of the sentence.

The outputs from all h heads are concatenated and projected back to the original dimension. The model thus looks at the sentence through 8 simultaneous lenses, each potentially learning something different.


Positional encoding: injecting order without recurrence

Here is a subtlety. Self-attention, as described, is order-invariant: it produces the same output regardless of the order of the input words. “Cat eats fish” and “Fish eats cat” would give identical attention patterns if order was ignored.

RNNs knew word order naturally — they processed words left to right. Without recurrence, the Transformer must add positional information explicitly.

The solution: before feeding words to the first attention layer, add a positional encoding — a fixed vector that encodes the position of each word. The paper uses sinusoidal functions: sine waves at different frequencies for different dimensions. Position 1 gets a specific vector, position 2 a different one, and so on.

After adding positional encodings, the model can distinguish positions by the shifted values in the input vectors, and attention patterns can vary based on position.


The full architecture in brief

The Transformer is an encoder-decoder model with N = 6 identical layers each.

Each encoder layer has two sub-layers:

  1. Multi-head self-attention (every position attends to every other)
  2. Feed-forward network (a small 2-layer MLP applied independently at each position)

Each sub-layer is wrapped in a residual connection and layer normalisation:

output = LayerNorm(x + SubLayer(x))

Each decoder layer has three sub-layers:

  1. Masked multi-head self-attention (positions can only attend to earlier positions — no peeking at future words during training)
  2. Multi-head cross-attention (decoder queries attend to encoder outputs — same as Bahdanau, just in matrix form)
  3. Feed-forward network

Stack 6 encoder layers. Stack 6 decoder layers. Train end-to-end on sentence pairs. That is the Transformer.