4. The math — scaled dot-product attention and multi-head attention

🔴 Advanced undergrad. This section uses matrix multiplication and the transpose. Read Matrix Multiplication, Matrix Transpose, and Softmax Function first.

Scaled dot-product attention

The core formula of the Transformer:

Attention(Q, K, V) = softmax( Q · Kᵀ / √dₖ ) · V

Let’s dissect every piece.

Q, K, V — what they are:

Suppose you have a sequence of T words, each represented as a d_model-dimensional vector. Pack these into a matrix X of shape (T × d_model).

Three learned weight matrices project X into three different spaces:

Q = X · W^Q       shape: (T × dₖ)    ← query projections
K = X · W^K       shape: (T × dₖ)    ← key projections
V = X · W^V       shape: (T × dᵥ)    ← value projections

W^Q, W^K are (d_model × dₖ) and W^V is (d_model × dᵥ). In the original paper, dₖ = dᵥ = d_model / h = 512 / 8 = 64.

Q · Kᵀ — the attention score matrix:

Q  shape: (T × dₖ)
Kᵀ shape: (dₖ × T)      ← K transposed (see Matrix Transpose tutorial)

Q · Kᵀ shape: (T × T)

Entry [i, j] of Q · Kᵀ is the dot product of query i with key j: how well “what position i is looking for” matches “what position j is offering.” This (T × T) matrix holds all pairwise attention scores.

Dividing by √dₖ — scaling:

Why divide by √dₖ? When dₖ is large, the dot products grow large in magnitude. Feeding large numbers through softmax produces very sharp distributions (one value near 1, all others near 0), which causes vanishingly small gradients. Dividing by √dₖ keeps the scores in a moderate range.

Intuition: If each component of q and k is independently drawn from a distribution with mean 0 and variance 1, their dot product has mean 0 and variance dₖ. Dividing by √dₖ gives variance 1 — stable no matter how large dₖ is.

softmax — row-wise:

softmax( Q · Kᵀ / √dₖ )   shape: (T × T)

Softmax is applied to each row independently. Row i gives the probability distribution over all T positions — “from position i’s perspective, how much should I attend to each other position?” Rows sum to 1.

Multiplying by V:

softmax(Q·Kᵀ/√dₖ) shape: (T × T)
V                  shape: (T × dᵥ)

Product shape: (T × dᵥ)

Each output row i is a weighted sum of all value rows, weighted by position i’s attention weights. This is identical to Bahdanau’s cₜ = Σᵢ αₜᵢ hᵢ, but now computed for all T positions simultaneously in one matrix multiplication.

Worked numerical example: scaled dot-product attention

Let T = 2 (two words: “Chai piyo”), d_model = 4, dₖ = dᵥ = 2.

Input embeddings:

X = [ 1.0  0.0  1.0  0.0 ]   ← "Chai"
    [ 0.0  1.0  0.0  1.0 ]   ← "piyo"

Projection matrices (hand-crafted for clean numbers):

W^Q = [[1,0],   W^K = [[0,1],   W^V = [[1,0],
       [0,1],          [1,0],          [0,1],
       [1,0],          [0,1],          [1,0],
       [0,1]]          [1,0]]          [0,1]]

Compute Q = X · W^Q:

q₁ = [1,0,1,0] · W^Q = [1×1+0×0+1×1+0×0,  1×0+0×1+1×0+0×1] = [2, 0]
q₂ = [0,1,0,1] · W^Q = [0×1+1×0+0×1+1×0,  0×0+1×1+0×0+1×1] = [0, 2]

Q = [[2, 0],
     [0, 2]]

Compute K = X · W^K:

k₁ = [1,0,1,0] · W^K = [0+0+0+0,  1+0+1+0] = [0, 2]
k₂ = [0,1,0,1] · W^K = [0+1+0+1,  0+0+0+0] = [2, 0]

K = [[0, 2],
     [2, 0]]

Compute V = X · W^V:

v₁ = [1,0,1,0] · W^V = [1+0+1+0,  0+0+0+0] = [2, 0]
v₂ = [0,1,0,1] · W^V = [0+0+0+0,  0+1+0+1] = [0, 2]

V = [[2, 0],
     [0, 2]]

Compute Q · Kᵀ:

Kᵀ = [[0, 2],
      [2, 0]]

Q · Kᵀ = [[2×0 + 0×2,   2×2 + 0×0],   =   [[0, 4],
           [0×0 + 2×2,   0×2 + 2×0]]        [4, 0]]

Entry [1,1] = 0: query “Chai” vs key “Chai” — low score (they don’t match well under these projections). Entry [1,2] = 4: query “Chai” vs key “piyo” — high score. Entry [2,1] = 4: query “piyo” vs key “Chai” — high score. Entry [2,2] = 0: query “piyo” vs key “piyo” — low score.

Scale by √dₖ = √2 ≈ 1.414:

Q · Kᵀ / √dₖ = [[0/1.414,  4/1.414],   =   [[0,      2.828],
                 [4/1.414,  0/1.414]]         [2.828,  0    ]]

Apply softmax row-wise:

Row 1: softmax([0, 2.828])

exp(0) = 1.000,  exp(2.828) = 16.923
Sum = 17.923
α₁ = [1.000/17.923, 16.923/17.923] = [0.056, 0.944]

Row 2: softmax([2.828, 0])

exp(2.828) = 16.923,  exp(0) = 1.000
Sum = 17.923
α₂ = [16.923/17.923, 1.000/17.923] = [0.944, 0.056]

Attention weight matrix A:

A = [[0.056, 0.944],
     [0.944, 0.056]]

Interpretation: “Chai” (row 1) attends 94.4% to “piyo” and only 5.6% to itself. “piyo” (row 2) attends 94.4% to “Chai” and 5.6% to itself. Each word’s output will be dominated by the other word’s value — they are looking across at each other.

Compute output Z = A · V:

A = [[0.056, 0.944],    V = [[2, 0],
     [0.944, 0.056]]         [0, 2]]

Z[1] = 0.056×[2,0] + 0.944×[0,2] = [0.112, 0.0] + [0.0, 1.888] = [0.112, 1.888]
Z[2] = 0.944×[2,0] + 0.056×[0,2] = [1.888, 0.0] + [0.0, 0.112] = [1.888, 0.112]

“Chai” output ≈ [0.11, 1.89]: mostly “piyo“‘s value vector [0,2]. “piyo“‘s output ≈ [1.89, 0.11]: mostly “Chai“‘s value vector [2,0]. Each word has absorbed the other’s representation. This is what makes self-attention powerful — every word’s output blends information from the whole sequence.

Multi-head attention

Instead of one attention computation with dₖ = 64 dimensions, the paper uses h = 8 independent attention “heads” each with dₖ = d_model / h = 64 dimensions.

Each head i has its own projection matrices: W^Q_i, W^K_i, W^V_i.

headᵢ = Attention(X·W^Q_i, X·W^K_i, X·W^V_i)

The h heads are computed in parallel, then concatenated:

MultiHead(X) = Concat(head₁, head₂, ..., headₕ) · W^O

Where W^O is a (d_model × d_model) output projection matrix.

Why multiple heads?

Each head can learn a different type of relationship. With 8 heads on the sentence “The cat sat on the mat”:

Head 1 might learn syntactic subject-verb relationships (cat → sat)
Head 2 might learn verb-object relationships (sat → mat)
Head 3 might learn determiner-noun relationships (the → cat, the → mat)
Head 4 might learn positional proximity (nearby words attend more)
Heads 5–8 might learn more abstract semantic patterns

No head is explicitly told what to specialise in. They learn from data.

Positional encoding

The sinusoidal positional encoding added to each input embedding:

PE(pos, 2i)   = sin( pos / 10000^(2i / d_model) )
PE(pos, 2i+1) = cos( pos / 10000^(2i / d_model) )

Where pos is the position in the sequence (0, 1, 2, …) and i is the dimension index (0, 1, …, d_model/2 − 1).

Worked example with d_model = 4:

Position 0: PE = [sin(0/1), cos(0/1), sin(0/100), cos(0/100)]
                = [sin(0),  cos(0),   sin(0),     cos(0)   ]
                = [0.000,   1.000,    0.000,       1.000   ]

Position 1: PE = [sin(1/1), cos(1/1), sin(1/100), cos(1/100)]
                = [sin(1),  cos(1),   sin(0.01),  cos(0.01)]
                ≈ [0.841,   0.540,    0.010,       1.000   ]

Position 2: PE = [sin(2/1), cos(2/1), sin(2/100), cos(2/100)]
                ≈ [0.909,  −0.416,    0.020,       1.000   ]

Each position has a unique encoding. Nearby positions have similar encodings (smooth change), far positions have different ones. The model can read position from these values.

Feed-forward sub-layer

After attention, each position passes through a simple two-layer MLP applied independently:

FFN(x) = max(0, x · W₁ + b₁) · W₂ + b₂

Where W₁ is (d_model × d_ff), d_ff = 2048 in the paper (4× the model dimension), and W₂ is (d_ff × d_model). The max(0, ·) is ReLU activation. This is the same operation applied at every position independently — a position-wise feed-forward network.

Why this? Attention mixes information across positions. The FFN then processes each position’s mixed representation with a small MLP, allowing non-linear transformation. Together, attention (communication) and FFN (computation) form a complete processing unit.

Full encoder layer equation

A  = MultiHead(X, X, X)            ← self-attention (Q=K=V all from X)
X' = LayerNorm(X + A)              ← residual connection + layer norm
Z  = FFN(X')                       ← feed-forward
X''= LayerNorm(X' + Z)             ← residual connection + layer norm

This is one encoder layer. Stack 6 and you have the Transformer encoder.