5. Worked example — one full encoder layer on “The chai is hot”
🔴 Advanced undergrad. Read Section 4 first.
We trace one complete encoder layer for the 4-word sentence “The chai is hot”. We use d_model = 4, dₖ = dᵥ = 2, one attention head (multi-head would just repeat this h times in parallel).
In a real Transformer: d_model = 512, dₖ = 64, h = 8 heads. The structure is identical.
Step 0: Input embeddings + positional encodings
Assume the embedding lookup gives us (before positional encoding):
word embeddings:
"The" → e₁ = [1.0, 0.0, 0.5, 0.2]
"chai" → e₂ = [0.2, 1.0, 0.8, 0.1]
"is" → e₃ = [0.5, 0.3, 1.0, 0.0]
"hot" → e₄ = [0.1, 0.5, 0.3, 1.0]
Add positional encodings PE(pos, d_model=4):
PE(0) = [0.000, 1.000, 0.000, 1.000] (position 0 = "The")
PE(1) = [0.841, 0.540, 0.010, 1.000] (position 1 = "chai")
PE(2) = [0.909, −0.416, 0.020, 1.000] (position 2 = "is")
PE(3) = [0.141, −0.990, 0.030, 1.000] (position 3 = "hot")
Input matrix X = embedding + PE:
X = [ 1.000+0.000, 0.000+1.000, 0.500+0.000, 0.200+1.000 ]
[ 0.200+0.841, 1.000+0.540, 0.800+0.010, 0.100+1.000 ]
[ 0.500+0.909, 0.300−0.416, 1.000+0.020, 0.000+1.000 ]
[ 0.100+0.141, 0.500−0.990, 0.300+0.030, 1.000+1.000 ]
= [ 1.000, 1.000, 0.500, 1.200 ] ← "The"
[ 1.041, 1.540, 0.810, 1.100 ] ← "chai"
[ 1.409, −0.116, 1.020, 1.000 ] ← "is"
[ 0.241, −0.490, 0.330, 2.000 ] ← "hot"
Step 1: Project to Q, K, V
Using learned projection matrices (simplified for clean numbers):
W^Q = [[1, 0], W^K = [[0, 1], W^V = [[1, 1],
[0, 1], [1, 0], [0, 0],
[0, 0], [0, 0], [1, 0],
[0, 0]] [0, 0]] [0, 1]]
Q = X · W^Q (only first two columns of X matter due to zero rows in W^Q):
q₁ = X[1] · W^Q = [1.000×1+1.000×0+0.500×0+1.200×0, 1.000×0+1.000×1+0.500×0+1.200×0]
= [1.000, 1.000]
q₂ = [1.041, 1.540]
q₃ = [1.409, −0.116]
q₄ = [0.241, −0.490]
Q = [[1.000, 1.000],
[1.041, 1.540],
[1.409, −0.116],
[0.241, −0.490]]
K = X · W^K (columns 2,1 of X swapped):
k₁ = [1.000, 1.000] k₂ = [1.540, 1.041]
k₃ = [−0.116, 1.409] k₄ = [−0.490, 0.241]
K = [[1.000, 1.000],
[1.540, 1.041],
[−0.116, 1.409],
[−0.490, 0.241]]
V = X · W^V (first column = X[:,0]+X[:,2], second = X[:,0]+X[:,3]):
Rather than track all 4 words through the full FFN, let’s focus on what matters for the attention output:
v₁ = [1.000+0.500, 1.000+1.200] = [1.500, 2.200]
v₂ = [1.041+0.810, 1.041+1.100] = [1.851, 2.141]
v₃ = [1.409+1.020, 1.409+1.000] = [2.429, 2.409]
v₄ = [0.241+0.330, 0.241+2.000] = [0.571, 2.241]
V = [[1.500, 2.200],
[1.851, 2.141],
[2.429, 2.409],
[0.571, 2.241]]
Step 2: Compute Q · Kᵀ
Kᵀ = [[1.000, 1.540, −0.116, −0.490],
[1.000, 1.041, 1.409, 0.241]]
Q · Kᵀ:
Row 1 (q₁=[1.0, 1.0]):
[1.0×1.0+1.0×1.0, 1.0×1.54+1.0×1.04, 1.0×(−0.12)+1.0×1.41, 1.0×(−0.49)+1.0×0.24]
= [2.000, 2.581, 1.293, −0.249]
Row 2 (q₂=[1.041, 1.540]):
[1.041+1.540, 1.041×1.54+1.540×1.04, 1.041×(−0.12)+1.540×1.41, 1.041×(−0.49)+1.540×0.24]
= [2.581, 3.210, 2.046, 0.258]
Row 3 (q₃=[1.409, −0.116]):
[1.409−0.116, 1.409×1.54+(−0.116)×1.04, 1.409×(−0.12)+(−0.116)×1.41, 1.409×(−0.49)+(−0.116)×0.24]
= [1.293, 2.049, −0.332, −0.717]
Row 4 (q₄=[0.241, −0.490]):
[0.241−0.490, 0.241×1.54+(−0.490)×1.04, 0.241×(−0.12)+(−0.490)×1.41, 0.241×(−0.49)+(−0.490)×0.24]
= [−0.249, 0.260, −0.719, −0.236]
Raw score matrix:
"The" "chai" "is" "hot"
"The" [ 2.000 2.581 1.293 −0.249]
"chai" [ 2.581 3.210 2.046 0.258]
"is" [ 1.293 2.049 −0.332 −0.717]
"hot" [−0.249 0.260 −0.719 −0.236]
Step 3: Scale by √dₖ = √2 ≈ 1.414 and apply softmax
Scaled scores (divide each by 1.414):
"The" "chai" "is" "hot"
"The" [ 1.414 1.826 0.914 −0.176]
"chai" [ 1.826 2.270 1.447 0.182]
"is" [ 0.914 1.449 −0.235 −0.507]
"hot" [−0.176 0.184 −0.509 −0.167]
Apply softmax row by row (showing “chai” row in detail):
Row "chai": [1.826, 2.270, 1.447, 0.182]
exp values: [6.209, 9.678, 4.251, 1.200]
Sum = 21.338
α = [0.291, 0.454, 0.199, 0.056]
“chai” attends 45.4% to itself, 29.1% to “The”, 19.9% to “is”, and only 5.6% to “hot”. This makes intuitive sense: “chai” is most self-referential, and its nearest neighbours matter more than “hot” which is semantically distant.
Full attention matrix (all rows softmaxed, approximate):
"The" "chai" "is" "hot"
"The" [0.330, 0.506, 0.199, 0.065]
"chai" [0.291, 0.454, 0.199, 0.056]
"is" [0.296, 0.503, 0.095, 0.106]
"hot" [0.218, 0.314, 0.153, 0.215]
Rows sum to 1. ✓
Step 4: Output Z = A · V
V = [[1.500, 2.200],
[1.851, 2.141],
[2.429, 2.409],
[0.571, 2.241]]
Output for "chai" (row 2):
Z₂ = 0.291×[1.500, 2.200] + 0.454×[1.851, 2.141] + 0.199×[2.429, 2.409] + 0.056×[0.571, 2.241]
= [0.437, 0.640] + [0.840, 0.972] + [0.483, 0.479] + [0.032, 0.125]
= [1.792, 2.216]
“chai“‘s output vector [1.792, 2.216] is a blend of all four words’ value vectors, weighted by attention. It is dominated by its own value ([1.851, 2.141] at 45.4%) but enriched by context from neighbouring words.
Step 5: Residual + Layer Norm
The attention output Z is added to the original input X (residual connection), then layer-normalised:
X' = LayerNorm(X + Z)
For “chai”:
X_chai = [1.041, 1.540, 0.810, 1.100] (original input, 4-dim)
Z_chai = [1.792, 2.216, ?, ?] (2-dim attention output in this toy)
In a real model, all dimensions (d_model = 512) go through attention and the residual + LN is applied to the full 512-dim vector, then layer norm brings it to mean 0, std 1 (scaled by learned γ, β). See the Normalisation tutorial for the calculation.
Step 6: Feed-forward sub-layer
After Residual + LN, each position passes independently through a 2-layer MLP:
FFN(x) = max(0, x · W₁ + b₁) · W₂ + b₂
In the original paper: W₁ maps from 512 to 2048 dimensions, W₂ maps back from 2048 to 512. The ReLU in the middle introduces non-linearity. A second Residual + LN follows.
This completes one encoder layer. The output feeds into the next encoder layer, and so on for all 6 layers.
What the encoder “understands” after 6 layers
After 6 encoder layers, each position’s vector is a rich contextual representation:
- “chai” knows it is a noun, the subject of “is hot,” a drink
- “hot” knows it is an adjective describing “chai”
- “The” knows it is a determiner modifying “chai”
- “is” knows it is a linking verb between “chai” and “hot”
None of this was explicitly programmed. It emerged through training on millions of sentences. The attention patterns in each layer specialise to capture different aspects of this understanding.