Let’s trace through a complete example showing all three key operations.
Example: Processing a Single Token
Setup: We have a token embedding that flows through one transformer layer. We’ll trace:
- Pre-normalization with RMSNorm
- Self-attention (simplified)
- SwiGLU feedforward
Simplified scenario:
- Embedding dimension: d = 4 (normally 4096, but 4 for manual computation)
- Batch size: 1
- Sequence length: 1 (single token)
Step 1: Input Embedding
Raw embedding: $x = [0.5, -1.2, 0.8, 0.3]$
This comes from embedding the token “LLaMA”.
Step 2: Pre-Normalization (RMSNorm)
Operation: Normalize the input before attention.
Computation:
Sum of squares: $$\sum x_i^2 = (0.5)^2 + (-1.2)^2 + (0.8)^2 + (0.3)^2 = 0.25 + 1.44 + 0.64 + 0.09 = 2.42$$
RMS: $$\text{RMS}(x) = \sqrt{\frac{2.42}{4}} = \sqrt{0.605} \approx 0.778$$
Normalized (assuming $\gamma = 1.0$): $$\text{RMSNorm}(x) = \frac{[0.5, -1.2, 0.8, 0.3]}{0.778} = [0.643, -1.543, 1.029, 0.386]$$
After RMSNorm: $x_{\text{norm}} = [0.643, -1.543, 1.029, 0.386]$
Step 3: Self-Attention (Simplified)
In a real transformer, we compute Query, Key, and Value projections, apply attention, and get the output. For brevity, let’s say attention outputs:
$$\text{attention_output} = [0.6, -1.5, 1.0, 0.4]$$
(In reality, this would be computed via dot-product attention, but the process is the same.)
Residual connection: $$\text{after_attention} = x + \text{attention_output} = [0.5, -1.2, 0.8, 0.3] + [0.6, -1.5, 1.0, 0.4] = [1.1, -2.7, 1.8, 0.7]$$
Step 4: Pre-Normalization Again (Before FFN)
Input to RMSNorm: $y = [1.1, -2.7, 1.8, 0.7]$
Sum of squares: $$\sum y_i^2 = (1.1)^2 + (-2.7)^2 + (1.8)^2 + (0.7)^2 = 1.21 + 7.29 + 3.24 + 0.49 = 12.23$$
RMS: $$\text{RMS}(y) = \sqrt{\frac{12.23}{4}} = \sqrt{3.0575} \approx 1.749$$
Normalized: $$\text{RMSNorm}(y) = \frac{[1.1, -2.7, 1.8, 0.7]}{1.749} = [0.629, -1.544, 1.030, 0.400]$$
After RMSNorm: $y_{\text{norm}} = [0.629, -1.544, 1.030, 0.400]$
Step 5: SwiGLU Feedforward
Operation: $\text{FFN}{\text{SwiGLU}}(y{\text{norm}}) = \text{Swish}(y_{\text{norm}} W_1 + b_1) \otimes (y_{\text{norm}} W_2 + b_2)$
Simplified parameters:
- $W_1$: Projects input (4D) to intermediate (8D), then we’ll just compute 2 dimensions
- $W_2$: Projects input (4D) to intermediate (8D)
For manual computation, let’s use smaller matrices:
First projection (gate input): $$z_1 = y_{\text{norm}} \cdot W_1 + b_1$$
Using $W_1 = [0.5, -0.3; 0.2, 0.4; -0.1, 0.6; 0.3, -0.2]$ (4x2 matrix) and $b_1 = [0.1, -0.1]$:
For dimension 1: $$z_{1,1} = 0.629 \cdot 0.5 + (-1.544) \cdot (-0.3) + 1.030 \cdot (-0.1) + 0.400 \cdot 0.3$$ $$= 0.315 + 0.463 - 0.103 + 0.120 = 0.795$$ $$\text{After bias: } 0.795 + 0.1 = 0.895$$
For dimension 2: $$z_{1,2} = 0.629 \cdot (-0.3) + (-1.544) \cdot 0.4 + 1.030 \cdot 0.6 + 0.400 \cdot (-0.2)$$ $$= -0.189 - 0.618 + 0.618 - 0.080 = -0.269$$ $$\text{After bias: } -0.269 - 0.1 = -0.369$$
So $z_1 = [0.895, -0.369]$
Apply Swish activation: $$\text{Swish}(z) = z \cdot \sigma(z) = z \cdot \frac{1}{1 + e^{-z}}$$
For $z_{1,1} = 0.895$: $$\sigma(0.895) = \frac{1}{1 + e^{-0.895}} = \frac{1}{1 + 0.407} = 0.711$$ $$\text{Swish}(0.895) = 0.895 \cdot 0.711 = 0.636$$
For $z_{1,2} = -0.369$: $$\sigma(-0.369) = \frac{1}{1 + e^{0.369}} = \frac{1}{1 + 1.447} = 0.408$$ $$\text{Swish}(-0.369) = -0.369 \cdot 0.408 = -0.151$$
So $\text{Swish}(z_1) = [0.636, -0.151]$
Second projection (gate): $$z_2 = y_{\text{norm}} \cdot W_2 + b_2$$
Using a different weight matrix $W_2 = [0.4, 0.2; -0.1, 0.5; 0.3, -0.2; -0.2, 0.4]$ and $b_2 = [0, 0.05]$:
For dimension 1: $$z_{2,1} = 0.629 \cdot 0.4 + (-1.544) \cdot (-0.1) + 1.030 \cdot 0.3 + 0.400 \cdot (-0.2)$$ $$= 0.252 + 0.154 + 0.309 - 0.080 = 0.635$$ $$\text{After bias: } 0.635 + 0 = 0.635$$
For dimension 2: $$z_{2,2} = 0.629 \cdot 0.2 + (-1.544) \cdot 0.5 + 1.030 \cdot (-0.2) + 0.400 \cdot 0.4$$ $$= 0.126 - 0.772 - 0.206 + 0.160 = -0.692$$ $$\text{After bias: } -0.692 + 0.05 = -0.642$$
So $z_2 = [0.635, -0.642]$
Element-wise multiplication (gating): $$\text{FFN}_{\text{SwiGLU}} = \text{Swish}(z_1) \otimes z_2 = [0.636, -0.151] \otimes [0.635, -0.642]$$ $$= [0.636 \cdot 0.635, -0.151 \cdot (-0.642)] = [0.404, 0.097]$$
FFN output: $[0.404, 0.097]$ (in 2D for this example; normally 8D)
Step 6: Residual and Next Token
Final output before next layer: $$\text{output} = y + \text{FFN}_{\text{SwiGLU}} = [1.1, -2.7, 1.8, 0.7] + [0.404, 0.097, \ldots]$$
(In reality, the FFN output would be the same dimension as the input, so we’d add all components.)
Step 7: RoPE Example (Attention Computation)
Now let’s see how RoPE affects the attention computation for a 2-token sequence.
Token 1 query (after attention head projection): $q_1 = [1.0, 0.5]$
Token 1 key: $k_1 = [0.8, 0.6]$
Token 2 query: $q_2 = [0.9, 0.7]$
Token 2 key: $k_2 = [0.7, 0.5]$
RoPE angle basis: $\theta = 0.1$ rad/position
Without RoPE (Absolute position embeddings)
Add learned position embeddings:
- $p_1 = [0.1, 0.05]$
- $p_2 = [0.15, 0.08]$
Then: $$q’_1 = q_1 + p_1 = [1.1, 0.55]$$ $$k’_1 = k_1 + p_1 = [0.9, 0.65]$$ $$q’_2 = q_2 + p_2 = [1.05, 0.78]$$ $$k’_2 = k_2 + p_2 = [0.85, 0.58]$$
Attention score between $q_2$ and $k_1$: $$\text{score} = q’_2 \cdot k’_1 = 1.05 \cdot 0.9 + 0.78 \cdot 0.65 = 0.945 + 0.507 = 1.452$$
With RoPE
Apply rotation at each position:
Position 1, angle = 0.1: $$R(0.1) = \begin{bmatrix} \cos(0.1) & -\sin(0.1) \ \sin(0.1) & \cos(0.1) \end{bmatrix} = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix}$$
$$q’_1 = R(0.1) \cdot q_1 = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.5 \end{bmatrix} = \begin{bmatrix} 0.995 - 0.0499 \ 0.0998 + 0.4975 \end{bmatrix} = \begin{bmatrix} 0.945 \ 0.597 \end{bmatrix}$$
$$k’_1 = R(0.1) \cdot k_1 = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix} \begin{bmatrix} 0.8 \ 0.6 \end{bmatrix} = \begin{bmatrix} 0.796 - 0.0599 \ 0.0798 + 0.597 \end{bmatrix} = \begin{bmatrix} 0.736 \ 0.677 \end{bmatrix}$$
Position 2, angle = 0.2: $$R(0.2) = \begin{bmatrix} \cos(0.2) & -\sin(0.2) \ \sin(0.2) & \cos(0.2) \end{bmatrix} = \begin{bmatrix} 0.980 & -0.199 \ 0.199 & 0.980 \end{bmatrix}$$
$$q’_2 = R(0.2) \cdot q_2 = \begin{bmatrix} 0.980 & -0.199 \ 0.199 & 0.980 \end{bmatrix} \begin{bmatrix} 0.9 \ 0.7 \end{bmatrix} = \begin{bmatrix} 0.882 - 0.1393 \ 0.1791 + 0.686 \end{bmatrix} = \begin{bmatrix} 0.743 \ 0.865 \end{bmatrix}$$
$$k’_2 = R(0.2) \cdot k_2 = \begin{bmatrix} 0.980 & -0.199 \ 0.199 & 0.980 \end{bmatrix} \begin{bmatrix} 0.7 \ 0.5 \end{bmatrix} = \begin{bmatrix} 0.686 - 0.0995 \ 0.1393 + 0.490 \end{bmatrix} = \begin{bmatrix} 0.586 \ 0.629 \end{bmatrix}$$
Attention score between $q’_2$ and $k’_1$ (with RoPE): $$\text{score}_{\text{RoPE}} = q’_2 \cdot k’_1 = 0.743 \cdot 0.736 + 0.865 \cdot 0.677 = 0.547 + 0.586 = 1.133$$
Comparison:
- Without RoPE: score = 1.452
- With RoPE: score = 1.133
The RoPE score encodes the relative distance (position 2 - position 1 = 1), while the absolute embedding score depends on absolute positions. RoPE will generalize better to longer sequences.
Summary: Full Trace
| Operation | Input | Output | Key Insight |
|---|---|---|---|
| RMSNorm | [0.5, -1.2, 0.8, 0.3] | [0.643, -1.543, 1.029, 0.386] | Normalizes via RMS, simpler than LayerNorm |
| Attention | [0.643, …] | [0.6, -1.5, 1.0, 0.4] | Computes similarities (simplified) |
| Residual | [0.5, -1.2, 0.8, 0.3] + [0.6, -1.5, 1.0, 0.4] | [1.1, -2.7, 1.8, 0.7] | Preserves information via skip connection |
| RMSNorm (2nd) | [1.1, -2.7, 1.8, 0.7] | [0.629, -1.544, 1.030, 0.400] | Normalize again before FFN |
| SwiGLU | [0.629, …] | [0.404, 0.097] | Smooth activation + gating > ReLU |
| RoPE (alt) | Rotates by relative position | Relative-position encoding | Generalizes to longer sequences |
All three techniques work together to make LLaMA efficient and scalable.