Prerequisite Tutorials
- Transformer Architecture — understand attention, feedforward layers
- Linear Algebra: Vectors and Matrices
- Neural Network Basics
- Layer Normalization
1. RMSNorm (Root Mean Square Normalization)
Standard LayerNorm
For reference, standard LayerNorm computes:
$$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$$
$$\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$$
$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$
$$y_i = \gamma \hat{x}_i + \beta$$
Where $\gamma, \beta$ are learnable parameters.
RMSNorm
RMSNorm simplifies this by removing the mean subtraction:
$$\text{RMS}(x) = \sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2}$$
$$\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x) + \epsilon} \otimes \gamma$$
where $\otimes$ is element-wise multiplication, $\gamma$ is a learnable scale, and $\epsilon$ is a small constant for numerical stability.
Key difference: RMSNorm only normalizes by the root mean square of the vector, not by the variance. No mean subtraction, no $\beta$ bias parameter.
Numerical Example
Input vector: $x = [2, -1, 3, 0]$ (dimension d = 4)
Step 1: Compute sum of squares $$\sum x_i^2 = 2^2 + (-1)^2 + 3^2 + 0^2 = 4 + 1 + 9 + 0 = 14$$
Step 2: Compute RMS $$\text{RMS}(x) = \sqrt{\frac{14}{4}} = \sqrt{3.5} \approx 1.871$$
Step 3: Normalize (assuming $\gamma = 1$, $\epsilon = 0$) $$\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} = \frac{[2, -1, 3, 0]}{1.871} = [1.069, -0.535, 1.604, 0.000]$$
Verification: Check that the RMS of the output is 1: $$\text{RMS}(\text{output}) = \sqrt{\frac{(1.069)^2 + (-0.535)^2 + (1.604)^2 + 0^2}{4}} = \sqrt{\frac{3.50}{4}} = \sqrt{0.875} \approx 1.0 \checkmark$$
Comparison: RMSNorm vs. LayerNorm
For the same input $x = [2, -1, 3, 0]$:
LayerNorm:
- Mean: $\mu = (2 - 1 + 3 + 0) / 4 = 1.0$
- Variance: $\sigma^2 = ((2-1)^2 + (-1-1)^2 + (3-1)^2 + (0-1)^2) / 4 = (1 + 4 + 4 + 1) / 4 = 2.5$
- Std: $\sigma = \sqrt{2.5} = 1.581$
- Output: $[(2-1)/1.581, (-1-1)/1.581, (3-1)/1.581, (0-1)/1.581] = [0.632, -1.265, 1.265, -0.632]$ (after scaling with $\gamma=1$)
RMSNorm:
- RMS: $\sqrt{14/4} = 1.871$
- Output: $[1.069, -0.535, 1.604, 0.000]$ (as computed above)
Both normalize, but LayerNorm centers around zero (output has mean ≈ 0), while RMSNorm does not. RMSNorm is simpler (no mean computation) and slightly faster.
2. SwiGLU Activation Function
Standard FFN with ReLU
In GPT-3, the feedforward network is:
$$\text{FFN}_{\text{ReLU}}(x) = \text{ReLU}(x W_1 + b_1) \cdot W_2 + b_2$$
where ReLU$(z) = \max(0, z)$.
SwiGLU FFN
In LLaMA, replace ReLU with SwiGLU:
$$\text{FFN}_{\text{SwiGLU}}(x) = (\text{Swish}(x W_1 + b_1)) \otimes (x W_2 + b_2)$$
where:
- $\text{Swish}(z) = z \cdot \sigma(z)$ (Swish activation)
- $\sigma(z) = 1 / (1 + e^{-z})$ (sigmoid function)
- $\otimes$ is element-wise multiplication
The key difference: gating. The output of the first projection is gated (element-wise multiplied) by the output of a separate projection.
Numerical Example
Input: $x = 1.5$ (scalar, for simplicity)
Parameters: $W_1 = 2.0, b_1 = 0.5, W_2 = 3.0, b_2 = 0$
Step 1a: Compute pre-activation for first part $$z_1 = x W_1 + b_1 = 1.5 \cdot 2.0 + 0.5 = 3.5$$
Step 1b: Apply Swish $$\text{Swish}(z_1) = z_1 \cdot \sigma(z_1) = 3.5 \cdot \sigma(3.5)$$
where $\sigma(3.5) = 1 / (1 + e^{-3.5}) = 1 / (1 + 0.0302) = 0.9704$
$$\text{Swish}(3.5) = 3.5 \cdot 0.9704 = 3.396$$
Step 2: Compute gate $$z_2 = x W_2 + b_2 = 1.5 \cdot 3.0 + 0 = 4.5$$
Step 3: Multiply (gate) $$\text{FFN}_{\text{SwiGLU}}(1.5) = \text{Swish}(3.5) \otimes z_2 = 3.396 \cdot 4.5 = 15.28$$
For comparison, ReLU would give: $$\text{FFN}_{\text{ReLU}}(1.5) = \text{ReLU}(3.5) \cdot W_2 + b_2 = 3.5 \cdot 3.0 + 0 = 10.5$$
SwiGLU produces a higher value (15.28 vs. 10.5) due to the smooth Swish activation and the gating mechanism.
Why SwiGLU?
Empirically, SwiGLU shows:
- Slightly better performance on language benchmarks (~2-3% improvements)
- No dead units (unlike ReLU, which can output 0 for large negative values)
- More parameter efficiency (gating allows selective feature usage)
3. Rotary Positional Embeddings (RoPE)
The Problem with Absolute Position Embeddings
Standard Transformers learn position embeddings $p_i$ for each position $i = 1, 2, \ldots, L$:
$$\text{input}_i = \text{embedding}(x_i) + p_i$$
Issues:
- Only defined for positions up to training length L
- Generalizes poorly to longer sequences (e.g., trained on 2048 tokens, cannot handle 4096)
- Uses more parameters
Rotary Embeddings (RoPE)
Instead of adding position embeddings, rotate the query and key vectors by an angle proportional to position.
For position $m$, apply a 2D rotation:
$$\mathbf{R}(m, \theta) = \begin{bmatrix} \cos(m\theta) & -\sin(m\theta) \ \sin(m\theta) & \cos(m\theta) \end{bmatrix}$$
Then: $$q’_m = \mathbf{R}(m, \theta) \cdot q_m$$ $$k’_n = \mathbf{R}(n, \theta) \cdot k_n$$
where $q_m, k_n$ are query and key vectors (in practice, applied to pairs of dimensions).
Numerical Example: 2D Rotation
Query vector at position m=1: $q_1 = [1.0, 0.5]$
Angle basis: $\theta = 0.1$ rad/position
Position 1 angle: $1 \cdot 0.1 = 0.1$ rad
Rotation matrix for position 1: $$\mathbf{R}(1, 0.1) = \begin{bmatrix} \cos(0.1) & -\sin(0.1) \ \sin(0.1) & \cos(0.1) \end{bmatrix} = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix}$$
Rotated query: $$q’_1 = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.5 \end{bmatrix} = \begin{bmatrix} 0.995 - 0.0499 \ 0.0998 + 0.4975 \end{bmatrix} = \begin{bmatrix} 0.945 \ 0.597 \end{bmatrix}$$
Now, for key at position n=3:
Position 3 angle: $3 \cdot 0.1 = 0.3$ rad
$$\mathbf{R}(3, 0.1) = \begin{bmatrix} \cos(0.3) & -\sin(0.3) \ \sin(0.3) & \cos(0.3) \end{bmatrix} = \begin{bmatrix} 0.955 & -0.296 \ 0.296 & 0.955 \end{bmatrix}$$
If $k_3 = [1.0, 0.5]$: $$k’_3 = \begin{bmatrix} 0.955 & -0.296 \ 0.296 & 0.955 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.5 \end{bmatrix} = \begin{bmatrix} 0.955 - 0.148 \ 0.296 + 0.4775 \end{bmatrix} = \begin{bmatrix} 0.807 \ 0.774 \end{bmatrix}$$
Attention between position 1 and 3: $$\text{score} = q’_1 \cdot k’_3 = 0.945 \cdot 0.807 + 0.597 \cdot 0.774 = 0.763 + 0.462 = 1.225$$
The key insight: This score depends on the relative distance (3 - 1 = 2), not absolute positions. If we apply the same angle difference ($0.2$ rad), we get the same attention score regardless of starting position.
Generalization Property
Because RoPE encodes only relative position (distance), a model trained on sequences of length 2048 can generalize to 4096 or longer:
- Training: sequence length 2048, max angle difference = 2048 × 0.1
- Testing: sequence length 4096, max angle difference = 4096 × 0.1 (larger angle, but still interpretable as “relative position”)
With learned absolute embeddings, you have no way to represent positions beyond 2048.
Summary: The Mathematical Improvements
| Component | Benefit |
|---|---|
| RMSNorm | Simpler, faster than LayerNorm; no mean subtraction; fewer parameters |
| SwiGLU | Smoother activation; gating mechanism; ~2-3% better performance |
| RoPE | Encodes only relative position; generalizes to longer sequences |
None is revolutionary alone, but together they make training more efficient and inference faster while maintaining or improving quality.