The Math: RMSNorm, SwiGLU, and RoPE — LLaMA: Open and Efficient Foundation Language Models

Prerequisite Tutorials

Transformer Architecture — understand attention, feedforward layers
Linear Algebra: Vectors and Matrices
Neural Network Basics
Layer Normalization

1. RMSNorm (Root Mean Square Normalization)

Standard LayerNorm

For reference, standard LayerNorm computes:

$$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$$

$$\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$$

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

$$y_i = \gamma \hat{x}_i + \beta$$

Where $\gamma, \beta$ are learnable parameters.

RMSNorm

RMSNorm simplifies this by removing the mean subtraction:

$$\text{RMS}(x) = \sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2}$$

$$\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x) + \epsilon} \otimes \gamma$$

where $\otimes$ is element-wise multiplication, $\gamma$ is a learnable scale, and $\epsilon$ is a small constant for numerical stability.

Key difference: RMSNorm only normalizes by the root mean square of the vector, not by the variance. No mean subtraction, no $\beta$ bias parameter.

Numerical Example

Input vector: $x = [2, -1, 3, 0]$ (dimension d = 4)

Step 1: Compute sum of squares $$\sum x_i^2 = 2^2 + (-1)^2 + 3^2 + 0^2 = 4 + 1 + 9 + 0 = 14$$

Step 2: Compute RMS $$\text{RMS}(x) = \sqrt{\frac{14}{4}} = \sqrt{3.5} \approx 1.871$$

Step 3: Normalize (assuming $\gamma = 1$, $\epsilon = 0$) $$\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} = \frac{[2, -1, 3, 0]}{1.871} = [1.069, -0.535, 1.604, 0.000]$$

Verification: Check that the RMS of the output is 1: $$\text{RMS}(\text{output}) = \sqrt{\frac{(1.069)^2 + (-0.535)^2 + (1.604)^2 + 0^2}{4}} = \sqrt{\frac{3.50}{4}} = \sqrt{0.875} \approx 1.0 \checkmark$$

Comparison: RMSNorm vs. LayerNorm

For the same input $x = [2, -1, 3, 0]$:

LayerNorm:

Mean: $\mu = (2 - 1 + 3 + 0) / 4 = 1.0$
Variance: $\sigma^2 = ((2-1)^2 + (-1-1)^2 + (3-1)^2 + (0-1)^2) / 4 = (1 + 4 + 4 + 1) / 4 = 2.5$
Std: $\sigma = \sqrt{2.5} = 1.581$
Output: $[(2-1)/1.581, (-1-1)/1.581, (3-1)/1.581, (0-1)/1.581] = [0.632, -1.265, 1.265, -0.632]$ (after scaling with $\gamma=1$)

RMSNorm:

RMS: $\sqrt{14/4} = 1.871$
Output: $[1.069, -0.535, 1.604, 0.000]$ (as computed above)

Both normalize, but LayerNorm centers around zero (output has mean ≈ 0), while RMSNorm does not. RMSNorm is simpler (no mean computation) and slightly faster.

2. SwiGLU Activation Function

Standard FFN with ReLU

In GPT-3, the feedforward network is:

$$\text{FFN}_{\text{ReLU}}(x) = \text{ReLU}(x W_1 + b_1) \cdot W_2 + b_2$$

where ReLU$(z) = \max(0, z)$.

SwiGLU FFN

In LLaMA, replace ReLU with SwiGLU:

$$\text{FFN}_{\text{SwiGLU}}(x) = (\text{Swish}(x W_1 + b_1)) \otimes (x W_2 + b_2)$$

where:

$\text{Swish}(z) = z \cdot \sigma(z)$ (Swish activation)
$\sigma(z) = 1 / (1 + e^{-z})$ (sigmoid function)
$\otimes$ is element-wise multiplication

The key difference: gating. The output of the first projection is gated (element-wise multiplied) by the output of a separate projection.

Numerical Example

Input: $x = 1.5$ (scalar, for simplicity)

Parameters: $W_1 = 2.0, b_1 = 0.5, W_2 = 3.0, b_2 = 0$

Step 1a: Compute pre-activation for first part $$z_1 = x W_1 + b_1 = 1.5 \cdot 2.0 + 0.5 = 3.5$$

Step 1b: Apply Swish $$\text{Swish}(z_1) = z_1 \cdot \sigma(z_1) = 3.5 \cdot \sigma(3.5)$$

where $\sigma(3.5) = 1 / (1 + e^{-3.5}) = 1 / (1 + 0.0302) = 0.9704$

$$\text{Swish}(3.5) = 3.5 \cdot 0.9704 = 3.396$$

Step 2: Compute gate $$z_2 = x W_2 + b_2 = 1.5 \cdot 3.0 + 0 = 4.5$$

Step 3: Multiply (gate) $$\text{FFN}_{\text{SwiGLU}}(1.5) = \text{Swish}(3.5) \otimes z_2 = 3.396 \cdot 4.5 = 15.28$$

For comparison, ReLU would give: $$\text{FFN}_{\text{ReLU}}(1.5) = \text{ReLU}(3.5) \cdot W_2 + b_2 = 3.5 \cdot 3.0 + 0 = 10.5$$

SwiGLU produces a higher value (15.28 vs. 10.5) due to the smooth Swish activation and the gating mechanism.

Why SwiGLU?

Empirically, SwiGLU shows:

Slightly better performance on language benchmarks (~2-3% improvements)
No dead units (unlike ReLU, which can output 0 for large negative values)
More parameter efficiency (gating allows selective feature usage)

3. Rotary Positional Embeddings (RoPE)

The Problem with Absolute Position Embeddings

Standard Transformers learn position embeddings $p_i$ for each position $i = 1, 2, \ldots, L$:

$$\text{input}_i = \text{embedding}(x_i) + p_i$$

Issues:

Only defined for positions up to training length L
Generalizes poorly to longer sequences (e.g., trained on 2048 tokens, cannot handle 4096)
Uses more parameters

Rotary Embeddings (RoPE)

Instead of adding position embeddings, rotate the query and key vectors by an angle proportional to position.

For position $m$, apply a 2D rotation:

$$\mathbf{R}(m, \theta) = \begin{bmatrix} \cos(m\theta) & -\sin(m\theta) \ \sin(m\theta) & \cos(m\theta) \end{bmatrix}$$

Then: $$q’_m = \mathbf{R}(m, \theta) \cdot q_m$$ $$k’_n = \mathbf{R}(n, \theta) \cdot k_n$$

where $q_m, k_n$ are query and key vectors (in practice, applied to pairs of dimensions).

Numerical Example: 2D Rotation

Query vector at position m=1: $q_1 = [1.0, 0.5]$

Angle basis: $\theta = 0.1$ rad/position

Position 1 angle: $1 \cdot 0.1 = 0.1$ rad

Rotation matrix for position 1: $$\mathbf{R}(1, 0.1) = \begin{bmatrix} \cos(0.1) & -\sin(0.1) \ \sin(0.1) & \cos(0.1) \end{bmatrix} = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix}$$

Rotated query: $$q’_1 = \begin{bmatrix} 0.995 & -0.0998 \ 0.0998 & 0.995 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.5 \end{bmatrix} = \begin{bmatrix} 0.995 - 0.0499 \ 0.0998 + 0.4975 \end{bmatrix} = \begin{bmatrix} 0.945 \ 0.597 \end{bmatrix}$$

Now, for key at position n=3:

Position 3 angle: $3 \cdot 0.1 = 0.3$ rad

$$\mathbf{R}(3, 0.1) = \begin{bmatrix} \cos(0.3) & -\sin(0.3) \ \sin(0.3) & \cos(0.3) \end{bmatrix} = \begin{bmatrix} 0.955 & -0.296 \ 0.296 & 0.955 \end{bmatrix}$$

If $k_3 = [1.0, 0.5]$: $$k’_3 = \begin{bmatrix} 0.955 & -0.296 \ 0.296 & 0.955 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.5 \end{bmatrix} = \begin{bmatrix} 0.955 - 0.148 \ 0.296 + 0.4775 \end{bmatrix} = \begin{bmatrix} 0.807 \ 0.774 \end{bmatrix}$$

Attention between position 1 and 3: $$\text{score} = q’_1 \cdot k’_3 = 0.945 \cdot 0.807 + 0.597 \cdot 0.774 = 0.763 + 0.462 = 1.225$$

The key insight: This score depends on the relative distance (3 - 1 = 2), not absolute positions. If we apply the same angle difference ($0.2$ rad), we get the same attention score regardless of starting position.

Generalization Property

Because RoPE encodes only relative position (distance), a model trained on sequences of length 2048 can generalize to 4096 or longer:

Training: sequence length 2048, max angle difference = 2048 × 0.1
Testing: sequence length 4096, max angle difference = 4096 × 0.1 (larger angle, but still interpretable as “relative position”)

With learned absolute embeddings, you have no way to represent positions beyond 2048.

Summary: The Mathematical Improvements

Component	Benefit
RMSNorm	Simpler, faster than LayerNorm; no mean subtraction; fewer parameters
SwiGLU	Smoother activation; gating mechanism; ~2-3% better performance
RoPE	Encodes only relative position; generalizes to longer sequences

None is revolutionary alone, but together they make training more efficient and inference faster while maintaining or improving quality.