LLaMA tackled two related problems:
Problem 1: Suboptimal Model Scaling
The old assumption: To get better language models, train bigger models.
- GPT-2: 1.5B parameters
- GPT-3: 175B parameters
- PaLM: 540B parameters
The pattern: bigger = better. So keep scaling up.
The issue: From Chinchilla (Paper 13), we know this is suboptimal. When you have a fixed compute budget, you should:
- Train a smaller model
- On MORE data
- For longer
Concrete example: Given 1 exaflop of compute (roughly the budget to train GPT-3):
Old approach (GPT-3 style):
- Model: 175B parameters
- Training tokens: ~300B
- Cost: ~1 exaflop
- Result: Strong, but data-starved
Chinchilla/LLaMA approach:
- Model: 7-13B parameters (100-200x smaller!)
- Training tokens: 1.4 trillion (5x more!)
- Cost: ~1 exaflop (same budget)
- Result: Better performance with fewer parameters
Why was GPT-3 suboptimal? Because 300B tokens is not enough data for a 175B parameter model. The model can memorize much of the training data and still have room left over. By using a smaller model (which has less memorization capacity) and much more data (1.4T tokens), the Chinchilla team found a sweet spot where the data is fully utilized.
The implication: You do not need a 175B model to match GPT-3. A 13B model trained on 1.4T tokens will do better.
Problem 2: Proprietary Models Block Research
Closed access: In late 2022, the state-of-the-art models were:
- GPT-3 (OpenAI): Closed. Only accessible via API. Weights not public.
- PaLM (Google): Closed. Weights not public.
- Chinchilla (DeepMind): Closed. Weights not public.
What could researchers do?
- Use smaller open models: Models like BLOOM (176B) were open but less capable than GPT-3.
- Call the API: Use OpenAI’s API for GPT-3, but this was expensive (~$0.02 per 1K tokens) and rate-limited.
- Train from scratch: Train your own model, but this requires massive compute (thousands of GPUs, millions of dollars).
The problem for researchers:
- A PhD student at a university in Pune cannot train a 175B model; their institution lacks the compute.
- They cannot fine-tune GPT-3 weights (weights not available); they can only call the API.
- They cannot study what GPT-3 learned or make architectural modifications.
- Innovation is concentrated at a few labs with massive compute.
Barriers to progress:
- Only researchers at OpenAI, Google, DeepMind, and a few other labs can push the frontier.
- Thousands of talented researchers globally are locked out of working with frontier models.
- The field becomes less diverse; ideas come from fewer institutions.
Architectural Inefficiencies
Beyond scaling, LLaMA addressed known architectural inefficiencies:
LayerNorm Is Slow and Unstable
Standard LayerNorm (used in GPT-3):
For input x:
1. Compute mean: μ = (1/d) Σ x_i
2. Compute variance: σ² = (1/d) Σ (x_i - μ)²
3. Normalize: x̂ = (x - μ) / √(σ² + ε)
4. Scale: y = γ * x̂ + β (learnable γ, β)
This requires:
- Computing mean (multiple operations)
- Computing variance (multiple operations)
- Division (expensive)
- Storing three parameter tensors (γ, β, and the layer’s weights)
Pre-normalization vs. Post-normalization:
Post-norm (GPT-3 style):
Self-Attention → Add → LayerNorm → FFN → Add → LayerNorm
(post-norm after addition)
Pre-norm (better):
LayerNorm → Self-Attention → Add → LayerNorm → FFN → Add
(applied before, not after)
Pre-norm is more stable during training because you normalize before the operation, preventing activation explosion.
SwiGLU vs. ReLU
In the feedforward network, GPT-3 used standard ReLU:
FFN(x) = ReLU(x W + b) V
LLaMA uses SwiGLU (Swish-Gated Linear Unit):
FFN(x) = Swish(x W + b) ⊙ (x V + c)
where Swish(x) = x * sigmoid(x)
SwiGLU has been shown to improve model quality slightly while using similar compute.
Fixed Positional Embeddings vs. Learned Absolute vs. Rotary
GPT-3 learned absolute positional embeddings. LLaMA uses Rotary Positional Embeddings (RoPE), which encode position via rotation in the query/key vector space. RoPE generalizes better to unseen sequence lengths (e.g., a model trained on 2048 tokens can handle 4096 tokens more gracefully).
Summary: Two Concrete Problems
-
Training inefficiency: Models like GPT-3 are trained with suboptimal compute allocation. A smaller model on more data would perform better.
-
Access bottleneck: State-of-the-art models are proprietary and closed. Most researchers cannot access frontier-level models, limiting innovation and diversity in the field.
LLaMA addresses both: it’s a smaller, more efficient model that is openly released.