LLaMA: Open and Efficient Foundation Language Models
What This Paper Did
Imagine two paths to build a powerful engine: (1) Build a massive engine with huge fuel tanks and run it for 6 months, or (2) Build a smaller, more efficient engine and run it for 5 years on the same total fuel budget. Which gives better performance?
LLaMA chose path 2.
Meta AI applied the Chinchilla-optimal scaling laws (from Paper 13) — train smaller models on more tokens, rather than training large models on fewer tokens — and released open-source language models (7B, 13B, 33B, 65B parameters) that match or exceed GPT-3 (175B) on most benchmarks.
The key innovations:
- Chinchilla scaling: Train smaller models (7B instead of 175B) for longer (1.4 trillion tokens instead of 300B)
- RMSNorm: Replace LayerNorm with RMS Normalization — simpler, faster, more stable
- SwiGLU activation: Replace ReLU with Swish-Gated Linear Unit in the feedforward network
- Rotary Positional Embeddings (RoPE): Encode position as rotations in the query/key vectors, better for long contexts
- Open weights: Release model weights publicly for research
The result: Open-source LLMs that rival proprietary models, sparking the open-source LLM revolution.
Key numbers:
LLaMA-13B vs. GPT-3 (175B):
- Parameters: 13B vs. 175B (13.5x smaller)
- Training tokens: 1.4T vs. ~300B (4.7x more data)
- Benchmark performance: Often outperforms GPT-3
- Inference cost: Much lower (runs on single GPUs)
Training compute:
- LLaMA-65B: ~2,300 V100 GPU days
- Total data: Publicly available (no proprietary data)
The Indian Analogy
Chinchilla scaling: Instead of training 175 experienced engineers for 6 months on basic company procedures, train 13 talented engineers intensively for 5 years. The smaller, well-trained team beats the larger, briefly-trained team.
Open release: Like IIT publishing its entire curriculum, lecture notes, and exam solutions online for free. Previously, only students who got into IIT could study this material. Now, any motivated student in Bhilai, Tirunelveli, or Patna can access the same resources. LLaMA democratized frontier AI research the same way.
RMSNorm: A simpler, faster way to normalize inputs to neural networks — like using a sleeker gear in a machine instead of a bulkier one. Same effect (stabilizing the signal), less friction.
Read in This Order
| Section | What You Will Learn | Difficulty | Time |
|---|---|---|---|
| 01 - Context | Why pre-training at scale matters; the limits of proprietary models | Beginner | 5 min |
| 02 - The Problem | Proprietary models are expensive and closed; scaling laws suggest better alternatives | Intermediate | 5 min |
| 03 - The Idea | Chinchilla scaling, open-source release, architectural innovations | Intermediate | 8 min |
| 04 - The Math | RMSNorm formulation; SwiGLU; RoPE rotation mechanics; scaling laws | Intermediate | 10 min |
| 05 - Worked Example | Compute RMSNorm on example tensor; trace SwiGLU computation; illustrate RoPE | Intermediate | 8 min |
| 06 - The Code | Implement RMSNorm from scratch; load LLaMA from Hugging Face; run inference | Beginner | 6 min |
| 07 - Limitations | Limited context, English-centric, no RLHF in base; misuse risks | Advanced | 4 min |
| 08 - Impact | o1, Alpaca, Mistral, thousands of fine-tunes; open-source revolution | Intermediate | 3 min |
| 09 - Summary | One-line recap, key ideas, numbers, scaling principles, what next | Beginner | 1 min |
Before You Read: Math and AI Concepts You’ll Need
- Transformer Architecture (Paper 08): LLaMA is a transformer; understanding attention, feedforward layers, and layer norm is essential
- Scaling Laws (Paper 13 / Chinchilla): The Chinchilla-optimal allocation of compute to model size and data
- GPT-3 (Paper 12): The baseline that LLaMA is compared against
- Linear Algebra: Vectors, matrices, norms, rotations
- Layer Normalization: Basic understanding of how to stabilize neural network training
Visual Overview: LLaMA’s Architectural Components
Input Tokens
|
Embedding Layer (E)
|
┌────────────┴────────────┐
| |
Self-Attention Pre-Norm (RMSNorm)
(no changes) |
| Feedforward
| (SwiGLU instead of ReLU)
| |
└────────────┬────────────┘
|
Residual Connection
|
Pre-Norm (RMSNorm again)
|
Stack x 32 Layers
|
Output Normalization
|
Token Prediction Head
|
Next Token Probability
Comparison: LLaMA vs. GPT-3
| Aspect | LLaMA-13B | GPT-3 (175B) | LLaMA-65B |
|---|---|---|---|
| Parameters | 13B | 175B | 65B |
| Training Tokens | 1.4T | ~300B | 1.4T |
| Architecture | Transformer + RMSNorm + RoPE + SwiGLU | Standard Transformer | Transformer + RMSNorm + RoPE + SwiGLU |
| Training Compute | ~1,000 V100 days | ~3,640 V100 days | ~2,300 V100 days |
| Inference Cost | Low (1-2 GPUs) | Very High (8-32 GPUs) | Medium (4-8 GPUs) |
| MMLU Benchmark | 63.9% | 71.3% | 73.5% |
| Weights Released | Yes (research) | No | Yes (research) |
Insight: LLaMA-65B (65B params) outperforms GPT-3 (175B) while using 2.7x fewer parameters and similar compute.
Navigation
← Paper 16: Let’s Verify Step by Step | Paper 18: Mistral 7B →
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.