LLaMA: Open and Efficient Foundation Language Models

What This Paper Did

Imagine two paths to build a powerful engine: (1) Build a massive engine with huge fuel tanks and run it for 6 months, or (2) Build a smaller, more efficient engine and run it for 5 years on the same total fuel budget. Which gives better performance?

LLaMA chose path 2.

Meta AI applied the Chinchilla-optimal scaling laws (from Paper 13) — train smaller models on more tokens, rather than training large models on fewer tokens — and released open-source language models (7B, 13B, 33B, 65B parameters) that match or exceed GPT-3 (175B) on most benchmarks.

The key innovations:

Chinchilla scaling: Train smaller models (7B instead of 175B) for longer (1.4 trillion tokens instead of 300B)
RMSNorm: Replace LayerNorm with RMS Normalization — simpler, faster, more stable
SwiGLU activation: Replace ReLU with Swish-Gated Linear Unit in the feedforward network
Rotary Positional Embeddings (RoPE): Encode position as rotations in the query/key vectors, better for long contexts
Open weights: Release model weights publicly for research

The result: Open-source LLMs that rival proprietary models, sparking the open-source LLM revolution.

Key numbers:

LLaMA-13B vs. GPT-3 (175B):
- Parameters: 13B vs. 175B (13.5x smaller)
- Training tokens: 1.4T vs. ~300B (4.7x more data)
- Benchmark performance: Often outperforms GPT-3
- Inference cost: Much lower (runs on single GPUs)

Training compute:
- LLaMA-65B: ~2,300 V100 GPU days
- Total data: Publicly available (no proprietary data)

The Indian Analogy

Chinchilla scaling: Instead of training 175 experienced engineers for 6 months on basic company procedures, train 13 talented engineers intensively for 5 years. The smaller, well-trained team beats the larger, briefly-trained team.

Open release: Like IIT publishing its entire curriculum, lecture notes, and exam solutions online for free. Previously, only students who got into IIT could study this material. Now, any motivated student in Bhilai, Tirunelveli, or Patna can access the same resources. LLaMA democratized frontier AI research the same way.

RMSNorm: A simpler, faster way to normalize inputs to neural networks — like using a sleeker gear in a machine instead of a bulkier one. Same effect (stabilizing the signal), less friction.

Read in This Order

Section	What You Will Learn	Difficulty	Time
01 - Context	Why pre-training at scale matters; the limits of proprietary models	Beginner	5 min
02 - The Problem	Proprietary models are expensive and closed; scaling laws suggest better alternatives	Intermediate	5 min
03 - The Idea	Chinchilla scaling, open-source release, architectural innovations	Intermediate	8 min
04 - The Math	RMSNorm formulation; SwiGLU; RoPE rotation mechanics; scaling laws	Intermediate	10 min
05 - Worked Example	Compute RMSNorm on example tensor; trace SwiGLU computation; illustrate RoPE	Intermediate	8 min
06 - The Code	Implement RMSNorm from scratch; load LLaMA from Hugging Face; run inference	Beginner	6 min
07 - Limitations	Limited context, English-centric, no RLHF in base; misuse risks	Advanced	4 min
08 - Impact	o1, Alpaca, Mistral, thousands of fine-tunes; open-source revolution	Intermediate	3 min
09 - Summary	One-line recap, key ideas, numbers, scaling principles, what next	Beginner	1 min

Before You Read: Math and AI Concepts You’ll Need

Transformer Architecture (Paper 08): LLaMA is a transformer; understanding attention, feedforward layers, and layer norm is essential
Scaling Laws (Paper 13 / Chinchilla): The Chinchilla-optimal allocation of compute to model size and data
GPT-3 (Paper 12): The baseline that LLaMA is compared against
Linear Algebra: Vectors, matrices, norms, rotations
Layer Normalization: Basic understanding of how to stabilize neural network training

Visual Overview: LLaMA’s Architectural Components

                Input Tokens
                     |
            Embedding Layer (E)
                     |
        ┌────────────┴────────────┐
        |                         |
   Self-Attention            Pre-Norm (RMSNorm)
   (no changes)                   |
        |                    Feedforward
        |              (SwiGLU instead of ReLU)
        |                         |
        └────────────┬────────────┘
                     |
          Residual Connection
                     |
        Pre-Norm (RMSNorm again)
                     |
            Stack x 32 Layers
                     |
           Output Normalization
                     |
          Token Prediction Head
                     |
            Next Token Probability

Comparison: LLaMA vs. GPT-3

Aspect	LLaMA-13B	GPT-3 (175B)	LLaMA-65B
Parameters	13B	175B	65B
Training Tokens	1.4T	~300B	1.4T
Architecture	Transformer + RMSNorm + RoPE + SwiGLU	Standard Transformer	Transformer + RMSNorm + RoPE + SwiGLU
Training Compute	~1,000 V100 days	~3,640 V100 days	~2,300 V100 days
Inference Cost	Low (1-2 GPUs)	Very High (8-32 GPUs)	Medium (4-8 GPUs)
MMLU Benchmark	63.9%	71.3%	73.5%
Weights Released	Yes (research)	No	Yes (research)

Insight: LLaMA-65B (65B params) outperforms GPT-3 (175B) while using 2.7x fewer parameters and similar compute.

← Paper 16: Let’s Verify Step by Step | Paper 18: Mistral 7B →