Paper 17
Intermediate

LLaMA: Open and Efficient Foundation Language Models

What This Paper Did

Imagine two paths to build a powerful engine: (1) Build a massive engine with huge fuel tanks and run it for 6 months, or (2) Build a smaller, more efficient engine and run it for 5 years on the same total fuel budget. Which gives better performance?

LLaMA chose path 2.

Meta AI applied the Chinchilla-optimal scaling laws (from Paper 13) — train smaller models on more tokens, rather than training large models on fewer tokens — and released open-source language models (7B, 13B, 33B, 65B parameters) that match or exceed GPT-3 (175B) on most benchmarks.

The key innovations:

  1. Chinchilla scaling: Train smaller models (7B instead of 175B) for longer (1.4 trillion tokens instead of 300B)
  2. RMSNorm: Replace LayerNorm with RMS Normalization — simpler, faster, more stable
  3. SwiGLU activation: Replace ReLU with Swish-Gated Linear Unit in the feedforward network
  4. Rotary Positional Embeddings (RoPE): Encode position as rotations in the query/key vectors, better for long contexts
  5. Open weights: Release model weights publicly for research

The result: Open-source LLMs that rival proprietary models, sparking the open-source LLM revolution.

Key numbers:

LLaMA-13B vs. GPT-3 (175B):
- Parameters: 13B vs. 175B (13.5x smaller)
- Training tokens: 1.4T vs. ~300B (4.7x more data)
- Benchmark performance: Often outperforms GPT-3
- Inference cost: Much lower (runs on single GPUs)

Training compute:
- LLaMA-65B: ~2,300 V100 GPU days
- Total data: Publicly available (no proprietary data)

The Indian Analogy

Chinchilla scaling: Instead of training 175 experienced engineers for 6 months on basic company procedures, train 13 talented engineers intensively for 5 years. The smaller, well-trained team beats the larger, briefly-trained team.

Open release: Like IIT publishing its entire curriculum, lecture notes, and exam solutions online for free. Previously, only students who got into IIT could study this material. Now, any motivated student in Bhilai, Tirunelveli, or Patna can access the same resources. LLaMA democratized frontier AI research the same way.

RMSNorm: A simpler, faster way to normalize inputs to neural networks — like using a sleeker gear in a machine instead of a bulkier one. Same effect (stabilizing the signal), less friction.


Read in This Order

SectionWhat You Will LearnDifficultyTime
01 - ContextWhy pre-training at scale matters; the limits of proprietary modelsBeginner5 min
02 - The ProblemProprietary models are expensive and closed; scaling laws suggest better alternativesIntermediate5 min
03 - The IdeaChinchilla scaling, open-source release, architectural innovationsIntermediate8 min
04 - The MathRMSNorm formulation; SwiGLU; RoPE rotation mechanics; scaling lawsIntermediate10 min
05 - Worked ExampleCompute RMSNorm on example tensor; trace SwiGLU computation; illustrate RoPEIntermediate8 min
06 - The CodeImplement RMSNorm from scratch; load LLaMA from Hugging Face; run inferenceBeginner6 min
07 - LimitationsLimited context, English-centric, no RLHF in base; misuse risksAdvanced4 min
08 - Impacto1, Alpaca, Mistral, thousands of fine-tunes; open-source revolutionIntermediate3 min
09 - SummaryOne-line recap, key ideas, numbers, scaling principles, what nextBeginner1 min

Before You Read: Math and AI Concepts You’ll Need

  • Transformer Architecture (Paper 08): LLaMA is a transformer; understanding attention, feedforward layers, and layer norm is essential
  • Scaling Laws (Paper 13 / Chinchilla): The Chinchilla-optimal allocation of compute to model size and data
  • GPT-3 (Paper 12): The baseline that LLaMA is compared against
  • Linear Algebra: Vectors, matrices, norms, rotations
  • Layer Normalization: Basic understanding of how to stabilize neural network training

Visual Overview: LLaMA’s Architectural Components

                Input Tokens
                     |
            Embedding Layer (E)
                     |
        ┌────────────┴────────────┐
        |                         |
   Self-Attention            Pre-Norm (RMSNorm)
   (no changes)                   |
        |                    Feedforward
        |              (SwiGLU instead of ReLU)
        |                         |
        └────────────┬────────────┘
                     |
          Residual Connection
                     |
        Pre-Norm (RMSNorm again)
                     |
            Stack x 32 Layers
                     |
           Output Normalization
                     |
          Token Prediction Head
                     |
            Next Token Probability

Comparison: LLaMA vs. GPT-3

AspectLLaMA-13BGPT-3 (175B)LLaMA-65B
Parameters13B175B65B
Training Tokens1.4T~300B1.4T
ArchitectureTransformer + RMSNorm + RoPE + SwiGLUStandard TransformerTransformer + RMSNorm + RoPE + SwiGLU
Training Compute~1,000 V100 days~3,640 V100 days~2,300 V100 days
Inference CostLow (1-2 GPUs)Very High (8-32 GPUs)Medium (4-8 GPUs)
MMLU Benchmark63.9%71.3%73.5%
Weights ReleasedYes (research)NoYes (research)

Insight: LLaMA-65B (65B params) outperforms GPT-3 (175B) while using 2.7x fewer parameters and similar compute.


Paper 16: Let’s Verify Step by Step | Paper 18: Mistral 7B →

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.