9. Summary — one page on Mixture of Experts
The paper in one sentence
Replace the dense FFN layer in a Transformer with thousands of specialised expert networks, learn a gating function to route each token to only the top-k experts, and get enormous model capacity at constant per-token compute cost.
The problem it solved
Dense neural networks must compute every parameter for every token. Doubling parameters means doubling compute — forever. This hard coupling made it prohibitively expensive to scale models beyond a few billion parameters in 2017.
The core idea
n expert networks: Each is a standard two-layer FFN with its own weights. Identical structure, different learned specialisation.
Gating network: A small learnable function that takes the current token and scores how relevant each expert is:
h(x) = x · W_g ← raw score per expert
G(x) = Softmax( TopK(h(x), k) ) ← sparse weights, sum to 1
MoE output: A weighted blend of only the k selected experts:
MoE(x) = Σᵢ G(x)ᵢ · Eᵢ(x) ← only k terms are non-zero
Auxiliary balancing loss: Prevents expert collapse by penalising unequal routing:
L_balance = α · n · Σᵢ fᵢ · pᵢ
The analogy
A hospital with 1,000 specialists and a gating doctor. Every patient sees the gating doctor briefly. The gating doctor sends each patient to the 2 most relevant specialists — not all 1,000. The hospital’s total knowledge is vast, but each consultation uses only a fraction of it.
What it unlocked
- Parameter count decoupled from compute cost
- 137 billion parameters at 2017 compute budgets (10× the largest dense models)
- Switch Transformer (2021): 1.6 trillion parameters, k=1 routing
- Mixtral 8×7B (2023): first open-source MoE, democratised the architecture
- GPT-4 (likely MoE), Gemini (likely MoE) — the frontier runs on MoE
What it left open
- All-to-all communication overhead between expert machines
- Expert collapse without careful balancing
- Token dropping when experts overflow capacity
- Memory scales with total parameters, not active ones — expensive inference
- Training instability more common than dense models
Key numbers from the paper
n experts: up to 131,072 (though 1,000–2,048 practical)
k (active experts): 1 or 2
Largest model: 137 billion parameters
Compute cost: comparable to a dense model k/n × the size
Difficulty
🔴 The math (Sections 4–5) is advanced undergrad — gating functions, softmax, auxiliary loss. 🟡 The concept and code (Sections 3, 6) are first-year college. 🟢 Sections 1–2, 8–9 are accessible to anyone.
Next paper: Paper 10 — GPT-1 → Back to: Paper 08 — Transformer →