9. Summary — one page on Mixture of Experts

The paper in one sentence

Replace the dense FFN layer in a Transformer with thousands of specialised expert networks, learn a gating function to route each token to only the top-k experts, and get enormous model capacity at constant per-token compute cost.

The problem it solved

Dense neural networks must compute every parameter for every token. Doubling parameters means doubling compute — forever. This hard coupling made it prohibitively expensive to scale models beyond a few billion parameters in 2017.

The core idea

n expert networks: Each is a standard two-layer FFN with its own weights. Identical structure, different learned specialisation.

Gating network: A small learnable function that takes the current token and scores how relevant each expert is:

h(x) = x · W_g                        ← raw score per expert
G(x) = Softmax( TopK(h(x), k) )       ← sparse weights, sum to 1

MoE output: A weighted blend of only the k selected experts:

MoE(x) = Σᵢ G(x)ᵢ · Eᵢ(x)            ← only k terms are non-zero

Auxiliary balancing loss: Prevents expert collapse by penalising unequal routing:

L_balance = α · n · Σᵢ fᵢ · pᵢ

The analogy

A hospital with 1,000 specialists and a gating doctor. Every patient sees the gating doctor briefly. The gating doctor sends each patient to the 2 most relevant specialists — not all 1,000. The hospital’s total knowledge is vast, but each consultation uses only a fraction of it.

What it unlocked

Parameter count decoupled from compute cost
137 billion parameters at 2017 compute budgets (10× the largest dense models)
Switch Transformer (2021): 1.6 trillion parameters, k=1 routing
Mixtral 8×7B (2023): first open-source MoE, democratised the architecture
GPT-4 (likely MoE), Gemini (likely MoE) — the frontier runs on MoE

What it left open

All-to-all communication overhead between expert machines
Expert collapse without careful balancing
Token dropping when experts overflow capacity
Memory scales with total parameters, not active ones — expensive inference
Training instability more common than dense models

Key numbers from the paper

n experts:           up to 131,072 (though 1,000–2,048 practical)
k (active experts):  1 or 2
Largest model:       137 billion parameters
Compute cost:        comparable to a dense model k/n × the size

Difficulty

🔴 The math (Sections 4–5) is advanced undergrad — gating functions, softmax, auxiliary loss. 🟡 The concept and code (Sections 3, 6) are first-year college. 🟢 Sections 1–2, 8–9 are accessible to anyone.

Next paper: Paper 10 — GPT-1 → Back to: Paper 08 — Transformer →