Paper 09
Intermediate

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, Mirhoseini, Maziarz, Davis, Le, Hinton, Dean · ICLR 2017 · arXiv:1701.06538


What this paper did

It broke the link between model size and compute cost.

In a standard dense neural network, every parameter fires for every input — doubling model size means doubling compute, permanently. This hard coupling made scaling beyond a few billion parameters economically impossible in 2017.

Shazeer’s team replaced the FFN sub-layer in their network with a Mixture of Experts layer: n expert networks (each a standard FFN) plus a learned gating function that routes each token to only k of them. With n=100 experts and k=2, you have 100× the parameters but pay the compute cost of only 2. The model’s knowledge capacity and its per-token inference cost become independent quantities.

The key equations:

G(x) = Softmax( TopK( x · W_g, k ) )    ← sparse routing weights
MoE(x) = Σᵢ G(x)ᵢ · Eᵢ(x)             ← weighted blend of k active experts
L_balance = α · n · Σᵢ fᵢ · pᵢ         ← auxiliary loss preventing expert collapse

The result: a 137-billion-parameter model that trains at the cost of a dense ~10-billion-parameter model. By 2023, MoE was the likely architecture of every frontier AI system.


The Indian analogy

A government hospital with 1,000 specialists. The gating doctor (gating network) briefly examines each patient (token) and routes them to the 2 most relevant specialists (top-k experts). The hospital’s total knowledge is vast, but each patient consults only a small fraction of it. The auxiliary balancing loss is the hospital administrator ensuring no single specialist gets a three-year waiting list while others sit empty.


Read in this order

SectionWhat you will learnDifficultyTime
1. ContextThe compute wall of 2017, MoE’s 1990s origins🟢4 min
2. The ProblemEvery neuron firing for every token is wasteful🟢3 min
3. The IdeaHospital analogy, sparse routing, expert specialisation🟡5 min
4. The MathGating function, TopK, auxiliary loss — worked by hand🔴10 min
5. Worked Example4 experts routing “chai bahut garam hai” token by token🔴8 min
6. The CodeFull MoE forward pass and balancing loss in NumPy🟡6 min
7. LimitationsCommunication overhead, collapse, token dropping🟡4 min
8. ImpactSwitch Transformer, Mixtral, GPT-4, the frontier🟢4 min
9. SummaryOne-page recap🟢3 min

Also: Glossary · Quiz · Further Reading


Before you read: math tutorials you need


MoE layer at a glance

Input token x (d_model dimensions)


  ┌──────────────────────────────────┐
  │  GATING NETWORK                 │
  │  logits = x · W_g               │   (one score per expert)
  │  mask all but top-k to -∞       │
  │  G(x) = Softmax(masked logits)  │   (sparse weights, sum to 1)
  └──────────────────────────────────┘

       ├──── Expert i  (if G(x)ᵢ > 0) → Eᵢ(x) → weight G(x)ᵢ
       ├──── Expert j  (if G(x)ⱼ > 0) → Eⱼ(x) → weight G(x)ⱼ
       └──── All other experts: SKIPPED (G = 0, no compute)


  MoE(x) = Σᵢ G(x)ᵢ · Eᵢ(x)   (only k terms non-zero)

Paper 08 — Transformer    → Paper 10 — GPT-1

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.