Section 02

The problem: every neuron fires for every token

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer 2017

2. The problem — every neuron fires for every token

In a standard dense neural network, every parameter participates in every computation. Feed the word “chai” through a 1-billion-parameter model, and all 1 billion parameters are involved. Feed “piyo” through the same model — again, all 1 billion. Every token. Every parameter. Every time.

This is called dense computation, and it creates a hard ceiling on how large you can make a model for a given compute budget.


The arithmetic of dense scaling

Suppose you have a compute budget of C floating-point operations per token (FLOPs/token). A dense Transformer’s FLOPs scale roughly as:

FLOPs per token ≈ 2 × number of parameters

(The factor of 2 comes from one multiply and one add per parameter in the dominant matrix multiplications.)

If your budget is C = 10¹¹ FLOPs/token, you can afford at most ~50 billion parameters. Want 500 billion? You need 10× the compute — and at 2017 hardware prices, that was prohibitively expensive.

The compute budget is fixed by hardware and training time. Dense scaling hits that ceiling fast.


The waste in dense networks

Here is the uncomfortable truth about dense FFN layers: not every neuron needs to fire for every input.

When a language model processes “Virat scored a century,” the neurons encoding cricketing context should be highly active. The neurons encoding chemistry, music theory, or legal vocabulary probably should not be. In a dense network, they all fire anyway — they just happen to produce small activations that contribute little to the output.

This is like calling every member of a hospital staff for every patient consultation. The cardiologist, the dermatologist, the orthopaedic surgeon, the neurologist — all of them listen to the patient complain of a sore throat, most of them contribute nothing, and all of them are paid for their time.

The parameters exist, the compute is spent, but most of the capacity is wasted on this particular input.


What sparse computation promises

If you could route each input to only the relevant subset of parameters — a few specialists out of many — you could:

  1. Have a model with orders of magnitude more total parameters (more knowledge capacity)
  2. Pay compute cost proportional only to the active parameters (a small fraction)
  3. Keep inference fast because only a few experts run per token

A 137-billion-parameter model where only 8 billion parameters fire per token costs about the same compute as a dense 8-billion-parameter model — but with 17× the knowledge capacity.

This is the promise of the Mixture of Experts. The paper’s job was to make it actually work at scale.