2. The problem — every neuron fires for every token
In a standard dense neural network, every parameter participates in every computation. Feed the word “chai” through a 1-billion-parameter model, and all 1 billion parameters are involved. Feed “piyo” through the same model — again, all 1 billion. Every token. Every parameter. Every time.
This is called dense computation, and it creates a hard ceiling on how large you can make a model for a given compute budget.
The arithmetic of dense scaling
Suppose you have a compute budget of C floating-point operations per token (FLOPs/token). A dense Transformer’s FLOPs scale roughly as:
FLOPs per token ≈ 2 × number of parameters
(The factor of 2 comes from one multiply and one add per parameter in the dominant matrix multiplications.)
If your budget is C = 10¹¹ FLOPs/token, you can afford at most ~50 billion parameters. Want 500 billion? You need 10× the compute — and at 2017 hardware prices, that was prohibitively expensive.
The compute budget is fixed by hardware and training time. Dense scaling hits that ceiling fast.
The waste in dense networks
Here is the uncomfortable truth about dense FFN layers: not every neuron needs to fire for every input.
When a language model processes “Virat scored a century,” the neurons encoding cricketing context should be highly active. The neurons encoding chemistry, music theory, or legal vocabulary probably should not be. In a dense network, they all fire anyway — they just happen to produce small activations that contribute little to the output.
This is like calling every member of a hospital staff for every patient consultation. The cardiologist, the dermatologist, the orthopaedic surgeon, the neurologist — all of them listen to the patient complain of a sore throat, most of them contribute nothing, and all of them are paid for their time.
The parameters exist, the compute is spent, but most of the capacity is wasted on this particular input.
What sparse computation promises
If you could route each input to only the relevant subset of parameters — a few specialists out of many — you could:
- Have a model with orders of magnitude more total parameters (more knowledge capacity)
- Pay compute cost proportional only to the active parameters (a small fraction)
- Keep inference fast because only a few experts run per token
A 137-billion-parameter model where only 8 billion parameters fire per token costs about the same compute as a dense 8-billion-parameter model — but with 17× the knowledge capacity.
This is the promise of the Mixture of Experts. The paper’s job was to make it actually work at scale.