Expert collapse

Appears in 1 paper

A training failure mode where the gating network routes most tokens to a small number of popular experts, leaving the rest undertrained and effectively unused.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

A training failure mode where the gating network routes most tokens to a small number of popular experts, leaving the rest undertrained and effectively unused. Once started, it is self-reinforcing: popular experts train more → get better → attract more routing → receive even more training. The auxiliary balancing loss directly combats this.