Expert collapse
A training failure mode where the gating network routes most tokens to a small number of popular experts, leaving the rest undertrained and effectively unused.
A training failure mode where the gating network routes most tokens to a small number of popular experts, leaving the rest undertrained and effectively unused. Once started, it is self-reinforcing: popular experts train more → get better → attract more routing → receive even more training. The auxiliary balancing loss directly combats this.