Auxiliary balancing loss (L_balance)

Appears in 1 paper

An additional loss term added to the main cross-entropy language modelling loss during MoE training.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

An additional loss term added to the main cross-entropy language modelling loss during MoE training. Penalises unequal routing of tokens across experts. Formula: α · n · Σᵢ fᵢ · pᵢ. Prevents expert collapse by generating gradients that push the gating network toward uniform distribution. The coefficient α (typically 0.01) controls how strongly balancing is enforced relative to the main task.

Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

Appears in papers