Auxiliary balancing loss (L_balance)
An additional loss term added to the main cross-entropy language modelling loss during MoE training.
An additional loss term added to the main cross-entropy language modelling loss during MoE training. Penalises unequal routing of tokens across experts. Formula: α · n · Σᵢ fᵢ · pᵢ. Prevents expert collapse by generating gradients that push the gating network toward uniform distribution. The coefficient α (typically 0.01) controls how strongly balancing is enforced relative to the main task.