Load balancing

Appears in 2 papers

The goal of ensuring that all n experts receive roughly equal numbers of tokens over training.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

The goal of ensuring that all n experts receive roughly equal numbers of tokens over training. Perfectly balanced load means every expert trains on diverse data and develops distinct specialisation. Imbalanced load leads to expert collapse.

As used in Paper 19 — Ring Attention with Blockwise Transformers for Near-Infinite Context →

Ensuring all P GPUs have roughly equal work per round. Imbalanced load (some GPUs faster, some slower) causes idle time and reduces overall throughput. Critical for Ring Attention's linear speedup.