Switch Transformer

Appears in 1 paper

Google's 2021 simplification of MoE: k=1 routing (route each token to exactly one expert, no blending).

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

Google's 2021 simplification of MoE: k=1 routing (route each token to exactly one expert, no blending). Proved more stable than k=2, easier to implement, and scalable to 1.6 trillion parameters. Simplified the balancing loss to just the soft probabilities, removing the hard-assignment term.