Switch Transformer
Google's 2021 simplification of MoE: k=1 routing (route each token to exactly one expert, no blending).
Google's 2021 simplification of MoE: k=1 routing (route each token to exactly one expert, no blending). Proved more stable than k=2, easier to implement, and scalable to 1.6 trillion parameters. Simplified the balancing loss to just the soft probabilities, removing the hard-assignment term.