k (top-k experts)
The number of experts selected per token.
The number of experts selected per token. The paper uses k=1 or k=2. The Switch Transformer (2021) standardised k=1 for simplicity. With k=2, the output is a weighted blend of two expert outputs. With k=1, the output is directly the selected expert's output (no blending needed).