k (top-k experts)

Appears in 1 paper

The number of experts selected per token.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

The number of experts selected per token. The paper uses k=1 or k=2. The Switch Transformer (2021) standardised k=1 for simplicity. With k=2, the output is a weighted blend of two expert outputs. With k=1, the output is directly the selected expert's output (no blending needed).