Top-k selection (TopK)
The operation that keeps the k largest values in a vector and sets all others to −∞.
The operation that keeps the k largest values in a vector and sets all others to −∞. Combined with softmax, this produces a sparse probability vector with exactly k non-zero entries. The key operation that makes MoE sparse: only k experts receive non-zero weight, so only k compute.