Top-k selection (TopK)

Appears in 1 paper

The operation that keeps the k largest values in a vector and sets all others to −∞.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

The operation that keeps the k largest values in a vector and sets all others to −∞. Combined with softmax, this produces a sparse probability vector with exactly k non-zero entries. The key operation that makes MoE sparse: only k experts receive non-zero weight, so only k compute.