Gating network

Appears in 1 paper

The learned routing function `G(x) = Softmax(TopK(H(x), k))`.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

The learned routing function G(x) = Softmax(TopK(H(x), k)). Takes a token representation x and outputs a sparse vector of weights, one per expert, with only k non-zero entries. The gating matrix W_g (shape d_model × n) is the learnable component. The gating network sees every token and makes a routing decision in one matrix multiply — cheap relative to the experts themselves.

Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

Appears in papers