Gating network
The learned routing function `G(x) = Softmax(TopK(H(x), k))`.
The learned routing function G(x) = Softmax(TopK(H(x), k)). Takes a token representation x and outputs a sparse vector of weights, one per expert, with only k non-zero entries. The gating matrix W_g (shape d_model × n) is the learnable component. The gating network sees every token and makes a routing decision in one matrix multiply — cheap relative to the experts themselves.