MoE (Mixture of Experts) layer
A drop-in replacement for the FFN sub-layer in a Transformer.
A drop-in replacement for the FFN sub-layer in a Transformer. Contains n expert networks and a gating network. For each token, routes to top-k experts and outputs a weighted sum of their outputs. Keeps attention sub-layers dense and shared across all tokens.