3. The idea — route each token to the best specialists

Imagine a large government hospital with 1,000 specialist doctors. When you arrive at the outpatient department, you do not see all 1,000 specialists. A general physician — the gating doctor — briefly examines you and says: “You need the cardiologist and the endocrinologist.” You see just those two. The other 998 specialists are busy with other patients — their knowledge exists in the hospital, but it is not consumed for your consultation.

Now suppose the hospital has 10 million patients a day. The gating doctor sees everyone, makes a quick routing decision, and sends each patient to the 2 most relevant specialists. The hospital’s total knowledge capacity is the combined expertise of all 1,000 doctors, but the compute cost per patient is just 2 consultations.

This is the Mixture of Experts, exactly.

The MoE layer structure

The paper replaces some (or all) of the Transformer’s feed-forward sub-layers with MoE layers. Each MoE layer contains:

n expert networks — each is a standard feed-forward network (two-layer MLP), identical in structure but with different learned weights. The paper used up to n = 131,072 experts (though in practice, a more manageable 1,000–2,048 was common in their experiments).
A gating network — a small learned function that takes the current token’s representation as input and outputs a probability score for each expert.
Top-k selection — only the top k experts (k = 1 or k = 2 in the paper) are actually used for a given token. All others are ignored.

The gating network in detail

For a token with representation vector x (a d_model-dimensional vector), the gating network computes:

G(x) = Softmax( TopK( x · W_g + noise, k ) )

Step by step:

Step 1: Compute raw gate scores. Multiply x by a learnable weight matrix W_g. This gives one score per expert — how relevant is expert i for this token?

Step 2: Add noise (for training only). A small random noise is added to the scores during training. This stops the model from always routing the same tokens to the same experts, encouraging all experts to be trained on diverse inputs.

Step 3: Keep only the top k. The TopK operation sets all but the k highest scores to −∞. When −∞ goes through softmax, e^(−∞) = 0, so those experts get zero weight.

Step 4: Softmax the remaining scores. The top-k scores become non-negative weights summing to 1 (over just the selected experts).

Output: A sparse vector G(x) with k non-zero entries summing to 1, and (n − k) zero entries.

Computing the MoE layer output

With k experts selected and their weights from G(x):

MoE(x) = Σᵢ G(x)ᵢ · Eᵢ(x)

Where Eᵢ(x) is the output of expert i applied to x. Because G(x)ᵢ = 0 for all but k experts, only k expert networks actually compute anything. The sum has only k non-zero terms.

In practice: if n = 1,000 experts and k = 2, only 2 networks fire per token. You have 1,000× the parameter count of a single FFN, but 2× the compute cost of a single FFN.

The load-balancing problem — and its fix

Here is the training trap that killed earlier MoE attempts:

Suppose expert 1 gets slightly better at a particular type of token through random chance early in training. The gating network notices and routes more of those tokens to expert 1. Expert 1 gets more training signal, becomes even better. The gating routes even more there. Eventually, expert 1 handles 90% of all tokens and experts 2 through 1,000 are nearly unused.

This is called expert collapse or load imbalance. It is a death spiral: rich experts get richer, idle experts learn nothing, and you end up with an expensive dense model hidden inside a sparse wrapper.

The paper’s fix: an auxiliary balancing loss added to the training objective.

The auxiliary loss penalises unequal distribution of tokens across experts:

L_balance = α · n · Σᵢ fᵢ · pᵢ

Where:

fᵢ = fraction of tokens in the batch routed to expert i (computed from the hard top-k assignments)
pᵢ = mean soft gating probability for expert i across the batch (computed from the soft scores before top-k)
n = number of experts
α = a small coefficient (e.g., 10⁻²) so this loss does not dominate the main language modelling loss

When fᵢ and pᵢ are both large for one expert and small for others, L_balance is large. Gradient descent pushes toward a uniform distribution: fᵢ ≈ pᵢ ≈ 1/n for all experts.

This auxiliary loss is what makes large-scale MoE training stable. Without it, expert collapse is nearly inevitable.

Where MoE fits in the Transformer

The paper integrates MoE layers into a stacked LSTM architecture (they predated the original Transformer, which came out the same year). In subsequent work (and all modern MoE models), MoE layers replace the FFN sub-layer in Transformer encoder/decoder blocks:

Standard Transformer layer:
  Self-Attention → Add & Norm → FFN → Add & Norm

MoE Transformer layer:
  Self-Attention → Add & Norm → MoE Layer → Add & Norm

The attention mechanism is shared and dense — every token attends to every other token normally. Only the FFN is replaced by the sparse MoE. This is a sensible split: attention handles cross-token communication (where density is important for global context), while the FFN handles per-token computation (where specialisation is valuable).