Feed-forward network (FFN)

Appears in 1 paper

A two-layer MLP applied position-wise after attention: `FFN(x) = max(0, xW₁ + b₁)W₂ + b₂`.

As used in Paper 08 — Attention Is All You Need →

A two-layer MLP applied position-wise after attention: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂. W₁ projects from d_model to d_ff = 2048 (4×), then W₂ projects back. Applied independently at each position — there is no communication between positions inside the FFN. The ReLU adds non-linearity.

Paper 08 — Attention Is All You Need →

Appears in papers