Feed-forward network (FFN)
A two-layer MLP applied position-wise after attention: `FFN(x) = max(0, xW₁ + b₁)W₂ + b₂`.
A two-layer MLP applied position-wise after attention: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂. W₁ projects from d_model to d_ff = 2048 (4×), then W₂ projects back. Applied independently at each position — there is no communication between positions inside the FFN. The ReLU adds non-linearity.