Add & Norm (Residual + Layer Norm)

Appears in 1 paper

The wrapping applied around every sub-layer: `output = LayerNorm(x + SubLayer(x))`.

As used in Paper 08 — Attention Is All You Need →

The wrapping applied around every sub-layer: output = LayerNorm(x + SubLayer(x)). The "Add" is the residual connection (adding input back to output). The "Norm" is layer normalisation. Together they enable training of very deep networks by keeping gradient flow stable. See the Normalisation tutorial.