Add & Norm (Residual + Layer Norm)
The wrapping applied around every sub-layer: `output = LayerNorm(x + SubLayer(x))`.
The wrapping applied around every sub-layer: output = LayerNorm(x + SubLayer(x)). The "Add" is the residual connection (adding input back to output). The "Norm" is layer normalisation. Together they enable training of very deep networks by keeping gradient flow stable. See the Normalisation tutorial.