Sequence Parallelism

Appears in 1 paper

Parallelising the sequence dimension of tensors.

As used in Paper 19 — Ring Attention with Blockwise Transformers for Near-Infinite Context →

Parallelising the sequence dimension of tensors. Ring Attention is one implementation. Complements data parallelism (batches), tensor parallelism (model weights), and pipeline parallelism (layers) in modern distributed training.