Context Parallelism

Appears in 1 paper

Distributing a long sequence across multiple GPUs along the sequence dimension (distinct from data, tensor, or pipeline parallelism).

As used in Paper 19 — Ring Attention with Blockwise Transformers for Near-Infinite Context →

Distributing a long sequence across multiple GPUs along the sequence dimension (distinct from data, tensor, or pipeline parallelism). Ring Attention is the primary implementation of context parallelism. Enables training and inference on sequences far exceeding single-GPU memory.