Section 01

Context: The Long-Sequence Problem

Ring Attention with Blockwise Transformers for Near-Infinite Context 2023

Context: The Long-Sequence Problem

By 2023, language models had gotten very good. But they faced a hard ceiling: context length.

GPT-4 could handle 32,768 tokens (8K tokens; later 128K). But researchers wanted 1 million tokens — an entire book in one context window. Why? Because:

  1. Better reasoning: Long context allows the model to see the full problem before reasoning, not just fragments.
  2. Real applications: Legal documents, codebases, long conversations — the natural context for many tasks is large.
  3. Research interest: If models can truly understand long documents without information loss, they’re closer to human-like reasoning.

The Technical Ceiling

Standard Transformer attention has a hard limit: all key-value pairs must fit on a single GPU’s memory.

For a sequence of n tokens:

  • KV cache = 2 × n_heads × d_head × n (in float32)

For a 1 million-token sequence with a typical model (32 heads, 128 dims per head):

KV cache = 2 × 32 × 128 × 1,000,000
         = 8,192,000,000 floats
         = 32 GB (just for KV!)

An H100 GPU has 80 GB of VRAM. After loading model weights (~40 GB), you’re left with 40 GB — enough for the KV cache, but barely, and zero room for gradients if training.

For 4 million tokens (like Gemini 1.5 Pro uses):

KV cache = 32 GB × 4 = 128 GB

No single GPU has 128 GB. Not even possible.

Why Not Mistral’s Sliding Window?

Mistral 7B uses Sliding Window Attention — each token attends only to the last 4,096 tokens. This is brilliant for efficiency, but it’s not true long-context.

If your document is 1 million tokens and you need information from token #100 when answering a question at token #1,000,000, sliding window won’t help. SWA is optimised for local coherence, not true long-range dependencies.

Ring Attention’s goal is different: enable true full attention over arbitrarily long sequences.

The Distributed Computing Angle

The idea came from distributed systems thinking:

  • One GPU can’t hold the full KV cache, but…
  • P GPUs can each hold 1/P of the KV cache
  • If GPUs coordinate efficiently, they can compute full attention by each GPU processing its KV chunk, then exchanging chunks with neighbours

The challenge: how do you distribute attention so that each GPU only touches data it has, with minimal communication overhead?

Previous Approaches (Why They Didn’t Quite Work)

1. Naive Gradient Checkpointing

Store only some KV vectors, recompute others during backprop. Reduces memory but doesn’t enable long sequence in the forward pass.

2. Sparse Attention

Only attend to a subset of tokens (e.g., local windows, bigram patterns). Saves memory, but information is lost. Not true full attention.

3. KV Compression

Quantise or hash KV pairs to reduce their size. Saves memory, but lossy — model quality degrades.

4. Naive Data Parallelism

Shard the sequence across GPUs but compute full attention on each shard locally. Doesn’t help — each GPU still needs the full KV cache.

5. Sequence Parallelism v1

Split the sequence into chunks, assign to different GPUs, but require all-to-all communication to compute attention. Communication cost is huge — more expensive than saving memory.

Ring Attention’s Insight

The breakthrough: use a ring topology to pass KV chunks from GPU to GPU in a coordinated, pipelined fashion.

  • GPU 0 has KV chunk 0
  • GPU 1 has KV chunk 1
  • GPU 2 has KV chunk 2
  • GPU 3 has KV chunk 3

Each GPU computes attention for its query chunk against its local KV chunk, then:

  • Sends its KV chunk to the next GPU (clockwise around the ring)
  • Receives the previous GPU’s KV chunk (from the counterclockwise neighbor)

By the time we’ve gone around the ring P times, every GPU has seen every KV chunk. Full attention is computed, but distributed.

Why This Matters

  • Memory scales beautifully: With P GPUs, you reduce per-GPU memory from O(n) to O(n/P)
  • No lossy compression: Full attention, numerically exact
  • Communication is pipelined: Compute and data transfer happen in parallel, hiding latency
  • Context length is unbounded: Add more GPUs, handle longer sequences

This is the key difference from Mistral’s SWA: Ring Attention doesn’t sacrifice attention expressiveness to save memory. It scales memory usage linearly with the number of GPUs, maintaining full attention throughout.

By late 2023 / early 2024, multiple large labs (Google, DeepSeek, others) were using Ring Attention or similar ideas to train models with million-token context. It’s the foundation of modern long-context models.