Paper 19
Intermediate

Ring Attention with Blockwise Transformers for Near-Infinite Context

Ring Attention with Blockwise Transformers for Near-Infinite Context

The fundamental problem with Transformer attention is that it requires all key-value pairs to fit on a single GPU. For a sequence of 1 million tokens with standard attention, the KV cache alone would demand hundreds of gigabytes of memory — far beyond any single GPU.

Ring Attention solves this by distributing sequences across multiple GPUs arranged in a ring topology. Each GPU holds a chunk of the sequence; as computation proceeds, KV chunks circulate around the ring device-by-device. Computation and communication overlap, hiding latency. The result: context length scales with the number of GPUs.

Ring Attention enabled models like Gemini 1.5 Pro to handle 1-million-token contexts. It represents a fundamental shift: instead of asking “how long a sequence can fit on one GPU?”, ask “how many GPUs can we use, and scale context accordingly?”

What this paper did

The core contribution: A distributed attention algorithm where:

  • P GPUs are arranged in a ring topology
  • Each GPU holds a chunk of the KV cache (n/P tokens)
  • Each GPU computes attention for its query block against all KV blocks by circulating KV chunks around the ring
  • Computation overlaps with communication, hiding communication latency
  • Result: context length grows with P, not bounded by single-GPU memory
  • Sequence length can be 1 million or more tokens (tested up to 4 million)

Key equations (informal):

Memory per GPU: O((n/P) × d) — scales with sequence / num_GPUs
Computation: Total O(n²d) distributed across P devices
Communication: O(n × d × P) — each KV chunk traverses ring P times
Per-GPU work per iteration: O((n/P)² × d) locally + O(n × d) communication

Effective context = n, where n can be arbitrarily large as P increases

The Indian analogy

Imagine a very long cricket match scorecard (1 million records) that’s too long for one analyst to manage. You have 8 analysts sitting in a circle.

Without Ring Attention: All 8 analysts need a copy of the entire scorecard. Memory explosive.

With Ring Attention: Each analyst holds 1/8 of the scorecard. They pass their section to the next analyst clockwise. While analyst 1 is analyzing their section against the section analyst 8 just handed them, analyst 2 is simultaneously receiving analyst 1’s section. By the time the scorecard has circulated the full ring, every analyst has computed their piece of the answer — and no one was sitting idle waiting for data.

The key: synchronised passing and computing means communication latency is hidden by overlap.

Comparison: Context Handling Approaches

ApproachContext SizeMemory Per GPULatencyHow It Works
Standard AttentionLimited by GPU VRAMO(n × d)O(n²)All tokens on one GPU
Mistral SWAW tokens (4–8K)O(W × d)O(n × W)Local window only
Ring Attentionn (unbounded)O((n/P) × d)O(n² / P)Distributed across P GPUs
Sparse Attentionn (unbounded)O(n × d)O(n log n)Limited patterns
KV Compressionn (unbounded)O(c × d), c << nO(n × c)Lossy compression

Ring Attention’s advantage: clean memory scaling (linear with P) + full attention expressiveness (not sparse, not windowed, not lossy).

Read in this order

SectionWhat you will learnDifficultyTime
01 ContextWhy long sequences are hard; single-GPU limits🟡 Intermediate8 min
02 The ProblemDistributed attention is challenging; communication bottleneck🔴 Advanced10 min
03 The IdeaRing topology; blockwise attention; compute-comm overlap🔴 Advanced12 min
04 The MathFormal algorithm, complexity analysis, numerical examples🔴 Advanced14 min
05 Worked ExampleStep-by-step ring circulation on 6-token sequence🟡 Intermediate10 min
06 The CodePython simulation of ring attention; Colab-runnable🟢 Beginner6 min
07 LimitationsRing topology constraints, heterogeneity, causal masking🔴 Advanced8 min
08 ImpactGemini 1.5, context parallelism, million-token models🟢 Beginner8 min
09 SummaryOne-sentence recap; what to read next🟢 Beginner2 min

Before you read: Math tutorials you need

Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│         Ring Attention: 4 GPUs Processing 1M Tokens          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  Original Sequence: [t₁...t₂₅₀K, t₂₅₀K+1...t₅₀₀K, ...]    │
│                      (1,000,000 tokens)                      │
│                            ↓                                 │
│  Split across 4 GPUs (250K tokens each)                      │
│                            ↓                                 │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  GPU Ring Topology:                                 │    │
│  │                                                     │    │
│  │      GPU 0 (tokens 0–249K)                          │    │
│  │         ↙              ↖                            │    │
│  │    GPU 1              GPU 3                         │    │
│  │    (tokens           (tokens                        │    │
│  │     250K–500K)       750K–1M)                       │    │
│  │         ↘              ↙                            │    │
│  │        GPU 2 (tokens 500K–750K)                     │    │
│  │                                                     │    │
│  │  During computation:                                │    │
│  │  Round 1: Each GPU computes Q @ KV[own]            │    │
│  │           KV chunks pass clockwise                  │    │
│  │  Round 2: GPU i receives KV from GPU i-1           │    │
│  │           Computes Q @ KV[i-1]                     │    │
│  │  Round 3, 4: Continue until each GPU has seen all  │    │
│  │           KV chunks                                │    │
│  └─────────────────────────────────────────────────────┘    │
│                            ↓                                 │
│  Output: Full attention computed; every Q attended to       │
│  every KV across entire 1M token sequence                   │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Key Innovations

1. Blockwise Attention: Compute attention in blocks (query block × KV block) using online softmax. Numerically equivalent to full attention but allows sequential processing.

2. Ring Topology: Arrange P GPUs in a ring. Each device passes its KV block to the next. No centralised bottleneck.

3. Compute-Communication Overlap: GPU i computes attention while simultaneously receiving the next KV block. Latency is hidden.

4. Memory Scaling: Instead of O(n) memory on a single GPU, O(n/P) per GPU. As P grows, you can handle arbitrarily long sequences.

Paper 18: Mistral 7B | Paper 20: Gemini →

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.