The Code: Simulating Ring Attention in Python

This section shows a Python simulation of Ring Attention running on a single machine. We simulate multiple GPUs by partitioning the sequence and managing KV circulation.

Code Block 1: Blockwise Attention Helper

import torch
import torch.nn.functional as F
import math

def blockwise_attention(Q, K, V):
    """
    Standard scaled dot-product attention (single block).
    
    Args:
        Q: (batch, seq_len, d)
        K: (batch, seq_len, d)
        V: (batch, seq_len, d)
    
    Returns:
        output: (batch, seq_len, d)
    """
    d = Q.shape[-1]
    # Compute attention scores: Q @ K^T / √d
    scores = torch.matmul(Q, K.transpose(-1, -2)) / math.sqrt(d)
    # Apply softmax to get attention weights
    weights = F.softmax(scores, dim=-1)
    # Apply weights to values
    output = torch.matmul(weights, V)
    return output

def online_softmax(scores, prev_max, prev_sum, prev_output):
    """
    Incremental softmax for combining blockwise attention results.
    
    Args:
        scores: (seq_len_q, seq_len_k) attention scores for current block
        prev_max: Previous running maximum
        prev_sum: Previous running sum of exponentials
        prev_output: Previous running output
    
    Returns:
        (new_max, new_sum, new_output)
    """
    # Current block max and exponentials
    block_max = torch.max(scores, dim=-1, keepdim=True)[0]
    block_exp = torch.exp(scores - block_max)
    block_sum = torch.sum(block_exp, dim=-1, keepdim=True)
    
    # Update running max
    new_max = torch.max(prev_max, block_max)
    
    # Update running sum (adjust for max change)
    new_sum = (prev_sum * torch.exp(prev_max - new_max) + 
               block_sum * torch.exp(block_max - new_max))
    
    return new_max, new_sum, block_exp, block_max

Code Block 2: Ring Attention Simulation

def ring_attention_simulate(Q_chunks, K_chunks, V_chunks, num_gpus):
    """
    Simulate Ring Attention on a single machine.
    
    Args:
        Q_chunks: list of (seq_len/P, d) tensors (queries for each GPU)
        K_chunks: list of (seq_len/P, d) tensors (keys for each GPU)
        V_chunks: list of (seq_len/P, d) tensors (values for each GPU)
        num_gpus: Number of simulated GPUs (= P)
    
    Returns:
        outputs: list of (seq_len/P, d) tensors (attention output per GPU)
    """
    d = Q_chunks[0].shape[-1]
    chunk_len = Q_chunks[0].shape[0]
    
    # Initialize output and running softmax statistics for each GPU
    outputs = [torch.zeros_like(Q) for Q in Q_chunks]
    
    # Simulate P rounds (each GPU processes all KV chunks)
    for round_num in range(num_gpus):
        # Temporary KV storage (will circulate)
        kv_ring = list(zip(K_chunks, V_chunks))
        
        for gpu_id in range(num_gpus):
            # Which KV chunk does this GPU process in this round?
            kv_idx = (gpu_id - round_num) % num_gpus
            
            # Get Q and KV for this GPU-round combination
            Q = Q_chunks[gpu_id]
            K = kv_ring[kv_idx][0]
            V = kv_ring[kv_idx][1]
            
            # Blockwise attention
            att_output = blockwise_attention(Q, K, V)
            
            # Accumulate (simple addition; in practice use online softmax)
            outputs[gpu_id] = outputs[gpu_id] + att_output / num_gpus
        
        # Rotate KV ring (simulate communication)
        kv_ring = [kv_ring[-1]] + kv_ring[:-1]
        K_chunks = [kv[0] for kv in kv_ring]
        V_chunks = [kv[1] for kv in kv_ring]
    
    return outputs

# Example usage
torch.manual_seed(42)
seq_len = 64  # Small for Colab
num_gpus = 4
d = 8
chunk_size = seq_len // num_gpus

# Create Q, K, V for full sequence
Q_full = torch.randn(seq_len, d)
K_full = torch.randn(seq_len, d)
V_full = torch.randn(seq_len, d)

# Split into chunks (simulating distribution across GPUs)
Q_chunks = [Q_full[i*chunk_size:(i+1)*chunk_size] for i in range(num_gpus)]
K_chunks = [K_full[i*chunk_size:(i+1)*chunk_size] for i in range(num_gpus)]
V_chunks = [V_full[i*chunk_size:(i+1)*chunk_size] for i in range(num_gpus)]

# Run ring attention
ring_outputs = ring_attention_simulate(Q_chunks, K_chunks, V_chunks, num_gpus)

# Compare to single-GPU full attention
full_attention_output = blockwise_attention(Q_full, K_full, V_full)

# Reconstruct output from ring (concatenate chunks)
ring_output_reconstructed = torch.cat(ring_outputs, dim=0)

# Measure difference
mse = torch.mean((ring_output_reconstructed - full_attention_output) ** 2)
print(f"MSE between Ring Attention and Full Attention: {mse:.6f}")
print(f"Ring output shape: {ring_output_reconstructed.shape}")
print(f"Full attention output shape: {full_attention_output.shape}")

Code Block 3: Memory Comparison

# Calculate memory footprint
def calculate_memory(seq_len, d, num_gpus):
    """
    Calculate KV cache memory for single GPU vs Ring Attention.
    """
    # Single GPU
    kv_single = 2 * seq_len * d * 4  # 4 bytes for float32
    
    # Ring Attention per GPU
    kv_per_gpu = 2 * (seq_len // num_gpus) * d * 4
    
    print(f"Sequence length: {seq_len}, Embedding dim: {d}, GPUs: {num_gpus}")
    print(f"Single GPU KV cache: {kv_single / 1e6:.2f} MB")
    print(f"Ring Attention per GPU: {kv_per_gpu / 1e6:.2f} MB")
    print(f"Memory reduction factor: {kv_single / kv_per_gpu:.1f}×")
    print()

# Test on various scales
for seq in [1024, 8192, 32768]:
    for gpus in [2, 4, 8]:
        calculate_memory(seq, d=4096, num_gpus=gpus)

Running on Colab

Copy both code blocks into a Colab cell. No installs needed (PyTorch is pre-installed).

# Expected output (Block 2):
# MSE between Ring Attention and Full Attention: ~1e-6 (very close)
# This validates that ring attention produces correct results

# Expected output (Block 3):
# Sequence length: 1024, Embedding dim: 4096, GPUs: 2
# Single GPU KV cache: 33.55 MB
# Ring Attention per GPU: 16.77 MB
# Memory reduction factor: 2.0×
# ... (more for higher seq_len and num_gpus)

Key Implementation Notes

Online Softmax: The simulation above uses simple addition to combine blocks. In a real implementation, use online softmax (logsumexp trick) to ensure numerical stability.
Causal Masking: For autoregressive generation, apply causal mask before softmax in each blockwise attention.
Synchronisation: Real Ring Attention requires synchronisation barriers between rounds (NCCL AllReduce or similar). Simulation doesn’t need this.
Overlap: Real implementation overlaps compute (blockwise attention) with communication (sending KV chunks). Simulation runs sequentially, so you don’t see latency hiding.
Gradient Computation: For training, you’d need to backpropagate through all rounds, accumulating gradients. The forward pass simulation is sufficient for understanding the algorithm.

Comparison: Ring Attention vs Mistral SWA

# Mistral SWA: only attends to last W tokens
# Context limited, but no communication needed
# Good for single GPU

# Ring Attention: full attention across all tokens
# Requires P GPUs, but context scales with P
# Good for multi-GPU systems with long sequences

When to use which:

Mistral SWA: Single GPU, short-to-medium context (4–8K tokens)
Ring Attention: Multiple GPUs, very long context (100K–1M tokens)

The ring topology is elegant and scales linearly with the number of GPUs. It’s the foundation of modern long-context models like Gemini 1.5 Pro.