Paper 21
Intermediate

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Paper by: Albert Gu, Tri Dao
Published: December 2023
Venue: ICLR 2024 (arXiv 2023)
URL: https://arxiv.org/abs/2312.00752


What This Paper Did

Transformers are powerful but slow for long sequences — their attention mechanism is O(n²) in sequence length. Mamba trades off some flexibility to achieve O(n) linear-time performance while matching or exceeding Transformer quality on language modeling.

The key innovation: Selective State Space Models (SSMs). Instead of using fixed state transition matrices (like prior SSMs), Mamba makes the state matrices input-dependent. This allows the model to:

  • Remember important information (slow “decay” for key facts)
  • Forget irrelevant details (fast “decay” for noise)
  • Run extremely efficiently — 5× faster than Transformers for sequences of 2K tokens and beyond

Mamba matches or beats Transformers on:

  • Language modeling perplexity (how well it predicts the next word)
  • Code generation (HumanEval)
  • Mathematical reasoning

But runs significantly faster and uses less memory.

Core Numbers

MetricMambaTransformer
Inference Speed (2K seq)5× slower
Memory (1M tokens)Constant O(1)O(n) (quadratic)
Training ComputeSimilarSimilar
Language Modeling (Chinchilla 7B)BetterBaseline
HumanEval (code)57%55%
Inference ModeSequential (O(1) per step)Parallel (O(n) total)

The Trade-off

Transformers:

  • ✓ Flexible attention (each position attends to any other)
  • ✗ O(n²) inference for sequential generation
  • ✓ Easy parallelization during training

Mamba:

  • ✓ O(n) linear-time inference (massively faster)
  • ✗ Less flexible than full attention
  • ✗ Harder to parallelize (recurrent structure)

Mamba’s bet: For most real-world tasks, selectivity (focusing on important tokens) beats flexibility (attending to everything).


The Indian Analogy

Imagine a student (the model) processing a long river of information (sequence of tokens).

Traditional Attention (Transformer): The student stops at every word and asks: “What does each previous word tell me about this one?” For a paragraph of 1,000 words, they ask this question 1,000² = 1,000,000 times. Powerful understanding, but exhausting and slow.

State Space Models (Fixed, like S4): The student processes words like a river flowing downstream. Information decays naturally (old information is forgotten). But the rate of decay is the same regardless of content. During monsoon season, information decays fast. During summer, it decays fast too. The student can’t adjust.

Selective State Space Models (Mamba): The student now decides dynamically: “This word is important (teacher said it will be on the exam), so let me remember it longer (slow decay).” vs. “This word is just filler, I’ll forget it quickly (fast decay).” The decay rate depends on the content.

Result: The student understands as well as the attentive approach, but processes information as fast as the fixed river (O(n), not O(n²)).


Read This Paper in This Order

SectionWhat You Will LearnDifficultyTime
01 — ContextWhy Transformers are slow; prior SSM work (S4); the state space model framework🟡 Intermediate8 min
02 — The ProblemFixed SSMs treat all tokens equally; selectivity is key🟡 Intermediate6 min
03 — The IdeaSelective SSM: make B, C, Δ input-dependent; hardware-aware algorithm🟡 Intermediate12 min
04 — The MathSSM equations; discretisation; selective parameters; hardware scan🔴 Advanced14 min
05 — Worked ExampleComplete trace of SSM forward pass with real numbers🟡 Intermediate10 min
06 — The CodeMinimal Mamba implementation in PyTorch🟡 Intermediate5 min
07 — LimitationsIn-context recall struggles; training overhead; adoption challenges🟡 Intermediate7 min
08 — ImpactMamba 2, hybrid models, state space model renaissance🟢 Beginner5 min
09 — SummaryOne-paragraph recap; what came next🟢 Beginner2 min

Total reading time: ~50 minutes


Before You Read: Math Tutorials You Need

You should already know these concepts. If not, read them first:

  1. Eigenvalues and Eigenvectors — State space stability depends on eigenvalues of A
  2. Matrix Multiplication and Projections — Core of SSM math
  3. Differential Equations (Brief) — SSMs model continuous-time dynamics (optional, advanced)

Architecture Diagram

Input Sequence u = [u₁, u₂, ..., u_n]
         |

Compute selective parameters at each step:
  Δₜ = softplus(Wₓ u_t)         (input-dependent step size)
  B_t = Linear(u_t)            (input-dependent input matrix)
  C_t = Linear(u_t)            (input-dependent output matrix)
         |

Discretise state transition matrix:
  Ā = exp(Δₜ × A)  (depends on current input)
  B̄ = Δₜ × B_t
         |

Recurrent update (same for all t):
  x_t = Ā × x_{t-1} + B̄ × u_t
         |

Output:
  y_t = C_t × x_t
         |

Predicted next token

Paper 20: Gemini | Paper 22: Claude Model Card

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.