Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Paper by: Albert Gu, Tri Dao
Published: December 2023
Venue: ICLR 2024 (arXiv 2023)
URL: https://arxiv.org/abs/2312.00752

What This Paper Did

Transformers are powerful but slow for long sequences — their attention mechanism is O(n²) in sequence length. Mamba trades off some flexibility to achieve O(n) linear-time performance while matching or exceeding Transformer quality on language modeling.

The key innovation: Selective State Space Models (SSMs). Instead of using fixed state transition matrices (like prior SSMs), Mamba makes the state matrices input-dependent. This allows the model to:

Remember important information (slow “decay” for key facts)
Forget irrelevant details (fast “decay” for noise)
Run extremely efficiently — 5× faster than Transformers for sequences of 2K tokens and beyond

Mamba matches or beats Transformers on:

Language modeling perplexity (how well it predicts the next word)
Code generation (HumanEval)
Mathematical reasoning

But runs significantly faster and uses less memory.

Core Numbers

Metric	Mamba	Transformer
Inference Speed (2K seq)	1×	5× slower
Memory (1M tokens)	Constant O(1)	O(n) (quadratic)
Training Compute	Similar	Similar
Language Modeling (Chinchilla 7B)	Better	Baseline
HumanEval (code)	57%	55%
Inference Mode	Sequential (O(1) per step)	Parallel (O(n) total)

The Trade-off

Transformers:

✓ Flexible attention (each position attends to any other)
✗ O(n²) inference for sequential generation
✓ Easy parallelization during training

Mamba:

✓ O(n) linear-time inference (massively faster)
✗ Less flexible than full attention
✗ Harder to parallelize (recurrent structure)

Mamba’s bet: For most real-world tasks, selectivity (focusing on important tokens) beats flexibility (attending to everything).

The Indian Analogy

Imagine a student (the model) processing a long river of information (sequence of tokens).

Traditional Attention (Transformer): The student stops at every word and asks: “What does each previous word tell me about this one?” For a paragraph of 1,000 words, they ask this question 1,000² = 1,000,000 times. Powerful understanding, but exhausting and slow.

State Space Models (Fixed, like S4): The student processes words like a river flowing downstream. Information decays naturally (old information is forgotten). But the rate of decay is the same regardless of content. During monsoon season, information decays fast. During summer, it decays fast too. The student can’t adjust.

Selective State Space Models (Mamba): The student now decides dynamically: “This word is important (teacher said it will be on the exam), so let me remember it longer (slow decay).” vs. “This word is just filler, I’ll forget it quickly (fast decay).” The decay rate depends on the content.

Result: The student understands as well as the attentive approach, but processes information as fast as the fixed river (O(n), not O(n²)).

Read This Paper in This Order

Section	What You Will Learn	Difficulty	Time
01 — Context	Why Transformers are slow; prior SSM work (S4); the state space model framework	🟡 Intermediate	8 min
02 — The Problem	Fixed SSMs treat all tokens equally; selectivity is key	🟡 Intermediate	6 min
03 — The Idea	Selective SSM: make B, C, Δ input-dependent; hardware-aware algorithm	🟡 Intermediate	12 min
04 — The Math	SSM equations; discretisation; selective parameters; hardware scan	🔴 Advanced	14 min
05 — Worked Example	Complete trace of SSM forward pass with real numbers	🟡 Intermediate	10 min
06 — The Code	Minimal Mamba implementation in PyTorch	🟡 Intermediate	5 min
07 — Limitations	In-context recall struggles; training overhead; adoption challenges	🟡 Intermediate	7 min
08 — Impact	Mamba 2, hybrid models, state space model renaissance	🟢 Beginner	5 min
09 — Summary	One-paragraph recap; what came next	🟢 Beginner	2 min

Total reading time: ~50 minutes

Before You Read: Math Tutorials You Need

You should already know these concepts. If not, read them first:

Eigenvalues and Eigenvectors — State space stability depends on eigenvalues of A
Matrix Multiplication and Projections — Core of SSM math
Differential Equations (Brief) — SSMs model continuous-time dynamics (optional, advanced)

Architecture Diagram

Input Sequence u = [u₁, u₂, ..., u_n]
         |
         ↓
Compute selective parameters at each step:
  Δₜ = softplus(Wₓ u_t)         (input-dependent step size)
  B_t = Linear(u_t)            (input-dependent input matrix)
  C_t = Linear(u_t)            (input-dependent output matrix)
         |
         ↓
Discretise state transition matrix:
  Ā = exp(Δₜ × A)  (depends on current input)
  B̄ = Δₜ × B_t
         |
         ↓
Recurrent update (same for all t):
  x_t = Ā × x_{t-1} + B̄ × u_t
         |
         ↓
Output:
  y_t = C_t × x_t
         |
         ↓
Predicted next token

← Paper 20: Gemini | Paper 22: Claude Model Card →

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

What This Paper Did

Core Numbers

The Trade-off

The Indian Analogy

Read This Paper in This Order

Before You Read: Math Tutorials You Need

Architecture Diagram

Navigation

Discussion