Section 09

Summary: Mamba in one page

Mamba: Linear-Time Sequence Modeling with Selective State Spaces 2023

Summary: Mamba in one page


One Sentence

Mamba trades attention’s flexibility for linear-time O(n) efficiency by learning which information to remember and which to forget, matching Transformer quality while being 5× faster on long sequences.


The Problem It Solved

Transformers are O(n²) in sequence length.

For a sequence of n tokens:

  • Memory: quadratic in n
  • Compute: quadratic in n
  • Inference: must keep all KV pairs on GPU

At 100K tokens or beyond, Transformers become impractical—memory explodes, speed crawls.

State Space Models (SSMs) are O(n) — but prior SSMs (like S4) had fixed parameters, treating all tokens equally. They couldn’t adapt to content.


Key Ideas

1. Selective State Space Model (Selective SSM)

  • Instead of fixed A, B, C matrices → make B, C, and Δ (step size) input-dependent
  • At each step, the model decides: “Is this token important? Remember it longer (slow decay). Or is it noise? Forget it fast (fast decay).”
  • Decay is controlled by eigenvalues of A; Mamba learns to set them dynamically

2. Content-Aware Memory

  • Mamba’s hidden state is a summary that selectively compresses information
  • Important facts: slow decay (large eigenvalues)
  • Noise: fast decay (small eigenvalues)
  • Result: better use of fixed-size state vs. fixed-rate SSMs

3. Hardware-Aware Algorithm

  • Uses a parallel scan (Blelloch scan) during training to parallelize the recurrence
  • Switches to recurrence mode during inference (O(1) per token)
  • Custom CUDA kernels needed; can’t just use PyTorch ops

4. Linear Recurrence at Inference

  • Training: O(n log n) via FFT or parallel scan
  • Inference: O(1) per token generated, O(1) memory per token
  • No KV cache needed; state stays constant size

Key Numbers

MetricValue
Speedup over Transformer (2K+ tokens)5× faster
Memory at 1M tokensO(1), constant
Transformer memory at 1M tokensO(n), explodes
Language modeling (Chinchilla 7B)Matches or beats Transformer
HumanEval (code)57% (Mamba) vs 55% (Transformer)
Largest open Mamba model7B parameters (as of 2025)
Training timeSimilar to Transformer (not faster)

Indian Analogy Recap

A student processing a long river of information. Instead of stopping at every word to compare it to every previous word (O(n²), exhausting), the student learned to say: “I’ll remember this concept deeply (important for exams)” vs. “I’ll skim over this filler” (can forget).

Result: Same understanding, much faster processing.


What Came Next

Mamba 2 (Dao & Gu, 2024)

  • Theoretical reformulation connecting SSMs to attention
  • 2–8× faster training
  • Deeper understanding of why selective SSMs work

Jamba (AI21 Labs, 2024)

  • Hybrid model: alternates Mamba and Attention blocks
  • First commercial Mamba-based LLM
  • Proof that hybrids are the practical future

Hybrid Architectures

  • StripedHyena (Together AI)
  • Upcoming LLaMA variants with Mamba layers
  • Future: strategic mixing, not pure Mamba or pure Attention

When to Use Mamba

Use Mamba (or Jamba hybrid):

  • Processing very long sequences (10K+ tokens)
  • On-device inference (limited memory)
  • Real-time systems where latency per token matters
  • Building efficient models with small compute budgets

Use Transformers:

  • Most NLP tasks (text under 4K tokens)
  • Retrieval-heavy problems (“needle in haystack”)
  • Vision-language (images + text)
  • Fine-tuning on large-scale instruction data
  • When you need mature ecosystems (Hugging Face, vLLM, etc.)

Limitations Worth Knowing

  1. Poor in-context recall — Can’t point back to token position 5000 like Attention can
  2. Sequential inference — Must generate one token at a time (hard to parallelize)
  3. Training not faster — Speed gain is purely at inference on long sequences
  4. Limited ecosystem — Fewer libraries, tools, pre-trained models than Transformers
  5. Unproven at scale — Largest Mamba models are 7B; Transformers scale to 700B+

The Verdict

Mamba is not a Transformer killer. It’s a complementary approach that proved:

  • ✓ Linear-time sequence modeling works
  • ✓ Selectivity beats inflexibility
  • ✓ Hybrid architectures are the practical future
  • ✗ But it’s not universally better; context-specific trade-offs matter

In 2025, the field is settling on hybrid models (Mamba + Attention blocks) for balanced efficiency and capability. Pure Mamba shines in niche (but important) use cases.


Paper 20: Gemini | Paper 22: Claude Model Card

🎉 You've finished this paper!