Summary: Mamba in one page
One Sentence
Mamba trades attention’s flexibility for linear-time O(n) efficiency by learning which information to remember and which to forget, matching Transformer quality while being 5× faster on long sequences.
The Problem It Solved
Transformers are O(n²) in sequence length.
For a sequence of n tokens:
- Memory: quadratic in n
- Compute: quadratic in n
- Inference: must keep all KV pairs on GPU
At 100K tokens or beyond, Transformers become impractical—memory explodes, speed crawls.
State Space Models (SSMs) are O(n) — but prior SSMs (like S4) had fixed parameters, treating all tokens equally. They couldn’t adapt to content.
Key Ideas
1. Selective State Space Model (Selective SSM)
- Instead of fixed A, B, C matrices → make B, C, and Δ (step size) input-dependent
- At each step, the model decides: “Is this token important? Remember it longer (slow decay). Or is it noise? Forget it fast (fast decay).”
- Decay is controlled by eigenvalues of A; Mamba learns to set them dynamically
2. Content-Aware Memory
- Mamba’s hidden state is a summary that selectively compresses information
- Important facts: slow decay (large eigenvalues)
- Noise: fast decay (small eigenvalues)
- Result: better use of fixed-size state vs. fixed-rate SSMs
3. Hardware-Aware Algorithm
- Uses a parallel scan (Blelloch scan) during training to parallelize the recurrence
- Switches to recurrence mode during inference (O(1) per token)
- Custom CUDA kernels needed; can’t just use PyTorch ops
4. Linear Recurrence at Inference
- Training: O(n log n) via FFT or parallel scan
- Inference: O(1) per token generated, O(1) memory per token
- No KV cache needed; state stays constant size
Key Numbers
| Metric | Value |
|---|---|
| Speedup over Transformer (2K+ tokens) | 5× faster |
| Memory at 1M tokens | O(1), constant |
| Transformer memory at 1M tokens | O(n), explodes |
| Language modeling (Chinchilla 7B) | Matches or beats Transformer |
| HumanEval (code) | 57% (Mamba) vs 55% (Transformer) |
| Largest open Mamba model | 7B parameters (as of 2025) |
| Training time | Similar to Transformer (not faster) |
Indian Analogy Recap
A student processing a long river of information. Instead of stopping at every word to compare it to every previous word (O(n²), exhausting), the student learned to say: “I’ll remember this concept deeply (important for exams)” vs. “I’ll skim over this filler” (can forget).
Result: Same understanding, much faster processing.
What Came Next
Mamba 2 (Dao & Gu, 2024)
- Theoretical reformulation connecting SSMs to attention
- 2–8× faster training
- Deeper understanding of why selective SSMs work
Jamba (AI21 Labs, 2024)
- Hybrid model: alternates Mamba and Attention blocks
- First commercial Mamba-based LLM
- Proof that hybrids are the practical future
Hybrid Architectures
- StripedHyena (Together AI)
- Upcoming LLaMA variants with Mamba layers
- Future: strategic mixing, not pure Mamba or pure Attention
When to Use Mamba
Use Mamba (or Jamba hybrid):
- Processing very long sequences (10K+ tokens)
- On-device inference (limited memory)
- Real-time systems where latency per token matters
- Building efficient models with small compute budgets
Use Transformers:
- Most NLP tasks (text under 4K tokens)
- Retrieval-heavy problems (“needle in haystack”)
- Vision-language (images + text)
- Fine-tuning on large-scale instruction data
- When you need mature ecosystems (Hugging Face, vLLM, etc.)
Limitations Worth Knowing
- Poor in-context recall — Can’t point back to token position 5000 like Attention can
- Sequential inference — Must generate one token at a time (hard to parallelize)
- Training not faster — Speed gain is purely at inference on long sequences
- Limited ecosystem — Fewer libraries, tools, pre-trained models than Transformers
- Unproven at scale — Largest Mamba models are 7B; Transformers scale to 700B+
The Verdict
Mamba is not a Transformer killer. It’s a complementary approach that proved:
- ✓ Linear-time sequence modeling works
- ✓ Selectivity beats inflexibility
- ✓ Hybrid architectures are the practical future
- ✗ But it’s not universally better; context-specific trade-offs matter
In 2025, the field is settling on hybrid models (Mamba + Attention blocks) for balanced efficiency and capability. Pure Mamba shines in niche (but important) use cases.