Paper 21

Further Reading — Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Further Reading: Mamba

Original Paper

  • “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” — Albert Gu, Tri Dao (2023)
    arXiv:2312.00752 — Core paper: selective SSM design, discretisation, hardware algorithms, language modeling benchmarks.
    https://arxiv.org/abs/2312.00752

Essential Follow-Ups

Mamba 2 & Theoretical Foundations

  • “Mamba-2: State Space Duality and Time-Dependent Models” — Dao & Gu (2024)
    arXiv:2405.21060 — Theoretical reformulation connecting SSMs to attention mathematically. 2–8× faster training. This paper explains why selective SSMs work.
    https://arxiv.org/abs/2405.21060

Predecessor: Structured State Spaces (S4)

  • “Efficiently Modeling Long Sequences with Structured State Spaces” (S4) — Gu et al. (2021)
    arXiv:2111.00396 — The parent architecture that Mamba improves upon. Fixed (not selective) SSM with HiPPO structure. Read this to understand the evolution.
    https://arxiv.org/abs/2111.00396
  • “RWKV: Reinventing RNNs for the Transformer Era” — Peng et al. (2023)
    Parallel line of work: linear-time RNN with training-friendly architecture. Different from Mamba but same goal (O(n) efficiency). Shows multiple paths exist.
    https://github.com/BlinkDL/RWKV-LM

Hybrid Models in Practice

  • “Jamba: A Hybrid Transformer-Mamba Language Model” — AI21 Labs (2024)
    The production hybrid combining Mamba and Attention blocks. First commercial LLM deploying Mamba-style architecture. Study this for real-world lessons.
    https://huggingface.co/ai21labs/Jamba-v0.1

Explainers & Blog Posts

  • “The Annotated Mamba” — Sasha Rush (if available)
    Similar to the classic “Annotated Transformer”; walks through code line by line.

  • “State Space Models (SSM) Blog Series” — Various researchers
    Search for SSM blog posts on Substack, Medium. Jay Alammar and others have written accessible SSM tutorials.

  • “Parallel Scan Algorithms” — CS literature
    If you want to understand the training algorithm deeply, papers on prefix scans and work-efficient parallel algorithms are illuminating.


Code & Implementation

  • Official Mamba Repository — Gu & Dao
    Reference implementation with training and inference code.
    https://github.com/state-spaces/mamba

  • Mamba 2 Code — Same repository, updated branch
    Production improvements to the kernel and algorithm.

  • Jamba (HuggingFace)
    Pre-trained weights, inference examples, fine-tuning guides.
    https://huggingface.co/ai21labs/Jamba-v0.1

  • Mamba-in-a-Nutshell (Educational)
    Simplified implementations for understanding; not production-ready.


Foundational SSM Theory

  • “The Theory of State Spaces and Control” — Classical control theory
    Mamba borrows from decades of control theory. For deep understanding, read textbooks on linear systems (e.g., Kailath, Kung).

  • “Signal Processing & State Space Models” — Rigorous mathematical foundation
    Mamba’s framework is rooted in signal processing. Papers on Kalman filtering and stochastic control are relevant.


  • “Flash Attention” (Dao et al., 2022) — Not about SSMs, but about efficient attention. Complementary to Mamba; some hybrid models use both.
    https://arxiv.org/abs/2205.14135

  • “Sparse Transformers” (Child et al., 2019) — Alternative to dense attention (like Mamba is alternative to dense attention). Different approach, same goal.
    https://arxiv.org/abs/1904.10509


Benchmarks & Evaluation

  • Language Modeling (Chinchilla scale — 7B parameters)
    Mamba matches or beats Transformer baseline. See the paper’s Section 4.

  • HumanEval (Code generation) — Mamba: 57%, Transformer: 55% (small edge)

  • MMLU (Knowledge) — Transformer typically wins

  • GSM8K (Math) — Mixed results; depends on model size

  • SuperGLUE — Check the paper for fine-tuning results


Open Questions & Research Directions

  1. Does Mamba scale to 70B+? Unknown. No large-scale pure-Mamba models exist yet (as of 2025).

  2. Can we combine Mamba with LoRA for efficient fine-tuning? Likely yes; Jamba supports this.

  3. How does Mamba handle multi-modal (image + text)? Early explorations; not yet clear.

  4. Is pure Mamba or hybrid (Mamba+Attention) the future? Consensus: hybrid seems to win in practice.

  5. Can SSM ideas improve attention (and vice versa)? Yes — Mamba 2 shows SSMs and attention are dual. Expect more cross-pollination.


Practical Guides

  • “Deploying Mamba Models” — How to serve Mamba efficiently
    Memory-efficient inference, streaming generation, batching strategies.

  • “Fine-tuning Mamba on Custom Data” — HuggingFace tutorials
    LoRA, full fine-tune, prompt engineering — lessons from Jamba.

  • “Comparing Mamba vs Transformer for Your Use Case” — Decision tree
    Long sequence? Use Mamba. In-context recall? Use Transformer. Uncertain? Use hybrid.


Community & Ecosystem

  • GitHub (mamba-ssm, Jamba, etc.)
    Community implementations, fine-tuned variants, applications.

  • Hugging Face Model Hub
    Mamba-7B, Jamba, and variants. Community fine-tunes.

  • ArXiv & Papers With Code
    Tracking papers that cite or build on Mamba.


Theoretical Path

  1. S4 paper (Gu et al., 2021)
  2. Original Mamba paper (Gu & Dao, 2023)
  3. Mamba 2 / State Space Duality (Dao & Gu, 2024)
  4. Control theory & signal processing textbooks (optional, advanced)

Practical Path

  1. Original Mamba paper (focus on “The Idea” section)
  2. Jamba paper (see how to deploy in production)
  3. HuggingFace Jamba tutorials
  4. Fine-tune on your data

Comparison Path

  1. Flash Attention (Dao et al., 2022) — efficient Transformers
  2. Mamba paper
  3. Jamba paper
  4. Decide: pure Transformer, pure Mamba, or hybrid?

Breadth Path

  1. Transformer (Paper 08) — the baseline
  2. Mamba (this paper) — linear-time alternative
  3. RWKV — another alternative
  4. Jamba — practical hybrid
  5. Understand the trade-offs

Quotes to Remember

“Transformers are not the only way to model sequences.” — Implied by Mamba’s results

“Selectivity (remembering what matters) beats flexibility (attending to everything).” — Core insight of Mamba

“The future is hybrid architectures, not pure Mamba or pure Attention.” — Emerging consensus, 2024–2025


Key Takeaway

Mamba doesn’t replace Transformers. It:

  1. Proved O(n) linear-time is viable
  2. Inspired theoretical unification work (Mamba 2)
  3. Catalyzed practical hybrids (Jamba)
  4. Shifted research conversation from “What’s best?” to “What trade-offs suit my task?”

In practice, hybrid models are emerging as the sweet spot. Pure Mamba shines in niche use cases (very long sequences, on-device); Transformers remain the standard for most tasks; hybrids balance both.


Paper 20: Gemini | Paper 22: Claude Model Card