Researchers have unveiled Mamba-3, a significant evolution in State Space Models (SSM) that shifts the optimization focus squarely onto inference efficiency. This marks a departure from its predecessor, Mamba-2, which prioritized training speed. The latest iteration aims to tackle the growing demand for faster LLM deployment and agentic workflows.
Developed through a collaboration between Carnegie Mellon University, Princeton University, Cartesia AI, and Together AI, Mamba-3 introduces a more expressive recurrence formula, complex-valued state tracking, and a multi-input, multi-output (MIMO) variant. These enhancements reportedly boost accuracy without compromising decoding speed.
At the 1.5 billion parameter scale, Mamba-3's single-input, single-output (SISO) version outperforms Mamba-2, Gated DeltaNet, and even Llama-3.2-1B (a Transformer) in prefill and decode latency across various sequence lengths. The team has also open-sourced the underlying kernels, built with Triton, TileLang, and CuTe for optimal hardware performance.
Shifting Gears: From Training to Inference
The LLM landscape is increasingly centered on post-training optimization and deployment, areas heavily reliant on inference speed. While Mamba-2's training efficiency gains led to broad adoption, the subsequent focus on applications like reinforcement learning and agentic workflows has amplified inference demands.
Many existing linear architectures, including Mamba-2, were designed with training as the primary bottleneck. Simplifying the SSM mechanism for faster pretraining, however, left the inference step too basic and memory-bound. Mamba-3 seeks to bridge this gap by optimizing for the quality-efficiency frontier.
Architectural Innovations in Mamba-3
Mamba-3 addresses the inherent challenge of compressing all past information into a fixed-size state—a core limitation compared to Transformers' growing KV cache. It pulls three key levers: making the recurrence more expressive, employing a richer transition matrix, and incorporating more parallel computation within each update.
Key architectural changes include the addition of QKNorm (or BCNorm) for training stabilization, replacing the optional RMSNorm from Mamba-2. The short causal convolution, a staple in earlier Mamba versions, has been removed, with its functionality implicitly handled by the new discretization-based recurrence and BC bias.
Furthermore, Mamba-3 integrates RoPE modules to express complex-valued SSMs efficiently and MIMO projections to support the multi-input, multi-output variants. These components, along with interleaved MLP layers, bring the architecture in line with contemporary language models.
Empirical Performance Gains
Evaluations show Mamba-3 surpassing prior linear models like Mamba-2 and Gated DeltaNet in language modeling tasks. The MIMO variant further enhances accuracy by over 1 percentage point at the 1B scale, with the notable benefit of not increasing decoding latency, despite longer training times.
This training-inference dichotomy is explained by their differing compute and memory-bound natures. While current linear models leverage GPU tensor cores for fast training, inference often leaves hardware underutilized due to minimal compute per timestep. Mamba-3's design aims to fill this idle capacity.
Although linear models inherently lag behind Transformers in retrieval tasks due to their fixed-state nature, Mamba-3 demonstrates strong performance within the sub-quadratic alternative class. The addition of MIMO further aids retrieval without enlarging the state size.
The researchers predict that future hybrid models, combining linear layers with self-attention's KV cache, will become dominant, offering superior performance and efficiency. Understanding the precise interaction between these components remains an active area of research.
