The Transformer architecture powering many of today’s most capable large language models faces a well-documented challenge: its computational demands grow quadratically as input sequences get longer.
Tackling this efficiency hurdle, IBM Research, in partnership with Carnegie Mellon University, Princeton University, and the University of Illinois Urbana-Champaign, has introduced Bamba-9B-v2. This newly released open-source model employs a hybrid design, combining Transformer components with the Mamba2 State-Space Model (SSM) architecture.
Addressing the Transformer Bottleneck
Standard Transformers, first detailed in the 2017 paper “Attention Is All You Need,” owe much of their success to the self-attention mechanism.
This allows the model to assess the relevance of all tokens in a sequence simultaneously. However, this all-to-all comparison results in computation and memory needs, particularly for the KV cache storing attention states, scaling quadratically with increased sequence length. This “quadratic bottleneck” makes processing very long contexts increasingly slow and costly, an issue highlighted in industry discussions earlier this year regarding the sustainability of scaling AI models.
Hybrid Approach: Merging Transformers and State-Space Models
Bamba incorporates State-Space Models (SSMs), a concept from control theory adapted for deep learning, to mitigate Transformer inefficiencies. The specific variant used is Mamba2, developed by Albert Gu (CMU) and Tri Dao (Princeton).
SSMs utilize a compressed, fixed-size “hidden state” to represent past information, allowing sequence processing potentially in linear time during training (via a convolutional view) and constant time per token during inference (via a recurrent view). Ankit Gupta, an IBM researcher involved in foundational SSM work, noted their traditional role: “They are the bread and butter of electrical engineering — signal processing, robotics, and control theory.”
The Bamba architecture strategically interleaves these efficient Mamba2 layers with standard Transformer attention blocks. The goal is to leverage SSMs for handling long-range dependencies efficiently while retaining attention for its strong contextual understanding capabilities.
IBM’s performance claims for the Bamba-9B-v2 model, which is an 8-bit quantized version (reducing size from 18GB to 9GB), are promising. The model, trained on 3 trillion tokens, reportedly matches Meta’s Llama 3.1 8B on key benchmarks, despite Llama 3.1’s much larger training dataset (15T+ tokens).
IBM states Bamba currently runs inference 2 to 2.5 times faster than similar-sized Transformers, attributing this primarily to reduced KV cache demands. IBM’s Raghu Ganti, leading the Bamba project, emphasized, “Everything comes back to the KV cache reduction… More throughput, lower latency, longer context length.”
Evaluating these speed benefits and potential power consumption differences across diverse real-world scenarios will be important next steps.
An Open Development and Training Process
IBM and its collaborators are releasing Bamba under an open model, providing access to model weights, training details, and code via the Hugging Face Bamba collection and the project’s GitHub repository.
The creation of Bamba v2 involved several stages, starting from the initial 2T token Bamba v1 (released around Christmas 2024). First, training was extended to 2.5T tokens using the Olmo Mix dataset. Then, two separate models were trained up to 3T tokens using a custom mix including Nemotron-CC data, each with a different learning rate schedule (constant vs. cosine decay). Finally, both 3T models were “annealed” on 100B high-quality tokens before being merged using MergeKit’s weighted averaging.
Optimizing inference performance remains a key focus. The team is actively working with the vLLM community to enhance support for Mamba2’s state management, which differs from standard KV caching.
Tyler Smith, a technical staff member at Red Hat and vLLM committer involved in the effort, noted, “SSMs are difficult to support, because you need bespoke state management.” Future improvements target chunked prefill and faster custom decode kernels, potentially boosting Bamba’s speed advantage to 4-5x over traditional Transformers. The team invites the open-source community to contribute, particularly on testing long-context scaling and improving mathematical performance.
The Bamba architecture represents more than just a research exploration. IBM has confirmed that key features from the Bamba project will be incorporated into its upcoming IBM Granite 4.0 enterprise models, set for release in the coming months. This planned integration highlights the growing industry interest in hybrid AI architectures as a practical path towards more efficient and scalable language models capable of handling the increasingly long context demands of modern AI applications.