Microsoft Releases BitNet b1.58 2B4T, a 1.58-Bit AI Model That Runs on Standard CPUs

BitNet b1.58 2B4T from Microsoft Research aims for efficient AI use on CPUs with a native 1.58-bit architecture and custom frameworks.

Microsoft researchers have put a new contender into the AI arena with BitNet b1.58 2B4T, an open-source large language model operating with extremely low-precision weights. What sets this 2-billion parameter model apart is that it was trained natively using a 1.58-bit architecture, rather than being quantized after training.

The promise, according to its technical report, is performance comparable to conventional models of similar size but with drastically reduced computational demands.

The core claim revolves around efficiency. While many LLMs require hefty hardware, Microsoft suggests BitNet b1.58 2B4T, trained on 4 trillion tokens, can operate effectively even on standard CPUs. Their technical report highlights a non-embedding memory footprint of just 0.4GB, a sharp contrast to figures ranging from 1.4GB (Gemma-3 1B) to 4.8GB (MiniCPM 2B) for competitors.

Furthermore, Microsoft estimates its energy consumption per token is significantly lower (0.028 Joules vs. a range of 0.186J to 0.649J for others) and claims faster CPU decoding latency (29 milliseconds per token vs. 41ms-124ms) when run using its specialized framework on test hardware (an Intel Core i7-13800H).

Under the Hood: The BitNet Approach

How does BitNet achieve this purported efficiency? Its architecture swaps standard linear layers for custom BitLinear layers that employ aggressive quantization during training. Instead of typical 16-bit numbers, the model’s weights are constrained during the forward pass to just three possible values: -1, 0, or +1.

This ternary (three-state) system, using an “absmean” quantization technique, theoretically requires only ~1.58 bits of information per weight (derived from log₂(3) ≈ 1.58). This “native 1-bit” training approach, Microsoft argues based on its research presented in the original BitNet paper, sidesteps performance losses often associated with compressing models after they’ve been trained (post-training quantization, or PTQ).

Alongside the ternary weights, the values passed between layers (activations) are quantized to 8-bit integers using a per-token “absmax” method—a configuration known as W1.58A8 (1.58-bit weights, 8-bit activations). The model architecture is Transformer-based but incorporates specific adjustments suitable for this low-bit regime: it uses squared ReLU (ReLU²) activation functions replace SwiGLU, employs standard Rotary Position Embeddings (RoPE) for positional data, uses subln normalization (cited for stability benefits in quantized training), and omits bias terms in its layers. Tokenization relies on the Llama 3 tokenizer.

Training and Performance Claims

Developing BitNet b1.58 2B4T involved three training stages. Initial pre-training involved the 4-trillion-token dataset mix of web data, code, and synthetic math, using a tailored two-stage learning rate and weight decay plan.

This was followed by supervised fine-tuning (SFT) using public and synthetic instruction datasets (like WizardLM Evol-Instruct and SlimOrca) to teach instruction following. Finally, Direct Preference Optimization (DPO)—a method for preference alignment without needing a separate reward model—was applied using datasets including UltraFeedback to refine its conversational abilities and safety profile.

Microsoft’s evaluations, detailed in the technical report, place BitNet b1.58 2B4T competitively against established full-precision 1B-2B parameter models. It reportedly shows stronger results on certain benchmarks like GSM8K (math), PIQA (physical commonsense), and WinoGrande (commonsense), while performing comparably on others.

The report states, “Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency.” It also claims superior performance compared to models subjected to standard INT4 PTQ methods.

The Catch: Getting the Efficiency Gains

Accessing the model’s heralded efficiency improvements isn’t straightforward with standard tools. The Hugging Face model card carries a prominent warning: “Please do NOT expect performance efficiency gains (in terms of speed, latency, or energy consumption) when using this model with the standard transformers library… For achieving the efficiency benefits demonstrated in the technical paper, you MUST use the dedicated C++ implementation: bitnet.cpp.”

This is because typical GPU hardware and libraries lack optimized routines for the specific W1.58A8 math BitNet employs. Realizing the efficiency requires using Microsoft’s dedicated, open-source inference frameworks.

For CPUs, the bitnet.cpp GitHub repository details a C++ library (based on the popular llama.cpp) which uses lookup table methods (described in a related paper) to deliver the reported gains, claiming speedups between 1.37x and 6.17x with 55% to 82% energy reduction compared to other CPU frameworks depending on the chip (ARM/x86) and model size.

For GPUs, custom CUDA kernels are needed, involving packing and unpacking weights for computation—a step acknowledging current GPUs aren’t ideal for this type of model. Whether these custom solutions maintain performance and stability across diverse hardware setups will require broader community testing. Microsoft plans future support for NPUs and improved GPU handling within bitnet.cpp.

Availability and Context

Microsoft has made BitNet b1.58 2B4T available on Hugging Face under the permissive MIT License. Users can find the packed 1.58-bit weights for efficient inference, separate BF16 master weights solely for retraining or fine-tuning, and a GGUF format for use with bitnet.cpp. The model operates with a 4096-token context window.

This release culminates work that began conceptually with a paper published in February 2024, followed by the bitnet.cpp framework in October 2024, marking the first scaled-up, open model release based on this native 1-bit training approach from the research group, whose homepage can be found at https://aka.ms/GeneralAI. Microsoft researchers outlined future plans including training larger BitNet models, exploring hardware co-design, extending context lengths, and adding multilingual features.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x