Microsoft has released a new open-source library called DeepSpeed, which, when combined with its ‘ZeRO’ module can train 100 billion parameter models without using the resources traditionally associated with that.
“DeepSpeed is compatible with PyTorch. One piece of that library, called ZeRO, is a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained,” explained the company. “Researchers have used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), the largest publicly known language model at 17 billion parameters.”
With the release, Microsoft hopes to help AI developers gain the increased accuracy that can be had though training on large models. ZeRO achieves its data parallelism by partitioning model states across data parallels instead of replicating them. Microsoft says it then uses a dynamics communication schedule to share state across distributed devices.
To test the model, Microsoft trained a Turing-NLG model with 17 billion parameters. It said memory savings allowed for 4x smaller model parallelism and a 4x larger batch size, through a 3x throughput gain. It was able to train a batch size of 512 with 256 GPUs with a combination of ZeRO and Nvidia’s Megtron-LM, rather than the 1024 required with Megatron alone.
All of this is obviously quite complicated if you’re not familiar with model training. The bottom line is that Microsoft has created a faster, more efficient way to train AI models, and it’s sharing it with everyone for free. Though it could have kept the techniques internal, CEO Satya Nadella has previously spoken about the need to democratize AI, and this is just one example of that philosophy at play.
Last updated: April 7, 2025Benchmark stats come from the model providers, if available. For models with optional advanced reasoning, we provide the highest benchmark score achieved.
Organization | Model | Context | Parameters (B) | Input $/M | Output $/M | License | GPQA | MMLU | MMLU Pro | DROP | HumanEval | AIME'24 | SimpleBench | Model |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meta | Llama 4 Maverick | 1,000,000 | 288 | $0.19-$0.49 | - | Open | 69.80% | 84.60% | 80.50% | - | - | - | 27.70% | Llama 4 Maverick |
meta | Llama 4 Scout | 10,000,000 | 17 | - | - | Open | 57.20% | - | 74.30% | - | - | - | - | Llama 4 Scout |
meta | Llama 4 Behemoth | 10,000,000 | 288 | - | - | Open | 73.70% | 85.80% | 82.20% | - | - | - | - | Llama 4 Behemoth |
Gemini 2.5 Pro (Exp) | 1,000,000 | - | $2.50 | $15.00 | Proprietary | 84.00% | 89.8% | - | - | - | 92.00% | 51.60& | Gemini 2.5 Pro (Exp) | |
openai | o3 | 128,000 | - | - | - | Proprietary | 87.70% | - | - | - | - | - | o3 | |
anthropic | Claude 3.7 Sonnet | 200,000 | - | $3.00 | $15.00 | Proprietary | 84.80% | 86.10% | - | - | - | 80.00% | 46.4% | Claude 3.7 Sonnet |
xai | Grok-3 | 128,000 | - | - | - | Proprietary | 84.60% | - | 79.90% | - | - | 93.30% | Grok-3 | |
xai | Grok-3 Mini | 128,000 | - | - | - | Proprietary | 84.60% | - | 78.90% | - | - | 90.80% | Grok-3 Mini | |
openai | o3-mini | 200,000 | - | $1.10 | $4.40 | Proprietary | 79.70% | 86.90% | - | - | - | 86.50% | 22.8% | o3-mini |
openai | o1-pro | 128,000 | - | - | - | Proprietary | 79.00% | - | - | - | - | 86.00% | o1-pro | |
openai | o1 | 200,000 | - | $15.00 | $60.00 | Proprietary | 78.00% | 91.80% | - | - | 88.10% | 83.30% | 40.1% | o1 |
Gemini 2.0 Flash Thinking | 1,000,000 | - | - | - | Proprietary | 74.20% | - | - | - | - | 73.30% | 30.7% | Gemini 2.0 Flash Thinking | |
openai | o1-preview | 128,000 | - | $15.00 | $60.00 | Proprietary | 73.30% | 90.80% | - | - | - | 44.60% | 41.7% | o1-preview |
deepseek | DeepSeek-R1 | 131,072 | 671 | $0.55 | $2.19 | Open | 71.50% | 90.80% | 84.00% | 92.20% | - | 79.80% | 30.9% | DeepSeek-R1 |
openai | GPT-4.5 | 128,000 | - | - | - | Proprietary | 71.4% | 90.0% | - | - | 88.0% | 36.7% | 34.5% | GPT-4.5 |
anthropic | Claude 3.5 Sonnet | 200,000 | - | $3.00 | $15.00 | Proprietary | 67.20% | 90.40% | 77.60% | 87.10% | 93.70% | 16.00% | 41.4% | Claude 3.5 Sonnet |
qwen | QwQ-32B-Preview | 32,768 | 32.5 | $0.15 | $0.20 | Open | 65.20% | - | 70.97% | - | - | 50.00% | QwQ-32B-Preview | |
Gemini 2.0 Flash | 1,048,576 | - | - | - | Proprietary | 62.10% | - | 76.40% | - | - | 35.5% | 18.9% | Gemini 2.0 Flash | |
openai | o1-mini | 128,000 | - | $3.00 | $12.00 | Proprietary | 60.00% | 85.20% | 80.30% | - | 92.40% | 70.00% | 18.1% | o1-mini |
deepseek | DeepSeek-V3 | 131,072 | 671 | $0.27 | $1.10 | Open | 59.10% | 88.50% | 75.90% | 91.60% | - | 39.2% | 18.9% | DeepSeek-V3 |
Gemini 1.5 Pro | 2,097,152 | - | $2.50 | $10.00 | Proprietary | 59.10% | 85.90% | 75.80% | 74.90% | 84.10% | 19.3% | 27.1% | Gemini 1.5 Pro | |
microsoft | Phi-4 | 16,000 | 14.7 | $0.07 | $0.14 | Open | 56.10% | 84.80% | 70.40% | 75.50% | 82.60% | Phi-4 | ||
xai | Grok-2 | 128,000 | - | $2.00 | $10.00 | Proprietary | 56.00% | 87.50% | 75.50% | - | 88.40% | 22.7% | Grok-2 | |
openai | GPT-4o | 128,000 | - | $2.50 | $10.00 | Proprietary | 53.60% | 88.00% | 74.70% | - | - | 17.8% | GPT-4o | |
Gemini 1.5 Flash | 1,048,576 | - | $0.15 | $0.60 | Proprietary | 51.00% | 78.90% | 67.30% | - | 74.30% | Gemini 1.5 Flash | |||
xai | Grok-2 mini | 128,000 | - | - | - | Proprietary | 51.00% | 86.20% | 72.00% | - | 85.70% | Grok-2 mini | ||
meta | Llama 3.1 405B Instruct | 128,000 | 405 | $0.90 | $0.90 | Open | 50.70% | 87.30% | 73.30% | 84.80% | 89.00% | 23.0% | Llama 3.1 405B Instruct | |
meta | Llama 3.3 70B Instruct | 128,000 | 70 | $0.20 | $0.20 | Open | 50.50% | 86.00% | 68.90% | - | 88.40% | 19.9% | Llama 3.3 70B Instruct | |
anthropic | Claude 3 Opus | 200,000 | - | $15.00 | $75.00 | Proprietary | 50.40% | 86.80% | 68.50% | 83.10% | 84.90% | 23.5% | Claude 3 Opus | |
qwen | Qwen2.5 32B Instruct | 131,072 | 32.5 | - | - | Open | 49.50% | 83.30% | 69.00% | - | 88.40% | Qwen2.5 32B Instruct | ||
qwen | Qwen2.5 72B Instruct | 131,072 | 72.7 | $0.35 | $0.40 | Open | 49.00% | - | 71.10% | - | 86.60% | 23.30% | Qwen2.5 72B Instruct | |
openai | GPT-4 Turbo | 128,000 | - | $10.00 | $30.00 | Proprietary | 48.00% | 86.50% | - | 86.00% | 87.10% | GPT-4 Turbo | ||
amazon | Nova Pro | 300,000 | - | $0.80 | $3.20 | Proprietary | 46.90% | 85.90% | - | 85.40% | 89.00% | Nova Pro | ||
meta | Llama 3.2 90B Instruct | 128,000 | 90 | $0.35 | $0.40 | Open | 46.70% | 86.00% | - | - | - | Llama 3.2 90B Instruct | ||
qwen | Qwen2.5 14B Instruct | 131,072 | 14.7 | - | - | Open | 45.50% | 79.70% | 63.70% | - | 83.50% | Qwen2.5 14B Instruct | ||
mistral | Mistral Small 3 | 32,000 | 24 | $0.07 | $0.14 | Open | 45.30% | - | 66.30% | - | 84.80% | Mistral Small 3 | ||
qwen | Qwen2 72B Instruct | 131,072 | 72 | - | - | Open | 42.40% | 82.30% | 64.40% | - | 86.00% | Qwen2 72B Instruct | ||
amazon | Nova Lite | 300,000 | - | $0.06 | $0.24 | Proprietary | 42.00% | 80.50% | - | 80.20% | 85.40% | Nova Lite | ||
meta | Llama 3.1 70B Instruct | 128,000 | 70 | $0.20 | $0.20 | Open | 41.70% | 83.60% | 66.40% | 79.60% | 80.50% | Llama 3.1 70B Instruct | ||
anthropic | Claude 3.5 Haiku | 200,000 | - | $0.10 | $0.50 | Proprietary | 41.60% | - | 65.00% | 83.10% | 88.10% | Claude 3.5 Haiku | ||
anthropic | Claude 3 Sonnet | 200,000 | - | $3.00 | $15.00 | Proprietary | 40.40% | 79.00% | 56.80% | 78.90% | 73.00% | Claude 3 Sonnet | ||
openai | GPT-4o mini | 128,000 | - | $0.15 | $0.60 | Proprietary | 40.20% | 82.00% | - | 79.70% | 87.20% | 10.7% | GPT-4o mini | |
amazon | Nova Micro | 128,000 | - | $0.04 | $0.14 | Proprietary | 40.00% | 77.60% | - | 79.30% | 81.10% | Nova Micro | ||
Gemini 1.5 Flash 8B | 1,048,576 | 8 | $0.07 | $0.30 | Proprietary | 38.40% | - | 58.70% | - | - | Gemini 1.5 Flash 8B | |||
ai21 | Jamba 1.5 Large | 256,000 | 398 | $2.00 | $8.00 | Open | 36.90% | 81.20% | 53.50% | - | - | Jamba 1.5 Large | ||
microsoft | Phi-3.5-MoE-instruct | 128,000 | 60 | - | - | Open | 36.80% | 78.90% | 54.30% | - | 70.70% | Phi-3.5-MoE-instruct | ||
qwen | Qwen2.5 7B Instruct | 131,072 | 7.6 | $0.30 | $0.30 | Open | 36.40% | - | 56.30% | - | 84.80% | Qwen2.5 7B Instruct | ||
xai | Grok-1.5 | 128,000 | - | - | - | Proprietary | 35.90% | 81.30% | 51.00% | - | 74.10% | Grok-1.5 | ||
openai | GPT-4 | 32,768 | - | $30.00 | $60.00 | Proprietary | 35.70% | 86.40% | - | 80.90% | 67.00% | 25.1% | GPT-4 | |
anthropic | Claude 3 Haiku | 200,000 | - | $0.25 | $1.25 | Proprietary | 33.30% | 75.20% | - | 78.40% | 75.90% | Claude 3 Haiku | ||
meta | Llama 3.2 11B Instruct | 128,000 | 10.6 | $0.06 | $0.06 | Open | 32.80% | 73.00% | - | - | - | Llama 3.2 11B Instruct | ||
meta | Llama 3.2 3B Instruct | 128,000 | 3.2 | $0.01 | $0.02 | Open | 32.80% | 63.40% | - | - | - | Llama 3.2 3B Instruct | ||
ai21 | Jamba 1.5 Mini | 256,144 | 52 | $0.20 | $0.40 | Open | 32.30% | 69.70% | 42.50% | - | - | Jamba 1.5 Mini | ||
openai | GPT-3.5 Turbo | 16,385 | - | $0.50 | $1.50 | Proprietary | 30.80% | 69.80% | - | 70.20% | 68.00% | GPT-3.5 Turbo | ||
meta | Llama 3.1 8B Instruct | 131,072 | 8 | $0.03 | $0.03 | Open | 30.40% | 69.40% | 48.30% | 59.50% | 72.60% | Llama 3.1 8B Instruct | ||
microsoft | Phi-3.5-mini-instruct | 128,000 | 3.8 | $0.10 | $0.10 | Open | 30.40% | 69.00% | 47.40% | - | 62.80% | Phi-3.5-mini-instruct | ||
Gemini 1.0 Pro | 32,760 | - | $0.50 | $1.50 | Proprietary | 27.90% | 71.80% | - | - | - | Gemini 1.0 Pro | |||
qwen | Qwen2 7B Instruct | 131,072 | 7.6 | - | - | Open | 25.30% | 70.50% | 44.10% | - | - | Qwen2 7B Instruct | ||
mistral | Codestral-22B | 32,768 | 22.2 | $0.20 | $0.60 | Open | - | - | - | - | 81.10% | Codestral-22B | ||
cohere | Command R+ | 128,000 | 104 | $0.25 | $1.00 | Open | - | 75.70% | - | - | - | 17.4% | Command R+ | |
deepseek | DeepSeek-V2.5 | 8,192 | 236 | $0.14 | $0.28 | Open | - | 80.40% | - | - | 89.00% | DeepSeek-V2.5 | ||
Gemma 2 27B | 8,192 | 27.2 | - | - | Open | - | 75.20% | - | - | 51.80% | Gemma 2 27B | |||
Gemma 2 9B | 8,192 | 9.2 | - | - | Open | - | 71.30% | - | - | 40.20% | Gemma 2 9B | |||
xai | Grok-1.5V | 128,000 | - | - | - | Proprietary | - | - | - | - | - | Grok-1.5V | ||
moonshotai | Kimi-k1.5 | 128,000 | - | - | - | Proprietary | - | 87.40% | - | - | - | Kimi-k1.5 | ||
nvidia | Llama 3.1 Nemotron 70B Instruct | 128,000 | 70 | - | - | Open | - | 80.20% | - | - | - | Llama 3.1 Nemotron 70B Instruct | ||
mistral | Ministral 8B Instruct | 128,000 | 8 | $0.10 | $0.10 | Open | - | 65.00% | - | - | 34.80% | Ministral 8B Instruct | ||
mistral | Mistral Large 2 | 128,000 | 123 | $2.00 | $6.00 | Open | - | 84.00% | - | - | 92.00% | 22.5% | Mistral Large 2 | |
mistral | Mistral NeMo Instruct | 128,000 | 12 | $0.15 | $0.15 | Open | - | 68.00% | - | - | - | Mistral NeMo Instruct | ||
mistral | Mistral Small | 32,768 | 22 | $0.20 | $0.60 | Open | - | - | - | - | - | Mistral Small | ||
microsoft | Phi-3.5-vision-instruct | 128,000 | 4.2 | - | - | Open | - | - | - | - | - | Phi-3.5-vision-instruct | ||
mistral | Pixtral-12B | 128,000 | 12.4 | $0.15 | $0.15 | Open | - | 69.20% | - | - | 72.00% | Pixtral-12B | ||
mistral | Pixtral Large | 128,000 | 124 | $2.00 | $6.00 | Open | - | - | - | - | - | Pixtral Large | ||
qwen | QvQ-72B-Preview | 32,768 | 73.4 | - | - | Open | - | - | - | - | - | QvQ-72B-Preview | ||
qwen | Qwen2.5-Coder 32B Instruct | 128,000 | 32 | $0.09 | $0.09 | Open | - | 75.10% | 50.40% | - | 92.70% | Qwen2.5-Coder 32B Instruct | ||
qwen | Qwen2.5-Coder 7B Instruct | 128,000 | 7 | - | - | Open | - | 67.60% | 40.10% | - | 88.40% | Qwen2.5-Coder 7B Instruct | ||
qwen | Qwen2-VL-72B-Instruct | 32,768 | 73.4 | - | - | Open | - | - | - | - | - | Qwen2-VL-72B-Instruct | ||
cohere | Command A | 256,000 | 111 | $2.50 | $10.00 | Open | - | 85.00% | - | - | - | - | - | Command A |
baidu | ERNIE 4.5 | - | - | - | - | - | 75.00% | - | 79.00% | 87.00% | 85.00% | ERNIE 4.5 | ||
Gemma 3 1B | 128,000 | 1 | - | - | Open | 19.20% | 29.90% | 14.70% | - | 32.00% | - | - | Gemma 3 1B | |
Gemma 3 4B | 128,000 | 4 | - | - | Open | 30.80% | 46.90% | 43.60% | - | - | - | - | Gemma 3 4B | |
Gemma 3 12B | 128,000 | 12 | - | - | Open | 40.90% | 65.20% | 60.60% | - | - | - | - | Gemma 3 12B | |
Gemma 3 27B | 128,000 | 27 | - | - | Open | 42.40% | 72.1% | 67.50% | - | 89.00% | - | - | Gemma 3 27B | |
qwen | Qwen2.5 Max | 32,768 | - | 59.00% | - | 76.00% | - | 93.00% | 23.00% | - | Qwen2.5 Max | |||
qwen | QwQ 32B | 131,000 | 32.8 | Open | 59.00% | - | 76.00% | 98.00% | 78.00% | - | QwQ 32B |
Last Updated on March 9, 2025 7:49 pm CET