Microsoft DeepSpeed with Zero Can Train 100 Billion Parameter AI Models

Microsoft's DeepSpeed framework and its ZeRO module can significantly decrease the hardware resources and time needed to train an AI model.

Microsoft has released a new open-source library called DeepSpeed, which, when combined with its ‘ZeRO’ module can train 100 billion parameter models without using the resources traditionally associated with that.

“DeepSpeed is compatible with PyTorch. One piece of that library, called ZeRO, is a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained,” explained the company. “Researchers have used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), the largest publicly known language model at 17 billion parameters.”

With the release, Microsoft hopes to help AI developers gain the increased accuracy that can be had though training on large models. ZeRO achieves its data parallelism by partitioning model states across data parallels instead of replicating them. Microsoft says it then uses a dynamics communication schedule to share state across distributed devices.

To test the model, Microsoft trained a Turing-NLG model with 17 billion parameters. It said memory savings allowed for 4x smaller model parallelism and a 4x larger batch size, through a 3x throughput gain. It was able to train a batch size of 512 with 256 GPUs with a combination of ZeRO and Nvidia’s Megtron-LM, rather than the 1024 required with Megatron alone.

All of this is obviously quite complicated if you’re not familiar with model training. The bottom line is that Microsoft has created a faster, more efficient way to train AI models, and it’s sharing it with everyone for free. Though it could have kept the techniques internal, CEO Satya Nadella has previously spoken about the need to democratize AI, and this is just one example of that philosophy at play.

Last updated: April 7, 2025

Benchmark stats come from the model providers, if available. For models with optional advanced reasoning, we provide the highest benchmark score achieved.
OrganizationModelContextParameters (B)Input $/MOutput $/MLicenseGPQAMMLUMMLU ProDROPHumanEvalAIME'24SimpleBenchModel
metaLlama 4 Maverick1,000,000288$0.19-$0.49-Open69.80%84.60%80.50%---27.70%Llama 4 Maverick
metaLlama 4 Scout10,000,00017--Open57.20%-74.30%----Llama 4 Scout
metaLlama 4 Behemoth10,000,000288--Open73.70%85.80%82.20%----Llama 4 Behemoth
googleGemini 2.5 Pro (Exp)1,000,000-$2.50$15.00Proprietary84.00%89.8%---92.00%51.60&Gemini 2.5 Pro (Exp)
openai o3128,000---Proprietary87.70%-----o3
anthropic Claude 3.7 Sonnet200,000-$3.00 $15.00 Proprietary84.80%86.10%---80.00%46.4%Claude 3.7 Sonnet
xai Grok-3128,000---Proprietary84.60%-79.90%--93.30%Grok-3
xai Grok-3 Mini128,000---Proprietary84.60%-78.90%--90.80%Grok-3 Mini
openai o3-mini200,000-$1.10 $4.40 Proprietary79.70%86.90%---86.50%22.8%o3-mini
openai o1-pro128,000---Proprietary79.00%----86.00%o1-pro
openai o1200,000-$15.00 $60.00 Proprietary78.00%91.80%--88.10%83.30%40.1%o1
google Gemini 2.0 Flash Thinking1,000,000---Proprietary74.20%----73.30%30.7%Gemini 2.0 Flash Thinking
openai o1-preview128,000-$15.00 $60.00 Proprietary73.30%90.80%---44.60%41.7%o1-preview
deepseek DeepSeek-R1131,072671$0.55 $2.19 Open71.50%90.80%84.00%92.20%-79.80%30.9%DeepSeek-R1
openaiGPT-4.5128,000---Proprietary71.4%90.0%--88.0%36.7%34.5%GPT-4.5
anthropic Claude 3.5 Sonnet200,000-$3.00 $15.00 Proprietary67.20%90.40%77.60%87.10%93.70%16.00%41.4%Claude 3.5 Sonnet
qwen QwQ-32B-Preview32,76832.5$0.15 $0.20 Open65.20%-70.97%--50.00%QwQ-32B-Preview
google Gemini 2.0 Flash1,048,576---Proprietary62.10%-76.40%--35.5%18.9%Gemini 2.0 Flash
openai o1-mini128,000-$3.00 $12.00 Proprietary60.00%85.20%80.30%-92.40%70.00%18.1%o1-mini
deepseek DeepSeek-V3131,072671$0.27 $1.10 Open59.10%88.50%75.90%91.60%-39.2%18.9%DeepSeek-V3
google Gemini 1.5 Pro2,097,152-$2.50 $10.00 Proprietary59.10%85.90%75.80%74.90%84.10%19.3%27.1%Gemini 1.5 Pro
microsoft Phi-416,00014.7$0.07 $0.14 Open56.10%84.80%70.40%75.50%82.60%Phi-4
xai Grok-2128,000-$2.00 $10.00 Proprietary56.00%87.50%75.50%-88.40%22.7%Grok-2
openai GPT-4o128,000-$2.50 $10.00 Proprietary53.60%88.00%74.70%--17.8%GPT-4o
google Gemini 1.5 Flash1,048,576-$0.15 $0.60 Proprietary51.00%78.90%67.30%-74.30%Gemini 1.5 Flash
xai Grok-2 mini128,000---Proprietary51.00%86.20%72.00%-85.70%Grok-2 mini
meta Llama 3.1 405B Instruct128,000405$0.90 $0.90 Open50.70%87.30%73.30%84.80%89.00%23.0%Llama 3.1 405B Instruct
meta Llama 3.3 70B Instruct128,00070$0.20 $0.20 Open50.50%86.00%68.90%-88.40%19.9%Llama 3.3 70B Instruct
anthropic Claude 3 Opus200,000-$15.00 $75.00 Proprietary50.40%86.80%68.50%83.10%84.90%23.5%Claude 3 Opus
qwen Qwen2.5 32B Instruct131,07232.5--Open49.50%83.30%69.00%-88.40%Qwen2.5 32B Instruct
qwen Qwen2.5 72B Instruct131,07272.7$0.35 $0.40 Open49.00%-71.10%-86.60%23.30%Qwen2.5 72B Instruct
openai GPT-4 Turbo128,000-$10.00 $30.00 Proprietary48.00%86.50%-86.00%87.10%GPT-4 Turbo
amazon Nova Pro300,000-$0.80 $3.20 Proprietary46.90%85.90%-85.40%89.00%Nova Pro
meta Llama 3.2 90B Instruct128,00090$0.35 $0.40 Open46.70%86.00%---Llama 3.2 90B Instruct
qwen Qwen2.5 14B Instruct131,07214.7--Open45.50%79.70%63.70%-83.50%Qwen2.5 14B Instruct
mistral Mistral Small 332,00024$0.07 $0.14 Open45.30%-66.30%-84.80%Mistral Small 3
qwen Qwen2 72B Instruct131,07272--Open42.40%82.30%64.40%-86.00%Qwen2 72B Instruct
amazon Nova Lite300,000-$0.06 $0.24 Proprietary42.00%80.50%-80.20%85.40%Nova Lite
meta Llama 3.1 70B Instruct128,00070$0.20 $0.20 Open41.70%83.60%66.40%79.60%80.50%Llama 3.1 70B Instruct
anthropic Claude 3.5 Haiku200,000-$0.10 $0.50 Proprietary41.60%-65.00%83.10%88.10%Claude 3.5 Haiku
anthropic Claude 3 Sonnet200,000-$3.00 $15.00 Proprietary40.40%79.00%56.80%78.90%73.00%Claude 3 Sonnet
openai GPT-4o mini128,000-$0.15 $0.60 Proprietary40.20%82.00%-79.70%87.20%10.7%GPT-4o mini
amazon Nova Micro128,000-$0.04 $0.14 Proprietary40.00%77.60%-79.30%81.10%Nova Micro
google Gemini 1.5 Flash 8B1,048,5768$0.07 $0.30 Proprietary38.40%-58.70%--Gemini 1.5 Flash 8B
ai21 Jamba 1.5 Large256,000398$2.00 $8.00 Open36.90%81.20%53.50%--Jamba 1.5 Large
microsoft Phi-3.5-MoE-instruct128,00060--Open36.80%78.90%54.30%-70.70%Phi-3.5-MoE-instruct
qwen Qwen2.5 7B Instruct131,0727.6$0.30 $0.30 Open36.40%-56.30%-84.80%Qwen2.5 7B Instruct
xai Grok-1.5128,000---Proprietary35.90%81.30%51.00%-74.10%Grok-1.5
openai GPT-432,768-$30.00 $60.00 Proprietary35.70%86.40%-80.90%67.00%25.1%GPT-4
anthropic Claude 3 Haiku200,000-$0.25 $1.25 Proprietary33.30%75.20%-78.40%75.90%Claude 3 Haiku
meta Llama 3.2 11B Instruct128,00010.6$0.06 $0.06 Open32.80%73.00%---Llama 3.2 11B Instruct
meta Llama 3.2 3B Instruct128,0003.2$0.01 $0.02 Open32.80%63.40%---Llama 3.2 3B Instruct
ai21 Jamba 1.5 Mini256,14452$0.20 $0.40 Open32.30%69.70%42.50%--Jamba 1.5 Mini
openai GPT-3.5 Turbo16,385-$0.50 $1.50 Proprietary30.80%69.80%-70.20%68.00%GPT-3.5 Turbo
meta Llama 3.1 8B Instruct131,0728$0.03 $0.03 Open30.40%69.40%48.30%59.50%72.60%Llama 3.1 8B Instruct
microsoft Phi-3.5-mini-instruct128,0003.8$0.10 $0.10 Open30.40%69.00%47.40%-62.80%Phi-3.5-mini-instruct
google Gemini 1.0 Pro32,760-$0.50 $1.50 Proprietary27.90%71.80%---Gemini 1.0 Pro
qwen Qwen2 7B Instruct131,0727.6--Open25.30%70.50%44.10%--Qwen2 7B Instruct
mistral Codestral-22B32,76822.2$0.20 $0.60 Open----81.10%Codestral-22B
cohere Command R+128,000104$0.25 $1.00 Open-75.70%---17.4%Command R+
deepseek DeepSeek-V2.58,192236$0.14 $0.28 Open-80.40%--89.00%DeepSeek-V2.5
google Gemma 2 27B8,19227.2--Open-75.20%--51.80%Gemma 2 27B
google Gemma 2 9B8,1929.2--Open-71.30%--40.20%Gemma 2 9B
xai Grok-1.5V128,000---Proprietary-----Grok-1.5V
moonshotai Kimi-k1.5128,000---Proprietary-87.40%---Kimi-k1.5
nvidia Llama 3.1 Nemotron 70B Instruct128,00070--Open-80.20%---Llama 3.1 Nemotron 70B Instruct
mistral Ministral 8B Instruct128,0008$0.10 $0.10 Open-65.00%--34.80%Ministral 8B Instruct
mistral Mistral Large 2128,000123$2.00 $6.00 Open-84.00%--92.00%22.5%Mistral Large 2
mistral Mistral NeMo Instruct128,00012$0.15 $0.15 Open-68.00%---Mistral NeMo Instruct
mistral Mistral Small32,76822$0.20 $0.60 Open-----Mistral Small
microsoft Phi-3.5-vision-instruct128,0004.2--Open-----Phi-3.5-vision-instruct
mistral Pixtral-12B128,00012.4$0.15 $0.15 Open-69.20%--72.00%Pixtral-12B
mistral Pixtral Large128,000124$2.00 $6.00 Open-----Pixtral Large
qwen QvQ-72B-Preview32,76873.4--Open-----QvQ-72B-Preview
qwen Qwen2.5-Coder 32B Instruct128,00032$0.09 $0.09 Open-75.10%50.40%-92.70%Qwen2.5-Coder 32B Instruct
qwen Qwen2.5-Coder 7B Instruct128,0007--Open-67.60%40.10%-88.40%Qwen2.5-Coder 7B Instruct
qwen Qwen2-VL-72B-Instruct32,76873.4--Open-----Qwen2-VL-72B-Instruct
cohereCommand A256,000111$2.50$10.00Open-85.00%-----Command A
baiduERNIE 4.5-----75.00%-79.00%87.00%85.00%ERNIE 4.5
googleGemma 3 1B128,0001--Open19.20%29.90%14.70%-32.00%--Gemma 3 1B
googleGemma 3 4B128,0004--Open30.80%46.90%43.60%----Gemma 3 4B
googleGemma 3 12B128,00012--Open40.90%65.20%60.60%----Gemma 3 12B
googleGemma 3 27B128,00027--Open42.40%72.1%67.50%-89.00%--Gemma 3 27B
qwenQwen2.5 Max32,768-59.00%-76.00%-93.00%23.00%-Qwen2.5 Max
qwenQwQ 32B131,00032.8Open59.00%-76.00%98.00%78.00%-QwQ 32B

Last Updated on March 9, 2025 7:49 pm CET

SourceMicrosoft
Ryan Maskell
Ryan Maskellhttps://ryanmaskell.co.uk
Ryan has had a passion for gaming and technology since early childhood. Fusing the skills from his Creative Writing and Publishing degree with profound technical knowledge, he enjoys covering news about Microsoft. As an avid writer, he is also working on his debut novel.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x