Microsoft has released a new open-source library called DeepSpeed, which, when combined with its ‘ZeRO’ module can train 100 billion parameter models without using the resources traditionally associated with that.
“DeepSpeed is compatible with PyTorch. One piece of that library, called ZeRO, is a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained,” explained the company. “Researchers have used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), the largest publicly known language model at 17 billion parameters.”
With the release, Microsoft hopes to help AI developers gain the increased accuracy that can be had though training on large models. ZeRO achieves its data parallelism by partitioning model states across data parallels instead of replicating them. Microsoft says it then uses a dynamics communication schedule to share state across distributed devices.
To test the model, Microsoft trained a Turing-NLG model with 17 billion parameters. It said memory savings allowed for 4x smaller model parallelism and a 4x larger batch size, through a 3x throughput gain. It was able to train a batch size of 512 with 256 GPUs with a combination of ZeRO and Nvidia’s Megtron-LM, rather than the 1024 required with Megatron alone.
All of this is obviously quite complicated if you’re not familiar with model training. The bottom line is that Microsoft has created a faster, more efficient way to train AI models, and it’s sharing it with everyone for free. Though it could have kept the techniques internal, CEO Satya Nadella has previously spoken about the need to democratize AI, and this is just one example of that philosophy at play.