Microsoft took to the stage at the Supercomputing 2019 conference to announce the availability of NDv2, which Nvidia calls the”World's largest GPU-accelerated cloud-based supercomputer”.
Available via Azure, NDv2 can scale up to 800 Nvidia Tensor Core GPUs, which are interconnected with Mellanox InfiniBand. Interested parties can rent an entire supercomputer to meet their needs, with extreme benefits to speed.
With a pre-release version of the Cluster, Microsoft and Nvidia managed to train conversational AI model BERT in around three hours. That doesn't beat Nvidia's previous record of 53 minutes, but at 8.3 billion parameters, consumers are still getting very good speeds. Here are the base specs of a single NDv2 VM:
- 8 Nvidia Tesla V10 NVLink GPUs (32GB HBM2 memory each)
- Intel Xeon Platinum 8168 processor with 40 non-hyperthreaded cores
- 672 GiB memory
If needed, researchers can quickly open multiple NDv2 instances to train their AI models even faster. The VMs are currently in preview, alongside new NVv4 VMs from AMD, and NDv3, which features the Graphcore IPU.
“Now you can open up an instance, you grab one of the stacks … in the container, you launch it, on Azure, and you're doing science,” said Jensen Huang, NVIDIA CEO. “It's really quite fantastic, This puts a supercomputer in the hands of every scientist in the world.”
Of course, that assumes researchers can afford them. Microsoft hasn't released pricing information yet, but TheNextPlatform says it's heard it'll be $26.44 per NDv2 instance per hour. For reference, that would mean renting a single instance for a year would add up to $231,614. That's may well be competitive, but it's no small cost.