In December, Microsoft open sourced its ONNX Runtime inference engine. Now, the company says it also open-sourced an optimized version of BERT, a natural language model from Google, for ONNX.
By using ONNX Runtime with BERT, users can lower latency for language representation on the Bing search platform. Microsoft has previously said BERT brings “the largest improvement in search experience” for Bing.
With ONNX Runtime support, developers can use BERT to scale to as low as 1.7 milliseconds latency (alongside a Nvidia V100 GPU). Microsoft told VentureBeat that this capability has previously only been available to major tech companies.
“Since the BERT model is mainly composed of stacked transformer cells, we optimize each cell by fusing key sub-graphs of multiple elementary operators into single kernels for both CPU and GPU, including Self-Attention, LayerNormalization and Gelu layers. This significantly reduces memory copy between numerous elementary computations,” Microsoft senior program manager Emma Ning said today in a blog post.
ONNX
Open Neural Network Exchange (ONNX) creates a standard open platform for AI models that will work across frameworks.
ONNX Runtime is a high-performance inference engine for machine learning creations across Windows, Linux, and Mac. Developers can use the service to train AI models in any framework and turn these models to production in the cloud and edge.
Developed with Facebook and Amazon, the platform is growing in popularity. By its full launch in 2017 Facebook said several major tech companies have joined. Among them are AMD, ARM, IBM, Intel, Huawei, NVIDIA, and Qualcomm.