By optimizing BERT for CPU, Microsoft has made inferencing affordable and cost-effective.
According to the published benchmark, BERT inferencing based on an Azure Standard F16s_v2 CPU takes only 9ms which translates to a 17x increase in speed.
Microsoft partnered with NVIDIA to optimize BERT for GPUs powering the Azure NV6 Virtual Machines. The optimization included rewriting and implementing the neural network in TensorRT C++ APIs based on CUDA and CUBLAS libraries. The NV6 family of Azure VMs is powered by NVIDIA Tesla M60 GPUs. Microsoft claims that the improved Bing search platform running on the optimized model on NVIDIA GPUs serves more than one million BERT inferences per second within Bing’s latency limits.