Before optimization, Bing’s original transformer model had a 95th percentile latency of 4.76 seconds per batch (20 queries) and a throughput of 4.2 queries per second per instance.
With TensorRT-LLM, the latency was reduced to 3.03 seconds per batch, and throughput increased to 6.6 queries per second per instance.
This represents a 36% reduction in latency and a 57% decrease in operational costs.
The company states:
“… our product is built on the foundation of providing the best results, and we will not compromise on quality for speed. This is where TensorRT-LLM comes into play, reducing model inference time and, consequently, the end-to-end experience latency without sacrificing result quality.”
Benefits For Bing Users
This update brings several potential benefits to Bing users:
Faster search results with optimized inference and quicker response times
Improved accuracy through enhanced capabilities of SLM models, delivering more contextualized results
Cost efficiency, allowing Bing to invest in further innovations and improvements
Why Bing’s Move to LLM/SLM Models Matters
Bing’s switch to LLM/SLM models and TensorRT optimization could impact the future of search.
As users ask more complex questions, search engines need to better understand and deliver relevant results quickly. Bing aims to do that using smaller language models and advanced optimization techniques.
While we’ll have to wait and see the full impact, Bing’s move sets the stage for a new chapter in search.
Featured Image: mindea/Shutterstock