How we improved Tensorflow Serving performance by over 70%
Tensorflow Serving is a flexible server architecture designed to deploy and serve machine learning models. It provides monitoring components, a configurable architecture, and supports multiple ML models or versions. The size of the "servable" matters as smaller models use less memory and storage, leading to faster load times. To improve latency, optimizations can be made on both the prediction server and client. Techniques such as building CPU-optimized serving binary, using server-side batching, and implementing client-side batching can significantly reduce prediction latency. Additionally, hardware acceleration like GPUs may be considered for "offline" inference processing with massive volumes.
Company
Mux
Date published
Feb. 26, 2019
Author(s)
Masroor Hasan
Word count
1852
Hacker News points
None found.
Language
English