/plushcap/analysis/mux/mux-tuning-performance-of-tensorflow-serving-pipeline

How we improved Tensorflow Serving performance by over 70%

What's this blog post about?

Tensorflow Serving is a flexible server architecture designed to deploy and serve machine learning models. It provides monitoring components, a configurable architecture, and supports multiple ML models or versions. The size of the "servable" matters as smaller models use less memory and storage, leading to faster load times. To improve latency, optimizations can be made on both the prediction server and client. Techniques such as building CPU-optimized serving binary, using server-side batching, and implementing client-side batching can significantly reduce prediction latency. Additionally, hardware acceleration like GPUs may be considered for "offline" inference processing with massive volumes.

Company
Mux

Date published
Feb. 26, 2019

Author(s)
Masroor Hasan

Word count
1852

Language
English

Hacker News points
None found.


By Matt Makai. 2021-2024.