5 recommendations when running Thanos and Prometheus

Company

Zapier

Date Published

Jan. 13, 2023

Author

Ihor Horak

Word count

1031

Language

English

Hacker News points

None

URL

zapier.com/blog/five-recommendations-when-running-thanos-and-prometheus

Summary

We have built a robust monitoring system using Thanos and Prometheus that provides fast and reliable monitoring capabilities, with high availability and scalability. We have learned the importance of caching to improve query performance, downsampling metrics to reduce storage requirements, keeping metrics in good shape by only storing important ones, sharding long-term storage to serve large amounts of data efficiently, and scaling and high availability through manual scaling of Prometheus shards. By implementing these strategies, we have improved the performance and reliability of our monitoring system, which currently stores 130TB of metrics and serves a high volume of queries within a few seconds.