How Cloudflare runs Prometheus at scale
Prometheus is a powerful monitoring solution that excels at handling high cardinality time series data. However, this strength can also be its weakness as it can lead to overloaded instances when dealing with large numbers of metrics or labels. To tackle this issue, we developed two custom patches for Prometheus - one enforcing a total limit on the number of stored time series and another that provides graceful degradation by capping the number of time series per scrape while allowing appends to existing time series after reaching the limit. These patches help prevent overloaded instances, improve performance, and provide a safety net for dealing with high cardinality data. Additionally, we maintain extensive internal documentation to guide engineers through the entire process of working with metrics in Prometheus, from defining metrics to visualizing them in dashboards.
Company
Cloudflare
Date published
March 3, 2023
Author(s)
Lukasz Mierzwa
Word count
6846
Hacker News points
40
Language
English