ClickHouse Capacity Estimation Framework
Cloudflare uses ClickHouse extensively for its internal analytics workload, bot management, customer dashboards, and other systems. The largest cluster has over 100 nodes, with several others having at least three nodes each. They use the standard approach in ClickHouse schema design, with clusters holding shards and a node being a physical machine. To manage capacity planning, they periodically collect extensive information from system tables about operating processes. They previously used manual methods to predict disk usage but have now automated this process using Python and Facebook's Prophet for time-series forecasting. The metrics are stored in ClickHouse itself, with real and predicted data displayed on the same dashboard in Grafana. This automation has helped Cloudflare spot unexpected disk space issues across multiple clusters and provides a valuable tool for planning server purchases. The project was carried out by the Core SRE team to improve daily work and is now used by other teams within the company.
Company
Cloudflare
Date published
Nov. 5, 2020
Author(s)
Oxana Kharitonova
Word count
1617
Hacker News points
47
Language
English