/plushcap/analysis/cloudflare/clickhouse-capacity-estimation-framework

ClickHouse Capacity Estimation Framework

What's this blog post about?

Cloudflare uses ClickHouse extensively for its internal analytics workload, bot management, customer dashboards, and other systems. The largest cluster has over 100 nodes, with several others having at least three nodes each. They use the standard approach in ClickHouse schema design, with clusters holding shards and a node being a physical machine. To manage capacity planning, they periodically collect extensive information from system tables about operating processes. They previously used manual methods to predict disk usage but have now automated this process using Python and Facebook's Prophet for time-series forecasting. The metrics are stored in ClickHouse itself, with real and predicted data displayed on the same dashboard in Grafana. This automation has helped Cloudflare spot unexpected disk space issues across multiple clusters and provides a valuable tool for planning server purchases. The project was carried out by the Core SRE team to improve daily work and is now used by other teams within the company.

Company
Cloudflare

Date published
Nov. 5, 2020

Author(s)
Oxana Kharitonova

Word count
1617

Hacker News points
47

Language
English


By Matt Makai. 2021-2024.