Company
Date Published
Author
Emily Chang
Word count
3870
Language
English
Hacker News points
3

Summary

Datadog operates over 40 Kafka and ZooKeeper clusters that process trillions of datapoints daily across multiple platforms, data centers, and regions. The company has learned valuable lessons from scaling these clusters to support diverse workloads. They share insights on coordinating changes to maximum message size, unclean leader elections, investigating data reprocessing issues on low-throughput topics, and why low-traffic topics can retain data longer than expected. Monitoring certain metrics helps ensure the durability of data and availability of clusters.