Lessons learned from running Kafka at Datadog
Datadog operates over 40 Kafka and ZooKeeper clusters that process trillions of datapoints daily across multiple platforms, data centers, and regions. The company has learned valuable lessons from scaling these clusters to support diverse workloads. They share insights on coordinating changes to maximum message size, unclean leader elections, investigating data reprocessing issues on low-throughput topics, and why low-traffic topics can retain data longer than expected. Monitoring certain metrics helps ensure the durability of data and availability of clusters.
Company
Datadog
Date published
June 25, 2019
Author(s)
Emily Chang
Word count
3870
Hacker News points
3
Language
English