/plushcap/analysis/datadog/kafka-at-datadog

Lessons learned from running Kafka at Datadog

What's this blog post about?

Datadog operates over 40 Kafka and ZooKeeper clusters that process trillions of datapoints daily across multiple platforms, data centers, and regions. The company has learned valuable lessons from scaling these clusters to support diverse workloads. They share insights on coordinating changes to maximum message size, unclean leader elections, investigating data reprocessing issues on low-throughput topics, and why low-traffic topics can retain data longer than expected. Monitoring certain metrics helps ensure the durability of data and availability of clusters.

Company
Datadog

Date published
June 25, 2019

Author(s)
Emily Chang

Word count
3870

Language
English

Hacker News points
3