Datadog operates over 40 Kafka and ZooKeeper clusters that process trillions of datapoints daily across multiple platforms, data centers, and regions. The company has learned valuable lessons from scaling these clusters to support diverse workloads. They share insights on coordinating changes to maximum message size, unclean leader elections, investigating data reprocessing issues on low-throughput topics, and why low-traffic topics can retain data longer than expected. Monitoring certain metrics helps ensure the durability of data and availability of clusters.