Intelligent, automatic restarts for unhealthy Kafka consumers
When building distributed systems with Kubernetes, one common issue is ensuring the health of all components. In a system where microservices consume data from Apache Kafka topics, liveness checks can be used to ensure that consumers are actively processing messages. A naive approach is to use simple Kafka connectivity checks, but this may not be enough for systems with multiple partitions and replicas. To improve health checks, focus on message ingestion by checking the current offset (the last message sent) and the committed offset (the last message processed). By ensuring that the committed offset is changing and is equal to or behind the latest one, we can determine whether a consumer is actively processing messages. One issue with this approach is that rebalances in Kafka can cause consumers to be reassigned different partitions, leading to incorrect health checks if each instance of a service only keeps track of its assigned offsets. To solve this problem, use the Sarama library's functionality to observe when a rebalance happens and update the in-memory map of offsets accordingly. Overall, smart health checks can help prevent cascading failures in Kubernetes systems by ensuring that microservices are actively processing messages from Apache Kafka topics.
Company
Cloudflare
Date published
Jan. 24, 2023
Author(s)
Chris Shepherd, Andrea Medda
Word count
1737
Language
English
Hacker News points
2