How We Tracked Down a Linux Kernel Bug with Fallout
The article discusses a complex bug that took weeks to debug, involving multiple layers of software stack. It started with a performance test timing out after 16 hours due to an issue in the Linux kernel hrtimer code. Fallout, an open-source distributed systems testing service, was instrumental in quickly iterating and gathering new data to validate and invalidate guesses about the underlying bug. The author used various tools like nodetool tpstats, jstack, and BPF script to understand the issue at different levels of the stack. The kernel bug causing the red-black tree to become inconsistent was already fixed upstream in Linux 5.12 but not yet pulled into Ubuntu's kernel. The author suggests having a bag of tools and techniques to understand the behavior of an app at various levels of the stack, and services like Fallout for automatic deployment and provisioning of virtual machines for running tests.
Company
DataStax
Date published
Sept. 27, 2021
Author(s)
Matt Fleming
Word count
2754
Language
English
Hacker News points
1