The issue was caused by a regression in the Linux kernel between versions 4.9 and 4.10. This resulted in increased CPU usage on servers running Kafka, causing performance degradation. The problem was identified through bisection, which helped to pinpoint the exact version where the issue first appeared.
The solution involved enabling TCP segmentation offload (TSO) and other network offloading features on VLAN interfaces in the kernel configuration. This significantly improved performance by reducing CPU usage.
In addition, a workaround was implemented to automatically enable these offloading features if they are disabled on boot for VLAN interfaces. A ticket was also filed upstream with systemd regarding this issue.