Debugging Hardware Performance on Gen X Servers
In this text, hardware engineer Yasir Jamal from Cloudflare discusses an issue they faced when servers from one vendor (SKU-B) were consistently performing 5-10% worse than servers from another vendor (SKU-A). The team initially suspected CPU performance as the cause and ran AMD's DGEMM high-performance computing tool, but found that underperforming servers had lower Thermal Design Power (TDP) and floating-point computation rate. After trying various debugging options like disabling idle power saving mode, checking network interface, and enabling AMD Preferred I/O functionality, the team discovered a difference in memory clock frequency from Infinity Fabric system using AMD's HSMP tool. They asked the vendor to provide a new BIOS that set the frequency to 1467 MHz during compile time, which resolved the issue and improved performance of SKU-B servers to match or exceed SKU-A servers.
Company
Cloudflare
Date published
May 17, 2022
Author(s)
Yasir Jamal
Word count
920
Language
English
Hacker News points
5