How we used OpenBMC to support AI inference on GPUs around the world
Cloudflare recently announced Workers AI, allowing developers to run serverless GPU-powered AI inference on its global network. To manage this effectively, Cloudflare leverages OpenBMC, an open-source firmware stack from the Open Compute Project (OCP). One key area of focus was updating Baseboard Management Controllers (BMCs), which are embedded microprocessors responsible for remote power management and other features on most servers. With OpenBMC, Cloudflare has been able to adjust its BMC firmware to accommodate new GPUs while maintaining operational efficiency with respect to thermals and power consumption. The company used the PID controller theory to fine-tune the fan settings for better cooling of the GPU-equipped servers. They also established communication with the GPU using SMBus protocol and leveraged OpenBMC's applications for simple configuration, making sensor data and inventory information available via IPMI or Redfish. Overall, Cloudflare has been able to leverage OpenBMC to gain more control and flexibility with its server configurations without sacrificing efficiency at the core of its network. This demonstrates the importance of being able to modify server firmware without being locked to traditional device update cycles.
Company
Cloudflare
Date published
Dec. 6, 2023
Author(s)
Ryan Chow, Giovanni Pereira Zantedeschi, Nnamdi Ajah
Word count
1745
Language
English
Hacker News points
None found.