/plushcap/analysis/cloudflare/cloudflare-how-we-use-openbmc-and-acpi-power-states-to-monitor-the-state-of-our-servers

Is this thing on? Using OpenBMC and ACPI power states for reliable server boot

What's this blog post about?

Cloudflare uses Baseboard Management Controllers (BMCs) to manage its servers' operations globally. The BMC is a special processor that ensures smooth server operation independently of the Central Processing Unit (CPU). Cloudflare customizes and deploys OpenBMC, an open-source firmware stack designed for various systems including enterprise, telco, and cloud-scale data centers. This open-source nature provides greater flexibility and ownership over this critical server subsystem, allowing faster development of custom features/fixes. However, while developing Cloudflare's OpenBMC firmware, the team encountered several boot problems such as servers not booting, missing memory modules, and issues with thermal telemetry. To address these issues, they implemented host ACPI state on OpenBMC to track various power state changes of the host during the boot process. They also fixed the issues by controlling power mode states for interfering devices, disabling BMC from reading DIMM temperature sensors during specific phases, and setting non-functional thermal sensors in their configuration. These improvements have led to better observability of the boot progress and "last state" of Cloudflare's systems, enabling more reliable operations and ownership over the firmware that powers its servers.

Company
Cloudflare

Date published
Oct. 22, 2024

Author(s)
Nnamdi Ajah, Ryan Chow, Giovanni Pereira Zantedeschi

Word count
3324

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.