/plushcap/analysis/cloudflare/autonomous-hardware-diagnostics-and-recovery-at-scale

Autonomous hardware diagnostics and recovery at scale

What's this blog post about?

Cloudflare's global network spans over 310 cities in more than 120 countries, with thousands of servers across data centers. The company faced challenges dealing with broken servers, as manual troubleshooting and repair processes were time-consuming and inefficient. To address this issue, the Infrastructure Software Systems and Automation team developed an autonomous diagnostics and recovery automation called Phoenix. Phoenix runs at regular intervals to discover Cloudflare data centers with broken servers, performs diagnostics on detection, recovers those that pass diagnostics by re-provisioning, and ultimately re-enables them in the safest and most unobtrusive way possible without requiring any human intervention. It also handles server failures, updates relevant tickets, and reverts the state of the server when needed. The autonomous system is designed to be intelligent and aware of other automations executing certain operations, ensuring that recovery operations do not interfere with ongoing ones in the data center. Phoenix provides transparency by logging every operation and sharing information in communication channels like chat rooms and Jira tickets. It also helps manage error budgets, which define the amount of error that automation can accumulate over a certain period before causing significant harm to the system or excessive noise for SREs. With Phoenix, Cloudflare has not only witnessed the potential of autonomous automated systems but also experienced benefits such as reduced energy wastage and cost savings. The company plans to continue investing in engineering initiatives that focus on building better and smarter systems.

Company
Cloudflare

Date published
March 25, 2024

Author(s)
Jet Mariscal, Aakash Shah, Yilin Xiong

Word count
2063

Language
English

Hacker News points
None found.