/plushcap/analysis/cloudflare/automatic-remediation-of-kubernetes-nodes

Automatic Remediation of Kubernetes Nodes

What's this blog post about?

Cloudflare uses Kubernetes to manage diverse services at its edge locations, with five geographically distributed clusters and hundreds of nodes in their largest cluster. These clusters are self-managed on bare-metal machines, which provides flexibility but also requires manual handling of node failures. One common issue is the accumulation of network interfaces owned by the Container Network Interface (CNI) plugin, which can cause a node to become unhealthy. To address this problem and other similar issues, Cloudflare developed Sciuro, an open-source tool that synchronizes Kubernetes node conditions with currently firing alerts in Alertmanager. This allows for automatic remediation of nodes based on existing monitoring and alerting infrastructure. The team has successfully used automatic node remediation to manage 571 nodes in the past 30 days, saving considerable human effort and reducing time to repair for some issues. Sciuro is now available on GitHub, and Cloudflare is looking for more people passionate about Kubernetes to join their team and contribute to its development.

Company
Cloudflare

Date published
July 15, 2021

Author(s)
Andrew DeMaria

Word count
2302

Hacker News points
9

Language
English


By Matt Makai. 2021-2024.