/plushcap/analysis/cloudflare/a-byzantine-failure-in-the-real-world

A Byzantine failure in the real world

What's this blog post about?

On November 2, 2020, Cloudflare experienced an incident that impacted the availability of its API and dashboard for six hours and 33 minutes. The issue was caused by a Byzantine fault, which led to a cascading series of events involving partial switch failure, etcd errors, promotion of new primary databases, and overloaded authentication databases. Despite having redundancy in each system, the combination of degraded states made it difficult to model and anticipate the chain of events that transpired. The incident led Cloudflare to revisit its configuration parameters for auto-remediation processes and prompted further research into Byzantine Fault Tolerance (BFT) consensus protocols.

Company
Cloudflare

Date published
Nov. 27, 2020

Author(s)
Tom Lianza, Chris Snook

Word count
1913

Hacker News points
16

Language
English


By Matt Makai. 2021-2024.