A Byzantine failure in the real world
On November 2, 2020, Cloudflare experienced an incident that impacted the availability of its API and dashboard for six hours and 33 minutes. The issue was caused by a Byzantine fault, which led to a cascading series of events involving partial switch failure, etcd errors, promotion of new primary databases, and overloaded authentication databases. Despite having redundancy in each system, the combination of degraded states made it difficult to model and anticipate the chain of events that transpired. The incident led Cloudflare to revisit its configuration parameters for auto-remediation processes and prompted further research into Byzantine Fault Tolerance (BFT) consensus protocols.
Company
Cloudflare
Date published
Nov. 27, 2020
Author(s)
Tom Lianza, Chris Snook
Word count
1913
Language
English
Hacker News points
16