PIPEFAIL: How a missing shell option slowed Cloudflare down
On December 16, 2021, Cloudflare experienced a slowdown for approximately 30 minutes due to an empty Quicksilver key caused by a missing shell option called "pipefail". The issue started when the Kubernetes cron job failed to populate the key with valid data. This led to the failure of dosd, which provides protection against large attacks and relies on Quicksilver for configuration data. As a result, the Front Line's in-memory cache was flushed, causing a slowdown as requests were stuck waiting for dosd to reply. The issue was resolved by manually re-running the Kubernetes cron job. Lessons learned from this incident include scaling out services to handle high request rates and ensuring code and systems are resilient to failure.
Company
Cloudflare
Date published
April 5, 2022
Author(s)
Alex Forster
Word count
1983
Hacker News points
30
Language
English