Hardening Workers KV
Cloudflare recently experienced a series of incidents affecting their Workers KV (Key-Value) service, which is used to store configuration and data for applications running on Cloudflare's serverless platform. The root cause was an incorrectly deployed code change that caused keys in affected locations to be persisted with invalid configurations across requests, leaving the Worker "frozen" until a rollback was performed 10 minutes later. Additionally, the introduction of a new progressive release process for Workers KV prolonged the incident due to a bug in deployment logic, which dropped some traffic until it was rolled back. Cloudflare estimates that the affected traffic accounted for 0.2-0.5% of KV's global traffic and impacted customers with error rates approaching 20%. To improve reliability and mitigate risks associated with Workers KV, Cloudflare plans to implement several improvements: enhancing observability tooling for unhandled exceptions, improving safety around environmental variable mutations in a Worker, expanding test coverage, refining release processes, adding better logging, adjusting alerting thresholds, and addressing maturity issues with progressive deployment tooling. Cloudflare acknowledges that these incidents have not met their customers' expectations for the KV service and are working to address both the specific issues that led to this cycle of incidents as well as broader reliability concerns across Cloudflare services reliant on or relying on Workers KV.
Company
Cloudflare
Date published
Aug. 2, 2023
Author(s)
Matt Silverlock, Charles Burnett, Rob Sutter, Kris Evans
Word count
2576
Language
English
Hacker News points
8