The Real Failure Rate of EBS

Company

PlanetScale

Date Published

March 18, 2025

Author

Word count

1087

Language

English

Hacker News points

113

URL

planetscale.com/blog/the-real-fail-rate-of-ebs

Summary

At PlanetScale, they have deployed millions of Amazon Elastic Block Store (EBS) volumes across the world and have a unique viewpoint into the failure rate and mechanisms of EBS. They see failures like this on a daily basis, with a frequent enough rate that they've built systems to monitor EBS volumes directly to minimize impact. The true rate of failure is constant, variable, and all by design due to the lack of performance guarantees when volumes are not operating to their specifications. To handle these failures, PlanetScale has developed mitigations such as monitoring metrics closely and performing zero-downtime reparents in seconds to another node in the cluster. They've also built a shared-nothing architecture that uses local storage instead of network-attached storage like EBS, allowing other shards and nodes in a database to continue operating without problem.