Company
Date Published
June 11, 2024
Author
Jake Cooper
Word count
979
Language
English
Hacker News points
None

Summary

The Railway platform experienced a significant outage on June 11th, 2024, affecting both US and EU fleets, with approximately 20% of instances entering a degraded state due to slow response rates. The root cause was identified as an errant migration workflow that triggered a series of issues, including high IO pressure, queue buildup, and service degradation. The outage lasted for several hours, resulting in up to 20 minutes of downtime for some workloads. To mitigate similar incidents, Railway has implemented fixes such as caching status responses from machines, adding global rate limits to worker processes, and optimizing scheduler behavior to prevent cross-cluster write amplification. The platform's infrastructure team has also worked closely with customers to help harden their applications and provide proactive support.