The Railway platform experienced a significant outage on June 11th, 2024, affecting both US and EU fleets, with approximately 20% of instances entering a degraded state due to slow response rates. The root cause was identified as an errant migration workflow that triggered a series of issues, including high IO pressure, queue buildup, and service degradation. The outage lasted for several hours, resulting in up to 20 minutes of downtime for some workloads. To mitigate similar incidents, Railway has implemented fixes such as caching status responses from machines, adding global rate limits to worker processes, and optimizing scheduler behavior to prevent cross-cluster write amplification. The platform's infrastructure team has also worked closely with customers to help harden their applications and provide proactive support.