Company
Date Published
Aug. 27, 2024
Author
Jake Cooper
Word count
976
Language
English
Hacker News points
None

Summary

The Railway platform experienced a 30-minute outage on August 27th, 2024, affecting approximately 30% of traffic using the new proxy or public TCP proxy, mostly impacting new customers. The incident was caused by a pull request merging at 10:04 PM UTC, which recreated all instances of the new Railway proxy simultaneously with live traffic, causing timeouts and elevated latency. The issue was due to an external vendor's Terraform Provider issue that made it impossible to resize boot disks without recreating instances, leading to a flawed IaC orchestration process. The incident highlighted the need for proper deletion protection in Terraform configurations, which was not enabled by default on production environments. To prevent similar incidents, Railway is modifying internal Terraform policies and adding supplemental alerting, making it more difficult to destroy resources without privileged escalation requests.