Incident Report: May 4th, 2024

Company

Railway

Date Published

May 8, 2024

Author

Word count

1451

Language

English

Hacker News points

None

URL

blog.railway.com/p/2024-05-04-incident-report

Summary

We recently experienced an outage on our platform that affected 1% of our compute nodes and less than 1% of our workloads, primarily impacting the Asia-Southeast cluster. The incident began at approximately 03:39 UTC when internal monitoring alerted to exhausted compute capacity in the region, followed by low-disk capacity alarms. Investigation revealed that all instances were experiencing high memory usage, utilizing swap memory to cover burst capacity requirements. A partial outage was declared, and mitigations included deploying a new compute node and manually culling workloads from existing nodes. However, one instance remained unresponsive due to being irrevocably soft locked, making further debugging efforts difficult. The root cause was identified as a self-service customer's workload exceeding the capacity of the region, causing memory spikes that triggered kernel OOM reaper actions. To mitigate similar incidents, Railway plans to adjust kernel OOM reaper configuration, implement workloads prioritization, and tune deployment algorithms to account for historical metrics and adjust compute node capacity in Asia-Southeast to decrease likelihood of saturation.