Company
Date Published
Aug. 21, 2024
Author
John Spray
Word count
1264
Language
English
Hacker News points
None

Summary

The incident review discusses a recent outage in the us-east-1 region of Neon's services, which resulted in up to 2 hours of unavailability for approximately 0.4% of customer projects. The outage was caused by an EC2 instance failure, and it took around 30 minutes between the initial node failure and the decision to migrate projects away. The incident highlighted the need for a more resilient system, particularly in terms of fault tolerance and response time. To address this, Neon is introducing a new service called the Storage Controller, which uses a reconciliation-loop model to schedule users' Projects onto pageservers and can respond dynamically to changes in node availability or load. The Storage Controller has been in production since May 2024 and is being used to manage high-capacity Projects first, with plans to migrate all paying customers to it in the near future.