Company
Date Published
Author
Alexis Lê-Quôc
Word count
798
Language
English
Hacker News points
None

Summary

AWS customers experienced partial network outages last month that affected an availability zone's connectivity to the internet. The failures highlighted the importance of monitoring key metrics and being prepared for infrastructure failures. Monitoring error distributions, especially outliers, can help detect "grey" partial failures. Additionally, relying on already-deployed infrastructure in another zone or region is crucial when dealing with shared infrastructure issues. Building for failure and having a contingency plan are essential to minimize the impact of such incidents.