Learning from AWS failure
AWS customers experienced partial network outages last month that affected an availability zone's connectivity to the internet. The failures highlighted the importance of monitoring key metrics and being prepared for infrastructure failures. Monitoring error distributions, especially outliers, can help detect "grey" partial failures. Additionally, relying on already-deployed infrastructure in another zone or region is crucial when dealing with shared infrastructure issues. Building for failure and having a contingency plan are essential to minimize the impact of such incidents.
Company
Datadog
Date published
Oct. 23, 2013
Author(s)
Alexis Lê-Quôc
Word count
798
Hacker News points
None found.
Language
English