/plushcap/analysis/datadog/gray-aws-failures

Learning from AWS failure

What's this blog post about?

AWS customers experienced partial network outages last month that affected an availability zone's connectivity to the internet. The failures highlighted the importance of monitoring key metrics and being prepared for infrastructure failures. Monitoring error distributions, especially outliers, can help detect "grey" partial failures. Additionally, relying on already-deployed infrastructure in another zone or region is crucial when dealing with shared infrastructure issues. Building for failure and having a contingency plan are essential to minimize the impact of such incidents.

Company
Datadog

Date published
Oct. 23, 2013

Author(s)
Alexis Lê-Quôc

Word count
798

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.