2023-03-08 Incident: A Deep Dive into the Platform-level Recovery
On March 8, 2023, Datadog experienced a global outage that affected all services across multiple regions due to an unexpected system patch. The company's teams worked for approximately 13 hours to restore the majority of their compute capacity across all regions. They faced several challenges during this process, including different responses required for each region, limitations on the maximum number of VM instances in a peering group, and reaching subnet capacity limits. Despite these obstacles, Datadog managed to recover its platform-level capabilities and continue working towards complete recovery.
Company
Datadog
Date published
June 16, 2023
Author(s)
Laurent Bernaille
Word count
4508
Language
English
Hacker News points
None found.