2023-03-08 Incident: A Deep Dive into Our Incident Response

Company

Datadog

Date Published

June 1, 2023

Author

Laura de Vesine

Word count

3798

Language

English

Hacker News points

None

URL

www.datadoghq.com/blog/engineering/2023-03-08-deep-dive-into-incident-response

Summary

Datadog experienced a global outage on March 8th, which was the first of its kind for the company. The incident involved several hundred engineers working in shifts and using various communication channels to resolve the issue. This post describes Datadog's incident response process, including monitoring systems, high-severity incident management, training, and a blameless culture. The outage provided valuable lessons on improving internal response, customer communications, and overall preparedness for future incidents.