How we manage incidents at Datadog
This article discusses how Datadog manages incidents at scale, focusing on their entire process and the tools they've developed for handling them. They emphasize two core components of incident management: a culture of resilience and blameless organizational accountability, and monitoring their own systems. The company uses its own Incident Management tool to declare incidents, assign severity levels, set up communications channels, and designate first-line responders. They also rely on various support roles such as workstream leads, communications leads, and executive leads during incident response. Datadog prioritizes several metrics in order to gauge the success of their incident management process, including low rates of recurrence, increasing levels of incident complexity, decreased time to detection, and a low rate of spurious alerts.
Company
Datadog
Date published
Nov. 6, 2023
Author(s)
Laura de Vesine, Aaron Kaplan
Word count
2517
Language
English
Hacker News points
3