This article discusses how Datadog manages incidents at scale, focusing on their entire process and the tools they've developed for handling them. They emphasize two core components of incident management: a culture of resilience and blameless organizational accountability, and monitoring their own systems. The company uses its own Incident Management tool to declare incidents, assign severity levels, set up communications channels, and designate first-line responders. They also rely on various support roles such as workstream leads, communications leads, and executive leads during incident response. Datadog prioritizes several metrics in order to gauge the success of their incident management process, including low rates of recurrence, increasing levels of incident complexity, decreased time to detection, and a low rate of spurious alerts.