/plushcap/analysis/datadog/datadog-engineering-llms-for-postmortems

How we optimized LLM use for cost, quality, and safety to facilitate writing postmortems

What's this blog post about?

Writing a comprehensive postmortem after an incident can be time-consuming and challenging, especially when responders have moved on to the next urgent issue. To ease this process, Bits AI implemented a feature that uses large language models (LLMs) to generate a first draft of the postmortem for human authors to build upon. The solution combines structured metadata from Datadog's Incident Management app with unstructured discussions from related Slack channels. Challenges encountered during this project include ensuring data quality and minimizing hallucinations, balancing cost, speed, and quality trade-offs, and addressing trust and privacy concerns. To reduce the likelihood of hallucinations in LLM-generated content, refining LLM instructions with both structured and unstructured data was adopted. An experimentation framework was established to enable rapid iterations using different datasets, incident information, and model architectures. This allowed for tweaking various parameters such as model type, input settings, and output token limits. To enhance speed and accuracy, a multi-step instruction approach was used, breaking down the task into sections and processing them in parallel. To ensure trust and privacy when using LLMs to generate postmortem drafts, several strategies were implemented, including secret scanning and filtering mechanisms, citation of relevant sources, feedback mechanisms, and transparency about AI-generated content. The first implementation was more effective for incidents with mid to lower severities (e.g., from SEV5 to SEV2). Looking ahead, additional customization options, assistance in the moment while editing the postmortem, and incorporating more relevant incident context are being explored.

Company
Datadog

Date published
Sept. 23, 2024

Author(s)
Tran Le, Till Pieper, Gillian McGarvey

Word count
2814

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.