Monitoring 101: Investigating performance issues

Company

Datadog

Date Published

July 16, 2015

Author

Alexis Lê-Quôc

Word count

1050

Language

English

Hacker News points

None

URL

www.datadoghq.com/blog/monitoring-101-investigation

Summary

This article discusses an effective approach for diagnosing the root cause of problems in infrastructure using monitoring data. It highlights three types of monitoring data - work metrics, resource metrics, and events - that can help identify issues. The process involves starting with top-level work metrics to characterize the problem, then investigating resources used by the system, checking for any changes or events that may be correlated with the issue, and finally fixing it and adding more instrumentation if necessary. Building dashboards in advance is recommended to speed up investigation during an outage. The article emphasizes the importance of a systematic approach to problem diagnosis using monitoring data.