/plushcap/analysis/buildkite/outage-post-mortem-for-august-22nd

Outage Post-Mortem for August 22nd

What's this blog post about?

On August 23, 2016, Buildkite experienced a severe unplanned outage due to misconfigured settings on PagerDuty and phones being on "silent". The team woke up to find the website offline. They discovered that their main culprit was a downgraded PostgreSQL database which failed under heavy load. This led to failing health checks, replacement servers not going healthy, and issues with AWS services. Lessons learned include keeping an eye on AWS credits, load testing after significant infrastructure changes, rethinking health checks, ensuring all on-call team members have correct settings, and temporarily suspending auto scaling processes during high churn issues in AWS. The Buildkite team apologized for the downtime and promised to not make the same mistakes again.

Company
Buildkite

Date published
Aug. 23, 2016

Author(s)
Keith Pitt

Word count
2281

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.