Outage Post-Mortem for August 22nd
On August 23, 2016, Buildkite experienced a severe unplanned outage due to misconfigured settings on PagerDuty and phones being on "silent". The team woke up to find the website offline. They discovered that their main culprit was a downgraded PostgreSQL database which failed under heavy load. This led to failing health checks, replacement servers not going healthy, and issues with AWS services. Lessons learned include keeping an eye on AWS credits, load testing after significant infrastructure changes, rethinking health checks, ensuring all on-call team members have correct settings, and temporarily suspending auto scaling processes during high churn issues in AWS. The Buildkite team apologized for the downtime and promised to not make the same mistakes again.
Company
Buildkite
Date published
Aug. 23, 2016
Author(s)
Keith Pitt
Word count
2281
Language
English
Hacker News points
None found.