/plushcap/analysis/hex/we-took-down-production

We took down production by misconfiguring our ETL

What's this blog post about?

On November 7, 2022, Hex Cloud experienced its longest-ever service interruption due to an ETL configuration issue that caused their production Postgres database to fill up disk space. The incident lasted for about 2 hours and 40 minutes, affecting thousands of users globally. The company has published a detailed post-mortem report on the incident, highlighting the steps taken to resolve it and prevent similar issues in the future. Key takeaways from the incident include: 1. Monitoring alerts were missed for free space in the database and Fivetran failures, which could have prevented the outage. 2. The company is working on improving its monitoring systems and adding specific alerts to prevent similar issues in the future. 3. Postgres 13 has a feature that limits the size of the WAL (write-ahead log), preventing larger outages, and Hex plans to upgrade from Postgres 12. 4. The company is exploring Teleport for Fivetran as an alternative solution to reduce risks associated with ETL integration. 5. Maintaining incident runbooks and conducting quarterly fire drills are crucial in preparing engineers for crisis situations. 6. Redundancy, such as the read replica that saved Hex during this incident, is essential for mitigating potential outages. 7. Taking a moment to reflect before diving into solutions can help avoid unexpected failure modes and identify alternative approaches. 8. All systems that touch production should have the same level of reliability, monitoring, and alerting as critical systems.

Company
Hex

Date published
Nov. 17, 2022

Author(s)
Caitlin Colgrove, Amanda Fioritto

Word count
1885

Language
English

Hacker News points
None found.