Deleting Production in a Few Easy Steps (and How to Fix It)

Company

InfluxData

Date Published

June 24, 2022

Author

Wojciech Kocjan

Word count

2146

Language

English

Hacker News points

None

URL

www.influxdata.com/blog/deleting-production-steps-how-fix-it

Summary

The incident began with a single line of code that was added to a configuration file, which caused ArgoCD to incorrectly apply changes to production, resulting in the loss of a core workload and additional workload. The team responded by reviewing code, creating a recovery plan, and restoring state and data, including etcd, Kafka, and storage engine. They also improved their process for handling public-facing incidents and implemented changes to prevent similar errors in the future, such as using custom annotations to prevent deletion of resources and improving tooling to detect duplicates when generating YAML files. The incident highlighted the importance of having a well-planned disaster recovery strategy and effective monitoring and alerting systems.