Company
Date Published
Author
Wojciech Kocjan
Word count
2146
Language
English
Hacker News points
None

Summary

The incident began with a single line of code that was added to a configuration file, which caused ArgoCD to incorrectly apply changes to production, resulting in the loss of a core workload and additional workload. The team responded by reviewing code, creating a recovery plan, and restoring state and data, including etcd, Kafka, and storage engine. They also improved their process for handling public-facing incidents and implemented changes to prevent similar errors in the future, such as using custom annotations to prevent deletion of resources and improving tooling to detect duplicates when generating YAML files. The incident highlighted the importance of having a well-planned disaster recovery strategy and effective monitoring and alerting systems.