Company
Date Published
Author
Aaditya Talwai, Emily Chang
Word count
1542
Language
English
Hacker News points
1

Summary

Datadog recently ran a game day on one of their Elasticsearch clusters to test the resilience of their systems. They stopped Elasticsearch on various nodes including leader node, client nodes for recent and long-term data, and observed how their applications responded. The lessons learned include being prepared for 503s during leader election, handling dangling indices, and implementing health checks for client nodes. Game day exercises are a great way to test systems' fault tolerance and improve alerts and fixes.